Converting Bounding box to segmentation

in this post I am going to share my journey in converting my fashion dataset from bounding boxes to segmentation using SAM.
This is my first blog post so I’m not great at intros. I am just going to go straight in to the code and process. Let’s Begin:

Step 1: Prepare environment

If you are working on your own machine, then you have to set up a virtual environment. Read this to learn why and how:

my choice of virtual env is either conda or pipenv… but in this tutorial I will be developing on Kaggle which comes with it’s on environment so I don’t have to worry about it.

Step 2: Install libaries

Here are the libraries you need and the pip installation command:

Transformers by Huggingface(shout out Transformers: Rise of the Beasts)

pip install transformers

this comes with numpy, matplotlib and all of the tools you’ll need to run the Segment Anything Model. So… that was easy lol

Also here the tutorial I will be referencing for writing this code:

We are going to copy some function from this notebook and overall it’s a great intro in to SAM, check it out.

import torch
import torchvision
print(“PyTorch version:”, torch.__version__)
print(“Torchvision version:”, torchvision.__version__)
print(“CUDA is available:”, torch.cuda.is_available())
import sys
!{sys.executable} -m pip install opencv-python matplotlib
!{sys.executable} -m pip install ‘git+’

!mkdir images
!wget -P images
!wget -P images


Step 3: Let’s us view our classes

import torch
from transformers import SamModel, SamProcessor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

Now for your dataset we are going to be converting YOLO format bounding boxes to segmentation masks.

Here are the classes for my dataset:

Objects = ['Person', 'Top', 'Shoe', 'Dress', 'Pants', 'Luggage & bags', 'Hat', 'Skirt', 'Shorts', 'Accessory', 'Sunglasses', 'Swimwear']

Here is the label and image we are going to be working with

2 0.36851637065410614 0.8876559734344482 0.08369550108909607 0.1444988250732422
2 0.4354017525911331 0.747386246919632 0.06847158074378967 0.1340007185935974
7 0.4495207667350769 0.5054635405540466 0.23303723335266113 0.21894299983978271
0 0.45071592926979065 0.5515952035784721 0.2737799286842346 0.8076206594705582
1 0.48304708302021027 0.33529189229011536 0.1939612329006195 0.1412053108215332
5 0.6979173719882965 0.6846143305301666 0.1545822024345398 0.11421650648117065
5 0.34287188947200775 0.4507826119661331 0.10764887928962708 0.10002323985099792

The yolo format for bounding boxes uses this format:

  • One row per object

Here is a good tutorial on yolo, it’s a good read:

Now load you image:

import cv2
image_BGR = cv2.imread('/kaggle/input/bounding-to-seg-test/000027.jpg')

Let us display our bounding boxes, first we have to read in the labels. Here is the function I will be using:

# Reading annotation txt file that has bounding boxes coordinates in YOLO format
def getLabels(labelPath):
    with open(labelPath) as f:
        # Preparing list for annotation of BB (bounding boxes)
        labels = []
        for line in f:
            labels += [line.rstrip()]

    return labels

Let us plot our classes:

# Going through all BB
def readLabelBB(labels, w, h):
    parsedLabels = []
    for i in range(len(labels)):
        bb_current = labels[i].split()
        x_center, y_center = int(float(bb_current[1]) * w), int(float(bb_current[2]) * h)
        box_width, box_height = int(float(bb_current[3]) * w), int(float(bb_current[4]) * h)
        parsedLabels.append((x_center, y_center, box_width, box_height))
    return parsedLabels
def plotLabels(image, labels):
    h, w = image.shape[:2] 
    parsedLabels = readLabelBB(labels, w, h)
    for i in range(len(parsedLabels)):
        x_center, y_center, box_width, box_height = parsedLabels[i]
        # Now, from YOLO data format, we can get top left corner coordinates
        # that are x_min and y_min
        x_min = int(x_center - (box_width / 2))
        y_min = int(y_center - (box_height / 2))
        # Drawing bounding box on the original image
        cv2.rectangle(image, (x_min, y_min), (x_min + box_width, y_min + box_height), [172 , 10, 127], 2)

        # Preparing text with label and confidence for current bounding box
        class_current = 'Class: {}'.format(Objects[int(bb_current[0])])

        # Putting text with label and confidence on the original image
        cv2.putText(image, class_current, (x_min, y_min - 5), cv2.FONT_HERSHEY_COMPLEX, 0.7, [172 , 10, 127], 2)
    # Plotting this example
    # Setting default size of the plot
    plt.rcParams['figure.figsize'] = (15, 15)

    # Initializing the plot
    fig = plt.figure()

    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.title('Fashion Classes', fontsize=18)

    # Showing the plot
# Plot our classes
labels = getLabels('path-to-labels')
plotLabels(image_BGR, labels)

Here are our classes. It looks like it mistaked a vent for bag, but that is fine. I used a pre trained model to create my dataset. Very ingenious way to create a 60,000 labeled images for 60 dollars. Might make a blog post on that later.

Step 4: Let us write our bounding box to segmentation code.

For this part we are going to need to convert the yolo coordinate formats to a list of points, corresponding to the flattened coordinates of the top left point, and bottom right point of the bounding box. To do this we will use the following function.

def getConvertedBoxes(labels, image_width, image_height):
    parsedLabels = []
    for i in range(len(labels)):
        bb_current = labels[i].split()
        x_center, y_center = float(bb_current[1]), float(bb_current[2])
        box_width, box_height = float(bb_current[3]), float(bb_current[4])
        # Convert to top left and bottom right coordinates
        x0 = int((x_center - box_width / 2) * image_width)
        y0 = int((y_center - box_height / 2) * image_height)
        x1 = int((x_center + box_width / 2) * image_width)
        y1 = int((y_center + box_height / 2) * image_height)
        parsedLabels.append([x0, y0, x1, y1])
    return parsedLabels

After converting them we will then plot them. We will use the follwoing plot functions form the SAM demo notebook because since the format now fits it should work perfectly for using that function

import matplotlib.pyplot as plt
def show_box(box, ax):
    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))  
def show_boxes_on_image(raw_image, boxes):
    for box in boxes:
      show_box(box, plt.gca())
from PIL import Image

path = "path-to-image"
raw_image ="RGB")

inputBoxes = getConvertedBoxes(labels, w, h)
show_boxes_on_image(raw_image, convertedBoxes) 

Just like magic:

yolo bounding boxes converted to albumenation format.

So now input the boxes in to the processor and get the mask

import sys
from segment_anything import sam_model_registry, SamPredictor

sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"

device = "cuda"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)

predictor = SamPredictor(sam)
image = cv2.imread('/notebooks/images/000027.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
Image for segmentation

Step 4. Get Masks

The bounding boxes for these masks we got by converting the yolo format to bounding box format SAM expects. Look at part 1 to see that code that did that

Here are the bounding boxes for this picture.

bounding_boxes = [[209, 532, 262, 626], [256, 444, 300, 531], [213, 258, 362, 401], [200, 96, 376, 623], [247, 172, 371, 265], [397, 409, 496, 484], [184, 261, 253, 327]]

for inputting these boxes in to the SAM predictor. We have to convert them in to a tensor.

input_boxes = torch.tensor(bounding_boxes, device=predictor.device)

After doing so extract the new masks

transformed_boxes = predictor.transform.apply_boxes_torch(input_boxes, image.shape[:2])
masks, _, _ = predictor.predict_torch(
plt.figure(figsize=(10, 10))
for mask in masks:
    show_mask(mask.cpu().numpy(), plt.gca(), random_color=True)
for box in input_boxes:
    show_box(box.cpu().numpy(), plt.gca())

Here are the results

Bounding box to mask results

Here is the masks with out the person mask which is coloring everything green

masks with out person mask

It worked!!!!!!!!!

Here is the link to the full github repo: conversion_code

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Downloa the App

There are many variations of passages of Lorem Ipsum.







Available on the

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.