in this post I am going to share my journey in converting my fashion dataset from bounding boxes to segmentation using SAM.
This is my first blog post so I’m not great at intros. I am just going to go straight in to the code and process. Let’s Begin:
Step 1: Prepare environment
If you are working on your own machine, then you have to set up a virtual environment. Read this to learn why and how: https://medium.com/analytics-vidhya/virtual-environments-in-python-186cbd4a1b94
my choice of virtual env is either conda or pipenv… but in this tutorial I will be developing on Kaggle which comes with it’s on environment so I don’t have to worry about it.
Step 2: Install libaries
Here are the libraries you need and the pip installation command:
Transformers by Huggingface(shout out Transformers: Rise of the Beasts)
pip install transformers
this comes with numpy, matplotlib and all of the tools you’ll need to run the Segment Anything Model. So… that was easy lol
Also here the tutorial I will be referencing for writing this code: https://github.com/huggingface/notebooks/blob/main/examples/segment_anything.ipynb
We are going to copy some function from this notebook and overall it’s a great intro in to SAM, check it out.
import torch
import torchvision
print(“PyTorch version:”, torch.__version__)
print(“Torchvision version:”, torchvision.__version__)
print(“CUDA is available:”, torch.cuda.is_available())
import sys
!{sys.executable} -m pip install opencv-python matplotlib
!{sys.executable} -m pip install ‘git+https://github.com/facebookresearch/segment-anything.git’!mkdir images
!wget -P images https://raw.githubusercontent.com/facebookresearch/segment-anything/main/notebooks/images/truck.jpg
!wget -P images https://raw.githubusercontent.com/facebookresearch/segment-anything/main/notebooks/images/groceries.jpg!wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
Step 3: Let’s us view our classes
import torch
from transformers import SamModel, SamProcessor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SamModel.from_pretrained("facebook/sam-vit-huge").to(device)
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
Now for your dataset we are going to be converting YOLO format bounding boxes to segmentation masks.
Here are the classes for my dataset:
Objects = ['Person', 'Top', 'Shoe', 'Dress', 'Pants', 'Luggage & bags', 'Hat', 'Skirt', 'Shorts', 'Accessory', 'Sunglasses', 'Swimwear']
Here is the label and image we are going to be working with
2 0.36851637065410614 0.8876559734344482 0.08369550108909607 0.1444988250732422
2 0.4354017525911331 0.747386246919632 0.06847158074378967 0.1340007185935974
7 0.4495207667350769 0.5054635405540466 0.23303723335266113 0.21894299983978271
0 0.45071592926979065 0.5515952035784721 0.2737799286842346 0.8076206594705582
1 0.48304708302021027 0.33529189229011536 0.1939612329006195 0.1412053108215332
5 0.6979173719882965 0.6846143305301666 0.1545822024345398 0.11421650648117065
5 0.34287188947200775 0.4507826119661331 0.10764887928962708 0.10002323985099792

The yolo format for bounding boxes uses this format:
- One row per object
- Each row is
class
x_center
y_center
width
height
format. - Box coordinates must be normalized by the dimensions of the image (i.e. have values between 0 and 1)
- Class numbers are zero-indexed (start from 0).
Here is a good tutorial on yolo, it’s a good read: https://blog.paperspace.com/train-yolov5-custom-data/
Now load you image:
import cv2
image_BGR = cv2.imread('/kaggle/input/bounding-to-seg-test/000027.jpg')
Let us display our bounding boxes, first we have to read in the labels. Here is the function I will be using:
# Reading annotation txt file that has bounding boxes coordinates in YOLO format
def getLabels(labelPath):
with open(labelPath) as f:
# Preparing list for annotation of BB (bounding boxes)
labels = []
for line in f:
labels += [line.rstrip()]
return labels
Let us plot our classes:
# Going through all BB
def readLabelBB(labels, w, h):
parsedLabels = []
for i in range(len(labels)):
bb_current = labels[i].split()
x_center, y_center = int(float(bb_current[1]) * w), int(float(bb_current[2]) * h)
box_width, box_height = int(float(bb_current[3]) * w), int(float(bb_current[4]) * h)
parsedLabels.append((x_center, y_center, box_width, box_height))
return parsedLabels
def plotLabels(image, labels):
h, w = image.shape[:2]
parsedLabels = readLabelBB(labels, w, h)
for i in range(len(parsedLabels)):
x_center, y_center, box_width, box_height = parsedLabels[i]
# Now, from YOLO data format, we can get top left corner coordinates
# that are x_min and y_min
x_min = int(x_center - (box_width / 2))
y_min = int(y_center - (box_height / 2))
# Drawing bounding box on the original image
cv2.rectangle(image, (x_min, y_min), (x_min + box_width, y_min + box_height), [172 , 10, 127], 2)
# Preparing text with label and confidence for current bounding box
class_current = 'Class: {}'.format(Objects[int(bb_current[0])])
# Putting text with label and confidence on the original image
cv2.putText(image, class_current, (x_min, y_min - 5), cv2.FONT_HERSHEY_COMPLEX, 0.7, [172 , 10, 127], 2)
# Plotting this example
# Setting default size of the plot
plt.rcParams['figure.figsize'] = (15, 15)
# Initializing the plot
fig = plt.figure()
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.axis('off')
plt.title('Fashion Classes', fontsize=18)
# Showing the plot
plt.show()
# Plot our classes
labels = getLabels('path-to-labels')
plotLabels(image_BGR, labels)
Here are our classes. It looks like it mistaked a vent for bag, but that is fine. I used a pre trained model to create my dataset. Very ingenious way to create a 60,000 labeled images for 60 dollars. Might make a blog post on that later.

Step 4: Let us write our bounding box to segmentation code.
For this part we are going to need to convert the yolo coordinate formats to a list of points, corresponding to the flattened coordinates of the top left point, and bottom right point of the bounding box. To do this we will use the following function.
def getConvertedBoxes(labels, image_width, image_height):
parsedLabels = []
for i in range(len(labels)):
bb_current = labels[i].split()
x_center, y_center = float(bb_current[1]), float(bb_current[2])
box_width, box_height = float(bb_current[3]), float(bb_current[4])
# Convert to top left and bottom right coordinates
x0 = int((x_center - box_width / 2) * image_width)
y0 = int((y_center - box_height / 2) * image_height)
x1 = int((x_center + box_width / 2) * image_width)
y1 = int((y_center + box_height / 2) * image_height)
parsedLabels.append([x0, y0, x1, y1])
return parsedLabels
After converting them we will then plot them. We will use the follwoing plot functions form the SAM demo notebook because since the format now fits it should work perfectly for using that function
import matplotlib.pyplot as plt
def show_box(box, ax):
x0, y0 = box[0], box[1]
w, h = box[2] - box[0], box[3] - box[1]
ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))
def show_boxes_on_image(raw_image, boxes):
plt.figure(figsize=(10,10))
plt.imshow(raw_image)
for box in boxes:
show_box(box, plt.gca())
plt.axis('on')
plt.show()
from PIL import Image
path = "path-to-image"
raw_image = Image.open(path).convert("RGB")
inputBoxes = getConvertedBoxes(labels, w, h)
show_boxes_on_image(raw_image, convertedBoxes)
Just like magic:

So now input the boxes in to the processor and get the mask
import sys
sys.path.append("..")
from segment_anything import sam_model_registry, SamPredictor
sam_checkpoint = "sam_vit_h_4b8939.pth"
model_type = "vit_h"
device = "cuda"
sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)
predictor = SamPredictor(sam)
image = cv2.imread('/notebooks/images/000027.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

Step 4. Get Masks
The bounding boxes for these masks we got by converting the yolo format to bounding box format SAM expects. Look at part 1 to see that code that did that
Here are the bounding boxes for this picture.
bounding_boxes = [[209, 532, 262, 626], [256, 444, 300, 531], [213, 258, 362, 401], [200, 96, 376, 623], [247, 172, 371, 265], [397, 409, 496, 484], [184, 261, 253, 327]]
for inputting these boxes in to the SAM predictor. We have to convert them in to a tensor.
input_boxes = torch.tensor(bounding_boxes, device=predictor.device)
After doing so extract the new masks
transformed_boxes = predictor.transform.apply_boxes_torch(input_boxes, image.shape[:2]) masks, _, _ = predictor.predict_torch( point_coords=None, point_labels=None, boxes=transformed_boxes, multimask_output=False, ) plt.figure(figsize=(10, 10)) plt.imshow(image) for mask in masks: show_mask(mask.cpu().numpy(), plt.gca(), random_color=True) for box in input_boxes: show_box(box.cpu().numpy(), plt.gca()) plt.axis('off') plt.show()
Here are the results
Here is the masks with out the person mask which is coloring everything green

It worked!!!!!!!!!
Here is the link to the full github repo: conversion_code