For this tutorial, we will be finetuning a pre-trained Mask R-CNN model in the Penn-Fudan Database for Pedestrian Detection and Segmentation. It contains 170 images with 345 instances of pedestrians, and we will use it to illustrate how to use the new features in torchvision in order to train an instance segmentation model on a custom dataset.

## Defining the Dataset

The reference scripts for training object detection, instance segmentation and person keypoint detection allows for easily supporting adding new custom datasets. The dataset should inherit from the standard torch.utils.data.Dataset class, and implement __len__ and __getitem__.

The only specificity that we require is that the dataset __getitem__ should return: * image: a PIL Image of size (H, W) * target: a dict containing the following fields * boxes (FloatTensor[N, 4]): the coordinates of the N bounding boxes in [x0, y0, x1, y1] format, ranging from 0 to W and 0 to H * labels (Int64Tensor[N]): the label for each bounding box * image_id (Int64Tensor[1]): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation * area (Tensor[N]): The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes. * iscrowd (UInt8Tensor[N]): instances with iscrowd=True will be ignored during evaluation. * (optionally) masks (UInt8Tensor[N, H, W]): The segmentation masks for each one of the objects * (optionally) keypoints (FloatTensor[N, K, 3]): For each one of the N objects, it contains the K keypoints in [x, y, visibility] format, defining the object. visibility=0 means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt references/detection/transforms.py for your new keypoint representation

• image：大小为(H, W)的PIL Image
• target：包含以下字段的dict
• boxes (FloatTensor[N, 4])：N表示边界框数目，4表示边界框格式，分别为[x0, y0, x1, y1]，宽度取值为[0, W]，长度取值为[0, H]
• labels (Int64Tensor[N])：每个边界框的标签
• image_id (Int64Tensor[1])：图像标识符
• area (Tensor[N])：边界框面积。这在使用COCO指标进行评估时使用，用于区分小、中、大框之间的指标得分
• iscrowd (UInt8Tensor[N])：iscrowd=True的实例将在评估期间被忽略
• (可选) masks (UInt8Tensor[N, H, W])：每个目标的分割掩码
• (可选) keypoints (FloatTensor[N, K, 3])：对于N个对象中的每一个，它包含[x, y, visibility]格式的K个关键点，用于定义对象。visibility=0表示关键点不可见。注意，对于数据扩充，翻转关键点的概念取决于数据表示，你应该为新的关键点表示调整references/detection/transforms.py

If your model returns the above methods, they will make it work for both training and evaluation, and will use the evaluation scripts from pycocotools.

Additionally, if you want to use aspect ratio grouping during training (so that each batch only contains images with similar aspect ratio), then it is recommended to also implement a get_height_and_width method, which returns the height and the width of the image. If this method is not provided, we query all elements of the dataset via __getitem__ , which loads the image in memory and is slower than if a custom method is provided.

## Writing a custom dataset for PennFudan

Let’s write a dataset for the PennFudan dataset. After downloading and extracting the zip file, we have the following folder structure:

Here is one example of a pair of images and segmentation masks

So each image has a corresponding segmentation mask, where each color correspond to a different instance. Let’s write a torch.utils.data.Dataset class for this dataset.

That’s all for the dataset. Now let’s define a model that can perform predictions on this dataset.

In this tutorial, we will be using Mask R-CNN, which is based on top of Faster R-CNN. Faster R-CNN is a model that predicts both bounding boxes and class scores for potential objects in the image.

Mask R-CNN adds an extra branch into Faster R-CNN, which also predicts segmentation masks for each instance.

There are two common situations where one might want to modify one of the available models in torchvision modelzoo. The first is when we want to start from a pre-trained model, and just finetune the last layer. The other is when we want to replace the backbone of the model with a different one (for faster predictions, for example).

Let’s go see how we would do one or another in the following sections.

### 1 - Finetuning from a pretrained model

1 - 对预处理模型进行微调

Let’s suppose that you want to start from a model pre-trained on COCO and want to finetune it for your particular classes. Here is a possible way of doing it:

### 2 - Modifying the model to add a different backbone

2 - 通过添加不同的主干来修改模型

### An Instance segmentation model for PennFudan Dataset

PennFudan数据集的实例分割模型

In our case, we want to fine-tune from a pre-trained model, given that our dataset is very small, so we will be following approach number 1.

Here we want to also compute the instance segmentation masks, so we will be using Mask R-CNN:

That’s it, this will make model be ready to be trained and evaluated on your custom dataset.

## Putting everything together

In references/detection/, we have a number of helper functions to simplify training and evaluating detection models. Here, we will use references/detection/engine.py, references/detection/utils.py and references/detection/transforms.py. Just copy them to your folder and use them here.

Let’s write some helper functions for data augmentation / transformation:

## Testing forward() method (Optional)

Before iterating over the dataset, it’s good to see what the model expects during training and inference time on sample data.

Let’s now write the main function which performs the training and the validation:

You should get as output for the first epoch:

So after one epoch of training, we obtain a COCO-style mAP of 60.6, and a mask mAP of 70.4.

After training for 10 epochs, I got the following metrics

But what do the predictions look like? Let’s take one image in the dataset and verify

The trained model predicts 9 instances of person in this image, let’s see a couple of them:

The results look pretty good!

## Wrapping up

In this tutorial, you have learned how to create your own training pipeline for instance segmentation models, on a custom dataset. For that, you wrote a torch.utils.data.Dataset class that returns the images and the ground truth boxes and segmentation masks. You also leveraged a Mask R-CNN model pre-trained on COCO train2017 in order to perform transfer learning on this new dataset.

For a more complete example, which includes multi-machine / multi-gpu training, check references/detection/train.py, which is present in the torchvision repo.