Description
Instructions
This is a programming assignment to create, train, and test a CNN for semantic segmentation following the version of FCN32s and FCN16s described in https://www.cv-foundation.org/openaccess/content cvp r 2015/papers/Long Fully Convolutional Networks 2015 CVPR paper.pdf.
Architecture
Construct a modified version of Fully Convolutional Network FCN-32s. The paper describes a version using VGG-16 (though some figures are based on AlexNet); in this assignment, you are asked to replace the VGG-16 backbone model with the ResNet-18 model which is specified in Figure 1 (from the ResNet paper).
Figure 1: Architectures of ResNet models in the original ResNet paper. Building blocks are shown in brackets, with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.
You need to do the following modifications to convert the network to fully convolutional form:
-
Replace the average pool layer by an average pooling layer (referred to as “avgpool”) with kernel size of 7 x 7 with stride as 1 and no padding.
-
Add a new convolutional layer with a kernel size of 1 x 1. The number of kernels is (number of classes + 1), including a background class; this layer will compute the probabilities of each class over its spatial extent.
-
Add a transpose convolution layer with stride=32 and kernel size as 64, that up-samples the classifier tensor back to the input image size. In PyTorch you can use torch.nn.ConvTranspose2d.
As discussed in class, the average pooling layer further reduces resolution (beyond 32x reduction). We can get around this difficulty by zero-padding the input image by 100 on each side. For the upsampled images, you may need to crop the center to make it the same size as the input image. You can follow the example for VGG given in FCN implementation. A neater approach that avoids 100 pixel paddings was also discussed in class; you may follow that approach instead if you prefer.
You can obtain a pre-trained ResNet-18 using torchvision.models.resnet18(pretrained = True) and use “fea-tures” method to perform forward pass until before the linear layers. To start from the pretrained ResNet-18 model, once you have modified the FCN-32s structure, we would like to construct a modified version of FCN-16s also. In FCN-16s, we additionally use the previous layer (output of final layer of conv4 x) output and combine it with the upsampled feature from avgpool; the procedure is defined in the FCN paper.
Note that for FCN-16s you would need output from a previous layer and the default module obtained from torchvision doesn’t contain a function to obtain output from a different layer. There are two ways to go about this:
-
Copy-paste the whole forward function of ResNet-18 and create a new function which does forward pass only till the required layer, or change the outputs of the forward function to return all the layers. For example, for default torchvision function:
-
def forward ( self , x ) :
-
x1 = self . layer1 ( x )
3x2 = self . layer2 ( x1 )
-
return x2
After modification, it should be
-
def forward ( self , x ) :
-
x1 = self . layer1 ( x )
3x2 = self . layer2 ( x1 )
4return x2 , x1
2. Use class IntermediateLayerGetter. Please refer to this discussion.
Dataset
In this assignment, you’ll use the Kitti dataset for sementic segmentation. You can download the data from http://www.cvlibs.net/datasets/kitti/eval semseg.php?benchmark=semantics2015. You’ll need
-
Go to website
-
Download the dataset listed in Figure 2.
-
You will need to provide your email to get a download link.
-
Once downloaded, you can upload the zip file to Colab.
5. In the execution cell, run !unzip path to the file (e.g. !unzip data semantics.zip) to unzip the file
-
For visualization, you also need to download the development kit and unzip it in the same way as above
Figure 2: The dataset enclosed in the red box is what we need for this experiment.
You should now have
-
data semantics which contains images and ground truth, and
-
devkit semantics which contains metadata of Kitti.
data semantics has two folders: training and testing, we will not use testing folder. In training folder, you can find four folders: image 2, instance, semantic, semantic rgb. We will not use the instance folder. For model training, use image 2 and semantic folder.
image 2 contains 200 images and semantic contains corresponding ground truth. You can use ‘opencv’ or ‘skimage’ to read image and ground truth. Ground truth gives the category of each pixel. There are 35 cate-gories (see figure below) defined in Kitti (not all of them exist in a single image but we need not care about the missing classes), you can find the corresponding label in the filedevkit semantics/devkit/helpers/labels.py. In this label file (devkit semantics/devkit/helpers/labels.py), we only care about 3 columns: “name”, “id” and “color”; “name” is the human readable label for each class, “id” is label for each class in the ground truth, “color” gives the mapping form class to RGB color. For visualization, you will need to map your prediction to color according to (“id”, “color”) correspondence. In our experiment, we will use the first 34 categories, and it is possible for some specific categories, the number of instances is 0. Please mark these numbers as N/A in your report and do not take it into account for average IoU.
Kitti does not offer ground truth for testing set so for this assignment, you should split the original training set into training/validation/testing sets. Sort images alphabetically and split by ratio 70%/15%/15% for training/validation/testing. (Hint: os.listdir() to read files may give file order shuffled, use sorted() function after getting files) Foldersemantic rgb contains the visual ground truth of the segmentations. You will use it in visual comparison with your prediction.
PyTorch does not offer a predefined dataset class for Kitti. You will need to implement a KittiDataset class by yourself.
Training
Train the network using the images in the train set mentioned in the Dataset section with your own hyper-parameters. Please note that the number of epoch should be no more than 50, otherwise you may need to adjust your learning rate for your setting.
Use cross entropy loss over the individual pixels as the loss function. You could use ‘torch.nn.CrossEntropyLoss‘.
It should not be necessary to do data augmentation as there are many samples in each image. However, if you choose to experiment with augmentations, please note that you would need to perform the same augmentation on the ground-truth segmentation mask as well. As an example, in classification setting, a flipped dog is still a dog, however, in the image segmentation setting, a flipped dog will require a flipped mask. Record the IoU and loss after each iteration so that you can monitor it and plot it to show results.
Evaluation metric
Use the following two evaluation metrics
-
Pixel-level intersection-over-union (IoU): Pixel-level IoU = TP/(TP+FP+FN), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively. Pixel-level IoU should be computed on each class separately (treat other classes as negative when computing for on class). NOTE: You should compute IoU over all testing images not on a single image.
-
Mean Intersection-over-Union (mIoU): Simple average of per-class pixel-level IoUs, it reflects the model’s generality on all classes. Please write your own code to evaluate network performance.
Results
Map your output labels to color images which should look like the images in the semantic rgb folder. Color mapping is provided in the filedevkit semantics/devkit/helpers/labels.py.
SUBMISSION
You should turn in a PDF report along with your code in a compressed file with the following components:
-
A brief description of the programs you write, including the source listing.
-
Evolution of loss function with multiple steps.
-
A summary and discussion of the results, including the effects of parameter choices. Compare the 2 versions of modified FCN (32s and 16s). Include the visualization of results; show some examples of successful and some failure examples.