CutMix : An Augmentation Technique for Image Classification and Object Detection

Aanshi Patwari
Clique Community
Published in
5 min readJul 8, 2020

--

Data augmentation is used for generating more data without collecting new data which helps in increasing the diversity of the dataset and helps in reducing the over fitting of the model by generating new images.

CutMix named on the basis of combination of two regional dropout methods: CutOut and MixUp. It is an augmentation strategy. The method consists of cutting patches and pasting it against the pair of training images, also the ground truth labels are mixed proportional to the area of the patches. This method helps in retaining the regularization effect of regional dropout by outperforming the state-of-the-art-augmentation.

CutMix applied training images

CutMix is similar to MixUp in the way that both mixes the sample interpolating the images and their labels. But MixUp generates ambiguous and unnatural labels making it difficult for model prediction while CutMix performs the linear interpolation of one hot labels.

It is similar to CutOut such that both removes a patch of the image from the original image.in CutOut the removed patch is replaced by a black patch while in CutMix it is replaced by a similar size patch of another image.

Though Cutout and Mixup perform well in increasing the output prediction rate but performs badly in localization and object detection process; CutMix enhances the output prediction rate, localization, and object detection process as well.

Overview of results on ImageNet classification and ImageNet Localization

CutMix also increases the robustness and improves the uncertainty levels by generating more unseen samples as compared to the other two.

Alogrithm

The aim of using CutMix is to generate a new training sample (x̃, ỹ) given two distinct training samples (x1, y1) and (x2,y2). Here, x ϵ R W x H x C is the training image and y is the training label. The generated sample is used along with the original loss function to train the model. The combining operation is:-

x̃= M Ꙩ x1 + (1-M) Ꙩ x2

ỹ = ʎy1 + (1-ʎ)y2

where M ϵ {0,1}W x H denotes a binary mask indicating where to dropout and fill in from the two images, 1 is a binary mask filled with ones and Ꙩ is an element-wise multiplication. Here, ʎ is the ratio of the patch cut from the first image and pasted onto the second image to the total number of patches in the second image.

Experiments

Below are the results of the experiments displaying the capability of using CutMix on image classification and weakly supervised object localization and also transferability of CutMix pre-trained model for object detection.

  1. Image Classification

For this, we have used the ImageNet with 1K benchmark, the dataset consists of 1.2M training images and 50K validation images with 1K different categories. The model is trained for 300 epochs with an initial learning rate of 0.1 with a decaying factor of 0.1 at epochs 75,150 and 225. The batch size is set to 256. The hyper-parameter α is set to 1. The comparison of the results of different regularization methods is given in the table below:

ImageNet classification results using the ResNet-50 model.

The comparison shows that CutMix outperforms other methods by giving the best result of 21.40 % Top-1 Err among the various methods. On the feature level as well CutMix gives the best performance.

2. Weakly Supervised Object Localization(WSOL)

WSOL is the method of localizing target objects by training the classifier with the objects’ labels only. For localization, it is necessary that CNN extract the cues from the whole object rather than just extracting it from the discriminative parts of the object. CutMix helps the classifier in extracting the cues from a large part of the object as compared to other techniques.
The results of the training and the evaluation strategy of existing WSOL methods with VGG-GAP and ResNet-50 as base architectures along with augmentation strategies on CUB200–2011 and ImageNet dataset.

WSOL Results on CUB200–2011 and ImageNet dataset

The results show that CutMix gives out the maximum localization accuracy even when used with the WSOL methods.

3. Transferability

As CutMix shows great results in localizing the less discriminative parts of the object, it also provides great performance when the pre-trained CutMix model is used for object detection purposes. SSD and Faster RCNN detection models are used with ResNet-50 as the baseline for ImageNet pre-trained models and tested on Pascal VOC 2007 and 2012 trainval data using the maP metrics.

Transferability of the pre-trained model

The results suggest that the localization effect of CutMix leads to better detection performances.

Conclusion

On introducing CutMix for training CNN leads to strong classification and localization ability. It gives no computational overhead cost. It provides higher accuracy results improving the overall performance of the used models.

The github repository link for CutMix: https://github.com/clovaai/CutMix-PyTorch

References

[1]S. Yun, D. Han, S. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features”, arXiv.org, 2020. [Online]. Available: https://arxiv.org/abs/1905.04899. [Accessed: 06- Jul- 2020].

[2]Walawalkar, D. (2020, March 29). Attentive CutMix: An Enhanced Data Augmentation Approach for Deep Learning Based Image Classification. ArXiv.Org. https://arxiv.org/abs/2003.13048

--

--