Abstract
To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either task-unrelated or task-specific training signals from unlabeled data. We show that these two approaches, at the two extreme ends of the task-specificity spectrum, are suboptimal for the task performance. Utilizing too little task-specific training signals causes underfitting to the ground-truth labels of downstream tasks, while the opposite causes overfitting to the ground-truth labels. To this end, we propose a novel Class-Agnostic Semi-Supervised Learning (CA-SSL) framework to achieve a more favorable task-specificity balance in extracting training signals from unlabeled data. CA-SSL has three training stages that act on either ground-truth labels (labeled data) or pseudo labels (unlabeled data). This decoupling strategy avoids the complicated scheme in traditional SSL methods that balances the contributions from both data types. Especially, we introduce a warmup training stage to achieve a more optimal balance in task specificity by ignoring class information in the pseudo labels, while preserving localization training signals. As a result, our warmup model can better avoid underfitting/overfitting when fine-tuned on the ground-truth labels in detection and segmentation tasks. Using 3.6M unlabeled data, we achieve a significant performance gain of \(4.7\%\) over ImageNet-pretrained baseline on FCOS object detection. In addition, our warmup model demonstrates excellent transferability to other detection and segmentation frameworks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Berthelot, D., et al.: MixMatch: a holistic approach to semi-supervised learning. In: NeurlPS (2019)
et al, R.: Not all unlabeled data are equal: learning to weight data in semi-supervised learning. In: NeurlPS (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR (2019)
Chen, K., et al.: MMDetection: open MMLab detection toolbox and benchmark (2019)
Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv (2020)
Chen, X., Girshick, R., He, K., Dollár, P.: Tensormask: A foundation for dense object segmentation. In: ICCV (2019)
Chen, X., He, K.: Exploring simple Siamese representation learning. In: CVPR (2021)
Chen, Y., et al.: Scale-aware automatic augmentation for object detection. In: CVPR (2021)
Cheng, B., Schwing, A.G., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurlPS (2021)
Chu, R., Sun, Y., Li, Y., Liu, Z., Zhang, C., Wei, Y.: Vehicle re-identification with viewpoint-aware metric learning. In: ICCV (2019)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NeurIPS (2016)
Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: unsupervised pre-training for object detection with transformers. In: CVPR (2021)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. PAMI 37, 1558–1570 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hénaff, O.J., Koppula, S., Alayrac, J.B., van den Oord, A., Vinyals, O., Carreira, J.: Efficient visual pretraining with contrastive detection. In: ICCV (2021)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)
Jeong, J., Lee, S., Kim, J., Kwak, N.: Consistency-based semi-supervised learning for object detection (2019)
Kim, D., Lin, T.Y., Angelova, A., Kweon, I.S., Kuo, W.: Learning open-world object proposals without learning to classify. arXiv (2021)
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)
Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification. Dataset (2016). http://github.com/openimages
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NeurIPS (2012)
Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. IJCV 128, 1956–1981 (2020). https://doi.org/10.1007/s11263-020-01316-z
Li, Y., Huang, D., Qin, D., Wang, L., Gong, B.: Improving object detection with Selective self-supervised self-training. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 589–607. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_35
Li, Y., et al.: Fully convolutional networks for panoptic segmentation with point-based supervision. arXiv (2021)
Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: CVPR (2017)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR (2018)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: ICCV (2021)
Morrison, D., et al.: Cartman: the low-cost cartesian manipulator that won the amazon robotics challenge. In: ICRA (2018)
Mottaghi, R., et al.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)
Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J.: Amodal instance segmentation with KINS dataset. In: CVPR (2019)
Qi, L., et al.: Multi-scale aligned distillation for low-resolution detection. In: CVPR (2021)
Qi, L., et al.: Open-world entity segmentation. arXiv (2021)
Qi, L., Zhang, X., Chen, Y., Chen, Y., Sun, J., Jia, J.: PointINS: point-based instance segmentation. arXiv (2020)
Radosavovic, I., Dollár, P., Girshick, R., Gkioxari, G., He, K.: Data distillation: towards omni-supervised learning. In: CVPR (2018)
Ramanathan, V., Wang, R., Mahajan, D.: PreDet: large-scale weakly supervised pre-training for detection. In: ICCV (2021)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
Sharma, A., Khan, N., Mubashar, M., Sundaramoorthi, G., Torr, P.: Class-agnostic segmentation loss and its application to salient object detection and segmentation. In: ICCV (2021)
Shu, G.: Human detection, tracking and segmentation in surveillance video (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi-supervised learning framework for object detection. arXiv (2020)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR (2019)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-ResNet and the impact of residual connections on learning. arXiv (2016)
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)
Tang, P., Ramaiah, C., Wang, Y., Xu, R., Xiong, C.: Proposal learning for semi-supervised object detection. In: WACV (2021)
Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 282–298. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_17
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)
Wang, K., Yan, X., Zhang, D., Zhang, L., Lin, L.: Towards human-machine cooperation: self-supervised sample mining for object detection. In: CVPR (2018)
Wang, W., Feiszli, M., Wang, H., Tran, D.: Unidentified video objects: a benchmark for dense, open-world segmentation. arXiv (2021)
Wang, X., Zhang, R., Shen, C., Kong, T., Li, L.: Dense contrastive learning for self-supervised visual pre-training. In: CVPR (2021)
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). http://github.com/facebookresearch/detectron2
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 472–487. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_29
Xie, E., et al.: DetCo: unsupervised contrastive learning for object detection. In: ICCV (2021)
Xie, E., et al.: PolarMask: single shot instance segmentation with polar representation. In: CVPR (2020)
Xu, M., et al.: End-to-end semi-supervised object detection with soft teacher. In: ICCV (2021)
Zhang, R., Tian, Z., Shen, C., You, M., Yan, Y.: Mask encoding for single shot instance segmentation. In: CVPR (2020)
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. TPAMI 40, 1452–1464 (2017)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points (2019)
Zoph, B., et al.: Rethinking pre-training and self-training. In: NeurIPS (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qi, L. et al. (2022). CA-SSL: Class-Agnostic Semi-Supervised Learning for Detection and Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13691. Springer, Cham. https://doi.org/10.1007/978-3-031-19821-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-19821-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19820-5
Online ISBN: 978-3-031-19821-2
eBook Packages: Computer ScienceComputer Science (R0)