When CNN Meet with ViT: Towards Semi-supervised Learning for Multi-class Medical Image Semantic Segmentation | SpringerLink
Skip to main content

When CNN Meet with ViT: Towards Semi-supervised Learning for Multi-class Medical Image Semantic Segmentation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 Workshops (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13807))

Included in the following conference series:

Abstract

Due to the lack of quality annotation in medical imaging community, semi-supervised learning methods are highly valued in image semantic segmentation tasks. In this paper, an advanced consistency-aware pseudo-label-based self-ensembling approach is presented to fully utilize the power of Vision Transformer (ViT) and Convolutional Neural Network (CNN) in semi-supervised learning. Our proposed framework consists of a feature-learning module which is enhanced by ViT and CNN mutually, and a guidance module which is robust for consistency-aware purposes. The pseudo labels are inferred and utilized recurrently and separately by views of CNN and ViT in the feature-learning module to expand the data set and are beneficial to each other. Meanwhile, a perturbation scheme is designed for the feature-learning module, and averaging network weight is utilized to develop the guidance module. By doing so, the framework combines the feature-learning strength of CNN and ViT, strengthens the performance via dual-view co-training, and enables consistency-aware supervision in a semi-supervised manner. A topological exploration of all alternative supervision modes with CNN and ViT are detailed validated, demonstrating the most promising performance and specific setting of our method on semi-supervised medical image segmentation tasks. Experimental results show that the proposed method achieves state-of-the-art performance on a public benchmark data set with a variety of metrics. The code is publicly available (https://github.com/ziyangwang007/CV-SSL-MIS).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 12583
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 15729
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)

  2. Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE Trans. Med. Imaging 37(11), 2514–2525 (2018)

    Article  Google Scholar 

  3. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)

    Google Scholar 

  4. Cao, H., et al.: Swin-UNet: UNet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021)

  5. Chang, Y.T., et al.: Weakly-supervised semantic segmentation via sub-category exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8991–9000 (2020)

    Google Scholar 

  6. Chen, J., et al.: TransUNet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)

  7. Chen, L.-C., et al.: Naive-student: leveraging semi-supervised learning in video sequences for urban scene segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 695–714. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_40

    Chapter  Google Scholar 

  8. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  9. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision, pp. 801–818 (2018)

    Google Scholar 

  10. Chen, X., et al.: Semi-supervised semantic segmentation with cross pseudo supervision. In: CVPR (2021)

    Google Scholar 

  11. Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  12. Dong-DongChen, W., WeiGao, Z.H.: Tri-net for semi-supervised deep learning. In: Proceedings of Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 2014–2020 (2018)

    Google Scholar 

  13. Dosovitskiy, A., et al.: An image is worth \(16 \times 16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  14. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059. PMLR (2016)

    Google Scholar 

  15. Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  16. He, K., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)

    Google Scholar 

  17. Huang, B., et al.: Simultaneous depth estimation and surgical tool segmentation in laparoscopic images. IEEE Trans. Med. Robot. Bionics 4(2), 335–338 (2022)

    Article  Google Scholar 

  18. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  19. Hung, W.C., et al.: Adversarial learning for semi-supervised semantic segmentation. In: 29th British Machine Vision Conference, BMVC 2018 (2018)

    Google Scholar 

  20. Ibrahim, M.S., et al.: Semi-supervised semantic image segmentation with self-correcting networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12715–12725 (2020)

    Google Scholar 

  21. Ibtehaz, N., Rahman, M.S.: MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 121, 74–87 (2020)

    Article  Google Scholar 

  22. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

    Google Scholar 

  23. Isensee, F., et al.: nnU-Net: self-adapting framework for U-Net-based medical image segmentation. In: Handels, H., Deserno, T., Maier, A., Maier-Hein, K., Palm, C., Tolxdorff, T. (eds.) Bildverarbeitung für die Medizin 2019. I, p. 22. Springer, Wiesbaden (2019). https://doi.org/10.1007/978-3-658-25326-4_7

    Chapter  Google Scholar 

  24. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)

    Google Scholar 

  25. Ji, W., et al.: Learning calibrated medical image segmentation via multi-rater agreement modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12341–12351 (2021)

    Google Scholar 

  26. Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 (2016)

  27. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  28. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)

    Google Scholar 

  29. Luo, X., et al.: Efficient semi-supervised gross target volume of nasopharyngeal carcinoma segmentation via uncertainty rectified pyramid consistency. In: de Bruijne, M., et al. (eds.) MICCAI 2021. LNCS, vol. 12902, pp. 318–329. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87196-3_30

    Chapter  Google Scholar 

  30. Luo, X., et al.: Semi-supervised medical image segmentation via cross teaching between CNN and transformer. arXiv preprint arXiv:2112.04894 (2021)

  31. Mendel, R., de Souza, L.A., Rauber, D., Papa, J.P., Palm, C.: Semi-supervised segmentation based on error-correcting supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 141–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_9

    Chapter  Google Scholar 

  32. Qiao, S., Shen, W., Zhang, Z., Wang, B., Yuille, A.: Deep co-training for semi-supervised image recognition. In: Proceedings of the European Conference on Computer Vision, pp. 135–152 (2018)

    Google Scholar 

  33. Reiß, S., Seibold, C., Freytag, A., Rodner, E., Stiefelhagen, R.: Every annotation counts: multi-label deep supervision for medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9532–9542 (2021)

    Google Scholar 

  34. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  35. Woo, S., et al.: CBAM: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

    Google Scholar 

  36. Song, C., et al.: Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3136–3145 (2019)

    Google Scholar 

  37. Souly, N., Spampinato, C., Shah, M.: Semi supervised semantic segmentation using generative adversarial network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5688–5696 (2017)

    Google Scholar 

  38. Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272 (2021)

    Google Scholar 

  39. Tarvainen, A., et al.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  40. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

    Google Scholar 

  41. Verma, V., et al.: Interpolation consistency training for semi-supervised learning. In: International Joint Conference on Artificial Intelligence (2019)

    Google Scholar 

  42. Vu, T.H., et al.: Advent: adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526 (2019)

    Google Scholar 

  43. Wang, Z.: Deep learning in medical ultrasound image segmentation: a review. arXiv preprint arXiv:2002.07703 (2020)

  44. Wang, Z., et al.: RAR-U-Net: a residual encoder to attention decoder by residual connections framework for spine segmentation under noisy labels. In: 2021 IEEE International Conference on Image Processing (ICIP). IEEE (2021)

    Google Scholar 

  45. Wang, Z., Voiculescu, I.: Triple-view feature learning for medical image segmentation. In: Xu, X., Li, X., Mahapatra, D., Cheng, L., Petitjean, C., Fu, H. (eds.) REMIA 2022. LNCS, vol. 13543, pp. 42–54. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-16876-5_5

    Chapter  Google Scholar 

  46. Wang, Z., Voiculescu, I.: Quadruple augmented pyramid network for multi-class Covid-19 segmentation via CT. In: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC) (2021)

    Google Scholar 

  47. Wang, Z., et al.: Computationally-efficient vision transformer for medical image semantic segmentation via dual pseudo-label supervision. In: IEEE International Conference on Image Processing (ICIP) (2022)

    Google Scholar 

  48. Wang, Z., Zheng, J.Q., Voiculescu, I.: An uncertainty-aware transformer for MRI cardiac semantic segmentation via mean teachers. In: Yang, G., Aviles-Rivero, A., Roberts, M., Schönlieb, C.B. (eds.) MIUA 2022. LNCS, vol. 13413, pp. 497–507. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12053-4_37

    Chapter  Google Scholar 

  49. Xia, Y., et al.: 3D semi-supervised learning with uncertainty-aware multi-view co-training. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3646–3655 (2020)

    Google Scholar 

  50. You, X., et al.: Segmentation of retinal blood vessels using the radial projection and semi-supervised approach. Pattern Recogn. 44(10–11), 2314–2324 (2011)

    Article  Google Scholar 

  51. Yu, L., Wang, S., Li, X., Fu, C.-W., Heng, P.-A.: Uncertainty-aware self-ensembling model for semi-supervised 3D left atrium segmentation. In: Shen, D., et al. (eds.) MICCAI 2019. LNCS, vol. 11765, pp. 605–613. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32245-8_67

    Chapter  Google Scholar 

  52. Zhang, Y., Yang, L., Chen, J., Fredericksen, M., Hughes, D.P., Chen, D.Z.: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: Descoteaux, M., Maier-Hein, L., Franz, A., Jannin, P., Collins, D.L., Duchesne, S. (eds.) MICCAI 2017. LNCS, vol. 10435, pp. 408–416. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66179-7_47

    Chapter  Google Scholar 

  53. Zhou, B., et al.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

    Google Scholar 

  54. Zoph, B., et al.: Rethinking pre-training and self-training. In: Advances in Neural Information Processing Systems, vol. 33, pp. 3833–3845 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziyang Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 253 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Z., Li, T., Zheng, JQ., Huang, B. (2023). When CNN Meet with ViT: Towards Semi-supervised Learning for Multi-class Medical Image Semantic Segmentation. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13807. Springer, Cham. https://doi.org/10.1007/978-3-031-25082-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-25082-8_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-25081-1

  • Online ISBN: 978-3-031-25082-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics