TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset

Zhang, Yi; Zhang, Bowei; Yi, Kai

doi:10.1007/978-3-031-78128-5_20

Yi Zhang¹³,
Bowei Zhang¹³ &
Kai Yi¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15304))

Included in the following conference series:

International Conference on Pattern Recognition

92 Accesses

Abstract

Transformer has taken computer vision field by storm in recent years and is becoming increasingly popular in both academia and industry. However, the remarkable success is largely fueled by training on massive samples. In real applications, it is not always possible to have sufficient annotated data. When only a small set of labeled data is available (called tiny dataset), Transformer performs far worse than convolution neural network (CNN). Moreover, it occupies large memory footprint during training. Under such circumstances, a deep fusion model has been created by integrating CNN with Transformer architecture. Specifically, Conv-stem is implanted in the first stage of Transformer to reduce memory footprint. Then, the second and the third stages of the Transformer encoder have been modified into parallel structure with integration of CNN. Finally, depth-wise separable convolution blocks are appended to the encoder to enhance feature representation. Extensive experiments prove the effectiveness of our proposed model, in which it outperforms other popular methods on tiny datasets by sage margins.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 14871; Price includes VAT (Japan)

Softcover Book: JPY 18589; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A data efficient transformer based on Swin Transformer

Article 30 July 2023

TransMCGC: a recast vision transformer for small-scale image classification tasks

Article 04 January 2023

Locality Guidance for Improving Vision Transformers on Tiny Datasets

References

Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 23818–23830 (2021)
Google Scholar
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Adv. Neural. Inf. Process. Syst. 34, 12116–12128 (2021)
Google Scholar
Jin, Y., Han, D., Ko, H.: Trseg: transformer for semantic segmentation. Pattern Recogn. Lett. 148, 29–35 (2021)
Article Google Scholar
Heo, J., Wang, Y., Park, J.: Occlusion-aware spatial attention transformer for occluded object recognition. Pattern Recogn. Lett. 159, 70–76 (2022)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Google Scholar
Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)
Lee, S.H., Lee, S., Song, B.C.: Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492 (2021)
Pan, J., et al.: Edgevits: competing light-weight cnns on mobile devices with vision transformers. In: European Conference on Computer Vision, pp. 294–311. Springer (2022). https://doi.org/10.1007/978-3-031-20083-0_18
Wang, J., et al.: Riformer: keep your vision backbone effective but removing token mixer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14443–14452 (2023)
Google Scholar
Ryoo, M.S., Piergiovanni, A.J., Arnab, A., Dehghani, M., Angelova, A.: Tokenlearner: what can 8 learned tokens do for images and videos? arXiv preprint arXiv:2106.11297 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Yuan, L., et al.:. Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Google Scholar
Wu, H., et al.: Cvt: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22–31 (2021)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, pp. 2286–2296. PMLR (2021)
Google Scholar
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021)
Google Scholar
Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision And Pattern Recognition, pp. 12175–12185 (2022)
Google Scholar
Fan, X., Liu, H.: Flexformer: flexible transformer for efficient visual recognition. Pattern Recogn. Lett. 169, 95–101 (2023)
Article Google Scholar
Sun, P., et al.: Swformer: sparse window transformer for 3d object detection in point clouds. In: European Conference on Computer Vision, pp. 426–442. Springer (2022). https://doi.org/10.1007/978-3-031-20080-9_25
Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11315–11325 (2022)
Google Scholar
Wang, Y., Yan, L., Feng, Z., Xia, Y., Xiao, B.: Visual tracking using transformer with a combination of convolution and attention. Image Vis. Comput. 137, 104760 (2023)
Article Google Scholar
Mehta, S., Rastegari, M.: Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178 (2021)
Islam, M.A., Jia, S., Bruce, N.D.B.: How much position information do convolutional neural networks encode? arXiv preprint arXiv:2001.08248 (2020)
Graham, B., et al.: Levit: a vision transformer in convnet’s clothing for faster inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 12259–12269 (2021)
Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S., Aguera y Arcas, B.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282. PMLR (2017)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Zhu, C., Chen, W., Peng, T., Wang, Y., Jin, M.: Hard sample aware noise robust learning for histopathology image classification. IEEE Trans. Med. Imaging 41(4), 881–894 (2021)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Heo, B., Yun, S., Han, D., Chun, S., Choe, J., Oh, S.J.: Rethinking spatial dimensions of vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11936–11945 (2021)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Article Google Scholar
Zhang, X., Zhang, Y.: Conv-pvt: a fusion architecture of convolution and pyramid vision transformer. Int. J. Mach. Learn. Cybern. 14(6), 2127–2136 (2023)
Article Google Scholar
Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Huang, T., Huang, L., You, S., Wang, F., Qian, C., Xu, C.: Lightvit: Towards light-weight convolution-free vision transformers. arXiv preprint arXiv:2207.05557 (2022)
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Zhang, W., et al.: Topformer: token pyramid transformer for mobile semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12083–12093 (2022)
Google Scholar
Ali, A., et al.: Xcit: cross-covariance image transformers. Adv. Neural. Inf. Process. Syst. 34, 20014–20027 (2021)
Google Scholar

Download references

Acknowledgment

This work is supported by the Intelligent Policing Key Laboratory of Sichuan Province, No. ZNJW2024FKMS004.

Author information

Authors and Affiliations

Department of Computer Science, Sichuan University, Chengdu, China
Yi Zhang & Bowei Zhang
Sichuan Police College, Intelligent Policing Key Laboratory of Sichuan Province, Luzhou, China
Kai Yi

Authors

Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bowei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Yi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Yi .

Editor information

Editors and Affiliations

University of Salford, Salford, Lancashire, UK
Apostolos Antonacopoulos
Indian Institute of Technology Bombay, Mumbai, Maharashtra, India
Subhasis Chaudhuri
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa
Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
IIT Kharagpur, Kharagpur, West Bengal, India
Saumik Bhattacharya
Indian Statistical Institute Kolkata, Kolkata, West Bengal, India
Umapada Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Zhang, B., Yi, K. (2025). TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15304. Springer, Cham. https://doi.org/10.1007/978-3-031-78128-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-78128-5_20
Published: 30 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78127-8
Online ISBN: 978-3-031-78128-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A data efficient transformer based on Swin Transformer

TransMCGC: a recast vision transformer for small-scale image classification tasks

Locality Guidance for Improving Vision Transformers on Tiny Datasets

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

TinyConv-PVT: A Deeper Fusion Model of CNN and Transformer for Tiny Dataset

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A data efficient transformer based on Swin Transformer

TransMCGC: a recast vision transformer for small-scale image classification tasks

Locality Guidance for Improving Vision Transformers on Tiny Datasets

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation