{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,3,22]],"date-time":"2025-03-22T12:40:09Z","timestamp":1742647209001,"version":"3.40.2"},"reference-count":40,"publisher":"Institution of Engineering and Technology (IET)","issue":"10","license":[{"start":{"date-parts":[[2023,6,18]],"date-time":"2023-06-18T00:00:00Z","timestamp":1687046400000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["61672466 62011530130"],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Image Processing"],"published-print":{"date-parts":[[2023,8]]},"abstract":"Abstract<\/jats:title>Human keypoints detection is different from general detection tasks and requires networks that can learn visual information and anatomical constraints. Since CNN is excellent in extracting texture features of images and transformer can learn the correlation among keypoints well, many CTPNets (CNN+transformer type human pose estimation networks) have emerged. However, these networks are unconcerned with the processing of the features extracted from the CNN and naturally expand only from the channel dimension, ignoring the spatial features in the visual information that are essential for complex detection tasks like pose estimation. So the channel spatial integrated transformer for human pose estimation, termed CSIT, is proposed. The visual information are summarized as texture and spatial information, and a parallel network is used to expand the feature maps in the channel and spatial dimensions to learn texture features and spatial features respectively. In addition, anatomically constrained information is learned by keypoint embeddings. At the end of the network, the 1D vector representation method with more advanced performance and more compatible with transformer's characteristics is used to predict keypoints. Experiments show that CSIT outperforms the mainstream CTPNets on the COCO test\u2010dev dataset, and also show satisfactory results on the MPII dataset.<\/jats:p>","DOI":"10.1049\/ipr2.12850","type":"journal-article","created":{"date-parts":[[2023,6,19]],"date-time":"2023-06-19T03:35:00Z","timestamp":1687145700000},"page":"3002-3011","update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":7,"title":["CSIT: Channel Spatial Integrated Transformer for human pose estimation"],"prefix":"10.1049","volume":"17","author":[{"given":"Shaohua","family":"Li","sequence":"first","affiliation":[{"name":"Graphics and Data Intelligence Team, Key Laboratory of Autonomous Intelligent System Software Systems and Applications School of Computer Science and Technology, Zhejiang Sci\u2010Tech University Hangzhou China"}]},{"given":"Haixiang","family":"Zhang","sequence":"additional","affiliation":[{"name":"Graphics and Data Intelligence Team, Key Laboratory of Autonomous Intelligent System Software Systems and Applications School of Computer Science and Technology, Zhejiang Sci\u2010Tech University Hangzhou China"}]},{"given":"Hanjie","family":"Ma","sequence":"additional","affiliation":[{"name":"Graphics and Data Intelligence Team, Key Laboratory of Autonomous Intelligent System Software Systems and Applications School of Computer Science and Technology, Zhejiang Sci\u2010Tech University Hangzhou China"}]},{"given":"Jie","family":"Feng","sequence":"additional","affiliation":[{"name":"Graphics and Data Intelligence Team, Key Laboratory of Autonomous Intelligent System Software Systems and Applications School of Computer Science and Technology, Zhejiang Sci\u2010Tech University Hangzhou China"}]},{"given":"Mingfeng","family":"Jiang","sequence":"additional","affiliation":[{"name":"Graphics and Data Intelligence Team, Key Laboratory of Autonomous Intelligent System Software Systems and Applications School of Computer Science and Technology, Zhejiang Sci\u2010Tech University Hangzhou China"}]}],"member":"265","published-online":{"date-parts":[[2023,6,18]]},"reference":[{"key":"e_1_2_9_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/TII.2022.3143605"},{"key":"e_1_2_9_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2020.09.068"},{"key":"e_1_2_9_4_1","doi-asserted-by":"publisher","DOI":"10.1109\/EMBC.2017.8037221"},{"key":"e_1_2_9_5_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.gaitpost.2020.05.027"},{"key":"e_1_2_9_6_1","doi-asserted-by":"publisher","DOI":"10.3390\/en16031078"},{"key":"e_1_2_9_7_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.eswa.2022.118807"},{"key":"e_1_2_9_8_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.143"},{"key":"e_1_2_9_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00198"},{"key":"e_1_2_9_10_1","unstructured":"Mao W. et\u00a0al.:Tfpose: Direct human pose estimation with transformers. arXiv:2103.15320 (2021)"},{"key":"e_1_2_9_11_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20065-6_25"},{"key":"e_1_2_9_12_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-20068-7_6"},{"key":"e_1_2_9_13_1","doi-asserted-by":"crossref","unstructured":"Newell A. Yang K. Deng J.:Stacked hourglass networks for human pose estimation. In:14th European Conference Computer Vision\u2013ECCV 2016 pp.483\u2013499.Springer Cham(2016)","DOI":"10.1007\/978-3-319-46484-8_29"},{"key":"e_1_2_9_14_1","doi-asserted-by":"crossref","unstructured":"Xiao B. Wu H. Wei Y.:Simple baselines for human pose estimation and tracking. In:Proceedings of the European Conference on Computer Vision (ECCV) pp.466\u2013481.Springer Cham(2018)","DOI":"10.1007\/978-3-030-01231-1_29"},{"key":"e_1_2_9_15_1","doi-asserted-by":"crossref","unstructured":"Sun K. Xiao B. Liu D. Wang J.:Deep high\u2010resolution representation learning for human pose estimation. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition pp.5693\u20135703.IEEE Piscataway NJ(2019)","DOI":"10.1109\/CVPR.2019.00584"},{"key":"e_1_2_9_16_1","doi-asserted-by":"crossref","unstructured":"Wei S.\u2010E. Ramakrishna V. Kanade T. Sheikh Y.:Convolutional pose machines. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp.4724\u20134732.IEEE Piscataway NJ(2016)","DOI":"10.1109\/CVPR.2016.511"},{"key":"e_1_2_9_17_1","doi-asserted-by":"crossref","unstructured":"Cheng B. et\u00a0al.:Higherhrnet: Scale\u2010aware representation learning for bottom\u2010up human pose estimation. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition pp.5386\u20135395.IEEE Piscataway NJ(2020)","DOI":"10.1109\/CVPR42600.2020.00543"},{"key":"e_1_2_9_18_1","doi-asserted-by":"crossref","unstructured":"Luo Z. et\u00a0al.:Rethinking the heatmap regression for bottom\u2010up human pose estimation. In:Proceedings of the IEEE\/CVF conference on computer vision and pattern recognition pp.13264\u201313273.IEEE Piscataway NJ(2021)","DOI":"10.1109\/CVPR46437.2021.01306"},{"key":"e_1_2_9_19_1","doi-asserted-by":"crossref","unstructured":"Gu K. Yang L. Yao A.:Removing the bias of integral pose regression. In:Proceedings of the IEEE\/CVF International Conference on Computer Vision pp.11067\u201311076.IEEE Piscataway NJ(2021)","DOI":"10.1109\/ICCV48922.2021.01088"},{"key":"e_1_2_9_20_1","unstructured":"Qu H. Xu L. Cai Y. Foo L.G. Liu J.:Heatmap distribution matching for human pose estimation. arXiv:2210.00740(2022)"},{"key":"e_1_2_9_21_1","unstructured":"Devlin J. Chang M.\u2010W. Lee K. Toutanova K.:Bert: pre\u2010training of deep bidirectional transformers for language understanding.arXiv:1810.04805(2018)"},{"key":"e_1_2_9_22_1","unstructured":"Liu Y. et\u00a0al.:Roberta: a robustly optimized bert pretraining approach.arXiv:1907.11692(2019)"},{"key":"e_1_2_9_23_1","unstructured":"Radford A. Narasimhan K. Salimans T. Sutskever I. et\u00a0al.:Improving language understanding by generative pre\u2010training(2018)"},{"key":"e_1_2_9_24_1","doi-asserted-by":"crossref","unstructured":"Lewis M. et\u00a0al.:Bart: denoising sequence\u2010to\u2010sequence pre\u2010training for natural language generation translation and comprehension.arXiv:1910.13461(2019)","DOI":"10.18653\/v1\/2020.acl-main.703"},{"key":"e_1_2_9_25_1","first-page":"5485","article-title":"Exploring the limits of transfer learning with a unified text\u2010to\u2010text transformer","volume":"21","author":"Raffel C.","year":"2020","journal-title":"J. Mach. Learn. Res."},{"key":"e_1_2_9_26_1","unstructured":"Zhang S. et\u00a0al.:OPT: open pre\u2010trained transformer language models.arXiv:2205.01068(2022)"},{"key":"e_1_2_9_27_1","unstructured":"Dosovitskiy A. et\u00a0al.:An image is worth 16x16 words: transformers for image recognition at scale.arXiv:2010.11929(2020)"},{"key":"e_1_2_9_28_1","unstructured":"Touvron H. et\u00a0al.:Training data\u2010efficient image transformers & distillation through attention. In:International Conference on Machine Learning pp.10347\u201310357.Microtome Publishing Brookline MA(2021)"},{"key":"e_1_2_9_29_1","doi-asserted-by":"crossref","unstructured":"Chen C.\u2010F.R. Fan Q. Panda R.:Crossvit: Cross\u2010attention multi\u2010scale vision transformer for image classification. In:Proceedings of the IEEE\/CVF International Conference on Computer Vision pp.357\u2013366.IEEE Piscataway NJ(2021)","DOI":"10.1109\/ICCV48922.2021.00041"},{"key":"e_1_2_9_30_1","doi-asserted-by":"crossref","unstructured":"Lanchantin J. Wang T. Ordonez V. Qi Y.:General multi\u2010label image classification with transformers. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition pp.16478\u201316488.IEEE Piscataway NJ(2021)","DOI":"10.1109\/CVPR46437.2021.01621"},{"key":"e_1_2_9_31_1","doi-asserted-by":"crossref","unstructured":"Carion N. et\u00a0al.:End\u2010to\u2010end object detection with transformers. In:16th European Conference Computer Vision\u2013ECCV 2020 pp.213\u2013229.Springer Cham(2020)","DOI":"10.1007\/978-3-030-58452-8_13"},{"key":"e_1_2_9_32_1","unstructured":"Zhu X. et\u00a0al.:Deformable DETR: deformable transformers for end\u2010to\u2010end object detection.arXiv:2010.04159(2020)"},{"key":"e_1_2_9_33_1","doi-asserted-by":"crossref","unstructured":"Dai Z. Cai B. Lin Y. Chen J.:UP\u2010DETR: unsupervised pre\u2010training for object detection with transformers. In:Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition pp.1601\u20131610.IEEE Piscataway NJ(2021)","DOI":"10.1109\/CVPR46437.2021.00165"},{"key":"e_1_2_9_34_1","doi-asserted-by":"crossref","unstructured":"Zheng C. et\u00a0al.:3D human pose estimation with spatial and temporal transformers. In:Proceedings of the IEEE\/CVF International Conference on Computer Vision pp.11656\u201311665.IEEE Piscataway NJ(2021)","DOI":"10.1109\/ICCV48922.2021.01145"},{"key":"e_1_2_9_35_1","doi-asserted-by":"publisher","DOI":"10.1109\/LSP.2022.3163678"},{"key":"e_1_2_9_36_1","doi-asserted-by":"publisher","DOI":"10.1007\/s10044-023-01130-6"},{"key":"e_1_2_9_37_1","doi-asserted-by":"crossref","unstructured":"Papandreou G. et\u00a0al.:Towards accurate multi\u2010person pose estimation in the wild. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp.4903\u20134911.IEEE Piscataway NJ(2017)","DOI":"10.1109\/CVPR.2017.395"},{"key":"e_1_2_9_38_1","doi-asserted-by":"crossref","unstructured":"Chen Y. et\u00a0al.:Cascaded pyramid network for multi\u2010person pose estimation. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp.7103\u20137112.IEEE Piscataway NJ(2018)","DOI":"10.1109\/CVPR.2018.00742"},{"key":"e_1_2_9_39_1","doi-asserted-by":"crossref","unstructured":"Fang H.\u2010S. Xie S. Tai Y.\u2010W. Lu C.:RMPE: regional multi\u2010person pose estimation. In:Proceedings of the IEEE International Conference on Computer Vision pp.2334\u20132343.IEEE Piscataway NJ(2017)","DOI":"10.1109\/ICCV.2017.256"},{"key":"e_1_2_9_40_1","unstructured":"Stoffl L. Vidal M. Mathis A.:End\u2010to\u2010end trainable multi\u2010instance pose estimation with transformers.arXiv:2103.12115(2021)"},{"key":"e_1_2_9_41_1","unstructured":"Xu Y. Zhang J. Zhang Q. Tao D.:ViTPose: Simple vision transformer baselines for human pose estimation.arXiv:2204.12484(2022)"}],"container-title":["IET Image Processing"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/ipr2.12850","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,3,22]],"date-time":"2025-03-22T12:05:19Z","timestamp":1742645119000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/ipr2.12850"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,6,18]]},"references-count":40,"journal-issue":{"issue":"10","published-print":{"date-parts":[[2023,8]]}},"alternative-id":["10.1049\/ipr2.12850"],"URL":"https:\/\/doi.org\/10.1049\/ipr2.12850","archive":["Portico"],"relation":{},"ISSN":["1751-9659","1751-9667"],"issn-type":[{"type":"print","value":"1751-9659"},{"type":"electronic","value":"1751-9667"}],"subject":[],"published":{"date-parts":[[2023,6,18]]},"assertion":[{"value":"2023-04-22","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-06-18","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}