Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

Wu, Jie Ying; Tamhane, Aniruddha; Kazanzides, Peter; Unberath, Mathias

doi:10.1007/s11548-021-02343-y

Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

Original Article
Published: 24 March 2021

Volume 16, pages 779–787, (2021)
Cite this article

International Journal of Computer Assisted Radiology and Surgery Aims and scope Submit manuscript

Jie Ying Wu ORCID: orcid.org/0000-0002-7306-8140¹^na1,
Aniruddha Tamhane¹^na1,
Peter Kazanzides¹ &
…
Mathias Unberath¹

1680 Accesses
Explore all metrics

Abstract

Purpose

Multi- and cross-modal learning consolidates information from multiple data sources which may offer a holistic representation of complex scenarios. Cross-modal learning is particularly interesting, because synchronized data streams are immediately useful as self-supervisory signals. The prospect of achieving self-supervised continual learning in surgical robotics is exciting as it may enable lifelong learning that adapts to different surgeons and cases, ultimately leading to a more general machine understanding of surgical processes.

Methods

We present a learning paradigm using synchronous video and kinematics from robot-mediated surgery. Our approach relies on an encoder–decoder network that maps optical flow to the corresponding kinematics sequence. Clustering on the latent representations reveals meaningful groupings for surgeon gesture and skill level. We demonstrate the generalizability of the representations on the JIGSAWS dataset by classifying skill and gestures on tasks not used for training.

Results

For tasks seen in training, we report a 59 to 70% accuracy in surgical gestures classification. On tasks beyond the training setup, we note a 45 to 65% accuracy. Qualitatively, we find that unseen gestures form clusters in the latent space of novice actions, which may enable the automatic identification of novel interactions in a lifelong learning scenario.

Conclusion

From predicting the synchronous kinematics sequence, optical flow representations of surgical scenes emerge that separate well even for new tasks that the model had not seen before. While the representations are useful immediately for a variety of tasks, the self-supervised learning paradigm may enable research in lifelong and user-specific learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Automatic Alignment of Surgical Videos Using Kinematic Data

Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows

Article Open access 01 April 2024

Using open surgery simulation kinematic data for tool and gesture recognition

Article 13 April 2022

References

Ahmidi N, Tao L, Sefati S, Gao Y, Lea C, Haro BB, Zappella L, Khudanpur S, Vidal R, Hager GD (2017) A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans Biomed Eng 64(9):2025–2041
Article Google Scholar
Arandjelovic R, Zisserman A (2018) Objects that sound. In: Proceedings of the European conference on computer vision, pp. 435–451
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 785–794
DiPietro R, Hager GD (2018) Unsupervised learning for surgical motion by learning to predict the future. In: International conference on medical image computing and computer-assisted intervention, pp. 281–288. Springer
DiPietro R, Hager GD (2019) Automated surgical activity recognition with one labeled sequence. In: International conference on medical image computing and computer-assisted intervention, pp. 458–466. Springer
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis, pp. 363–370. Springer
Funke I, Mees ST, Weitz J, Speidel S (2019) Video-based surgical skill assessment using 3D convolutional neural networks. Int J Comput Assist Radiol Surg 14(7):1217–1225
Article Google Scholar
Gao Y, Vedula SS, Reiley CE, Ahmidi N, Varadarajan B, Lin HC, Tao L, Zappella L, Béjar B, Yuh DD, Chen CCG, Vidal R, Khudanpur S, Hager GD (2014) JHU-ISI gesture and skill assessment working set (jigsaws): a surgical activity dataset for human motion modeling. In: MICCAI workshop: M2CAI, vol. 3, p. 3
Guthart GS, Salisbury JK (2000) The intuitive\(^{TM}\) telesurgery system: overview and application. In: IEEE international conference on robotics and automation, vol. 1, pp. 618–621
Jing L, Tian Y (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell
Kazanzides P, Chen Z, Deguet A, Fischer GS, Taylor RH, DiMaio SP (2014) An open-source research kit for the da vinci\(^{\textregistered }\) surgical system. In: IEEE international conference on robotics and automation, pp. 6434–6439
Long YH, Wu JY, Lu B, Jin YM, Unberath M, Liu YH, Heng PA, Dou Q (2020) Relational graph learning on visual and kinematics embeddings for accurate gesture recognition in robotic surgery
Mazomenos E, Watson D, Kotorov R, Stoyanov D (2018) Gesture classification in robotic surgery using recurrent neural networks with kinematic information. In: 8th Joint workshop on new technologies for computer/robotic assisted surgery
McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426
Murali A, Garg A, Krishnan S, Pokorny FT, Abbeel P, Darrell T, Goldberg K (2016) TSC-DL: unsupervised trajectory segmentation of multi-modal surgical demonstrations with deep learning. In: IEEE international conference on robotics and automation, pp. 4150–4157
Qin Y, Feyzabadi S, Allan M, Burdick JW, Azizian M (2020) davincinet: joint prediction of motion and surgical state in robot-assisted surgery. arXiv preprint arXiv:2009.11937
Sarikaya D, Jannin P (2019) Surgical gesture recognition with optical flow only. arXiv preprint arXiv:1904.01143
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst pp. 568–576
Tanwani AK, Sermanet P, Yan A, Anand R, Phielipp M, Goldberg K (2020) Motion2vec: semi-supervised representation learning from surgical videos. arXiv preprint arXiv:2006.00545
van Amsterdam B, Nakawala H, De Momi E, Stoyanov D (2019) Weakly supervised recognition of surgical gestures. In: IEEE international conference on robotics and automation, pp. 9565–9571
Wang Z, Fey AM (2018) Deep learning with convolutional neural network for objective skill evaluation in robot-assisted surgery. Int J Comput Assist Radiol Surg 13(12):1959–1970
Article Google Scholar
Weiss MY, Melnyk R, Mix D, Ghazi A, Vates GE, Stone JJ (2020) Design and validation of a cervical laminectomy simulator using 3D printing and hydrogel phantoms. Oper Neurosurg 18(2):202–208
Google Scholar
Wu JY, Kazanzides P, Unberath M (2020) Leveraging vision and kinematics data to improve realism of biomechanic soft tissue simulation for robotic surgery. Int J Comput Assist Radiol Surg pp. 1–8
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European conference on computer vision, pp. 686–701
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10394–10403

Download references

Funding

This research was supported in part by a collaborative research agreement with the Multi-Scale Medical Robotics Center in Hong Kong.

Author information

Jie Ying Wu and Aniruddha Tamhane are joint first authors.

Authors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
Jie Ying Wu, Aniruddha Tamhane, Peter Kazanzides & Mathias Unberath

Authors

Jie Ying Wu
View author publications
You can also search for this author inPubMed Google Scholar
Aniruddha Tamhane
View author publications
You can also search for this author inPubMed Google Scholar
Peter Kazanzides
View author publications
You can also search for this author inPubMed Google Scholar
Mathias Unberath
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jie Ying Wu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest. This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

To test whether our method can generalize to new and clinically more relevant scenarios, we also collect a dataset on the da Vinci Research Kit [11] with a hydrogel hysterectomy phantom (University of Rochester Medical Center, constructed similarly to the phantom presented in [22]). The procedure was performed by a gynecology fellow and we annotate one section of the procedure that contains suturing with the gestures that correspond to those in JIGSAWS. As JIGSAWS does not give the calibration matrix between the tools and the camera, the kinematics would be misaligned between the two datasets. Therefore, we limit our investigations to test time inference without any retraining to study whether our method can recognize gestures in this setting.

Observations We chose to densely annotate the video instead of only predicting at the beginning of a gesture to account for possible misalignment in start times of gestures so we report one gesture every 50 frames. Given the numerous differences in how the data was collected and annotated, we report only qualitative observations. The gestures that transferred best were G3—“pushing needle through tissue” and G14—“reaching for suture with right hand”. The former action involves moving tissue, which causes denser optical flow than just moving the instruments, potentially making it easier to recognize in a new scene. The latter solely involves movement of the right instrument (compared to a potentially similar gesture “pulling suture with right hand” which involves moving the suture as well), which may also aid its transfer. Figure 5 shows a sequence of frames that were correctly labeled as “pushing needle through tissue” while Fig. 6 shows gestures that were incorrectly labeled the same.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, J.Y., Tamhane, A., Kazanzides, P. et al. Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery. Int J CARS 16, 779–787 (2021). https://doi.org/10.1007/s11548-021-02343-y

Download citation

Received: 19 January 2021
Accepted: 02 March 2021
Published: 24 March 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11548-021-02343-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal self-supervised representation learning for gesture and skill recognition in robotic surgery

Abstract

Purpose

Methods

Results

Conclusion

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Alignment of Surgical Videos Using Kinematic Data

Multimodal semi-supervised learning for online recognition of multi-granularity surgical workflows

Using open surgery simulation kinematic data for tool and gesture recognition

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now