Abstract
Automatically recognizing surgical gestures is a crucial step towards a thorough understanding of surgical skill. Possible areas of application include automatic skill assessment, intra-operative monitoring of critical surgical steps, and semi-automation of surgical tasks. Solutions that rely only on the laparoscopic video and do not require additional sensor hardware are especially attractive as they can be implemented at low cost in many scenarios. However, surgical gesture recognition based only on video is a challenging problem that requires effective means to extract both visual and temporal information from the video. Previous approaches mainly rely on frame-wise feature extractors, either handcrafted or learned, which fail to capture the dynamics in surgical video. To address this issue, we propose to use a 3D Convolutional Neural Network (CNN) to learn spatiotemporal features from consecutive video frames. We evaluate our approach on recordings of robot-assisted suturing on a bench-top model, which are taken from the publicly available JIGSAWS dataset. Our approach achieves high frame-wise surgical gesture recognition accuracies of more than 84%, outperforming comparable models that either extract only spatial features or model spatial and low-level temporal information separately. For the first time, these results demonstrate the benefit of spatiotemporal CNNs for video-based surgical gesture recognition.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
- Surgical gesture
- Spatiotemporal modeling
- Video understanding
- Action segmentation
- Convolutional Neural Network
1 Introduction
Surgical gestures [1] are the basic elements of every surgical process. Recognizing which surgical gesture is being performed is crucial for understanding the current surgical situation and for providing meaningful computer assistance to the surgeon. Automatic surgical gesture recognition also offers new possibilities for surgical training. For example, it may enable a computer-assisted surgical training system to observe whether gestures are performed in the correct order or to identify with which gestures a trainee struggles the most.
Especially appealing is the exploitation of ubiquitous video feeds for surgical gesture recognition, such as the feed of the laparoscopic camera, which displays the surgical field in conventional and robot-assisted minimally invasive surgery. The problem of video-based surgical gesture recognition is formalized as follows: A video of length T is a sequence of video frames \(v_t, t = 1, ..., T\). The problem is to predict the gesture \(g(t) \in \mathcal {G}\) performed at time t for each \(t = 1, ..., T\), where \(\mathcal {G} = \lbrace {1, ..., G\rbrace }\) is the set of surgical gestures. Variations of surgical gesture recognition differ in the amount of information that is available to obtain an estimate \(\hat{g}(t)\) of the current gesture, e.g., (i) only the current video frame, i.e., \(\hat{g}(t) = \hat{g}(v_t)\) (frame-wise recognition), (ii) only frames up until the current timestep, i.e., \(\hat{g}(t) = \hat{g}(v_k, ..., v_t), k \ge 1\) (on-line recognition), or (iii) the complete video, i.e., \(\hat{g}(t) = \hat{g}(v_1, ..., v_T)\) (off-line recognition).
The main challenge in video-based surgical gesture recognition is the high dimensionality, high level of redundancy, and high complexity of video data. State-of-the-art methods tackle the problem by transforming video frames into feature representations, which are fed into temporal models that infer the sequence of gestures based on the input sequence. These temporal models have been continuously improved in the last years, starting with variants of Hidden Markov Models [12] and Conditional Random Fields [9, 13] and evolving into deep learning-based methods such as Recurrent Neural Networks [3], Temporal Convolutional Networks (TCN) [10], and Deep Reinforcement Learning (RL) [11].
To obtain feature representations from video frames, early approaches compute bag-of-features histograms from feature descriptors extracted around space-time interest points or dense trajectories [13]. More recently, Convolutional Neural Networks (CNNs) became a popular tool for visual feature extraction. For example, Lea et al. train a CNN (S-CNN) for frame-wise gesture recognition [9] and use the latent video frame encodings as feature representations, which are further processed by a TCN for gesture recognition [10]. A TCN combines 1D convolutional filters with pooling and channel-wise normalization layers to hierarchically capture temporal relationships at low-, intermediate-, and high-level time scales.
Features extracted from individual video frames cannot represent the dynamics in surgical video, i.e., changes between adjacent frames. To alleviate this problem, Lea et al. [10] propose adding a number of difference images to the input fed to the S-CNN. For timestep t, difference images are calculated within a window of 2 s around frame \(v_t\). Also, they suggest to use a spatiotemporal CNN (ST-CNN) [9], which applies a large temporal 1D convolutional filter to the latent activations obtained by a S-CNN. In contrast, we propose to use a 3D CNN to learn spatiotemporal features from stacks of consecutive video frames, thus modeling the temporal evolution of video frames directly.
To the best of our knowledge, we are the first to design a 3D CNN for surgical gesture recognition that predicts gesture labels for consecutive frames of surgical video. An evaluation on the suturing task of the publicly available JIGSAWS [1] dataset demonstrates the superiority of our approach compared to 2D CNNs that estimate surgical gestures based on spatial features extracted from individual video frames. Averaging the dense predictions of the 3D CNN over time even achieves compelling frame-wise gesture recognition accuracies of over 84%. Source code can be accessed at https://gitlab.com/nct_tso_public/surgical_gesture_recognition.
2 Methods
In the following, we detail the architecture and training procedure of the proposed 3D CNN for video-based surgical gesture recognition.
2.1 Network Architecture
Ji et al. [6] proposed 3D CNNs as a natural extension of well-known (2D) CNNs. While 2D CNNs apply 2D convolutions and 2D pooling kernels to extract features along the spatial dimensions of a video frame \(v \in \mathbb {R}^{C \times H \times W}\), 3D CNNs apply 3D convolutions and 3D pooling kernels to extract features along the spatial and temporal dimensions of a stack of video frames \(\vartheta = [v_k, v_{k + 1}, ..., v_{k + L - 1}] \in \mathbb {R}^{C \times L \times H \times W}\). Recently, Carreira et al. [2] suggested to create 3D CNN architectures by inflating established deep 2D CNN architectures along the temporal dimension. This basically means that all \(N \times N\) kernels are expanded into their cubic \(N \times N \times N\) counterparts.
The proposed 3D CNN for surgical gesture recognition is based on 3D ResNet-18 [4], which is created by inflating an 18-layer residual network [5]. Input to the network are stacks of 16 consecutive video frames (as proposed in [4]) with a resolution of \(224 \times 224\) pixels. More precisely, to obtain an estimate \(\hat{g}(t)\) of the gesture being performed at time t, we feed the video snippet \(\vartheta _t = (v_{t - 15}, ..., v_{t - 1}, v_t)\) to the network. Because we process the video at 5 fps, the network can refer to the previous three seconds of video in order to infer \(\hat{g}(t)\). At this point, we abstain from feeding future video frames to the network so that the method is applicable for online gesture recognition.
The original 3D ResNet-18 architecture is designed to predict one distinct action label per video snippet using a one-hot encoding. In contrast, surgical gesture recognition is a dense labeling problem, where each frame \(v_k\) of a video snippet has a distinct label g(k). This means that one video snippet may contain frames that belong to different gestures. To account for this, we adapt our network to output dense gesture label estimates \(\hat{\gamma }_t = (\hat{g}_t(t - 15), ..., \hat{g}_t(t - 1), \hat{g}_t(t)) \in \mathbb {R}^{G \times 16}\). Here, G denotes the number of distinct surgical gestures. The component \(\hat{g}_t(t - i), i = 0, ..., 15,\) of \(\hat{\gamma }_t\) is the estimate for gesture label \(g(t - i)\), obtained at time t.
Specifically, we adapt the max pooling layer of 3D ResNet-18 so that downsampling is only performed along the spatial dimensions. Thus, the feature maps after the final average pooling layer have a dimension of \(512 \times 2\). This is upsampled to the output dimension \(G \times 16\) using a transposed 1D convolution (\(\text {conv}^T\)) with kernel size 11 and stride 5.
An overview of the network architecture is given in Table 1. The input is downsampled in the initial convolutional and max pooling layers and then passed through a number of residual blocks. When convolutions are applied with stride 2 to downsample feature maps, the number of feature maps is doubled. For details on residual blocks, please see the original papers [4, 5]. We apply batch normalization and the ReLU non-linearity after each convolutional layer. An exception is the final transposed convolution, which is normalized using a softmax layer.
2.2 Network Training
We train our 3D CNN on video snippets \(\vartheta _t = (v_{t - 15}, ..., v_{t - 1}, v_t)\) to predict the corresponding ground truth gesture labels \(\gamma _t = (g(t - 15), ..., g(t - 1), g(t))\). Therefore, we minimize the loss \(\mathcal {L}(\gamma _t, \hat{\gamma }_t) = \sum _{i=0}^{15}\omega _i \mathcal {L}_{CE}(g(t-i), \hat{g}_t(t-i))\), where \(\mathcal {L}_{CE}\) denotes the cross entropy loss. We found it to be beneficial to penalize the errors made on more current predictions harder and therefore train with weighting factors \(\omega _i = \frac{(16 - i)^2}{\sum _{i=0}^{15} (16 - i)^2}\).
Because of their large number of parameters, 3D CNNs are difficult to train, especially on small datasets [4]. Thus, it is important to begin training from a suitable initialization of network parameters. We investigate two approaches for network initialization: (i) Initializing the network with parameters obtained by training on Kinetics [2], one of the largest human action datasets available so far. For this, a publicly available pretrained 3D ResNet-18 modelFootnote 1 [4] is used. (ii) Bootstrapping network parameters from an ImageNet-pretrained 2D ResNet-18 model that was further trained on individual video frames to perform frame-wise gesture recognition. As described in [2], the 3D filters of the 3D ResNet-18 are initialized by repeating the weights of the corresponding 2D filters N times along the temporal dimension and then dividing them by N.
During training, we sample video snippets \(\vartheta _t\) at random temporal positions t from the training videos. Per epoch, we sample about 3000 snippets in a class-balanced manner, which means that we ensure that each gesture \(g \in \mathcal {G}\) is represented equally in the set of sampled snippets. For data augmentation, we use scale jittering and corner cropping as proposed in [14]. Here, all frames within one training snippet are augmented in the same manner. We train the 3D CNN for 250 epochs using the Adam [7] optimizer with a batch size of 32 and an initial learning rate of \(2.5 \cdot 10^{-4}\). The learning rate is divided by factor 5 every 50 epochs. Our 3D CNN implementation is based on codeFootnote 2 provided by [4].
3 Evaluation
We evaluate our approach on 39 videos of robot-assisted suturing tasks performed on a bench-top model, which are taken from the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) [1]. The recorded tasks were performed by eight participants with varying surgical experience. The videos were annotated with surgical gestures such as positioning the tip of the needle or pushing needle through the tissue. In total, \(G = 10\) different gestures are used. We follow the leave-one-user-out (LOUO) setup for cross-validation as defined in [1]. Thus, for each experiment, we train one model per left-out user.
We report the following evaluation metrics: (i) Frame-wise accuracy, i.e., the ratio of correctly predicted gesture labels in a video. (ii) Average \(F_1\) score, where we calculate the \(F_1\) score, i.e., the harmonic mean of precision and recall, with respect to each gesture class and average the results over all classes. (iii) Edit score, as proposed in [9], which employs the Levenshtein distance to assess the quality of predicted gesture segments. (iv) Segmental \(F_1\) score with threshold 10% (\(F_1@10\)), as proposed in [8]. Here, a predicted gesture segment is considered a true positive if its intersection with the corresponding ground truth segment is over 10%, and the \(F_1\) score is calculated regarding the total number of true positives, false positives, and false negatives. For each experiment, evaluation metrics are calculated for every video in the dataset and then averaged.
As baseline experiment, we train a 2D ResNet-18 [5], i.e., the 2D counterpart to the proposed 3D CNN, for frame-wise gesture recognition. Here, we follow the training procedure described in Sect. 2.2 except for the fact that we train on video snippets of size 1, i.e., individual video frames. The 2D ResNet-18 is initialized with ImageNet-pretrained weights.
Additionally, we perform two experiments where we train the proposed 3D CNN for surgical gesture recognition: one where we initialize the 3D CNN with Kinetics-pretrained weights (3D CNN (K)) and one where we bootstrap weights from a pretrained 2D ResNet-18 as described in Sect. 2.2 (3D CNN (B)). To account for the stochastic nature of CNN optimization, we repeat the three experiments four times and report the averaged results. For the 3D CNN (B) experiment, we initialize the models in the \(i^{\text {th}}\) experiment repetition by bootstrapping weights from the corresponding 2D ResNet-18 models (with respect to the LOUO splits) that were trained during the \(i^{\text {th}}\) repetition of the baseline experiment.
We evaluate the trained 3D CNN models either snippet-wise or in combination with a sliding window (+ window). For snippet-wise evaluation, the estimated gesture label \(\hat{g}(t)\) at time t is simply \(\hat{g}_t(t)\). With the sliding window approach, we accumulate the dense predictions of the 3D CNN over time. This yields the overall estimate \(\bar{\hat{g}}(t) = \sum _{i = 0}^{15} \hat{g}_{t + i}(t)\) for the gesture at time t. To obtain \(\bar{\hat{g}}(t)\), information of 15 future time steps is used, which corresponds to the next three seconds of video.
To make comparisons to prior studies possible, we additionally evaluate the 2D ResNet-18 and the 3D CNN models at 10 fps. This means that we extract video snippets at 10 Hz, instead of 5 Hz, from the video. For the 3D CNNs, the individual snippets still consist of 16 frames sampled at 5 fps. To apply the sliding window approach, we temporally upsample the prediction \(\hat{\gamma }_t \in \mathbb {R}^{G \times 16}\) to \(\tilde{\gamma }_t \in \mathbb {R}^{G \times 32}\), where
The experimental results are listed in Table 2. For comparison, we state the results of some previous methods that were described in Sect. 1. Further experiments can be found in the supplementary document.
S-CNN + TCN refers to the method where spatial features are extracted from video frames using a S-CNN and fed to a TCN that predicts surgical gestures [8, 10]. Here, the results were reproduced using the ED-TCN architecture described in [8] with 2 layers and temporal filter size C. For causal evaluation, filters are applied from \(v_{t - C}\) to \(v_t\) instead of \(v_{t - C/2}\) to \(v_{t + C/2}\). We use source codeFootnote 3 provided by the authors of [8, 10]. The reported results are averaged over four LOUO cross-validation runs.
4 Discussion
As can be seen in Table 2, the proposed variant of 3D ResNet-18 for snippet-wise gesture recognition yields comparable or better frame-wise evaluation results (accuracy and average \(F_1\)) and considerably better segment-based evaluation results (edit score and \(F_1@10\)) compared to the 2D counterpart. This demonstrates the benefit of modeling several consecutive video frames to capture the temporal evolution of video.
Accumulating the 3D CNN predictions using a sliding window with a duration of three seconds provides a further boost to recognition performance. Not only does the sliding window approach produce better gesture segments, it also improves frame-wise accuracies. Considering future video snippets most likely helps to resolve ambiguities in individual snippets.
Minor differences can be observed between both network initialization variants, Kinetics pretraining (K) and 2D weight bootstrapping (B): while pretraining on Kinetics yields higher frame-wise accuracies, the other approach yields better gesture segments. In combination with the sliding window the differences are marginal.
When testing at 10 fps instead of 5 fps, we observe a notable degradation of the segment-based measures for both the 2D ResNet-18 and the 3D variants. Most likely, the high evaluation frequency enhances noise in the gesture predictions, which is penalized by the edit score and the \(F_1@{10}\) metric. For the 3D CNNs, this effect can be alleviated by filtering with the sliding window.
Compared to the ST-CNN, the 3D CNN yields considerably better results with regards to all evaluation metrics when being evaluated with the sliding window approach. Apparently, for the given task, modeling spatiotemporal features in video snippets achieves better results than modeling spatial and temporal information separately, as is the case for the ST-CNN.
In combination with the sliding window, the proposed 3D CNN also outperforms the state-of-the-art methods S-CNN + TCN and S-CNN + TCN + Deep RL in terms of accuracy and average \(F_1\). These methods apply very long temporal filters while the proposed approach only processes a few seconds of video to estimate the current gesture. Thus, it is surprising that the quality of gesture segments, as measured by edit score and \(F_1@10\), is almost equal.
Note that the proposed method operates with a delay of only 3 s and can therefore provide information, such as feedback in a surgical training scenario, in a more timely manner than methods with a longer look ahead time.
5 Conclusion
We present a 3D CNN to predict dense gesture labels for surgical video. The conducted experiments demonstrate the benefits of using an inherently spatiotemporal model to extract features from consecutive video frames. Future work will investigate options for combining spatiotemporal feature extractors with models that capture high-level temporal dependencies, such as LSTMs or TCNs.
References
Ahmidi, N., Tao, L., Sefati, S., Gao, Y., Lea, C., Haro, B.B., et al.: A dataset and benchmarks for segmentation and recognition of gestures in robotic surgery. IEEE Trans. Biomed. Eng. 64(9), 2025–2041 (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the Kinetics dataset. In: CVPR, pp. 4724–4733. IEEE (2017)
DiPietro, R., Lea, C., Malpani, A., Ahmidi, N., Vedula, S.S., Lee, G.I., et al.: Recognizing surgical activities with recurrent neural networks. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9900, pp. 551–558. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46720-7_64
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3D residual networks for action recognition. In: ICCV-W, pp. 3154–3160. IEEE (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778. IEEE (2016)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: CVPR, pp. 156–165. IEEE (2017)
Lea, C., Reiter, A., Vidal, R., Hager, G.D.: Segmental spatiotemporal CNNs for fine-grained action segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 36–52. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_3
Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: a unified approach to action segmentation. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 47–54. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49409-8_7
Liu, D., Jiang, T.: Deep reinforcement learning for surgical gesture segmentation and classification. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11073, pp. 247–255. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00937-3_29
Tao, L., Elhamifar, E., Khudanpur, S., Hager, G.D., Vidal, R.: Sparse hidden Markov models for surgical gesture classification and skill evaluation. In: Abolmaesumi, P., Joskowicz, L., Navab, N., Jannin, P. (eds.) IPCAI 2012. LNCS, vol. 7330, pp. 167–177. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30618-1_17
Tao, L., Zappella, L., Hager, G.D., Vidal, R.: Surgical gesture segmentation and recognition. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013. LNCS, vol. 8151, pp. 339–346. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40760-4_43
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Acknowledgements
The authors thank Colin Lea for sharing code and precomputed S-CNN features to reproduce results from [10] as well as the Helmholtz-Zentrum Dresden-Rossendorf (HZDR) for granting access to their GPU cluster.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Funke, I., Bodenstedt, S., Oehme, F., von Bechtolsheim, F., Weitz, J., Speidel, S. (2019). Using 3D Convolutional Neural Networks to Learn Spatiotemporal Features for Automatic Surgical Gesture Recognition in Video. In: Shen, D., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2019. MICCAI 2019. Lecture Notes in Computer Science(), vol 11768. Springer, Cham. https://doi.org/10.1007/978-3-030-32254-0_52
Download citation
DOI: https://doi.org/10.1007/978-3-030-32254-0_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32253-3
Online ISBN: 978-3-030-32254-0
eBook Packages: Computer ScienceComputer Science (R0)