Abstract
3D colon reconstruction from Optical Colonoscopy (OC) to detect non-examined surfaces remains an unsolved problem. The challenges arise from the nature of optical colonoscopy data, characterized by highly reflective low-texture surfaces, drastic illumination changes and frequent tracking loss. Recent methods demonstrate compelling results, but suffer from: (1) frangible frame-to-frame (or frame-to-model) pose estimation resulting in many tracking failures; or (2) rely on point-based representations at the cost of scan quality. In this paper, we propose a novel reconstruction framework that addresses these issues end to end, which result in both quantitatively and qualitatively accurate and robust 3D colon reconstruction. Our SLAM approach, which employs correspondences based on contrastive deep features, and deep consistent depth maps, estimates globally optimized poses, is able to recover from frequent tracking failures, and estimates a global consistent 3D model; all within a single framework. We perform an extensive experimental evaluation on multiple synthetic and real colonoscopy videos, showing high-quality results and comparisons against relevant baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alyabsi, M., Algarni, M., Alshammari, K.: Trends in colorectal cancer incidence rates in Saudi Arabia (2001–2016) using Saudi national registry: Early- versus late-onset disease. Front. Oncol. 11, 3392 (2021)
Bian, J., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS (2019)
International Agency for Research on Cancer: Globocan 2020: Cancer fact sheets-colorectal cancer”. https://gco.iarc.fr/today/data/factsheets/cancers/10_8_9-Colorectum-fact-sheet.pdf
Xiang, C., H., Li, K., Fu, Z., Liu, M., Chen, Z., Guo, Y.: Distortion-aware monocular depth estimation for omnidirectional images. IEEE Signal Process. Lett. 28, 334–338 (2021)
Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, R.J., Bobrow, T.L., Athey, T., Mahmood, F., Durr, N.J.: Slam endoscopy enhanced by adversarial depth prediction. In: KDD Workshop on Applied Data Science for Healthcare 2019 (2019)
Choi, S., Zhou, Q.Y., Koltun, V.: Robust reconstruction of indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Curless, B., Levoy, M.: A volumetric method for building complex models from range images. Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (1996)
Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: Bundlefusion: real-time globally consistent 3d reconstruction using on-the-fly surface re-integration. CoRR (2016)
Dai, J., et al.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 764–773 (2017)
DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: self-supervised interest point detection and description. CoRR (2017). http://arxiv.org/abs/1712.07629
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth prediction (October 2019)
Gower, J.: Generalized procrustes analysis. Psychometrika 40(1), 33–51 (1975)
Grisetti, G., Kümmerle, R., Stachniss, C., Burgard, W.: A tutorial on graph-based slam. IEEE Intell. Transp. Syst. Mag. 2(4), 31–43 (2010)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
Jau, Y.Y., Zhu, R., Su, H., Chandraker, M.: Deep keypoint-based camera pose estimation with geometric constraints. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4950–4957 (2020). https://doi.org/10.1109/IROS45743.2020.9341229
Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A 32(5), 922–923 (1976)
Kumar, V.R., et al.: Fisheyedistancenet: self-supervised scale-aware distance estimation using monocular fisheye camera for autonomous driving. 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 574–581 (2020)
Liang, Z., Richards, R.: Virtual colonoscopy vs optical colonoscopy. Expert Opinion Med. Diagn. 4(2), 159–169 (2010), 20473367[pmid]
Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017)
Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3d surface construction algorithm. SIGGRAPH Comput. Graph. 21(4), 163–169 (1987)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157. IEEE (1999)
Ma, R., et al.: Colon10k: a benchmark for place recognition in colonoscopy. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 1279–1283 (2021). https://doi.org/10.1109/ISBI48211.2021.9433780
Ma, R., et al.: Rnnslam: reconstructing the 3d colon to visualize missing regions during a colonoscopy. Med. Image Anal. 72, 102100 (2021)
Mirzaei, H., Panahi, M., Etemad, K., GHanbari-Motlagh, A., Holakouie-Naini, K.A.: Evaluation of pilot colorectal cancer screening programs in iran. Iranian J. Epidem. 12(3) (2016)
Mohaghegh, P., Ahmadi, F., Shiravandi, M., Nazari, J.: Participation rate, risk factors, and incidence of colorectal cancer in the screening program among the population covered by the health centers in arak, iran. Inter. J. Cancer Manag. 14(7), e113278 (2021)
Moshfeghi, K., Mohammadbeigi, A., Hamedi-Sanani, D., Bahrami, M.: Evaluation the role of nutritional and individual factors in colorectal cancer. Zahedan J. Res. Med. Sci. 13(4), e93934 (2011)
Mur-Artal, R., Montiel, J.M.M., Tardós, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. CoRR (2015). http://arxiv.org/abs/1502.00956
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. ArXiv (2018)
Ozyoruk, K.B., et al.: Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos. Med. Image Anal. 71, 102058 (2021)
Rau, A., et al.: Implicit domain adaptation with conditional generative adversarial networks for depth prediction in endoscopy. Inter. J. Comput. Assisted Radiol. Surgery4 (2019)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. CoRR (2014). http://arxiv.org/abs/1409.0575
Shao, S., et al.: Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue. Med. Image Anal., 102338 (2021)
Smith, K., et al.: Data from ct colonography. the cancer imaging archive (2015). https://doi.org/10.7937/K9/TCIA.2015.NWTESAY1
Sohn, K.: Improved deep metric learning with multi-class n-pair loss objective. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS 2016, pp. 1857–1865. Curran Associates Inc., Red Hook, NY, USA (2016)
Widya, A.R., Monno, Y., Okutomi, M., Suzuki, S., Gotoda, T., Miki, K.: Learning-based depth and pose estimation for monocular endoscope with loss generalization. CoRR abs/ arXiv: 2107.13263 (2021)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance-level discrimination. ArXiv (2018)
Yao, H., Stidham, R.W., Gao, Z., Gryak, J., Najarian, K.: Motion-based camera localization system in colonoscopy videos. Med. Image Anal. 73, 102180 (2021)
Zhang, S., Zhao, L., Huang, S., Ye, M., Hao, Q.: A template-based 3d reconstruction of colon structures and textures from stereo colonoscopic images. IEEE Trans. Med. Robotics Bionics 3(1), 85–95 (2021)
Zhang, Y., et ak.: Colde: a depth estimation framework for colonoscopy reconstruction (2021)
Zhang, Y., Wang, S., Ma, R., McGill, S.K., Rosenman, J.G., Pizer, S.M.: Lighting enhancement aids reconstruction of colonoscopic surfaces (2021)
Zhang, Z., Scaramuzza, D.: A tutorial on quantitative trajectory evaluation for visual(-inertial) odometry. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).,pp. 7244–7251 (2018)
Zhou, Q.Y., Koltun, V.: Dense scene reconstruction with points of interest. ACM Trans. Graph. 32 (2013)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. CoRR (2017). http://arxiv.org/abs/1704.07813
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A. Appendix
A. Appendix
1.1 A.1 Synthetic Data Generation
For the purpose of reproducibility, we state the parameters that were used to build each synthetic sequence using the synthetic colonoscopy simulator [41]. The parameters are summarised in Table. 4, where RP stands for Random Path. It is worth mentioning that the user does not have the ability to set the seed of the random number generator for the random path chosen.
1.2 A.2 Depth Training and Implementation Details
We use AdamW optimizer [23], with \(\beta _1 = 0.9\), \(\beta _2 = 0.999\). We train the synthetic and the Colon10k models for 40 epochs, with a batch size of 16 on a 24GB Nvidia 3090 RTX. The initial learning rate is \(10^{-4}\); we reduce it by half on each of the 16th, 24th and 32nd epochs. As for the 3D colon print model, we train for 200 epochs; reduce the learning rate by half on each of the 80th, 120th and 170th epochs. We center-crop the synthetic images to \(400\times 400\) to remove vignetting effects. The Colon10k images are provided in an un-distorted and center-cropped version of 270\(\,\times \,\)216 pixels. Finally, the cropped image is scaled to 224\(\,\times \,\)224 before feeding to the network. For the 3D colon print, we employ test time training due to the scarcity of the data and the fact that the training process is completely self-supervised.
To generate the specular reflection mask for each frame, we convert the input frames to YUV color-space and apply a threshold of 90% on the Y channel and dilate the resulting binary mask with a kernel of 13 pixels.
We use MMLab’s [5] implementation of ResNet [16], deformable convolutions and FPN. All ResNet encoders and the FPN were pre-trained on ImageNet [34]. We use ResNet50 for the depth encoder. For the pose encoder and FPN, we use ResNet18. Deformable convolution layers are applied in the depth encoder stages of conv3, conv4 and conv5. We set \(\lambda _{ph-extra}=0.1\) and \(\lambda _{dc}=0.1\), \(\tau =0.01\).
1.3 A.3 Correspondence Matching Qualitative Results
Matching examples of ContraFeat, SIFT [24] and SuperPoint [11] are shown in Fig. 8. ContraFeat incline to produce more correct matches and spread out evenly throughout the image, and is less susceptible against drastic illumination changes.
1.4 A.4 Comparison of the Estimated Trajectories and Ground Truth Trajectories
Fig 9 compares the estimated trajectory and ground truth trajectory on the 3D colon print between DSO [12], our framework using SuperPoint [11] and our proposed method. The pose estimation from the network is of arbitrary scale. Therefore, we first align the two trajectories using similarity transform [18] following with first-frame alignment for better visualization and comparison. Note that the estimated trajectories by our framework is more accurate with loops of similar shape as compared to the ground truth trajectory.
1.5 A.5 Extra Qualitative Depth-Map Predictions Results
In Fig. 10 and Fig. 11 we show extra depth-map predictions of the 3D colon print and Colon10K [25].
1.6 A.6 Extra Qualitative 3D Reconstruction Results on Colon10K
In Fig. 12 and Fig. 13 we show extra points-of-view of the 3D reconstructions by our proposed framework on Colon10K [25] and 3D colon print data.
1.7 A.7 SuperPoint Training
SuperPoint [11] was trained using [17] Pytorch implementation with their suggested improvements that enable end to end training using a softargmax at the detector head and a sparse descriptor loss that allows an efficient training. Photo-metric augmentations were adapted to the colon data-set by lowering the contrast, blur and noise levels to values that enabled the extraction of features even from deeper shadowed areas of the colon. the network was trained for about 100 epochs, with a batch size of 10 and learning rate of 0.0003. The best checkpoint was chosen based on validation set precision and recall.
1.8 A.8 Supplementary Video Results
In the supplementary video, labeled as rgb_tex_geo.mp4, we show the fully endoscopic investigation of the 3D colon print while comparing the resemblance between the reconstructed model and the captured RGB images. This is accomplished by re-rendering the reconstructed model using the camera intrinsics, camera predicted pose and framework’s output mesh. In the video rgb_tex_geo.mp4 we visualise the captured video (Left) next to the re-rendered reconstructed model with texture (right). An example can be seen in Fig. 14. An additional camera fly-through video is available, labeled as fly_through.mkv, showing the final reconstruction of the 3D colon print.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Posner, E., Zholkover, A., Frank, N., Bouhnik, M. (2023). \(\hbox {C}^3\)Fusion: Consistent Contrastive Colon Fusion, Towards Deep SLAM in Colonoscopy. In: Wachinger, C., Paniagua, B., Elhabian, S., Li, J., Egger, J. (eds) Shape in Medical Imaging. ShapeMI 2023. Lecture Notes in Computer Science, vol 14350. Springer, Cham. https://doi.org/10.1007/978-3-031-46914-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-46914-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46913-8
Online ISBN: 978-3-031-46914-5
eBook Packages: Computer ScienceComputer Science (R0)