PlaneFormers: From Sparse View Planes to 3D Reconstruction

Agarwala, Samir; Jin, Linyi; Rockwell, Chris; Fouhey, David F.

doi:10.1007/978-3-031-20062-5_12

Samir Agarwala¹²,
Linyi Jin¹²,
Chris Rockwell¹² &
…
David F. Fouhey¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

European Conference on Computer Vision

2271 Accesses

Abstract

We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer, that uses a transformer applied to 3D-aware plane tokens to perform 3D reasoning. Our experiments show that our approach is substantially more effective than prior work, and that several 3D-specific design decisions are crucial for its success. Code is available at https://github.com/samiragarwala/PlaneFormers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: JPY 3498; Price includes VAT (Japan)

eBook: JPY 12583; Price includes VAT (Japan)

Softcover Book: JPY 15729; Price includes VAT (Japan)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unstructured Multi-view Depth Estimation Using Mask-Based Multiplane Representation

PlaneStereo: Plane-aware Multi-view Stereo

Article 28 November 2024

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

References

Agarwal, S., Snavely, N., Seitz, S.M., Szeliski, R.: Bundle adjustment in the large. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 29–42. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_3
Chapter Google Scholar
Bloem, P.: August 2019. http://peterbloem.nl/blog/transformers
Bozic, A., Palafox, P., Thies, J., Dai, A., Nießner, M.: Transformerfusion: monocular RGB scene reconstruction using transformers. In: NeurIPS, vol. 34 (2021)
Google Scholar
Cai, R., Hariharan, B., Snavely, N., Averbuch-Elor, H.: Extreme rotation estimation using dense correlation volumes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Cai, Z., et al.: MessyTable: instance association in multiple camera views. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 1–16. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_1
Chapter Google Scholar
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)
Google Scholar
Chen, A., et al.: MVSNeRF: fast generalizable radiance field reconstruction from multi-view stereo. In: ICCV, pp. 14124–14133 (2021)
Google Scholar
Chen, K., Snavely, N., Makadia, A.: Wide-baseline relative camera pose estimation with directional learning. In: CVPR, pp. 3258–3268, June 2021
Google Scholar
Chen, W., Qian, S., Fan, D., Kojima, N., Hamilton, M., Deng, J.: Oasis: a large-scale dataset for single image 3D in the wild. In: CVPR (2020)
Google Scholar
Choy, C., Dong, W., Koltun, V.: Deep global registration. In: CVPR, pp. 2514–2523 (2020)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)
Google Scholar
El Banani, M., Gao, L., Johnson, J.: Unsupervised R &R: unsupervised point cloud registration via differentiable rendering. In: CVPR (2021)
Google Scholar
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)
Google Scholar
Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Manhattan-world stereo. In: CVPR (2009)
Google Scholar
Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: ICCV (2019)
Google Scholar
Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004). ISBN 0521540518
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Hoiem, D., Efros, A.A., Hebert, M.: Geometric context from a single image. In: ICCV, vol. 1, pp. 654–661. IEEE (2005)
Google Scholar
Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)
Google Scholar
Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 351–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_21
Chapter Google Scholar
Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: semantically consistent few-shot view synthesis. In: ICCV, pp. 5885–5894 (2021)
Google Scholar
Jiang, C., Sud, A., Makadia, A., Huang, J., Nießner, M., Funkhouser, T., et al.: Local implicit grid representations for 3D scenes. In: CVPR, pp. 6001–6010 (2020)
Google Scholar
Jin, L., Qian, S., Owens, A., Fouhey, D.F.: Planar surface reconstruction from sparse views. In: ICCV (2021)
Google Scholar
Jin, Y., et al.: Image matching across wide baselines: from paper to practice. IJCV 129(2), 517–547 (2020)
Article Google Scholar
Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: NeurIPS (2017)
Google Scholar
Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: CVPR, pp. 1611–1621 (2021)
Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)
Article MathSciNet MATH Google Scholar
Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR, pp. 2041–2050 (2018)
Google Scholar
Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: Barf: bundle-adjusting neural radiance fields. In: ICCV (2021)
Google Scholar
Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Lindenberger, P., Sarlin, P.E., Larsson, V., Pollefeys, M.: Pixel-perfect structure-from-motion with featuremetric refinement. In: ICCV, pp. 5987–5997 (2021)
Google Scholar
Liu, C., Kim, K., Gu, J., Furukawa, Y., Kautz, J.: Planercnn: 3D plane detection and reconstruction from a single image. In: CVPR (2019)
Google Scholar
Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: Planenet: piece-wise planar reconstruction from a single RGB image. In: CVPR, pp. 2579–2588 (2018)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
Article Google Scholar
Ma, Y., Soatto, S., Košecká, J., Sastry, S.: An Invitation to 3-D Vision: From Images to Geometric Models, vol. 26. Springer, New York (2004). https://doi.org/10.1007/978-0-387-21779-6
Book MATH Google Scholar
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR, pp. 4460–4470 (2019)
Google Scholar
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24
Chapter Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. TOG 31(5), 1147–1163 (2015)
Google Scholar
Pritchett, P., Zisserman, A.: Wide baseline stereo matching. In: ICCV (1998)
Google Scholar
Qian, S., Jin, L., Fouhey, D.F.: Associative3D: volumetric reconstruction from sparse views. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 140–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_9
Chapter Google Scholar
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. TPAMI (2020)
Google Scholar
Raposo, C., Lourenço, M., Antunes, M., Barreto, J.P.: Plane-based odometry using an RGB-D camera. In: BMVC (2013)
Google Scholar
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: CVPR (2020)
Google Scholar
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)
Google Scholar
Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31
Chapter Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)
Google Scholar
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)
Google Scholar
Sun, J., Xie, Y., Chen, L., Zhou, X., Bao, H.: Neuralrecon: real-time coherent 3D reconstruction from monocular video. In: CVPR, pp. 15598–15607 (2021)
Google Scholar
Teed, Z., Deng, J.: Droid-slam: deep visual slam for monocular, stereo, and RGB-D cameras. In: NeurIPS, vol. 34 (2021)
Google Scholar
Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment — a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) IWVA 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44480-7_21
Chapter Google Scholar
Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: generating 3D mesh models from single RGB images. In: ECCV, pp. 52–67 (2018)
Google Scholar
Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)
Google Scholar
Wang, W., Hu, Y., Scherer, S.: TartanVO: a generalizable learning-based VO. In: CoRL (2020)
Google Scholar
Wang, X., Fouhey, D.F., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR (2015)
Google Scholar
Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: end-to-end view synthesis from a single image. In: CVPR, pp. 7467–7477 (2020)
Google Scholar
Wong, S.: Takaratomy transformers henkei octane. https://live.staticflickr.com/3166/2970928056_c3b59be5ca_b.jpg
Wu, C., Clipp, B., Li, X., Frahm, J.M., Pollefeys, M.: 3D model matching with viewpoint-invariant patches (VIP). In: CVPR (2008)
Google Scholar
Yang, F., Zhou, Z.: Recovering 3D planes from a single image via convolutional neural networks. In: ECCV (2018)
Google Scholar
Yi, K.M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR, pp. 2666–2674 (2018)
Google Scholar
Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)
Google Scholar
Yu, Z., Zheng, J., Lian, D., Zhou, Z., Gao, S.: Single-image piece-wise planar 3D reconstruction via associative embedding. In: CVPR, pp. 1029–1037 (2019)
Google Scholar
Zhang, J., et al.: Learning two-view correspondences and geometry using order-aware network. In: ICCV, pp. 5845–5854 (2019)
Google Scholar
Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. IJCV 13(2), 119–152 (1994)
Article Google Scholar
Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. TOG 40(4), 1–12 (2021)
Google Scholar
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)
Google Scholar

Download references

Acknowledgements

This work was supported by the DARPA Machine Common Sense Program. We would like to thank Richard Higgins and members of the Fouhey lab for helpful discussions and feedback.

Author information

Authors and Affiliations

University of Michigan, Ann Arbor, USA
Samir Agarwala, Linyi Jin, Chris Rockwell & David F. Fouhey

Authors

Samir Agarwala
View author publications
You can also search for this author in PubMed Google Scholar
Linyi Jin
View author publications
You can also search for this author in PubMed Google Scholar
Chris Rockwell
View author publications
You can also search for this author in PubMed Google Scholar
David F. Fouhey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Samir Agarwala .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16939 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Agarwala, S., Jin, L., Rockwell, C., Fouhey, D.F. (2022). PlaneFormers: From Sparse View Planes to 3D Reconstruction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-20062-5_12
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20061-8
Online ISBN: 978-3-031-20062-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PlaneFormers: From Sparse View Planes to 3D Reconstruction