Deep Nets: What have They Ever Done for Vision? | International Journal of Computer Vision Skip to main content

Advertisement

Log in

Deep Nets: What have They Ever Done for Vision?

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This is an opinion paper about the strengths and weaknesses of Deep Nets for vision. They are at the heart of the enormous recent progress in artificial intelligence and are of growing importance in cognitive science and neuroscience. They have had many successes but also have several limitations and there is limited understanding of their inner workings. At present Deep Nets perform very well on specific visual tasks with benchmark datasets but they are much less general purpose, flexible, and adaptive than the human visual system. We argue that Deep Nets in their current form are unlikely to be able to overcome the fundamental problem of computer vision, namely how to deal with the combinatorial explosion, caused by the enormous complexity of natural images, and obtain the rich understanding of visual scenes that the human visual achieves. We argue that this combinatorial explosion takes us into a regime where “big data is not enough” and where we need to rethink our methods for benchmarking performance and evaluating vision algorithms. We stress that, as vision algorithms are increasingly used in real world applications, that performance evaluation is not merely an academic exercise but has important consequences in the real world. It is impractical to review the entire Deep Net literature so we restrict ourselves to a limited range of topics and references which are intended as entry points into the literature. The views expressed in this paper are our own and do not necessarily represent those of anybody else in the computer vision community.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The first author remembers that in the mid 1990s and early 2000s the term “neural network” in the title of a submission to a computer vision conference was sadly a good predictor for rejection and recalls sympathizing with researchers who were pursuing such unfashionable ideas.

  2. In addition to visualization, training a small neural network (also known as readout functions) on top of deep features is another popular technique of assessing how much they encode some particular properties, which is now widely adopted in the self-supervised learning literature (Noroozi et al. 2016; Zhang et al. 2016).

  3. Admittedly in ResNets (He et al. 2016), there is only one “decision layer”, and the analogy to “template matching” also weakens at higher layers due to presence of the residual connection.

  4. https://michaelbach.de/ot/.

  5. This issue we are describing is in general related to how Deep Nets can take unintended, “shortcut” solutions, for example the chromatic aberration noticed in Doersch et al. (2015), or the low level statistics and edge continuity noticed in Noroozi et al. (2016). In this paper we highlight “over-sensitivity to context” as the representative example, for both familiarity and keeping the discussion contained.

  6. The first author remembers that when studying text detection for the visually impaired we were so concerned about dataset biases that we recruited blind subjects who would walk the streets of San Francisco taking images automatically (but found the main difference from regular images was that there was a greater variety of angles).

  7. Available from Sowerby Research Centre, British Aerospace.

  8. https://www.caranddriver.com/features/a32266303/self-driving-cars-are-taking-longer-to-build-than-everyone-thought/.

  9. https://www.ncbi.nlm.nih.gov/books/NBK210143/.

  10. https://www.wired.com/story/done-right-ai-make-policing-fairer/.

  11. Quote during a public talk by a West Coast Professor who, perhaps coincidentally, had a start-up company.

References

  • Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274–2282.

    Google Scholar 

  • Alcorn, MA., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W., & Nguyen, A. (2019). Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In CVPR, Computer Vision Foundation/IEEE (pp. 4845–4854).

  • Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016). Neural module networks. In CVPR, IEEE Computer Society (pp. 39–48).

  • Arbib, M. A., & Bonaiuto, J. J. (2016). From neuron to cognition via computational neuroscience. Cambridge: MIT Press.

    Google Scholar 

  • Arterberry, M. E., & Kellman, P. J. (2016). Development of perception in infancy: The cradle of knowledge revisited. Oxford: Oxford University Press.

    Google Scholar 

  • Athalye, A., Carlini, N., & Wagner, DA. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 80, pp. 274–283).

  • Barlow, H., & Tripathy, S. P. (1997). Correspondence noise and signal pooling in the detection of coherent visual motion. Journal of Neuroscience, 17(20), 7954–7966.

    Google Scholar 

  • Bashford, A., & Levine, P. (2010). The Oxford handbook of the history of eugenics. OUP USA.

  • Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45), 18327–18332.

    Google Scholar 

  • Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115.

    Google Scholar 

  • Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., & Roli, F. (2013). Evasion attacks against machine learning at test time. In ECML/PKDD (3), Springer, Lecture Notes in Computer Science (Vol. 8190, pp. 387–402).

  • Bowyer, KW., Kranenburg, C., & Dougherty, S. (1999). Edge detector evaluation using empirical ROC curves. In CVPR, IEEE Computer Society (pp. 1354–1359).

  • Boyden, E. S., Zhang, F., Bamberg, E., Nagel, G., & Deisseroth, K. (2005). Millisecond-timescale, genetically targeted optical control of neural activity. Nature Neuroscience, 8(9), 1263.

    Google Scholar 

  • Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (pp. 77–91).

  • Canny, J. F. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698.

    Google Scholar 

  • Chang, AX., Funkhouser, TA., Guibas, LJ., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An information-rich 3d model repository. CoRR abs/1512.03012.

  • Changizi, M. (2010). The vision revolution: How the latest research overturns everything we thought we knew about human vision. Benbella books.

  • Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Google Scholar 

  • Chen, X., & Yuille, AL. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS (pp. 1736–1744).

  • Chen, X., & Yuille, AL. (2015). Parsing occluded people by flexible compositions. In CVPR, IEEE Computer Society (pp. 3945–3954).

  • Chen, Y., Zhu, L., Lin, C., Yuille, AL., & Zhang, H. (2007). Rapid inference on a novel AND/OR graph for object detection, segmentation and parsing. In NIPS, Curran Associates, Inc., (pp. 289–296).

  • Chomsky, N. (2014). Aspects of the theory of syntax. Cambridge: MIT Press.

    Google Scholar 

  • Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, 27755.

    Google Scholar 

  • Clune, J., Mouret, J. B., & Lipson, H. (2013). The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755), 20122863.

    Google Scholar 

  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. MCSS, 2(4), 303–314.

    MathSciNet  MATH  Google Scholar 

  • Darwiche, A. (2018). Human-level intelligence or animal-like abilities? Commun ACM, 61(10), 56–67.

    Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, IEEE Computer Society (pp. 248–255).

  • Doersch, C., Gupta, A., & Efros, AA. (2015). Unsupervised visual representation learning by context prediction. In ICCV, IEEE Computer Society (pp. 1422–1430).

  • Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NIPS (pp. 2366–2374)

  • Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Google Scholar 

  • Firestone, C. (2020). Performance versus competence in human-machine comparisons. In Proceedings of the National Academy of Sciences In Press.

  • Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition, Competition and cooperation in neural nets (pp. 267–285). Berlin: Springer.

    Google Scholar 

  • Geisler, W. S. (2011). Contributions of ideal observer theory to vision research. Vision Research, 51(7), 771–781.

    Google Scholar 

  • Geman, S. (2007). Compositionality in vision. In The grammar of vision: probabilistic grammar-based models for visual scene understanding and object categorization.

  • George, D., Lehrach, W., Kansky, K., Lázaro-Gredilla, M., Laan, C., Marthi, B., et al. (2017). A generative vision model that trains with high data efficiency and breaks text-based captchas. Science, 358(6368), eaag2612.

    Google Scholar 

  • Gibson, J. J. (1986). The ecological approach to visual perception. Hove: Psychology Press.

    Google Scholar 

  • Girshick, RB., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, IEEE Computer Society (pp. 580–587).

  • Goodfellow, IJ., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, AC., & Bengio, Y. (2014). Generative adversarial nets. In NIPS (pp. 2672–2680).

  • Goodfellow, IJ., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.

  • Gopnik, A., Meltzoff, A. N., & Kuhl, P. K. (1999). The scientist in the crib: Minds, brains, and how children learn. New York: William Morrow and Co.

    Google Scholar 

  • Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T., & Danks, D. (2004). A theory of causal learning in children: Causal maps and bayes nets. Psychological Review, 111(1), 3.

    Google Scholar 

  • Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New Jersey: John Wiley.

    Google Scholar 

  • Gregoriou, G. G., Rossi, A. F., Ungerleider, L. G., & Desimone, R. (2014). Lesions of prefrontal cortex reduce attentional modulation of neuronal responses and synchrony in v4. Nature Neuroscience, 17(7), 1003–1011.

    Google Scholar 

  • Gregory, R. L. (1973). Eye and brain: The psychology of seeing. New York: McGraw-Hill.

    Google Scholar 

  • Grenander, U. (1993). General pattern theory-A mathematical study of regular structures. Oxford: Clarendon Press.

    MATH  Google Scholar 

  • Guu, K., Pasupat, P., Liu, EZ., & Liang, P. (2017). From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In ACL (1), Association for Computational Linguistics (pp. 1051–1062).

  • Guzmán, A. (1968). Decomposition of a visual scene into three-dimensional bodies. In Proceedings of the December 9–11, 1968, Fall Joint Computer Conference, Part I (pp. 291–304)

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, IEEE Computer Society (pp. 770–778).

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, RB. (2019). Momentum contrast for unsupervised visual representation learning. CoRR abs/1911.05722.

  • Hoffman, J., Tzeng, E., Park, T., Zhu, J., Isola, P., Saenko, K., et al. (2018). Cycada: Cycle-consistent adversarial domain adaptation. ICML, PMLR, Proceedings of Machine Learning Research, 80, 1994–2003.

    Google Scholar 

  • Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV (3), Springer, Lecture Notes in Computer Science. (Vol. 7574, pp. 340–353).

  • Hornik, K., Stinchcombe, M. B., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.

    MATH  Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 37, pp. 448–456).

  • Jabr, F. (2012). The connectome debate: Is mapping the mind of a worm worth it. New York: Scientific American.

    Google Scholar 

  • Jégou, S., Drozdzal, M., Vázquez, D., Romero, A., & Bengio, Y.(2017) The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In CVPR Workshops, IEEE Computer Society (pp. 1175–1183).

  • Julesz, B. (1971). Foundations of cyclopean perception. Chicago: U. Chicago Press.

    Google Scholar 

  • Kaushik, D., Hovy, EH., & Lipton, ZC. (2020). Learning the difference that makes A difference with counterfactually-augmented data. In ICLR, OpenReview.net.

  • Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, IEEE Computer Society (pp. 5454–5463).

  • Konishi, S., Yuille, AL., Coughlan, JM., & Zhu, SC. (1999). Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In CVPR, IEEE Computer Society (pp. 1573–1579)

  • Konishi, S., Yuille, A. L., Coughlan, J. M., & Zhu, S. C. (2003). Statistical edge detection: Learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 57–74.

    Google Scholar 

  • Kortylewski, A., He, J., Liu, Q., & Yuille, AL. (2020). Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. CoRR abs/2003.04490.

  • Kortylewski, A., Liu, Q., Wang, H., Zhang, Z., & Yuille, AL. (2020). Combining compositional models and deep networks for robust object classification under occlusion. In WACV, IEEE (pp. 1322–1330).

  • Krizhevsky, A., Sutskever, I., & Hinton, GE. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1106–1114).

  • LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.

    Google Scholar 

  • Lee, T. S., & Mumford, D. (2003). Hierarchical bayesian inference in the visual cortex. JOSA A, 20(7), 1434–1448.

    Google Scholar 

  • Lin, X., Wang, H., Li, Z., Zhang, Y., Yuille, AL., & Lee, TS. (2017). Transfer of view-manifold learning to similarity perception of novel objects. In International Conference on Learning Representations.

  • Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., Fei-Fei, L., Yuille, AL., Huang, J., & Murphy, K. (2018). Progressive neural architecture search. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 11205, pp. 19–35).

  • Liu, C., Dollár, P., He, K., Girshick, RB., Yuille, AL., & Xie, S. (2020). Are labels necessary for neural architecture search? CoRR abs/2003.12056.

  • Liu, R., Liu, C., Bai, Y., & Yuille, AL. (2019). Clevr-ref+: Diagnosing visual reasoning with referring expressions. In CVPR, Computer Vision Foundation/IEEE (pp. 4185–4194).

  • Liu, Z., Knill, D. C., & Kersten, D. (1995). Object classification for human and ideal observers. Vision Research, 35(4), 549–568.

    Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR, IEEE Computer Society (pp. 3431–3440).

  • Lu, H., & Yuille, AL. (2005). Ideal observers for detecting motion: Correspondence noise. In NIPS (pp. 827–834).

  • Lyu, J., Qiu, W., Wei, X., Zhang, Y., Yuille, AL., & Zha, Z. (2019). Identity preserve transform: Understand what activity classification models have learnt. CoRR abs/1912.06314.

  • Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. CoRR abs/1706.06083.

  • Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, AL. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In ICCV, IEEE Computer Society (pp. 2533–2541).

  • Marcus, G. (2018). Deep learning: A critical appraisal. CoRR abs/1801.00631.

  • Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Henry Holt and Co. Inc.

    Google Scholar 

  • Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, IEEE Computer Society (pp. 4040–4048).

  • McManus, J. N., Li, W., & Gilbert, C. D. (2011). Adaptive shape processing in primary visual cortex. Proceedings of the National Academy of Sciences, 108(24), 9739–9746.

    Google Scholar 

  • Mengistu, H., Huizinga, J., Mouret, J., & Clune, J. (2016). The evolutionary origins of hierarchy. PLoS Computational Biology, 12(6), e1004829.

    Google Scholar 

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. CoRR abs/1411.1784.

  • Mu, J., Qiu, W., Hager, GD., & Yuille, AL. (2019). Learning from synthetic animals. CoRR abs/1912.08265.

  • Mumford, D. (1994). Pattern theory: a unifying perspective. In First European Congress of Mathematics, Springer (pp. 187–224).

  • Mumford, D., & Desolneux, A. (2010). Pattern theory: The stochastic analysis of real-world signals. Cambridge: CRC Press.

    MATH  Google Scholar 

  • Murez, Z., Kolouri, S., Kriegman, DJ., Ramamoorthi, R., Kim, K. (2018). Image to image translation for domain adaptation. In CVPR (pp. 4500–4509), 10.1109/CVPR.2018.00473, http://openaccess.thecvf.com/content_cvpr_2018/html/Murez_Image_to_Image_CVPR_2018_paper.html.

  • Noroozi, M., Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV (6), Springer, Lecture Notes in Computer Science (Vol. 9910, pp. 69–84).

  • Papandreou, G., Chen, L., Murphy, KP., Yuille, AL. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV, IEEE Computer Society (pp. 1742–1750).

  • Pearl, J. (1989). Probabilistic reasoning in intelligent systems—networks of plausible inference. Morgan Kaufmann series in representation and reasoning, Morgan Kaufmann.

  • Pearl, J. (2009). Causality. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31(2), 109–130.

    Google Scholar 

  • Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. ICML, PMLR, Proceedings of Machine Learning Research, 80, 4092–4101.

    Google Scholar 

  • Poirazi, P., & Mel, B. W. (2001). Impact of active dendrites and structural plasticity on the memory capacity of neural tissue. Neuron, 29(3), 779–796.

    Google Scholar 

  • Qiao, S., Liu, C., Shen, W., & Yuille, AL. (2018). Few-shot image recognition by predicting parameters from activations. In CVPR, IEEE Computer Society (pp. 7229–7238).

  • Qiu, W., & Yuille, AL. (2016). Unrealcv: Connecting computer vision to unreal engine. In ECCV Workshops (3), Lecture Notes in Computer Science (Vol. 9915, pp. 909–916).

  • Ren, S., He, K., Girshick, RB., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).

  • Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., & Zha, H. (2017). Unsupervised deep learning for optical flow estimation. In AAAI, AAAI Press (pp. 1495–1501).

  • Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.

    Google Scholar 

  • Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019.

    Google Scholar 

  • Rosenfeld, A., Zemel, RS., & Tsotsos, JK. (2018). The elephant in the room. CoRR abs/1808.03305.

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). “Grabcut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.

    Google Scholar 

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.

    MATH  Google Scholar 

  • Russell, S. J., & Norvig, P. (2010). Artificial Intelligence—A Modern Approach. Pearson Education: Third International Edition.

  • Sabour, S., Frosst, N., & Hinton, GE. (2017). Dynamic routing between capsules. In NIPS (pp. 3856–3866).

  • Salakhutdinov, R., Tenenbaum, JB., & Torralba, A. (2012). One-shot learning with a hierarchical nonparametric bayesian model. In ICML Unsupervised and Transfer Learning, JMLR.org, JMLR Proceedings (Vol. 27, pp. 195–206).

  • Santoro, A., Hill, F., Barrett, DGT., Morcos, AS., & Lillicrap, TP. (2018). Measuring abstract reasoning in neural networks. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 80, pp. 4477–4486).

  • Seung, S. (2012). Connectome: How the brain’s wiring makes us who we are. HMH.

  • Shen, W., Zhao, K., Jiang, Y., Wang, Y., Bai, X., & Yuille, A. L. (2017a). Deepskeleton: Learning multi-task scale-associated deep side outputs for object skeleton extraction in natural images. IEEE Transactions on Image Processing, 26(11), 5298–5311.

    MathSciNet  MATH  Google Scholar 

  • Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y., & Xue, X. (2017) DSOD: learning deeply supervised object detectors from scratch. In ICCV, IEEE Computer Society (pp. 1937–1945).

  • Shu ,M., Liu, C., Qiu, W., & Yuille, AL. (2020). Identifying model weakness with adversarial examiner. In AAAI, AAAI Press, (pp. 11998–12006).

  • Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception, 28(9), 1059–1074.

    Google Scholar 

  • Simonyan ,K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.

  • Smirnakis, SM., & Yuille, AL. (1995). Neural implementation of bayesian vision theories by unsupervised learning. In The Neurobiology of Computation, Springer, (pp. 427–432).

  • Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1–2), 13–29.

    Google Scholar 

  • Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, IJ., & Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations.

  • Tjan, B. S., Braje, W. L., Legge, G. E., & Kersten, D. (1995). Human efficiency for recognizing 3-d objects in luminance noise. Vision Research, 35(21), 3053–3069.

    Google Scholar 

  • Torralba, A., & Efros, AA. (2011). Unbiased look at dataset bias. In CVPR, IEEE Computer Society (pp. 1521–1528).

  • Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). Robustness may be at odds with accuracy. In ICLR (Poster), OpenReview.net.

  • Tu, Z., Chen, X., Yuille, AL., & Zhu, SC. (2003). Image parsing: Unifying segmentation, detection, and recognition. In ICCV, IEEE Computer Society (pp. 18–25).

  • Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR (pp. 2962–2971), 10.1109/CVPR.2017.316, https://doi.org/10.1109/CVPR.2017.316

  • Uesato, J., O’Donoghue, B., Kohli, P., & van den Oord, A. (2018). Adversarial risk and the dangers of evaluating against weak attacks. ICML, PMLR, Proceedings of Machine Learning Research, 80, 5032–5041.

    Google Scholar 

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In NIPS (pp. 5998–6008).

  • Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In NIPS (pp. 3630–3638).

  • Wang, J., Zhang, Z., Premachandran, V., & Yuille, AL. (2015). Discovering internal representations from object-cnns using population encoding. CoRR abs/1511.06855.

  • Wang, J., Zhang, Z., Xie, C., Zhou, Y., Premachandran, V., Zhu, J., et al. (2018). Visual concepts and compositional voting. Annals of Mathematical Sciences and Applications, 2(3), 4.

    MathSciNet  MATH  Google Scholar 

  • Wang, P., & Yuille, AL. (2016). DOC: deep occlusion estimation from a single image. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 9905, pp. 545–561).

  • Wang, T., Zhao, J., Yatskar, M., Chang, K., & Ordonez, V. (2019). Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In ICCV, IEEE (pp. 5309–5318).

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In ICCV, IEEE Computer Society (pp. 2794–2802).

  • Wen, H., Shi, J., Zhang, Y., Lu, K. H., Cao, J., & Liu, Z. (2017). Neural encoding and decoding with deep learning for dynamic natural vision. Cerebral Cortex, 28, 1–25.

    Google Scholar 

  • Wu, Z., Xiong, Y., Yu, SX., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In CVPR, IEEE Computer Society (pp. 3733–3742).

  • Xia, F., Wang, P., Chen, L., & Yuille, AL. (2016), Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV (5), Springer, Lecture Notes in Computer Science (Vol. 9909, pp. 648–663).

  • Xia, Y., Zhang, Y., Liu, F., Shen, W., & Yuille, AL. (2020).Synthesize then compare: Detecting failures and anomalies for semantic segmentation. CoRR abs/2003.08440.

  • Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, AL. (2017). Adversarial examples for semantic segmentation and object detection. In ICCV, IEEE Computer Society (pp. 1378–1387).

  • Xie, C., Wang, J., Zhangm, Z., Ren, Z., & Yuille, AL. (2018). Mitigating adversarial effects through randomization. In International Conference on Learning Representations.

  • Xie, L., & Yuille, AL. (2017). Genetic CNN. In ICCV, IEEE Computer Society (pp. 1388–1397).

  • Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In ICCV, IEEE Computer Society (pp. 1395–1403).

  • Xu, L., Krzyzak, A., & Yuille, A. L. (1994). On radial basis function nets and kernel regression: Statistical consistency, convergence rates, and receptive field size. Neural Networks, 7(4), 609–628.

    MATH  Google Scholar 

  • Yamane, Y., Carlson, E. T., Bowman, K. C., Wang, Z., & Connor, C. E. (2008). A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neuroscience, 11(11), 1352–1360.

    Google Scholar 

  • Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23), 8619–8624.

    Google Scholar 

  • Yang, C., Kortylewski, A., Xie, C., Cao, Y., & Yuille, AL. (2020). Patchattack: A black-box texture-based attack with reinforcement learning. CoRR abs/2004.05682.

  • Yosinski, J., Clune, J., Nguyen, AM., Fuchs. TJ., & Lipson, H. (2015). Understanding neural networks through deep visualization. CoRR abs/1506.06579.

  • Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.

    Google Scholar 

  • Yuille, A. L., & Mottaghi, R. (2016). Complexity of representation and inference in compositional models with part sharing. Journal of Machine Learning Research, 17, 292–319.

    MathSciNet  MATH  Google Scholar 

  • Zbontar, J., & LeCun, Y. (2015). Computing the stereo matching cost with a convolutional neural network. In CVPR, IEEE Computer Society (pp. 1592–1599).

  • Zeiler, MD., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 8689, pp. 818–833).

  • Zendel, O., Murschitz, M., Humenberger, M., & Herzner, W. (2015). CV-HAZOP: introducing test data validation for computer vision. In ICCV, IEEE Computer Society (pp. 2066–2074).

  • Zhang, R., Isola, P., & Efros, AA. (2016). Colorful image colorization. In ECCV (3), Springer, Lecture Notes in Computer Science (Vol. 9907, pp. 649–666).

  • Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, AL. (2018). Unrealstereo: Controlling hazardous factors to analyze stereo vision. In 3DV, IEEE Computer Society (pp. 228–237).

  • Zhang, Z., Shen, W., Qiao, S., Wang, Y., Wang, B., & Yuille, AL. (2020). Robust face detection via learning small faces on hard images. In WACV, IEEE (pp. 1350–1359).

  • Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS (pp. 487–495).

  • Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International Conference on Learning Representations.

  • Zhou, T., Brown, M., Snavely, N., & Lowe, DG. (2017). Unsupervised learning of depth and ego-motion from video. In CVPR, IEEE Computer Society (pp. 6612–6619).

  • Zhou, Z., & Firestone, C. (2019). Humans can decipher adversarial images. Nature Communications, 10(1), 1–9.

    Google Scholar 

  • Zhu, H., Tang, P., Yuille, AL., Park, S., & Park, J. (2019). Robustness of object recognition under extreme occlusion in humans and computational models. In CogSci, cognitivesciencesociety.org (pp. 3213–3219).

  • Zhu, L., Chen, Y., Torralba, A., Freeman, WT., Yuille, AL. (2010). Part and appearance sharing: Recursive compositional models for multi-view. In CVPR, IEEE Computer Society (pp. 1919–1926).

  • Zhu, S., & Mumford, D. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.

    MATH  Google Scholar 

  • Zhu, Z., Xie, L., & Yuille, AL. (2017). Object recognition with and without objects. In IJCAI, ijcai.org (pp. 3609–3615).

  • Zitnick, C. L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., & Parikh, D. (2016). Measuring machine intelligence through visual question answering. AI Magazine, 37(1), 63–72.

    Google Scholar 

  • Zoph, B., & Le, QV. (2017). Neural architecture search with reinforcement learning. In ICLR, OpenReview.net.

Download references

Acknowledgements

This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC Award CCF-1231216 and ONR N00014-15-1-2356. We thank Kyle Rawlins, Tal Linzen, Wei Shen, and Adam Kortylewski for providing feedback and Weichao Qiu, Daniel Kersten, Ed Connor, Chaz Firestone, Vicente Ordonez, and Greg Hager for discussions on some of these topics. We thank the reviewers for some very helpful feedback which greatly improved the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenxi Liu.

Additional information

Communicated by Ivan Laptev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

For those readers unfamiliar with Monty Python see: https://youtu.be/Qc7HmhrgTuQ

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuille, A.L., Liu, C. Deep Nets: What have They Ever Done for Vision?. Int J Comput Vis 129, 781–802 (2021). https://doi.org/10.1007/s11263-020-01405-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01405-z

Keywords

Navigation