Abstract
This is an opinion paper about the strengths and weaknesses of Deep Nets for vision. They are at the heart of the enormous recent progress in artificial intelligence and are of growing importance in cognitive science and neuroscience. They have had many successes but also have several limitations and there is limited understanding of their inner workings. At present Deep Nets perform very well on specific visual tasks with benchmark datasets but they are much less general purpose, flexible, and adaptive than the human visual system. We argue that Deep Nets in their current form are unlikely to be able to overcome the fundamental problem of computer vision, namely how to deal with the combinatorial explosion, caused by the enormous complexity of natural images, and obtain the rich understanding of visual scenes that the human visual achieves. We argue that this combinatorial explosion takes us into a regime where “big data is not enough” and where we need to rethink our methods for benchmarking performance and evaluating vision algorithms. We stress that, as vision algorithms are increasingly used in real world applications, that performance evaluation is not merely an academic exercise but has important consequences in the real world. It is impractical to review the entire Deep Net literature so we restrict ourselves to a limited range of topics and references which are intended as entry points into the literature. The views expressed in this paper are our own and do not necessarily represent those of anybody else in the computer vision community.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The first author remembers that in the mid 1990s and early 2000s the term “neural network” in the title of a submission to a computer vision conference was sadly a good predictor for rejection and recalls sympathizing with researchers who were pursuing such unfashionable ideas.
In addition to visualization, training a small neural network (also known as readout functions) on top of deep features is another popular technique of assessing how much they encode some particular properties, which is now widely adopted in the self-supervised learning literature (Noroozi et al. 2016; Zhang et al. 2016).
Admittedly in ResNets (He et al. 2016), there is only one “decision layer”, and the analogy to “template matching” also weakens at higher layers due to presence of the residual connection.
This issue we are describing is in general related to how Deep Nets can take unintended, “shortcut” solutions, for example the chromatic aberration noticed in Doersch et al. (2015), or the low level statistics and edge continuity noticed in Noroozi et al. (2016). In this paper we highlight “over-sensitivity to context” as the representative example, for both familiarity and keeping the discussion contained.
The first author remembers that when studying text detection for the visually impaired we were so concerned about dataset biases that we recruited blind subjects who would walk the streets of San Francisco taking images automatically (but found the main difference from regular images was that there was a greater variety of angles).
Available from Sowerby Research Centre, British Aerospace.
Quote during a public talk by a West Coast Professor who, perhaps coincidentally, had a start-up company.
References
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., & Süsstrunk, S. (2012). SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2274–2282.
Alcorn, MA., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W., & Nguyen, A. (2019). Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In CVPR, Computer Vision Foundation/IEEE (pp. 4845–4854).
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016). Neural module networks. In CVPR, IEEE Computer Society (pp. 39–48).
Arbib, M. A., & Bonaiuto, J. J. (2016). From neuron to cognition via computational neuroscience. Cambridge: MIT Press.
Arterberry, M. E., & Kellman, P. J. (2016). Development of perception in infancy: The cradle of knowledge revisited. Oxford: Oxford University Press.
Athalye, A., Carlini, N., & Wagner, DA. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 80, pp. 274–283).
Barlow, H., & Tripathy, S. P. (1997). Correspondence noise and signal pooling in the detection of coherent visual motion. Journal of Neuroscience, 17(20), 7954–7966.
Bashford, A., & Levine, P. (2010). The Oxford handbook of the history of eugenics. OUP USA.
Battaglia, P. W., Hamrick, J. B., & Tenenbaum, J. B. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45), 18327–18332.
Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115.
Biggio, B., Corona, I., Maiorca, D., Nelson, B., Srndic, N., Laskov, P., Giacinto, G., & Roli, F. (2013). Evasion attacks against machine learning at test time. In ECML/PKDD (3), Springer, Lecture Notes in Computer Science (Vol. 8190, pp. 387–402).
Bowyer, KW., Kranenburg, C., & Dougherty, S. (1999). Edge detector evaluation using empirical ROC curves. In CVPR, IEEE Computer Society (pp. 1354–1359).
Boyden, E. S., Zhang, F., Bamberg, E., Nagel, G., & Deisseroth, K. (2005). Millisecond-timescale, genetically targeted optical control of neural activity. Nature Neuroscience, 8(9), 1263.
Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency (pp. 77–91).
Canny, J. F. (1986). A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(6), 679–698.
Chang, AX., Funkhouser, TA., Guibas, LJ., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An information-rich 3d model repository. CoRR abs/1512.03012.
Changizi, M. (2010). The vision revolution: How the latest research overturns everything we thought we knew about human vision. Benbella books.
Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.
Chen, X., & Yuille, AL. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS (pp. 1736–1744).
Chen, X., & Yuille, AL. (2015). Parsing occluded people by flexible compositions. In CVPR, IEEE Computer Society (pp. 3945–3954).
Chen, Y., Zhu, L., Lin, C., Yuille, AL., & Zhang, H. (2007). Rapid inference on a novel AND/OR graph for object detection, segmentation and parsing. In NIPS, Curran Associates, Inc., (pp. 289–296).
Chomsky, N. (2014). Aspects of the theory of syntax. Cambridge: MIT Press.
Cichy, R. M., Khosla, A., Pantazis, D., Torralba, A., & Oliva, A. (2016). Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence. Scientific Reports, 6, 27755.
Clune, J., Mouret, J. B., & Lipson, H. (2013). The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755), 20122863.
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. MCSS, 2(4), 303–314.
Darwiche, A. (2018). Human-level intelligence or animal-like abilities? Commun ACM, 61(10), 56–67.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, IEEE Computer Society (pp. 248–255).
Doersch, C., Gupta, A., & Efros, AA. (2015). Unsupervised visual representation learning by context prediction. In ICCV, IEEE Computer Society (pp. 1422–1430).
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In NIPS (pp. 2366–2374)
Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Firestone, C. (2020). Performance versus competence in human-machine comparisons. In Proceedings of the National Academy of Sciences In Press.
Fukushima, K., & Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition, Competition and cooperation in neural nets (pp. 267–285). Berlin: Springer.
Geisler, W. S. (2011). Contributions of ideal observer theory to vision research. Vision Research, 51(7), 771–781.
Geman, S. (2007). Compositionality in vision. In The grammar of vision: probabilistic grammar-based models for visual scene understanding and object categorization.
George, D., Lehrach, W., Kansky, K., Lázaro-Gredilla, M., Laan, C., Marthi, B., et al. (2017). A generative vision model that trains with high data efficiency and breaks text-based captchas. Science, 358(6368), eaag2612.
Gibson, J. J. (1986). The ecological approach to visual perception. Hove: Psychology Press.
Girshick, RB., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, IEEE Computer Society (pp. 580–587).
Goodfellow, IJ., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, AC., & Bengio, Y. (2014). Generative adversarial nets. In NIPS (pp. 2672–2680).
Goodfellow, IJ., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
Gopnik, A., Meltzoff, A. N., & Kuhl, P. K. (1999). The scientist in the crib: Minds, brains, and how children learn. New York: William Morrow and Co.
Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T., & Danks, D. (2004). A theory of causal learning in children: Causal maps and bayes nets. Psychological Review, 111(1), 3.
Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New Jersey: John Wiley.
Gregoriou, G. G., Rossi, A. F., Ungerleider, L. G., & Desimone, R. (2014). Lesions of prefrontal cortex reduce attentional modulation of neuronal responses and synchrony in v4. Nature Neuroscience, 17(7), 1003–1011.
Gregory, R. L. (1973). Eye and brain: The psychology of seeing. New York: McGraw-Hill.
Grenander, U. (1993). General pattern theory-A mathematical study of regular structures. Oxford: Clarendon Press.
Guu, K., Pasupat, P., Liu, EZ., & Liang, P. (2017). From language to programs: Bridging reinforcement learning and maximum marginal likelihood. In ACL (1), Association for Computational Linguistics (pp. 1051–1062).
Guzmán, A. (1968). Decomposition of a visual scene into three-dimensional bodies. In Proceedings of the December 9–11, 1968, Fall Joint Computer Conference, Part I (pp. 291–304)
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, IEEE Computer Society (pp. 770–778).
He, K., Fan, H., Wu, Y., Xie, S., & Girshick, RB. (2019). Momentum contrast for unsupervised visual representation learning. CoRR abs/1911.05722.
Hoffman, J., Tzeng, E., Park, T., Zhu, J., Isola, P., Saenko, K., et al. (2018). Cycada: Cycle-consistent adversarial domain adaptation. ICML, PMLR, Proceedings of Machine Learning Research, 80, 1994–2003.
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV (3), Springer, Lecture Notes in Computer Science. (Vol. 7574, pp. 340–353).
Hornik, K., Stinchcombe, M. B., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 37, pp. 448–456).
Jabr, F. (2012). The connectome debate: Is mapping the mind of a worm worth it. New York: Scientific American.
Jégou, S., Drozdzal, M., Vázquez, D., Romero, A., & Bengio, Y.(2017) The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In CVPR Workshops, IEEE Computer Society (pp. 1175–1183).
Julesz, B. (1971). Foundations of cyclopean perception. Chicago: U. Chicago Press.
Kaushik, D., Hovy, EH., & Lipton, ZC. (2020). Learning the difference that makes A difference with counterfactually-augmented data. In ICLR, OpenReview.net.
Kokkinos, I. (2017). Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, IEEE Computer Society (pp. 5454–5463).
Konishi, S., Yuille, AL., Coughlan, JM., & Zhu, SC. (1999). Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In CVPR, IEEE Computer Society (pp. 1573–1579)
Konishi, S., Yuille, A. L., Coughlan, J. M., & Zhu, S. C. (2003). Statistical edge detection: Learning and evaluating edge cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(1), 57–74.
Kortylewski, A., He, J., Liu, Q., & Yuille, AL. (2020). Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion. CoRR abs/2003.04490.
Kortylewski, A., Liu, Q., Wang, H., Zhang, Z., & Yuille, AL. (2020). Combining compositional models and deep networks for robust object classification under occlusion. In WACV, IEEE (pp. 1322–1330).
Krizhevsky, A., Sutskever, I., & Hinton, GE. (2012). Imagenet classification with deep convolutional neural networks. In NIPS (pp. 1106–1114).
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
Lee, T. S., & Mumford, D. (2003). Hierarchical bayesian inference in the visual cortex. JOSA A, 20(7), 1434–1448.
Lin, X., Wang, H., Li, Z., Zhang, Y., Yuille, AL., & Lee, TS. (2017). Transfer of view-manifold learning to similarity perception of novel objects. In International Conference on Learning Representations.
Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L., Fei-Fei, L., Yuille, AL., Huang, J., & Murphy, K. (2018). Progressive neural architecture search. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 11205, pp. 19–35).
Liu, C., Dollár, P., He, K., Girshick, RB., Yuille, AL., & Xie, S. (2020). Are labels necessary for neural architecture search? CoRR abs/2003.12056.
Liu, R., Liu, C., Bai, Y., & Yuille, AL. (2019). Clevr-ref+: Diagnosing visual reasoning with referring expressions. In CVPR, Computer Vision Foundation/IEEE (pp. 4185–4194).
Liu, Z., Knill, D. C., & Kersten, D. (1995). Object classification for human and ideal observers. Vision Research, 35(4), 549–568.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR, IEEE Computer Society (pp. 3431–3440).
Lu, H., & Yuille, AL. (2005). Ideal observers for detecting motion: Correspondence noise. In NIPS (pp. 827–834).
Lyu, J., Qiu, W., Wei, X., Zhang, Y., Yuille, AL., & Zha, Z. (2019). Identity preserve transform: Understand what activity classification models have learnt. CoRR abs/1912.06314.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. CoRR abs/1706.06083.
Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, AL. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In ICCV, IEEE Computer Society (pp. 2533–2541).
Marcus, G. (2018). Deep learning: A critical appraisal. CoRR abs/1801.00631.
Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Henry Holt and Co. Inc.
Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, IEEE Computer Society (pp. 4040–4048).
McManus, J. N., Li, W., & Gilbert, C. D. (2011). Adaptive shape processing in primary visual cortex. Proceedings of the National Academy of Sciences, 108(24), 9739–9746.
Mengistu, H., Huizinga, J., Mouret, J., & Clune, J. (2016). The evolutionary origins of hierarchy. PLoS Computational Biology, 12(6), e1004829.
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. CoRR abs/1411.1784.
Mu, J., Qiu, W., Hager, GD., & Yuille, AL. (2019). Learning from synthetic animals. CoRR abs/1912.08265.
Mumford, D. (1994). Pattern theory: a unifying perspective. In First European Congress of Mathematics, Springer (pp. 187–224).
Mumford, D., & Desolneux, A. (2010). Pattern theory: The stochastic analysis of real-world signals. Cambridge: CRC Press.
Murez, Z., Kolouri, S., Kriegman, DJ., Ramamoorthi, R., Kim, K. (2018). Image to image translation for domain adaptation. In CVPR (pp. 4500–4509), 10.1109/CVPR.2018.00473, http://openaccess.thecvf.com/content_cvpr_2018/html/Murez_Image_to_Image_CVPR_2018_paper.html.
Noroozi, M., Favaro, P. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV (6), Springer, Lecture Notes in Computer Science (Vol. 9910, pp. 69–84).
Papandreou, G., Chen, L., Murphy, KP., Yuille, AL. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV, IEEE Computer Society (pp. 1742–1750).
Pearl, J. (1989). Probabilistic reasoning in intelligent systems—networks of plausible inference. Morgan Kaufmann series in representation and reasoning, Morgan Kaufmann.
Pearl, J. (2009). Causality. Cambridge: Cambridge University Press.
Penn, D. C., Holyoak, K. J., & Povinelli, D. J. (2008). Darwin’s mistake: Explaining the discontinuity between human and nonhuman minds. Behavioral and Brain Sciences, 31(2), 109–130.
Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., & Dean, J. (2018). Efficient neural architecture search via parameter sharing. ICML, PMLR, Proceedings of Machine Learning Research, 80, 4092–4101.
Poirazi, P., & Mel, B. W. (2001). Impact of active dendrites and structural plasticity on the memory capacity of neural tissue. Neuron, 29(3), 779–796.
Qiao, S., Liu, C., Shen, W., & Yuille, AL. (2018). Few-shot image recognition by predicting parameters from activations. In CVPR, IEEE Computer Society (pp. 7229–7238).
Qiu, W., & Yuille, AL. (2016). Unrealcv: Connecting computer vision to unreal engine. In ECCV Workshops (3), Lecture Notes in Computer Science (Vol. 9915, pp. 909–916).
Ren, S., He, K., Girshick, RB., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS (pp. 91–99).
Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., & Zha, H. (2017). Unsupervised deep learning for optical flow estimation. In AAAI, AAAI Press (pp. 1495–1501).
Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019.
Rosenfeld, A., Zemel, RS., & Tsotsos, JK. (2018). The elephant in the room. CoRR abs/1808.03305.
Rother, C., Kolmogorov, V., & Blake, A. (2004). “Grabcut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
Russell, S. J., & Norvig, P. (2010). Artificial Intelligence—A Modern Approach. Pearson Education: Third International Edition.
Sabour, S., Frosst, N., & Hinton, GE. (2017). Dynamic routing between capsules. In NIPS (pp. 3856–3866).
Salakhutdinov, R., Tenenbaum, JB., & Torralba, A. (2012). One-shot learning with a hierarchical nonparametric bayesian model. In ICML Unsupervised and Transfer Learning, JMLR.org, JMLR Proceedings (Vol. 27, pp. 195–206).
Santoro, A., Hill, F., Barrett, DGT., Morcos, AS., & Lillicrap, TP. (2018). Measuring abstract reasoning in neural networks. In ICML, JMLR.org, JMLR Workshop and Conference Proceedings (Vol. 80, pp. 4477–4486).
Seung, S. (2012). Connectome: How the brain’s wiring makes us who we are. HMH.
Shen, W., Zhao, K., Jiang, Y., Wang, Y., Bai, X., & Yuille, A. L. (2017a). Deepskeleton: Learning multi-task scale-associated deep side outputs for object skeleton extraction in natural images. IEEE Transactions on Image Processing, 26(11), 5298–5311.
Shen, Z., Liu, Z., Li, J., Jiang, Y., Chen, Y., & Xue, X. (2017) DSOD: learning deeply supervised object detectors from scratch. In ICCV, IEEE Computer Society (pp. 1937–1945).
Shu ,M., Liu, C., Qiu, W., & Yuille, AL. (2020). Identifying model weakness with adversarial examiner. In AAAI, AAAI Press, (pp. 11998–12006).
Simons, D. J., & Chabris, C. F. (1999). Gorillas in our midst: Sustained inattentional blindness for dynamic events. Perception, 28(9), 1059–1074.
Simonyan ,K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
Smirnakis, SM., & Yuille, AL. (1995). Neural implementation of bayesian vision theories by unsupervised learning. In The Neurobiology of Computation, Springer, (pp. 427–432).
Smith, L., & Gasser, M. (2005). The development of embodied cognition: Six lessons from babies. Artificial Life, 11(1–2), 13–29.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, IJ., & Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations.
Tjan, B. S., Braje, W. L., Legge, G. E., & Kersten, D. (1995). Human efficiency for recognizing 3-d objects in luminance noise. Vision Research, 35(21), 3053–3069.
Torralba, A., & Efros, AA. (2011). Unbiased look at dataset bias. In CVPR, IEEE Computer Society (pp. 1521–1528).
Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). Robustness may be at odds with accuracy. In ICLR (Poster), OpenReview.net.
Tu, Z., Chen, X., Yuille, AL., & Zhu, SC. (2003). Image parsing: Unifying segmentation, detection, and recognition. In ICCV, IEEE Computer Society (pp. 18–25).
Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. (2017). Adversarial discriminative domain adaptation. In CVPR (pp. 2962–2971), 10.1109/CVPR.2017.316, https://doi.org/10.1109/CVPR.2017.316
Uesato, J., O’Donoghue, B., Kohli, P., & van den Oord, A. (2018). Adversarial risk and the dangers of evaluating against weak attacks. ICML, PMLR, Proceedings of Machine Learning Research, 80, 5032–5041.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, AN., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In NIPS (pp. 5998–6008).
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching networks for one shot learning. In NIPS (pp. 3630–3638).
Wang, J., Zhang, Z., Premachandran, V., & Yuille, AL. (2015). Discovering internal representations from object-cnns using population encoding. CoRR abs/1511.06855.
Wang, J., Zhang, Z., Xie, C., Zhou, Y., Premachandran, V., Zhu, J., et al. (2018). Visual concepts and compositional voting. Annals of Mathematical Sciences and Applications, 2(3), 4.
Wang, P., & Yuille, AL. (2016). DOC: deep occlusion estimation from a single image. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 9905, pp. 545–561).
Wang, T., Zhao, J., Yatskar, M., Chang, K., & Ordonez, V. (2019). Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In ICCV, IEEE (pp. 5309–5318).
Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In ICCV, IEEE Computer Society (pp. 2794–2802).
Wen, H., Shi, J., Zhang, Y., Lu, K. H., Cao, J., & Liu, Z. (2017). Neural encoding and decoding with deep learning for dynamic natural vision. Cerebral Cortex, 28, 1–25.
Wu, Z., Xiong, Y., Yu, SX., & Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In CVPR, IEEE Computer Society (pp. 3733–3742).
Xia, F., Wang, P., Chen, L., & Yuille, AL. (2016), Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net. In ECCV (5), Springer, Lecture Notes in Computer Science (Vol. 9909, pp. 648–663).
Xia, Y., Zhang, Y., Liu, F., Shen, W., & Yuille, AL. (2020).Synthesize then compare: Detecting failures and anomalies for semantic segmentation. CoRR abs/2003.08440.
Xie, C., Wang, J., Zhang, Z., Zhou, Y., Xie, L., & Yuille, AL. (2017). Adversarial examples for semantic segmentation and object detection. In ICCV, IEEE Computer Society (pp. 1378–1387).
Xie, C., Wang, J., Zhangm, Z., Ren, Z., & Yuille, AL. (2018). Mitigating adversarial effects through randomization. In International Conference on Learning Representations.
Xie, L., & Yuille, AL. (2017). Genetic CNN. In ICCV, IEEE Computer Society (pp. 1388–1397).
Xie, S., & Tu, Z. (2015). Holistically-nested edge detection. In ICCV, IEEE Computer Society (pp. 1395–1403).
Xu, L., Krzyzak, A., & Yuille, A. L. (1994). On radial basis function nets and kernel regression: Statistical consistency, convergence rates, and receptive field size. Neural Networks, 7(4), 609–628.
Yamane, Y., Carlson, E. T., Bowman, K. C., Wang, Z., & Connor, C. E. (2008). A neural code for three-dimensional object shape in macaque inferotemporal cortex. Nature Neuroscience, 11(11), 1352–1360.
Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of Sciences, 111(23), 8619–8624.
Yang, C., Kortylewski, A., Xie, C., Cao, Y., & Yuille, AL. (2020). Patchattack: A black-box texture-based attack with reinforcement learning. CoRR abs/2004.05682.
Yosinski, J., Clune, J., Nguyen, AM., Fuchs. TJ., & Lipson, H. (2015). Understanding neural networks through deep visualization. CoRR abs/1506.06579.
Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.
Yuille, A. L., & Mottaghi, R. (2016). Complexity of representation and inference in compositional models with part sharing. Journal of Machine Learning Research, 17, 292–319.
Zbontar, J., & LeCun, Y. (2015). Computing the stereo matching cost with a convolutional neural network. In CVPR, IEEE Computer Society (pp. 1592–1599).
Zeiler, MD., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV (1), Springer, Lecture Notes in Computer Science (Vol. 8689, pp. 818–833).
Zendel, O., Murschitz, M., Humenberger, M., & Herzner, W. (2015). CV-HAZOP: introducing test data validation for computer vision. In ICCV, IEEE Computer Society (pp. 2066–2074).
Zhang, R., Isola, P., & Efros, AA. (2016). Colorful image colorization. In ECCV (3), Springer, Lecture Notes in Computer Science (Vol. 9907, pp. 649–666).
Zhang, Y., Qiu, W., Chen, Q., Hu, X., & Yuille, AL. (2018). Unrealstereo: Controlling hazardous factors to analyze stereo vision. In 3DV, IEEE Computer Society (pp. 228–237).
Zhang, Z., Shen, W., Qiao, S., Wang, Y., Wang, B., & Yuille, AL. (2020). Robust face detection via learning small faces on hard images. In WACV, IEEE (pp. 1350–1359).
Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS (pp. 487–495).
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Object detectors emerge in deep scene cnns. In International Conference on Learning Representations.
Zhou, T., Brown, M., Snavely, N., & Lowe, DG. (2017). Unsupervised learning of depth and ego-motion from video. In CVPR, IEEE Computer Society (pp. 6612–6619).
Zhou, Z., & Firestone, C. (2019). Humans can decipher adversarial images. Nature Communications, 10(1), 1–9.
Zhu, H., Tang, P., Yuille, AL., Park, S., & Park, J. (2019). Robustness of object recognition under extreme occlusion in humans and computational models. In CogSci, cognitivesciencesociety.org (pp. 3213–3219).
Zhu, L., Chen, Y., Torralba, A., Freeman, WT., Yuille, AL. (2010). Part and appearance sharing: Recursive compositional models for multi-view. In CVPR, IEEE Computer Society (pp. 1919–1926).
Zhu, S., & Mumford, D. (2006). A stochastic grammar of images. Foundations and Trends in Computer Graphics and Vision, 2(4), 259–362.
Zhu, Z., Xie, L., & Yuille, AL. (2017). Object recognition with and without objects. In IJCAI, ijcai.org (pp. 3609–3615).
Zitnick, C. L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., & Parikh, D. (2016). Measuring machine intelligence through visual question answering. AI Magazine, 37(1), 63–72.
Zoph, B., & Le, QV. (2017). Neural architecture search with reinforcement learning. In ICLR, OpenReview.net.
Acknowledgements
This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC Award CCF-1231216 and ONR N00014-15-1-2356. We thank Kyle Rawlins, Tal Linzen, Wei Shen, and Adam Kortylewski for providing feedback and Weichao Qiu, Daniel Kersten, Ed Connor, Chaz Firestone, Vicente Ordonez, and Greg Hager for discussions on some of these topics. We thank the reviewers for some very helpful feedback which greatly improved the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
For those readers unfamiliar with Monty Python see: https://youtu.be/Qc7HmhrgTuQ
Rights and permissions
About this article
Cite this article
Yuille, A.L., Liu, C. Deep Nets: What have They Ever Done for Vision?. Int J Comput Vis 129, 781–802 (2021). https://doi.org/10.1007/s11263-020-01405-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01405-z