Abstract
Unsupervised learning of representations has been found useful in many applications and benefits from several advantages, e.g., where there are many unlabeled examples and few labeled ones (semi-supervised learning), or where the unlabeled or labeled examples are from a distribution different but related to the one of interest (self-taught learning, multi-task learning, and domain adaptation). Some of these algorithms have successfully been used to learn a hierarchy of features, i.e., to build a deep architecture, either as initialization for a supervised predictor, or as a generative model. Deep learning algorithms can yield representations that are more abstract and better disentangle the hidden factors of variation underlying the unknown generating distribution, i.e., to capture invariances and discover non-local structure in that distribution. This chapter reviews the main motivations and ideas behind deep learning algorithms and their representation-learning components, as well as recent results in this area, and proposes a vision of challenges and hopes on the road ahead, focusing on the questions of invariance and disentangling.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bagnell, J.A., Bradley, D.M.: Differentiable sparse coding. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21 (NIPS 2008), pp. 113–120 (2009)
Barron, A.E.: Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. on Information Theory 39, 930–945 (1993)
Bartfeld, E., Grinvald, A.: Relationships between orientation-preference pinwheels, cytochrome oxidase blobs, and ocular-dominance columns in primate striate cortex. Proc. Nati. Acad. Sci. USA 89, 11905–11909 (1992)
Becker, S., Hinton, G.E.: Learning mixture models of spatial coherence. Neural Computation 5, 267–277 (1993)
Bengio, Y.: Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1), 1–127 (2009); also published as a book. Now Publishers (2009)
Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Computation 21(6), 1601–1621 (2009)
Bengio, Y., Delalleau, O.: Shallow versus deep sum-product networks. In: The Learning Workshop, Fort Lauderdale, Florida (2011)
Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (eds.) Large Scale Kernel Machines. MIT Press (2007)
Bengio, Y., Delalleau, O., Le Roux, N., Paiement, J.-F., Vincent, P., Ouimet, M.: Spectral Dimensionality Reduction. In: Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L. (eds.) Feature Extraction, Foundations and Applications, vol. 207, pp. 519–550. Springer, Heidelberg (2006)
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 153–160. MIT Press (2007)
Bengio, Y., Bastien, F., Bergeron, A., Boulanger-Lewandowski, N., Chherawala, Y., Cisse, M., Côté, M., Erhan, D., Eustache, J., Glorot, X., Muller, X., Pannetier-Lebeuf, S., Pascanu, R., Savard, F., Sicard, G.: Deep self-taught learning for handwritten character recognition. In: NIPS*2010 Deep Learning and Unsupervised Feature Learning Workshop (2010)
Berkes, P., Wiskott, L.: Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision 5(6), 579–602 (2005)
Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics 59, 291–294 (1988)
Braverman, M.: Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM 54(4), 108–115 (2011)
Breuleux, O., Bengio, Y., Vincent, P.: Quickly generating representative samples from an RBM-derived process. Neural Computation 23(8), 2058–2073 (2011)
Bromley, J., Benz, J., Bottou, L., Guyon, I., Jackel, L., LeCun, Y., Moore, C., Sackinger, E., Shah, R.: Signature verification using a siamese time delay neural network. In: Advances in Pattern Recognition Systems using Neural Network Technologies, pp. 669–687. World Scientific, Singapore (1993)
Cadieu, C., Olshausen, B.: Learning transformational invariants from natural movies. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21, pp. 209–216. MIT Press (2009)
Cardoso, J.-F.: Multidimensional independent component analysis. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. 1941–1944 (1998)
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2005). IEEE Press (2005)
Collobert, R., Weston, J.: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 160–167. ACM (2008)
Courville, A., Bergstra, J., Bengio, Y.: A spike and slab restricted Boltzmann machine. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011) (2011)
Desjardins, G., Courville, A., Bengio, Y., Vincent, P., Delalleau, O.: Tempered Markov chain Monte-Carlo for training of restricted Boltzmann machine. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), pp. 145–152 (2010)
Erhan, D., Courville, A., Bengio, Y.: Understanding representations learned in deep architectures. Technical Report 1355, Université de Montréal/DIRO (2010a)
Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11, 625–660 (2010b)
Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 646–654 (2009)
Grimes, D.B., Rao, R.P.: Bilinear sparse coding for invariant vision. Neural Computation 17(1), 47–73 (2005)
Gutmann, M., Hyvarinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010 (2010)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2006), pp. 1735–1742. IEEE Press (2006)
Håstad, J.: Almost optimal lower bounds for small depth circuits. In: Proceedings of the 18th Annual ACM Symposium on Theory of Computing, Berkeley, California, pp. 6–20. ACM Press (1986)
Håstad, J., Goldmann, M.: On the power of small-depth threshold circuits. Computational Complexity 1, 113–129 (1991)
Hinton, G.E.: Products of experts. In: Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN), Edinburgh, Scotland, vol. 1, pp. 1–6. IEE (1999)
Hinton, G.E., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Hinton, G.E., Zemel, R.S.: Autoencoders, minimum description length, and Helmholtz free energy. In: Cowan, D., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information Processing Systems 6 (NIPS 1993), pp. 3–10. Morgan Kaufmann Publishers, Inc. (1994)
Hinton, G.E., Sejnowski, T.J., Ackley, D.H.: Boltzmann machines: Constraint satisfaction networks that learn. Technical Report TR-CMU-CS-84-119, Carnegie-Mellon University, Dept. of Computer Science (1984)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)
Hubel, D.H., Wiesel, T.N.: Receptive fields of single neurons in the cat’s striate cortex. Journal of Physiology 148, 574–591 (1959)
Hurri, J., Hyvärinen, A.: Temporal coherence, natural image sequences, and the visual cortex. In: Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 141–148 (2003)
Hyvärinen, A.: Estimation of non-normalized statistical models using score matching. Journal of Machine Learning Research 6, 695–709 (2005)
Hyvärinen, A., Hoyer, P.: Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation 12(7), 1705–1720 (2000)
Hyvärinen, A., Hoyer, P.O., Inki, M.O.: Topographic independent component analysis. Neural Computation 13(7), 1527–1558 (2001)
Jain, V., Seung, S.H.: Natural image denoising with convolutional networks. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21 (NIPS 2008), pp. 769–776 (2008)
Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proc. International Conference on Computer Vision (ICCV 2009), pp. 2146–2153. IEEE (2009)
Jenatton, R., Audibert, J.-Y., Bach, F.: Structured variable selection with sparsity-inducing norms. Technical report, arXiv:0904.3523 (2009)
Jordan, M.I.: Learning in Graphical Models. Kluwer, Dordrecht (1998)
Kavukcuoglu, K., Ranzato, M., LeCun, Y.: Fast inference in sparse coding algorithms with applications to object recognition. Technical report, Computational and Biological Learning Lab, Courant Institute, NYU. Tech Report CBLL-TR-2008-12-01 (2008)
Kavukcuoglu, K., Ranzato, M., Fergus, R., LeCun, Y.: Learning invariant features through topographic filter maps. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2009), pp. 1605–1612. IEEE (2009)
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., LeCun, Y.: Learning convolutional feature hierarchies for visual recognition. In: Advances in Neural Information Processing Systems 23 (NIPS 2010), pp. 1090–1098 (2010)
Kingma, D., LeCun, Y.: Regularized estimation of image statistics by score matching. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 1126–1134 (2010)
Klampfl, S., Maass, W.: Replacing supervised classification learning by slow feature analysis in spiking neural networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 988–996 (2009)
Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480 (1990)
Kohonen, T.: Emergence of invariant-feature detectors in the adaptive-subspace self-organizing map. Biological Cybernetics 75, 281–291 (1996), doi:10.1007/s004220050295
Kohonen, T., Nemeth, G., Bry, K.-J., Jalanko, M., Riittinen, H.: Spectral classification of phonemes by learning subspaces. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1979, vol. 4, pp. 97–100 (1979)
Körding, K.P., Kayser, C., Einhäuser, W., König, P.: How are complex cell properties adapted to the statistics of natural stimuli? Journal of Neurophysiology 91, 206–212 (2004)
Krizhevsky, A.: Convolutional deep belief networks on cifar-10 (2010) (unpublished manuscript) http://www.cs.utoronto.ca/~kriz/conv-cifar10-aug2010.pdf
Kurkova, V., Sanguineti, M.: Geometric upper bounds on rates of variable-basis approximation. IEEE Trans. on Information Theory 54, 5681–5688 (2008)
Larochelle, H., Erhan, D., Bengio, Y.: Zero-data learning of new tasks. In: Proceedings of the 23rd National Conference on Artificial Intelligence, vol. 2, pp. 646–651. AAAI Press (2008)
Le, Q., Ngiam, J., Chen, Z., Hao Chia, D.J., Koh, P.W., Ng, A.: Tiled convolutional neural networks. In: Lafferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23 (NIPS 2010), pp. 1279–1287 (2010)
Le Roux, N., Bengio, Y.: Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation 20(6), 1631–1649 (2008)
LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML 2009). ACM, Montreal (2009a)
Lee, H., Pham, P., Largman, Y., Ng, A.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio, Y., Schuurmans, D., Williams, C., Lafferty, J., Culotta, A. (eds.) Advances in Neural Information Processing Systems 22 (NIPS 2009), pp. 1096–1104 (2009b)
Lee, J.A., Verleysen, M.: Nonlinear dimensionality reduction. Springer (2007)
Manzagol, P.-A., Bertin-Mahieux, T., Eck, D.: On the use of sparse time-relative auditory codes for music. In: Proceedings of the 9th International Conference on Music Information Retrieval (ISMIR 2008), pp. 603–608 (2008)
Olshausen, B., Field, D.J.: How close are we to understanding V1? Neural Computation 17, 1665–1699 (2005)
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research 37, 3311–3325 (1997)
Olshausen, B.A., Anderson, C.H., Van Essen, D.C.: A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci. 13(11), 4700–4719 (1993)
Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.Y.: Self-taught learning: transfer learning from unlabeled data. In: Ghahramani, Z. (ed.) Proceedings of the Twenty-Fourth International Conference on Machine Learning (ICML 2007), pp. 759–766. ACM (2007)
Ranzato, M., Hinton, G.H.: Modeling pixel means and covariances using factorized third-order Boltzmann machines. In: Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR 2010), pp. 2551–2558. IEEE Press (2010)
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems 19 (NIPS 2006), pp. 1137–1144. MIT Press (2007a)
Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Efficient learning of sparse representations with an energy-based model. In: NIPS 2006 (2007b)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Salakhutdinov, R.: Learning deep Boltzmann machines using adaptive MCMC. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-Seventh International Conference on Machine Learning (ICML 2010), vol. 1, pp. 943–950. ACM (2010)
Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), vol. 5, pp. 448–455 (2009)
Salakhutdinov, R., Larochelle, H.: Efficient learning of deep Boltzmann machines. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), JMLR W&CP, vol. 9, pp. 693–700 (2010)
Saul, L., Roweis, S.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research 4, 119–155 (2002)
Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge (2002)
Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., Poggio, T.: A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function 165, 33–56 (2007)
Smith, E.C., Lewicki, M.S.: Efficient auditory coding. Nature 439(7079), 978–982 (2006)
Smolensky, P.: Information processing in dynamical systems: Foundations of harmony theory. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed Processing, ch. 6, vol. 1, pp. 194–281. MIT Press, Cambridge (1986)
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional Learning of Spatio-temporal Features. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 140–153. Springer, Heidelberg (2010)
Tenenbaum, J.B., Freeman, W.T.: Separating Style and Content with Bilinear Models. Neural Computation 12(6), 1247–1283 (2000)
Tieleman, T.: Training restricted Boltzmann machines using approximations to the likelihood gradient. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 1064–1071. ACM (2008)
Tieleman, T., Hinton, G.: Using fast weights to improve persistent contrastive divergence. In: Bottou, L., Littman, M. (eds.) Proceedings of the Twenty-Sixth International Conference on Machine Learning (ICML 2009), pp. 1033–1040. ACM (2009)
Turaga, S.C., Murray, J.F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk, W., Seung, H.S.: Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation 22, 511–538 (2010)
Vasilescu, M.A.O., Terzopoulos, D.: Multilinear independent components analysis. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 1, pp. 547–553 (2005)
Vincent, P.: A connection between score matching and denoising autoencoders. Neural Computation 23(7), 1661–1674 (2011)
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 1096–1103. ACM (2008)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11, 3371–3408 (2010)
Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (BMVC), London, UK, p. 127 (2009)
Welling, M.: Herding dynamic weights for partially observed random field models. In: Proceedings of the 25th Conference in Uncertainty in Artificial Intelligence (UAI 2009). Morgan Kaufmann (2009)
Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding. In: Cohen, W.W., McCallum, A., Roweis, S.T. (eds.) Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML 2008), pp. 1168–1175. ACM, New York (2008)
Wiskott, L., Sejnowski, T.: Slow feature analysis: Unsupervised learning of invariances. Neural Computation 14(4), 715–770 (2002)
Younes, L.: On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports 65(3), 177–228 (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bengio, Y., Courville, A. (2013). Deep Learning of Representations. In: Bianchini, M., Maggini, M., Jain, L. (eds) Handbook on Neural Information Processing. Intelligent Systems Reference Library, vol 49. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36657-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-36657-4_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36656-7
Online ISBN: 978-3-642-36657-4
eBook Packages: EngineeringEngineering (R0)