Abstract
We study the few-shot learning (FSL) problem, where a model learns to recognize new objects with extremely few labeled training data per category. Most of previous FSL approaches resort to the meta-learning paradigm, where the model accumulates inductive bias through learning many training tasks so as to solve a new unseen few-shot task. In contrast, we propose a simple semi-supervised FSL approach to exploit unlabeled data accompanying the few-shot task for improving few-shot performance. (i) Firstly, we propose a Dependency Maximization method based on the Hilbert-Schmidt norm of the cross-covariance operator, which maximizes the statistical dependency between the embedded features of those unlabeled data and their label predictions, together with the supervised loss over the support set. We then use the obtained model to infer the pseudo-labels of the unlabeled data. (ii) Furthermore, we propose an Instance Discriminant Analysis to evaluate the credibility of each pseudo-labeled example and select the most faithful ones into an augmented support set to retrain the model as in the first step. We iterate the above process until the pseudo-labels of the unlabeled set become stable. Our experiments demonstrate that the proposed method outperforms previous state-of-the-art methods on four widely used few-shot classification benchmarks, including mini-ImageNet, tiered-ImageNet, CUB, CIFARFS, as well as the standard few-shot semantic segmentation benchmark PASCAL-5\(^{i}\).


Similar content being viewed by others
References
Chen, W.-Y., Liu, Y.-C., Kira, Z., Wang, Y.-C. F., & Huang, J.-B. (2019). A closer look at few-shot classification. ICLR.
Raina, R., Battle, A., Lee, H., Packer, B., & Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. ICML.
Antoniou, A., Edwards, H., & Storkey, A. (2018). How to train your maml. arXiv preprint arXiv:1810.09502
Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. ICML.
Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., & Hadsell, R. (2019). Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960
Sun, Q., Liu, Y., Chua, T.-S., & Schiele, B. (2019). Meta-transfer learning for few-shot learning. CVPR.
Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. (2016). Matching networks for one shot learning. NeurIPS.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. NeurIPS.
Ye, H.-J., Hu, H., Zhan, D.-C., & Sha, F. (2020). Few-shot learning via embedding adaptation with set-to-set functions. CVPR.
Hou, R., Chang, H., Bingpeng, M., Shan, S., & Chen, X. (2019). Cross attention network for few-shot classification. NeurIPS.
Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. CVPR.
Bateni, P., Goyal, R., Masrani, V., Wood, F., & Sigal, L. (2020). Improved few-shot visual classification. CVPR.
Zhang, C., Cai, Y., Lin, G., & Shen, C. (2020). Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. CVPR.
Simon, C., Koniusz, P., Nock, R., & Harandi, M. (2020). Adaptive subspaces for few-shot learning. CVPR.
Gao, H., Shou, Z., Zareian, A., Zhang, H., & Chang, S.-F. (2018). Low-shot learning via covariance-preserving adversarial augmentation networks. NeurIPS.
Li, K., Zhang, Y., Li, K., & Fu, Y. (2020). Adversarial feature hallucination networks for few-shot learning. CVPR.
Zhang, R., Che, T., Ghahramani, Z., Bengio, Y., & Song, Y. (2018). Metagan: An adversarial approach to few-shot learning. NeurIPS.
Liu, Y., Lee, J., Park, M., Kim, S., Yang, E., Hwang, S. J., & Yang, Y. (2018). Learning to propagate labels: Transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002
Rodríguez, P., Laradji, I., Drouin, A., & Lacoste, A. (2020). Embedding propagation: Smoother manifold for few-shot classification. ECCV.
Hu, S. X., Moreno, P. G., Xiao, Y., Shen, X., Obozinski, G., Lawrence, N. D., & Damianou, A. (2020). Empirical bayes transductive meta-learning with synthetic gradients. ICLR.
Dhillon, G. S., Chaudhari, P., Ravichandran, A., & Soatto, S. (2020). A baseline for few-shot image classification. ICLR.
Boudiaf, M., Masud, Z. I., Rony, J., Dolz, J., Piantanida, P., & Ayed, I. B. (2020). Transductive information maximization for few-shot learning. NeurIPS.
Li, X., Sun, Q., Liu, Y., Zhou, Q., Zheng, S., Chua, T.-S., & Schiele, B. (2019). Learning to self-train for semi-supervised few-shot classification. NeurIPS.
Ren, M., Triantafillou, E., Ravi, S., Snell, J., Swersky, K., Tenenbaum, J. B., Larochelle, H., & Zemel, R. S. (2018). Meta-learning for semi-supervised few-shot classification. ICLR.
Wang, Y., Xu, C., Liu, C., Zhang, L., & Fu, Y. (2020). Instance credibility inference for few-shot learning. CVPR.
Baker, C. R. (1973). Joint measures and cross-covariance operators. Transactions of the American Mathematical Society.
Gretton, A., Bousquet, O., Smola, A., & Schölkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. ALT.
Ziko, I., Dolz, J., Granger, E., & Ayed, I. B. (2020). Laplacian regularized few-shot learning. ICML.
Liu, J., Song, L., & Qin, Y. (2020a). Prototype rectification for few-shot learning. ECCV.
Lichtenstein, M., Sattigeri, P., Feris, R., Giryes, R., & Karlinsky, L. (2020). Tafssl: Task-adaptive feature sub-space learning for few-shot classification. ECCV.
Lee, K., Maji, S., Ravichandran, A., & Soatto, S. (2019). Meta-learning with differentiable convex optimization. CVPR.
Mangla, P., Kumari, N., Sinha, A., Singh, M., & Krishnamurthy, B. (2020). Balasubramanian (Vol. N). Manifold mixup for few-shot learning. WACV: Charting the right manifold.
Ravi, S., & Larochelle, H. (2017). Optimization as a model for few-shot learning. ICLR.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset. Computation and Neural Systems Technical Report.
Zagoruyko, S., & Komodakis, N. (2016). Wide residual networks. BMVC.
Zhang, C., Lin, G., Liu, F., Yao, R., & Shen, C. (2019b). Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. CVPR.
Zhang, C., Lin, G., Liu, F., Guo, J., Wu, Q., & Yao, R. (2019a). Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. CVPR.
Liu, W., Zhang, C., Lin, G., & Liu, F. (2020b). Crnet: Cross-reference networks for few-shot segmentation. CVPR.
Gairola, S., Hemani, M., Chopra, A., & Krishnamurthy, B. (2020). Simpropnet: Improved similarity propagation for few-shot image segmentation. arXiv preprint arXiv:2004.15014
Yang, Y., Meng, F., Li, H., Wu, Q., Xu, X., & Chen, S. (2020b). A new local transformation module for few-shot segmentation. ICMM.
Yang, B., Liu, C., Li, B., Jiao, J., & Ye, Q. (2020a). Prototype mixture models for few-shot semantic segmentation. ECCV.
Liu, Y., Zhang, X., Zhang, S., & He, X. (2020c). Part-aware prototype network for few-shot semantic segmentation. ECCV.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. IJCV.
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. CVPR.
Merikoski, J. K., Sarria, H., & Tarazaga, P. (1994). Bounds for singular values using traces. Linear Algebra and its Applications, 210, 227–254.
Von Neumann, J. (1937). “Some Matrix-Inequalities and Metrization of Matrix-Space. Rev: Tomsk. Univ.
Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge University Press.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Before we provide the proof for Theorem 3, we list two useful lemmas that are used repeatedly in the following.
Lemma 1
([45]) The non-increasingly ordered singular values of a matrix \(\mathbf {M}\) obey \(0\le \sigma _i\le \dfrac{\Vert M\Vert _F}{\sqrt{i}}\), where \(\Vert \cdot \Vert _F\) denotes the matrix Frobenius norm.
Lemma 2
([46]) Let \(\sigma _i(M)\) and \(\sigma _i(N)\) be the non-increasingly ordered singular values of matrices \(\mathbf {M},\mathbf {N}\in {\mathbb {R}}^{a\times b}\). Then, \({{\,\mathrm{tr}\,}}\{\mathbf {M}\mathbf {N}^T\}\le \sum _i^r\sigma _i(\mathbf {M})\sigma _i(\mathbf {N})\), where \(r={min}(a,b)\).
1.1 Proof of Theorem 3
Proof
The Fishers Criterion can be rewritten as \(\psi ={{\,\mathrm{tr}\,}}\{\bar{\mathbf {S}}^{-1}\mathbf {S}_B\}\), where \(\bar{\mathbf {S}}=\mathbf {F}\mathbf {F}^T\) (\(\mathbf {F}\) is the matrix containing all features of the unlabeled set, arranged in columns) and \(\mathbf {S}_B=\sum _{c=1}^NM_c\varvec{\mu }_c\varvec{\mu }_c^T=\sum _{c=1}^N\mathbf {S}_{c}\) (\(\varvec{\mu }_c\) is the mean feature vector of class c). For notation clarity and simplicity, we assume that all data are centered and that data mean does not change after only one sample is removed. This is justifiable when the number of unlabeled data is sufficiently large, which is the case we consider here.
Suppose the removed instance has pseudo-label belonging to class u. After removing the instance \(f(\mathbf {x}_u)\), the two scatter matrices becomes: \(\bar{\mathbf {S}}'=\bar{\mathbf {S}}-f(\mathbf {x}_u)f(\mathbf {x}_u)^T\) and \(\mathbf {S}_B'=\mathbf {S}_B+\mathbf {S}_u'-\mathbf {S}_u=\mathbf {S}_B+\mathbf {E}_B\), where \(\mathbf {S}_u'=(M_u-1)\varvec{\mu }_u'\varvec{\mu }_u'^T\) and \(\varvec{\mu }_u'=(\varvec{\mu }_uM_u-f(\varvec{\mu }_u))/(M_u-1)\). Then, we can rewrite:
We can then define the IDA as:
The latter term can be reformulated by the Woodbury identity [47]:
Substitute this term into the above IDA equation, we have:
where \(\tilde{\mathbf {E}}_B=-{\mathbf {E}}_B\). To upper-bound \(d\psi _u\), we derive an upper-bound for the three terms respectively, given that trace operation is additive.
Upper-bound for \({{\,\mathrm{tr}\,}}\{\dfrac{\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}\mathbf {S}_B}{f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)-1}\}\): From Lemma 2, we have:
where \(\sigma _1(\cdot )\) denotes the largest singular value. Given that the largest singular value is actually the spectral norm, based on the norm submultiplicative, we have:
For the first norm, \(\Vert \bar{\mathbf {S}}^{-1}\Vert _2=1/\sigma _{min}(\bar{\mathbf {S}})\). Typically, \(\bar{\mathbf {S}}\) is regularized by a ridge parameter \(\rho >0\), i.e. \(\bar{\mathbf {S}}+\rho \mathbf {I}\), it can be said that \(\sigma _{min}(\bar{\mathbf {S}})>\rho\), so that \(\Vert \bar{\mathbf {S}}^{-1}\Vert _2<1/\rho\). For the second norm, \(\Vert \mathbf {S}_B\Vert _2=\Vert \sum _{c=1}^NM_c\varvec{\mu }_c\varvec{\mu }_c^T\Vert _2\le \sum _{c=1}^NM_c\Vert \varvec{\mu }_c\varvec{\mu }_c^T\Vert _2=\sum _{c=1}^NM_c\varvec{\mu }_c^T\varvec{\mu }_c=\delta\). It follows that \(\sigma _1(\bar{\mathbf {S}}^{-1}\mathbf {S}_B\bar{\mathbf {S}}^{-1})\le \delta /\rho ^2\). Finally, based on the von Neumann [46] property, \(f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)-1={{\,\mathrm{tr}\,}}\{f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)\}-1=C{\sigma _1(\bar{\mathbf {S}}^{-1})f(\mathbf {x}_u)^Tf(\mathbf {x}_u)}-1\), where \(C\in [-1,1]\). Hence, for simplicity, we use the following approximation: \(f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)-1\approx f(\mathbf {x}_u)^Tf(\mathbf {x}_u)/\rho -1\). Then, we can derive the upper-bound for \({{\,\mathrm{tr}\,}}\{\dfrac{\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}\mathbf {S}_B}{f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)-1}\}\) as:
Upper-bound for \({{\,\mathrm{tr}\,}}\{\bar{\mathbf {S}}^{-1}\tilde{\mathbf {E}}_B\}\): From Lemma 2, we have:
since \(\text {rank}(\tilde{\mathbf {E}}_B)\le 4\) [47]. Then, with Lemma 1, we have \(\sigma _i(\tilde{\mathbf {E}}_B)\le \dfrac{\Vert \tilde{\mathbf {E}}_B\Vert _F}{\sqrt{i}}=\dfrac{\Vert {\mathbf {E}}_B\Vert _F}{\sqrt{i}}\). By substituting the definition of \({\mathbf {E}}_B\) and using the triangular inequality, we have:
Based on the property that \(\Vert M\Vert _F^2={{\,\mathrm{tr}\,}}(M^TM)\):
where the definition of \(\nu _u\) is listed in Theorem 3 of our paper. With the bound on \(\sigma _1(\bar{\mathbf {S}}^{-1})<1/\rho\), we can derive the upper-bound for \({{\,\mathrm{tr}\,}}\{\bar{\mathbf {S}}^{-1}\tilde{\mathbf {E}}_B\}\) as:
Upper-bound for \({{\,\mathrm{tr}\,}}\{\dfrac{\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}\mathbf {E}_B}{f(\mathbf {x}_u)^T\bar{\mathbf {S}}^{-1}f(\mathbf {x}_u)-1}\}\): With similar derivation as in the first term, we have:
Again, based on the norm submultiplicative, \(\sigma _1(\bar{\mathbf {S}}^{-1}\mathbf {E}_B\bar{\mathbf {S}}^{-1})\le \Vert \bar{\mathbf {S}}^{-1}\Vert _2^2\Vert \mathbf {E}_B\Vert _2\). From the derivation in the second term, we readily get \(\Vert \mathbf {E}_B\Vert _2=\sigma _1(\Vert \mathbf {E}_B\Vert _2)\le \Vert \mathbf {E}_B\Vert _F\le \dfrac{\nu _u+f(\mathbf {x}_u)^Tf(\mathbf {x}_u)}{(M_u-1)}\). Using the upper-bound for \(\Vert \bar{\mathbf {S}}^{-1}\Vert _2\), we can obtain the bound \(\sigma _1(\bar{\mathbf {S}}^{-1}\mathbf {E}_B\bar{\mathbf {S}}^{-1})\le \Vert \mathbf {E}_B\Vert _F\le \dfrac{\nu _u+f(\mathbf {x}_u)^Tf(\mathbf {x}_u)}{(M_u-1)\rho ^2}\). Finally, we can derive the upper-bound for the third term:
Finally, we can conclude the upper-bound for \(d\psi _u\) by combining the upper-bounds for three additive terms together.
Rights and permissions
About this article
Cite this article
Hou, Z., Kung, SY. Semi-Supervised Few-Shot Learning Via Dependency Maximization and Instance Discriminant Analysis. J Sign Process Syst 95, 13–24 (2023). https://doi.org/10.1007/s11265-022-01796-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-022-01796-x