Abstract
Given imbalanced data, it is hard to train a good classifier using deep learning because of the poor generalization of minority classes. Traditionally, the well-known synthetic minority oversampling technique (SMOTE) for data augmentation, a data mining approach for imbalanced learning, has been used to improve this generalization. However, it is unclear whether SMOTE also benefits deep learning. In this work, we study why the original SMOTE is insufficient for deep learning, and enhance SMOTE using soft labels. Connecting the resulting soft SMOTE with Mixup, a modern data augmentation technique, leads to a unified framework that puts traditional and modern data augmentation techniques under the same umbrella. A careful study within this framework shows that Mixup improves generalization by implicitly achieving uneven margins between majority and minority classes. We then propose a novel margin-aware Mixup technique that more explicitly achieves uneven margins. Extensive experimental results demonstrate that our proposed technique yields state-of-the-art performance on deep imbalanced classification while achieving superior performance on extremely imbalanced data. The code is open-sourced in our developed package https://github.com/ntucllab/imbalanced-DL to foster future research in this direction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Availability of Data and Material
Experiments are based on public benchmark data.
Code availability: released at open-source at https://github.com/ntucllab/imbalanced-DL.
References
Awoyemi, J.O., Adetunmbi, A.O., Oluwadare, S.A.: Credit card fraud detection using machine learning techniques: a comparative analysis. In: 2017 ICCNI, pp. 1–9 (2017). https://doi.org/10.1109/ICCNI.2017.8123782
Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., Beling, P.: Deep learning detecting fraud in credit card transactions. In: 2018 SIEDS, pp. 129–134 (2018). https://doi.org/10.1109/SIEDS.2018.8374722
Horn, G.V., Perona, P.: The devil is in the tails: fine-grained classification in the wild. CoRR abs/1709.01450 (2017). https://arxiv.org/abs/1709.01450
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Zhong, Y., et al.: Unequal training for deep face recognition with long-tailed noisy data. In: 2019 CVPR, pp. 7804–7813 (2019). https://doi.org/10.1109/CVPR.2019.00800
Huang, C., Li, Y., Loy, C.C., Tang, X.: Deep imbalanced learning for face recognition and attribute prediction. TPAMI 42(11), 2781–2794 (2020). https://doi.org/10.1109/TPAMI.2019.2914680
Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: CVPR (2019)
Chawla, N., Bowyer, K., Hall, L.O., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008). https://doi.org/10.1109/IJCNN.2008.4633969
Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: 2016 CVPR, pp. 5375–5384 (2016). https://doi.org/10.1109/CVPR.2016.580
Liu, X., Zhou, Z.: The influence of class imbalance on cost-sensitive learning: an empirical study. In: ICDM 2006, pp. 970–974 (2006). https://doi.org/10.1109/ICDM.2006.158
Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NeurIPS (2019)
Bej, S., Davtyan, N., Wolfien, M., Nassar, M., Wolkenhauer, O.: LoRAS: an oversampling approach for imbalanced datasets. CoRR abs/1908.08346 (2019). https://arxiv.org/abs/1908.08346
DeVries, T., Taylor, G.W.: Dataset augmentation in feature space. In: ICLR 2017, Toulon, France, 24–26 April 2017, Workshop Track Proceedings (2017). https://openreview.net/forum?id=HyaF53XYx
Inoue, H.: Data augmentation by pairing samples for images classification (2018). https://openreview.net/forum?id=SJn0sLgRb
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. In: ICLR (2018). https://openreview.net/forum?id=r1Ddp1-Rb
Mathew, J., Luo, M., Pang, C.K., Chan, H.L.: Kernel-based SMOTE for SVM classification of imbalanced datasets. In: IECON 2015, pp. 001127–001132 (2015). https://doi.org/10.1109/IECON.2015.7392251
Dablain, D., Krawczyk, B., Chawla, N.: DeepSMOTE: fusing deep learning and smote for imbalanced data. IEEE Trans. Neural Netw. Learn. Syst., 1–15 (2022). https://doi.org/10.1109/TNNLS.2021.3136503
Goodfellow, I.J., et al.: Generative Adversarial Networks (2014). Accessed 8 Jan 2017
Kim, J., Jeong, J., Shin, J.: M2m: imbalanced classification via major-to-minor translation, pp. 13893–13902 (2020). https://doi.org/10.1109/CVPR42600.2020.01391
Chou, H.-P., Chang, S.-C., Pan, J.-Y., Wei, W., Juan, D.-C.: Remix: rebalanced mixup. arXiv (2020). 2007.03943
Johnson, J., Khoshgoftaar, T.: Survey on deep learning with class imbalance. J. Big Data 6, 1–54 (2019)
Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018). https://doi.org/10.1016/j.neunet.2018.07.011
Khan, S.H., Hayat, M., Bennamoun, M., Sohel, F.A., Togneri, R.: Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3573–3587 (2018). https://doi.org/10.1109/TNNLS.2017.2732482
Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: ICML, pp. 507–516 (2016)
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Sig. Process. Lett. 25(7), 926–930 (2018). https://doi.org/10.1109/LSP.2018.2822810
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)
Ye, H.-J., Chen, H.-Y., Zhan, D.-C., Chao, W.-L.: Identifying and compensating for feature deviation in imbalanced deep learning. arXiv (2020). 2001.01385
Reyzin, L., Schapire, R.E.: How boosting the margin can also boost classifier complexity. In: Proceedings of the 23rd International Conference on Machine Learning. ICML 2006, pp. 753–760 (2006). https://doi.org/10.1145/1143844.1143939
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 CPVR, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Yang, Y., Xu, Z.: Rethinking the value of labels for improving class-imbalanced learning. In: NeurIPS (2020)
Funding
The work is mainly supported by the MOST of Taiwan under 107-2628E-002-008-MY3.
Author information
Authors and Affiliations
Contributions
Cheng contributes to detailed literature survey, the initial idea of studying Mixup for deep imbalanced classification, experimental comparison, code implementation and release and initial manuscript writing; Ha rigorously reviewed the code implementation, addressed issues and bugs, and expanded the scope of experimental comparison by incorporating additional methods like DeepSMOTE and M2m; Lin contributes to the bigger picture of linking SMOTE and Mixup, the initial idea of the margin-aware extension, and suggestions on the research methodology.
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Conflicts of interest/Competing interests: n/a
Ethics approval: n/a
Consent to participate: n/a
Consent for publication: n/a
Appendices
Appendix A Margin Statistics Analysis
We discuss Mixup-based approaches [16, 21] and their effects on margin statistics compared with margin-based state-of-the-art work in LDAM [12].
1.1 A.1 Margin Perspectives
To better analyze and quantify the effect of different learning algorithms on the majority- and minority-class margins, we define the margin gap metric γgap as:
where i, j belong to majority and minority classes, respectively. To decide which class belongs to a majority class, and which belongs a minority class, we set a threshold: if the class sample numbers exceed 1 / K of the total training samples, we categorize them as majority classes; the others are viewed as minority classes.
Hence a large margin gap corresponds to majority classes with larger margins and minority classes with smaller margins, and hence poor generalizability for the minority classes. We hope to achieve a smaller margin gap when given unbalanced classes. Note that this metric can be negative, as the margins for minority classes are larger than those of majority classes. To better determine whether this is a good indicator of the correlation between the margin gap and top-1 validation accuracy, we further evaluate with Spearman’s rank order correlation ρ in Fig. A1.
A.1.1 Spearman’s Rank Order Correlation. We demonstrate the results of analysis using Spearman’s rank order correlation in Fig. A1. We note a negative rank order correlation between validation accuracy and margin gap γgap, as our definition of margin gap reflects the trend in which the better the model generalizes to the minority class, the lower the margin gap is. That is, better models produce smaller margin gaps between majority and minority classes. As seen in Fig. A1, Spearman’s rank order correlation is −0.820, showing that although it is sometimes noisy, in general γgap is a good indicator for top-1 validation accuracy. Note that we will discuss the noisy part later in the next subsection.
A.1.2. Uneven Margin. Given the superior empirical performance of mixup-based methods, we further analyzed this from a margin perspective to demonstrate the effectiveness of our method. first, we establish our baseline margin gap when the model is trained using ERM. Then, we examine the margin-based LDAM work in which larger margins are enforced for minority classes [12]. As seen in Table A1, the margin gap for ERM is the highest; that is, for deep models trained using ERM, majority classes tend to have higher margins than minority classes, resulting in poor generalizability for minority classes. LDAM-DRW [12] demonstrates its ability to shrink the margin gap, reducing the generalization error for the minority class through margin-based softmax training. Moreover, we observe that in long-tailed imbalance, the original Mixup alone yields competitive results, as the margin gaps are similar between the original Mixup, Remix, and our proposed method. This observation is consistent with remix, for which similar performance is reported in a long-tailed imbalance setting. However, in a step imbalance setting, the superiority of our method is evident, as it not only achieves better performance but also shrinks the margin gap more than the original Mixup.
Note that in Table A1, we see that for the long-tailed scenario, the margin gap of Remix-DRW is −1.598 and that of MAMix-DRW is −1.136. However, as shown in Table 4, their respective validation accuracies are 81.82 and 82.29. This is an example of the noisy part that is mentioned in the previous context. Here Remix-DRW yields a smaller margin gap than that of MAMix-DRW but poorer validation accuracy, because Remix tends to enforce excessive margins in minority classes, whereas our method strikes a better trade-off.
To further study why excessive margins in minority classes do not help with validation accuracy, we first decompose the margins into two parts: γ ≥ 0 and γ < 0 part, where validation accuracy is decided by the γ < 0 part (γ < 0 determines the validation error). The detailed decomposition result is in Table A3, where we take all γ < 0 margins and report the average among majority classes and minority classes for each method, and we compute γ ≥ 0 part the same way. From our observation, γ < 0 part is generally similar between Remix and our MAMix, thus there is only slight accuracy difference, however, the γ ≥ 0 part is generally higher for Remix, as we can see from Table A3. Therefore, the reason why in this case Remix has lower margin gap lies in the fact that it enforces more margins in γ ≥ 0 part of minority classes, as we can see the γ ≥ 0 part is 4.891 for Remix minority classes, and 4.213 for that of MAMix counterpart. From this observation, we identify that there seems to be excessive margins in minority classes for Remix, but—Do these excessive margins help or not ?—Previous research [29] has indicated that overly optimizing the margin may be an over-kill, in which the performance may be worse. We further answer this question by examining the difference between theoretical and practical margin distribution.
Recall that LDAM [12] derives a theoretically optimal ratio (1) for per class margin distribution, where such a ratio hints the need to not over-push the margin of minority classes. To further analyze how close the practical per class margin distribution of different methods are than that of theoretical margin distribution, we fit theoretical margin by practical margin, and since there is a constant multiplier C in theoretical margin, as in the form of (1), we choose to use linear regression without bias. We set C = 1 and compare the fitting (L2) error in Table A4. As we can see from Table A4, our proposed MAMix shows the smallest L2 error, hinting that the per class margin distribution produced by our method is the closest to the theoretical margin distribution derived by [12], while the per class margin distribution produced by Remix [21] is slightly inferior than ours in terms of L2 error between theoretical and practical margin, which is due to the excessive margins in minority classes as shown in Remix_DRW Minority γ ≥ 0 part in Table A3. Moreover, from Table A4 and Table 4, we observe that the closer practical margin is to theoretical margin, the higher the validation accuracy. Therefore, from the above evidence, we argue that we not only need to enforce larger margin for minority classes, but also need to not over-push minority margins, indicating the need for our method to strike for the better trade-off.
Note that in Table A2—the extremely imbalanced setting—our method brings the margin gap closer than Remix, verifying that our method consistently outperforms Remix.
Therefore, from a margin perspective, we first establish the baseline: when trained with ERM for imbalanced learning, the margins for majority classes are significantly larger than those for minority classes. Second, the recently proposed LDAM loss indeed shrinks the margin gap significantly, suggesting that their approach is effective. To answer the original question—Can we achieve uneven margins for class-imbalanced learning through data augmentation?—the answer is positive, as we observe that applying the original Mixup implicitly closes the gap from a margin perspective, achieving comparable results. We further achieve uneven margins explicitly through the proposed MAMix.
A.1.3. Per Class Accuracy Evaluation. To further demonstrate the effectiveness of our proposed method, we can see from Table A5 for detailed per class accuracy evaluation. As we can see from Table A5, with ERM, the minority classes (i.e, C7, C8, C9), the accuracy for those classes are low, with C8 and C9 to be 0.46 and 0.48 respectively. And we can see that previous state-of-theart in LDAM–DRW improved those two minority classes to 0.63 and 0.66. However, our proposed MAMix–DRW further elevated the per class accuracy of C8 and C9 and 0.79 and 0.82 respectively, without sacrificing the performance of the majority classes, which can be another evidence that shows the effectiveness of our algorithm.
A.1.4. Hyper-parameter ω in margin-aware MixupAs seen in Table A6, in the proposed MAMix, we can simply set ω to to 0.25, which is consistent with that suggested for LDAM [12]; however, the performance changes little when using different settings for ω, demonstrating that the proposed method is easy to tune.
Appendix B Implementation Details
1.1 B.1 Implementation Details for CIFAR
We followed [12] for CIFAR-10 and CIFAR-100. We also followed [12] to perform simple data augmentation described in [30] for training, where we first padded 4 pixels on each side, then a 32 × 32 crop was randomly sampled from the padded image, or its horizontal flip. We also used ResNet-32 [30] as our base network. We trained the model with a batch size of 128 for 200 epochs. We use an initial learning rate of 0.1, then decay by 0.01 at the 160 and 180th epoch. We also use linear warm-up learning rate schedule for the first 5 epochs for fair comparison.
1.2 B.2 Implementation Details for CINIC
We followed [21] for CINIC-10 where we used ResNet-18 [30] as our base network. As the training scheme provided by [21] we also trained the model for 200 epochs, with a batch size of 128, and initial learning rate of 0.1, followed by decaying the learning rate by 0.01 at the 160 and 180th epochs. We also use linear warm-up learning rate schedule. When DRW was deployed, it was deployed at the 160th epoch. When LDAM was used, we enforced the largest margin to be 0.5.
1.3 B.3 Implementation Details for SVHN
We followed [31] for SVHN. We adopted ResNet-32 [30] as our base network. We trained the model for 200 epochs, with initial learning rate of 0.1 and batch size of 128. We used linear warm-up schedule, and decay the learning rate by 0.1 at the 160th, and 180th epochs. When DRW was deployed, it was deployed at the 160th epoch. When LDAM was used, we enforced the largest margin to be 0.5.
The detailed results for imbalanced SVHN is given in Table B7.
1.4 B.4 Implementation Details for Tiny ImageNet
We followed [12] for Tiny ImageNet with 200 classes. For basic data augmentation in training, we first performed simple horizontal flips, followed by taking random crops of size 64 × 64 from images padded by 8 pixels on each side. We adopted ResNet-18 [30] as our base networks, and used stochastic gradient descent with momentum of 0.9, weight decay of 2·10−4. We trained the model for 300 epochs, with initial learning rate of 0.1 and batch size of 128. We used linear warm-up rate schedule, and decay the learning rate by 0.1 at the 150th epoch and 0.01 at the 250th epoch. When DRW was deployed, it was deployed at the 240th epoch. When LDAM was used, we follow the original paper to enforce largest margin to be 0.5. Note that we cannot reproduce the numbers reported in [12].
The detailed results for imbalanced Tiny-ImageNet is given in Table B8.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cheng, WC., Mai, TH., Lin, HT. (2024). From SMOTE to Mixup for Deep Imbalanced Classification. In: Lee, CY., Lin, CL., Chang, HT. (eds) Technologies and Applications of Artificial Intelligence. TAAI 2023. Communications in Computer and Information Science, vol 2074. Springer, Singapore. https://doi.org/10.1007/978-981-97-1711-8_6
Download citation
DOI: https://doi.org/10.1007/978-981-97-1711-8_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-1710-1
Online ISBN: 978-981-97-1711-8
eBook Packages: Computer ScienceComputer Science (R0)