From SMOTE to Mixup for Deep Imbalanced Classification | SpringerLink
Skip to main content

From SMOTE to Mixup for Deep Imbalanced Classification

  • Conference paper
  • First Online:
Technologies and Applications of Artificial Intelligence (TAAI 2023)

Abstract

Given imbalanced data, it is hard to train a good classifier using deep learning because of the poor generalization of minority classes. Traditionally, the well-known synthetic minority oversampling technique (SMOTE) for data augmentation, a data mining approach for imbalanced learning, has been used to improve this generalization. However, it is unclear whether SMOTE also benefits deep learning. In this work, we study why the original SMOTE is insufficient for deep learning, and enhance SMOTE using soft labels. Connecting the resulting soft SMOTE with Mixup, a modern data augmentation technique, leads to a unified framework that puts traditional and modern data augmentation techniques under the same umbrella. A careful study within this framework shows that Mixup improves generalization by implicitly achieving uneven margins between majority and minority classes. We then propose a novel margin-aware Mixup technique that more explicitly achieves uneven margins. Extensive experimental results demonstrate that our proposed technique yields state-of-the-art performance on deep imbalanced classification while achieving superior performance on extremely imbalanced data. The code is open-sourced in our developed package https://github.com/ntucllab/imbalanced-DL to foster future research in this direction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
JPY 3498
Price includes VAT (Japan)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
JPY 10295
Price includes VAT (Japan)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
JPY 12869
Price includes VAT (Japan)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Availability of Data and Material

Experiments are based on public benchmark data.

Code availability: released at open-source at https://github.com/ntucllab/imbalanced-DL.

References

  1. Awoyemi, J.O., Adetunmbi, A.O., Oluwadare, S.A.: Credit card fraud detection using machine learning techniques: a comparative analysis. In: 2017 ICCNI, pp. 1–9 (2017). https://doi.org/10.1109/ICCNI.2017.8123782

  2. Roy, A., Sun, J., Mahoney, R., Alonzi, L., Adams, S., Beling, P.: Deep learning detecting fraud in credit card transactions. In: 2018 SIEDS, pp. 129–134 (2018). https://doi.org/10.1109/SIEDS.2018.8374722

  3. Horn, G.V., Perona, P.: The devil is in the tails: fine-grained classification in the wild. CoRR abs/1709.01450 (2017). https://arxiv.org/abs/1709.01450

  4. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  5. Zhong, Y., et al.: Unequal training for deep face recognition with long-tailed noisy data. In: 2019 CVPR, pp. 7804–7813 (2019). https://doi.org/10.1109/CVPR.2019.00800

  6. Huang, C., Li, Y., Loy, C.C., Tang, X.: Deep imbalanced learning for face recognition and attribute prediction. TPAMI 42(11), 2781–2794 (2020). https://doi.org/10.1109/TPAMI.2019.2914680

  7. Cui, Y., Jia, M., Lin, T.-Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: CVPR (2019)

    Google Scholar 

  8. Chawla, N., Bowyer, K., Hall, L.O., Kegelmeyer, W.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16(1), 321–357 (2002)

    Article  Google Scholar 

  9. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008). https://doi.org/10.1109/IJCNN.2008.4633969

  10. Huang, C., Li, Y., Loy, C.C., Tang, X.: Learning deep representation for imbalanced classification. In: 2016 CVPR, pp. 5375–5384 (2016). https://doi.org/10.1109/CVPR.2016.580

  11. Liu, X., Zhou, Z.: The influence of class imbalance on cost-sensitive learning: an empirical study. In: ICDM 2006, pp. 970–974 (2006). https://doi.org/10.1109/ICDM.2006.158

  12. Cao, K., Wei, C., Gaidon, A., Arechiga, N., Ma, T.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NeurIPS (2019)

    Google Scholar 

  13. Bej, S., Davtyan, N., Wolfien, M., Nassar, M., Wolkenhauer, O.: LoRAS: an oversampling approach for imbalanced datasets. CoRR abs/1908.08346 (2019). https://arxiv.org/abs/1908.08346

  14. DeVries, T., Taylor, G.W.: Dataset augmentation in feature space. In: ICLR 2017, Toulon, France, 24–26 April 2017, Workshop Track Proceedings (2017). https://openreview.net/forum?id=HyaF53XYx

  15. Inoue, H.: Data augmentation by pairing samples for images classification (2018). https://openreview.net/forum?id=SJn0sLgRb

  16. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. In: ICLR (2018). https://openreview.net/forum?id=r1Ddp1-Rb

  17. Mathew, J., Luo, M., Pang, C.K., Chan, H.L.: Kernel-based SMOTE for SVM classification of imbalanced datasets. In: IECON 2015, pp. 001127–001132 (2015). https://doi.org/10.1109/IECON.2015.7392251

  18. Dablain, D., Krawczyk, B., Chawla, N.: DeepSMOTE: fusing deep learning and smote for imbalanced data. IEEE Trans. Neural Netw. Learn. Syst., 1–15 (2022). https://doi.org/10.1109/TNNLS.2021.3136503

  19. Goodfellow, I.J., et al.: Generative Adversarial Networks (2014). Accessed 8 Jan 2017

    Google Scholar 

  20. Kim, J., Jeong, J., Shin, J.: M2m: imbalanced classification via major-to-minor translation, pp. 13893–13902 (2020). https://doi.org/10.1109/CVPR42600.2020.01391

  21. Chou, H.-P., Chang, S.-C., Pan, J.-Y., Wei, W., Juan, D.-C.: Remix: rebalanced mixup. arXiv (2020). 2007.03943

    Google Scholar 

  22. Johnson, J., Khoshgoftaar, T.: Survey on deep learning with class imbalance. J. Big Data 6, 1–54 (2019)

    Article  Google Scholar 

  23. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018). https://doi.org/10.1016/j.neunet.2018.07.011

    Article  Google Scholar 

  24. Khan, S.H., Hayat, M., Bennamoun, M., Sohel, F.A., Togneri, R.: Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3573–3587 (2018). https://doi.org/10.1109/TNNLS.2017.2732482

  25. Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks. In: ICML, pp. 507–516 (2016)

    Google Scholar 

  26. Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin softmax for face verification. IEEE Sig. Process. Lett. 25(7), 926–930 (2018). https://doi.org/10.1109/LSP.2018.2822810

  27. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto (2009)

    Google Scholar 

  28. Ye, H.-J., Chen, H.-Y., Zhan, D.-C., Chao, W.-L.: Identifying and compensating for feature deviation in imbalanced deep learning. arXiv (2020). 2001.01385

    Google Scholar 

  29. Reyzin, L., Schapire, R.E.: How boosting the margin can also boost classifier complexity. In: Proceedings of the 23rd International Conference on Machine Learning. ICML 2006, pp. 753–760 (2006). https://doi.org/10.1145/1143844.1143939

  30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 CPVR, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  31. Yang, Y., Xu, Z.: Rethinking the value of labels for improving class-imbalanced learning. In: NeurIPS (2020)

    Google Scholar 

Download references

Funding

The work is mainly supported by the MOST of Taiwan under 107-2628E-002-008-MY3.

Author information

Authors and Affiliations

Authors

Contributions

Cheng contributes to detailed literature survey, the initial idea of studying Mixup for deep imbalanced classification, experimental comparison, code implementation and release and initial manuscript writing; Ha rigorously reviewed the code implementation, addressed issues and bugs, and expanded the scope of experimental comparison by incorporating additional methods like DeepSMOTE and M2m; Lin contributes to the bigger picture of linking SMOTE and Mixup, the initial idea of the margin-aware extension, and suggestions on the research methodology.

Corresponding author

Correspondence to Hsuan-Tien Lin .

Editor information

Editors and Affiliations

Ethics declarations

Conflicts of interest/Competing interests: n/a

Ethics approval: n/a

Consent to participate: n/a

Consent for publication: n/a

Appendices

Appendix A Margin Statistics Analysis

We discuss Mixup-based approaches [16, 21] and their effects on margin statistics compared with margin-based state-of-the-art work in LDAM [12].

1.1 A.1 Margin Perspectives

To better analyze and quantify the effect of different learning algorithms on the majority- and minority-class margins, we define the margin gap metric γgap as:

$$ \gamma_{gap} = \frac{{\sum\nolimits_{i} {n_{i} \cdot \overline{{\gamma_{i} }} } }}{{\sum\nolimits_{i} {n_{i} } }} - \frac{{\sum\nolimits_{j} {n_{j} \cdot \overline{{\gamma_{j} }} } }}{{\sum\nolimits_{j} {n_{j} } }} $$
(A1)

where i, j belong to majority and minority classes, respectively. To decide which class belongs to a majority class, and which belongs a minority class, we set a threshold: if the class sample numbers exceed 1 / K of the total training samples, we categorize them as majority classes; the others are viewed as minority classes.

Hence a large margin gap corresponds to majority classes with larger margins and minority classes with smaller margins, and hence poor generalizability for the minority classes. We hope to achieve a smaller margin gap when given unbalanced classes. Note that this metric can be negative, as the margins for minority classes are larger than those of majority classes. To better determine whether this is a good indicator of the correlation between the margin gap and top-1 validation accuracy, we further evaluate with Spearman’s rank order correlation ρ in Fig. A1.

A.1.1 Spearman’s Rank Order Correlation. We demonstrate the results of analysis using Spearman’s rank order correlation in Fig. A1. We note a negative rank order correlation between validation accuracy and margin gap γgap, as our definition of margin gap reflects the trend in which the better the model generalizes to the minority class, the lower the margin gap is. That is, better models produce smaller margin gaps between majority and minority classes. As seen in Fig. A1, Spearman’s rank order correlation is −0.820, showing that although it is sometimes noisy, in general γgap is a good indicator for top-1 validation accuracy. Note that we will discuss the noisy part later in the next subsection.

A.1.2. Uneven Margin. Given the superior empirical performance of mixup-based methods, we further analyzed this from a margin perspective to demonstrate the effectiveness of our method. first, we establish our baseline margin gap when the model is trained using ERM. Then, we examine the margin-based LDAM work in which larger margins are enforced for minority classes [12]. As seen in Table A1, the margin gap for ERM is the highest; that is, for deep models trained using ERM, majority classes tend to have higher margins than minority classes, resulting in poor generalizability for minority classes. LDAM-DRW [12] demonstrates its ability to shrink the margin gap, reducing the generalization error for the minority class through margin-based softmax training. Moreover, we observe that in long-tailed imbalance, the original Mixup alone yields competitive results, as the margin gaps are similar between the original Mixup, Remix, and our proposed method. This observation is consistent with remix, for which similar performance is reported in a long-tailed imbalance setting. However, in a step imbalance setting, the superiority of our method is evident, as it not only achieves better performance but also shrinks the margin gap more than the original Mixup.

Fig. A1.
figure 3

Relationship between margin gap and validation accuracy for long-tailed imbalanced CIFAR-10 with imbalance ratio ρ = 100 using ResNet32

Table A1. Margin gap on imbalanced CIFAR-10 with ρ = 100 using ResNet32
Table A2. Margin gap for extremely imbalanced CIFAR-10 with ρ = 300 using ResNet32
Table A3. Margin decomposition on long-tailed imbalanced CIFAR-10 with ρ = 100 using ResNet32 (Majority: Class 0 to Class 2; Minority: Class 3 to Class 9)

Note that in Table A1, we see that for the long-tailed scenario, the margin gap of Remix-DRW is −1.598 and that of MAMix-DRW is −1.136. However, as shown in Table 4, their respective validation accuracies are 81.82 and 82.29. This is an example of the noisy part that is mentioned in the previous context. Here Remix-DRW yields a smaller margin gap than that of MAMix-DRW but poorer validation accuracy, because Remix tends to enforce excessive margins in minority classes, whereas our method strikes a better trade-off.

To further study why excessive margins in minority classes do not help with validation accuracy, we first decompose the margins into two parts: γ ≥ 0 and γ < 0 part, where validation accuracy is decided by the γ < 0 part (γ < 0 determines the validation error). The detailed decomposition result is in Table A3, where we take all γ < 0 margins and report the average among majority classes and minority classes for each method, and we compute γ ≥ 0 part the same way. From our observation, γ < 0 part is generally similar between Remix and our MAMix, thus there is only slight accuracy difference, however, the γ ≥ 0 part is generally higher for Remix, as we can see from Table A3. Therefore, the reason why in this case Remix has lower margin gap lies in the fact that it enforces more margins in γ ≥ 0 part of minority classes, as we can see the γ ≥ 0 part is 4.891 for Remix minority classes, and 4.213 for that of MAMix counterpart. From this observation, we identify that there seems to be excessive margins in minority classes for Remix, but—Do these excessive margins help or not ?—Previous research [29] has indicated that overly optimizing the margin may be an over-kill, in which the performance may be worse. We further answer this question by examining the difference between theoretical and practical margin distribution.

Recall that LDAM [12] derives a theoretically optimal ratio (1) for per class margin distribution, where such a ratio hints the need to not over-push the margin of minority classes. To further analyze how close the practical per class margin distribution of different methods are than that of theoretical margin distribution, we fit theoretical margin by practical margin, and since there is a constant multiplier C in theoretical margin, as in the form of (1), we choose to use linear regression without bias. We set C = 1 and compare the fitting (L2) error in Table A4. As we can see from Table A4, our proposed MAMix shows the smallest L2 error, hinting that the per class margin distribution produced by our method is the closest to the theoretical margin distribution derived by [12], while the per class margin distribution produced by Remix [21] is slightly inferior than ours in terms of L2 error between theoretical and practical margin, which is due to the excessive margins in minority classes as shown in Remix_DRW Minority γ ≥ 0 part in Table A3. Moreover, from Table A4 and Table 4, we observe that the closer practical margin is to theoretical margin, the higher the validation accuracy. Therefore, from the above evidence, we argue that we not only need to enforce larger margin for minority classes, but also need to not over-push minority margins, indicating the need for our method to strike for the better trade-off.

Note that in Table A2—the extremely imbalanced setting—our method brings the margin gap closer than Remix, verifying that our method consistently outperforms Remix.

Therefore, from a margin perspective, we first establish the baseline: when trained with ERM for imbalanced learning, the margins for majority classes are significantly larger than those for minority classes. Second, the recently proposed LDAM loss indeed shrinks the margin gap significantly, suggesting that their approach is effective. To answer the original question—Can we achieve uneven margins for class-imbalanced learning through data augmentation?—the answer is positive, as we observe that applying the original Mixup implicitly closes the gap from a margin perspective, achieving comparable results. We further achieve uneven margins explicitly through the proposed MAMix.

Table A4. L2 error on long-tailed imbalanced CIFAR-10 with ρ = 100 using ResNet32
Table A5. Per Class Accuracy in long-tailed imbalanced CIFAR-10 with ρ = 100 using ResNet32

A.1.3. Per Class Accuracy Evaluation. To further demonstrate the effectiveness of our proposed method, we can see from Table A5 for detailed per class accuracy evaluation. As we can see from Table A5, with ERM, the minority classes (i.e, C7, C8, C9), the accuracy for those classes are low, with C8 and C9 to be 0.46 and 0.48 respectively. And we can see that previous state-of-theart in LDAM–DRW improved those two minority classes to 0.63 and 0.66. However, our proposed MAMix–DRW further elevated the per class accuracy of C8 and C9 and 0.79 and 0.82 respectively, without sacrificing the performance of the majority classes, which can be another evidence that shows the effectiveness of our algorithm.

A.1.4. Hyper-parameter ω in margin-aware MixupAs seen in Table A6, in the proposed MAMix, we can simply set ω to to 0.25, which is consistent with that suggested for LDAM [12]; however, the performance changes little when using different settings for ω, demonstrating that the proposed method is easy to tune.

Appendix B Implementation Details

1.1 B.1 Implementation Details for CIFAR

We followed [12] for CIFAR-10 and CIFAR-100. We also followed [12] to perform simple data augmentation described in [30] for training, where we first padded 4 pixels on each side, then a 32 × 32 crop was randomly sampled from the padded image, or its horizontal flip. We also used ResNet-32 [30] as our base network. We trained the model with a batch size of 128 for 200 epochs. We use an initial learning rate of 0.1, then decay by 0.01 at the 160 and 180th epoch. We also use linear warm-up learning rate schedule for the first 5 epochs for fair comparison.

Table A6. Sensitivity of ω in long-tailed extremely imbalanced CIFAR-10 with ρ = 300 using ResNet32

1.2 B.2 Implementation Details for CINIC

We followed [21] for CINIC-10 where we used ResNet-18 [30] as our base network. As the training scheme provided by [21] we also trained the model for 200 epochs, with a batch size of 128, and initial learning rate of 0.1, followed by decaying the learning rate by 0.01 at the 160 and 180th epochs. We also use linear warm-up learning rate schedule. When DRW was deployed, it was deployed at the 160th epoch. When LDAM was used, we enforced the largest margin to be 0.5.

1.3 B.3 Implementation Details for SVHN

We followed [31] for SVHN. We adopted ResNet-32 [30] as our base network. We trained the model for 200 epochs, with initial learning rate of 0.1 and batch size of 128. We used linear warm-up schedule, and decay the learning rate by 0.1 at the 160th, and 180th epochs. When DRW was deployed, it was deployed at the 160th epoch. When LDAM was used, we enforced the largest margin to be 0.5.

The detailed results for imbalanced SVHN is given in Table B7.

1.4 B.4 Implementation Details for Tiny ImageNet

We followed [12] for Tiny ImageNet with 200 classes. For basic data augmentation in training, we first performed simple horizontal flips, followed by taking random crops of size 64 × 64 from images padded by 8 pixels on each side. We adopted ResNet-18 [30] as our base networks, and used stochastic gradient descent with momentum of 0.9, weight decay of 2·10−4. We trained the model for 300 epochs, with initial learning rate of 0.1 and batch size of 128. We used linear warm-up rate schedule, and decay the learning rate by 0.1 at the 150th epoch and 0.01 at the 250th epoch. When DRW was deployed, it was deployed at the 240th epoch. When LDAM was used, we follow the original paper to enforce largest margin to be 0.5. Note that we cannot reproduce the numbers reported in [12].

Table B7. Top-1 validation accuracy (mean ± std) on imbalanced SVHN using ResNet32
Table B8. Top-1 validation accuracy (mean ± std) on imbalanced Tiny-ImageNet using ResNet18

The detailed results for imbalanced Tiny-ImageNet is given in Table B8.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cheng, WC., Mai, TH., Lin, HT. (2024). From SMOTE to Mixup for Deep Imbalanced Classification. In: Lee, CY., Lin, CL., Chang, HT. (eds) Technologies and Applications of Artificial Intelligence. TAAI 2023. Communications in Computer and Information Science, vol 2074. Springer, Singapore. https://doi.org/10.1007/978-981-97-1711-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-1711-8_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-1710-1

  • Online ISBN: 978-981-97-1711-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics