Risk-Averse support vector classifier machine via moments penalization | International Journal of Machine Learning and Cybernetics Skip to main content
Log in

Risk-Averse support vector classifier machine via moments penalization

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Support vector machine (SVM) has always been one of the most successful learning methods, with the idea of structural risk minimization which minimizes the upper bound of the generalization error. Recently, a tighter upper bound of the generalization error, related to the variance of loss, is proved as the empirical Bernstein bound. Based on this result, we propose a novel risk-averse support vector classifier machine (RA-SVCM), which can achieve a better generalization performance by considering the second order statistical information of loss function. It minimizes the empirical first- and second-moments of loss function, i.e., the mean and variance of loss function, to achieve the “right” bias-variance trade-off for general classes. The proposed method can be solved by the kernel reduced and Newton-type technique under certain conditions. Empirical studies show that the RA-SVCM achieves the best performance in comparison with other classical and state of art methods. The additional analysis shows that the proposed method is insensitive to the parameters, so abroad range of parameters lead to satisfactory performance. The proposed method is a general form of standard SVM, so it enriches the related studies of SVM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Cherkassky V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6):1564–1564. https://doi.org/10.1109/TNN.1997.641482

    Article  Google Scholar 

  2. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10(5):988–999. https://doi.org/10.1109/72.788640

    Article  Google Scholar 

  3. Osuna E, Freund R, Girosi F (2000) Training support vector machines: an application to face detection. In: IEEE Computer Society Conference on Computer Vision & Pattern Recognition, pp 130–136. https://doi.org/10.1109/CVPR.1997.609310

  4. Cheng Y, Fu L, Luo P, Ye Q, Liu F, Zhu W (2020) Multi-view generalized support vector machine via mining the inherent relationship between views with applications to face and fire smoke recognition. Knowledge-Based Syst 210:106488. https://doi.org/10.1016/j.knosys.2020.106488

    Article  Google Scholar 

  5. Olatunji SO (2019) Improved email spam detection model based on support vector machines. Neural Comput Appl 31(3):691–699. https://doi.org/10.1007/s00521-017-3100-y

    Article  Google Scholar 

  6. Cun YL, Boser B, Denker JS, Henderson D, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems, pp 396–404 . https://dl.acm.org/doi/10.5555/109230.109279

  7. Yadav A, Singh A, Dutta MK, Travieso CM (2020) Machine learning-based classification of cardiac diseases from PCG recorded heart sounds. Neural Comput Appl 32(28):17843–17856. https://doi.org/10.1007/s00521-019-04547-5

    Article  Google Scholar 

  8. Yang L, Xu Z (2019) Feature extraction by PCA and diagnosis of breast tumors using svm with de-based parameter tuning. Int J Mach Learn Cybern 10(3):591–601. https://doi.org/10.1007/s13042-017-0741-1

    Article  Google Scholar 

  9. Le DN, Parvathy VS, Gupta D, Khanna A, Rodrigues J, Shankar K (2021) Iot enabled depthwise separable convolution neural network with deep support vector machine for covid-19 diagnosis and classification. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-020-01248-7

    Article  Google Scholar 

  10. Yu D, Xu Z, Wang X (2020) Bibliometric analysis of support vector machines research trend: a case study in china. Int J Mach Learn Cybern 11(3):715–728. https://doi.org/10.1007/s13042-019-01028-y

    Article  Google Scholar 

  11. Suykens Vandewalle J (1999) Least square support vector machine classifiers. Neural Process Lett 9(3):293–300. https://doi.org/10.1023/A:1018628609742

    Article  Google Scholar 

  12. Du JZ, Lu WG, Wu XH, Dong JY, Zuo WM (2018) L-SVM: a radius-margin-based svm algorithm with logdet regularization. Expert Syst Appl 102:113–125. https://doi.org/10.1016/j.eswa.2018.02.006

    Article  Google Scholar 

  13. Vitt CA, Dentcheva D, Xiong H (2019) Risk-averse classification. Ann Operat Res 3:1–29. https://doi.org/10.1007/s10479-019-03344-6

    Article  Google Scholar 

  14. Zhou S (2015) Sparse LSSVM in primal using Cholesky factorization for large-scale problems. IEEE Trans Neural Netw Learn Syst 27(4):783–795. https://doi.org/10.1109/TNNLS.2015.2424684

    Article  MathSciNet  Google Scholar 

  15. Khemchandani R, Chandra S et al (2007) Twin support vector machines for pattern classification. IEEE Trans Pattern Anal Mach Intell 29(5):905–910. https://doi.org/10.1109/TPAMI.2007.1068

    Article  Google Scholar 

  16. Yan H, Ye Q, Zhang T, Yu D-J, Yuan X, Xu Y, Fu L (2018) Least squares twin bounded support vector machines based on l1-norm distance metric for classification. Pattern Recogn 74:434–447. https://doi.org/10.1016/j.patcog.2017.09.035

    Article  Google Scholar 

  17. Richhariya B, Tanveer M (2020) A reduced universum twin support vector machine for class imbalance learning. Pattern Recogn 102:107150. https://doi.org/10.1016/j.patcog.2019.107150

    Article  Google Scholar 

  18. Vapnik V, Vashist A (2009) A new learning paradigm: learning using privileged information. Neural Netw 22(5–6):544–557. https://doi.org/10.1016/j.neunet.2009.06.042

    Article  MATH  Google Scholar 

  19. Gammerman A, Vovk V, Papadopoulos H (2015) Statistical learning and data sciences. In: Third International Symposium, SLDS, vol 9047, pp 20–23

  20. Tang J, Tian Y, Zhang P, Liu X (2017) Multiview privileged support vector machines. IEEE Trans Neural Netw Learn Syst 29(8):3463–3477. https://doi.org/10.1109/TNNLS.2017.2728139

    Article  MathSciNet  Google Scholar 

  21. Cheng Y, Yin H, Ye Q, Huang P, Fu L, Yang Z, Tian Y (2020) Improved multi-view GEPSVM via inter-view difference maximization and intra-view agreement minimization. Neural Netw 125:313–329. https://doi.org/10.1016/j.neunet.2020.02.002

    Article  MATH  Google Scholar 

  22. Ye Q, Huang P, Zhang Z, Zheng Y, Fu L, Yang W (2021) Multiview learning with robust double-sided twin svm. IEEE Trans Cybern 60:1–14. https://doi.org/10.1109/TCYB.2021.3088519

    Article  Google Scholar 

  23. Garg A, Dan R (2003) Margin distribution and learning algorithms. In: Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21–24, 2003, Washington, DC, USA, pp 210–217

  24. Lu X, Liu W, Zhou C, Huang M (2017) Robust least-squares support vector machine with minimization of mean and variance of modeling error. IEEE Trans Neural Netw Learn Syst https://doi.org/10.1109/TNNLS.2017.2709805

    Article  Google Scholar 

  25. Zhang T, Zhou Z (2014) Large margin distribution machine. In: Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, pp 313–322. https://doi.org/10.1145/2623330.2623710

  26. Zhang T, Zhou Z (2020) Optimal margin distribution machine. IEEE Trans Knowledge Data Eng 32(6):1143–1156. https://doi.org/10.1109/TKDE.2019.2897662

    Article  MathSciNet  Google Scholar 

  27. Maurer A, Pontil M (2009) Empirical bernstein bounds and sample variance penalization. In: Proceedings of the 22nd Annual Conference on Learning Theory. Montreal, Canada, pp 1–9. https://arxiv.53yu.com/abs/0907.3740v1

  28. Steinwart I, Hush D, Scovel C (2011) Training SVMs without offset. J Mach Learn Res 12(1):141–202. https://doi.org/10.5555/1953048.1953054

    Article  MathSciNet  MATH  Google Scholar 

  29. Vito ED, Rosasco L, Caponnetto A, Piana M, Verri A (2004) Some properties of regularized kernel methods. J Mach Learn Res 5:1363–1390

    MathSciNet  MATH  Google Scholar 

  30. Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: International Conference on Computational Learning Theory, pp 416–426. https://doi.org/10.1007/3-540-44581-1_27

  31. Steinwart I (2003) Sparseness of support vector machines. J Mach Learn Res. https://doi.org/10.1162/1532443041827925

    Article  MATH  Google Scholar 

  32. Lee Y-J, Mangasarian OL (2001) RSVM: reduced support vector machines. In: Proceedings of the 2001 SIAM International Conference on Data Mining, pp 1–17. https://doi.org/10.1137/1.9781611972719.13

  33. Keerthi SS, Chapelle O, DeCoste D (2006) Building support vector machines with reduced classifier complexity. J Mach Learn Res 7:1493–1515

    MathSciNet  MATH  Google Scholar 

  34. Chapelle O (2007) Training a support vector machine in the primal. Neural comput 19(5):1155–1178. https://doi.org/10.1162/neco.2007.19.5.1155

    Article  MathSciNet  MATH  Google Scholar 

  35. Hoeffding W (1963) Probability inequalities for sums of bounded random variables. J Am Stat Assoc 58(301):13–30. https://doi.org/10.1080/01621459.1963.10500830

    Article  MathSciNet  MATH  Google Scholar 

  36. Golub GH, Heath M, Wahba G (1979) Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21(2):215–223. https://doi.org/10.1080/00401706.1979.10489751

    Article  MathSciNet  MATH  Google Scholar 

  37. Kohavi R, etal. (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International Joint Conference on Artificial Intelligence, vol 14, pp 1137–1145. https://dl.acm.org/doi/10.5555/1643031.1643047

  38. Fung GM, Mangasarian OL (2005) Proximal support vector machine classifiers. Mach Learn 59(1):77–97. https://doi.org/10.1007/s10994-005-0463-6

    Article  MATH  Google Scholar 

  39. Zhou S, Cui J, Ye F, Liu H, Zhu Q (2013) New smoothing SVM algorithm with tight error bound and efficient reduced techniques. Comput Optimiz Appl 56(3):599–617. https://doi.org/10.1007/s10589-013-9571-6

    Article  MathSciNet  MATH  Google Scholar 

  40. Demiar J, Schuurmans D (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(1):1–30. https://doi.org/10.5555/1248547.1248548

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grants 61772020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuisheng Zhou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Proof of remark 1

Appendix A Proof of remark 1

Theorem 2

Under the condition of Theorem 1, the objective function \(f_1({\varvec{\alpha }})\) in (24) is convex if

$$\begin{aligned} \frac{{{\lambda _1}}}{{{\lambda _2}}} \ge 2 \left(c + \frac{1}{p}\right). \end{aligned}$$
(A1)

Proof

Here, let \({g_i}({\varvec{\alpha }}) = \frac{1}{p}\log (1 + {e^{p{\varvec{r}_i}}}) = \max \{ {\varvec{r}_i},0\} + \frac{1}{p}\log (1 + {e^{ - p{|\varvec{r}_i |}}})\), the objective function of (24) can be written as

$$\begin{aligned} {f_1}(\varvec{\alpha } ) = {\lambda _1}\sum \limits _{i = 1}^m {{g_i}(\varvec{\alpha } )} + {\lambda _2}{g^\top }(\varvec{\alpha } )Qg(\varvec{\alpha } ) + \frac{1}{2}{\varvec{\alpha } ^\top }K\varvec{\alpha } \end{aligned}$$
(A2)

The Hessian matrix of \(f_1(\varvec{\alpha })\) is:

$$\begin{aligned} {\nabla ^2}{f_1}(\varvec{\alpha } )= & {} {\lambda _1}\sum \limits _{i = 1}^m {{\nabla ^2}{g_i}(\varvec{\alpha } )} + 2{\lambda _2}\sum \limits _{i = 1}^m {{\varvec{\delta } _i}{\nabla ^2}{g_i}(\varvec{\alpha } )} \nonumber \\&+ 2{\lambda _2}\nabla g(\varvec{\alpha } )Q\nabla {g^\top }(\varvec{\alpha } ) + K\nonumber \\= & {} \sum \limits _{i = 1}^m {({\lambda _1} + 2{\lambda _2}{\varvec{\delta } _i}){\nabla ^2}} {g_i}(\varvec{\alpha } ) \nonumber \\&+ 2{\lambda _2}\nabla g(\varvec{\alpha } )Q\nabla {g^\top }(\varvec{\alpha } ) + K \end{aligned}$$
(A3)

where \(Q=\varvec{I}-\frac{1}{m}\varvec{e}^\top \varvec{e}\), \(\varvec{\sigma }=Qg(\varvec{\alpha })\), and

$$\begin{aligned} {\varvec{\delta } _i}= & {} {g_i}(\varvec{\alpha } ) - \frac{1}{m}\sum \limits _{i = 1}^m {{e^T}g(\varvec{\alpha } )} \nonumber \\= & {} \max \{ {\varvec{r}_i},0\} + \frac{1}{p}\log (1 + {e^{ - p{|\varvec{r}_i |}}}) - \frac{1}{m}\sum \limits _{j = 1}^m \max \{ {\varvec{r}_j},0\} \nonumber \\&+ \frac{1}{p}\log (1 + {e^{ - p{|\varvec{r}_j |}}}) \nonumber \\\ge & {} - \frac{1}{m}\sum \limits _{j = 1}^m {\max \{ {\varvec{r}_j},0\} - \frac{1}{{mp}}\sum \limits _{j = 1}^m {\log (1 + {e^{ - p{|\varvec{r}_j |}}})} } \nonumber \\\ge & {} - {c} - \frac{1}{{mp}}\sum \limits _{j = 1}^m {\log (1 + {e^{ - p(1+M)}})} \nonumber \\\ge & {} - {c} - \frac{{\log 2}}{p}\nonumber \\\ge & {} - {c} - \frac{1}{p}. \end{aligned}$$
(A4)

To prove that the objective function in Eq.(A2) is convex, it suffices to show that \({\nabla ^2}{f_1}(\varvec{\alpha } ) \succeq 0\) for every \(\varvec{\alpha }\). It’s obvious that \(2{\lambda _2}\nabla g(\varvec{\alpha } )Q\nabla {g^T}(\varvec{\alpha } ) + K\succeq 0\). That is, we need to prove that \(\sum \nolimits _{i = 1}^m {({\lambda _1} + 2{\lambda _2}{\varvec{\delta } _i}){\nabla ^2}} {g_i}(\varvec{\alpha } ) \succeq 0\). Thus, let \(\varvec{\mu }_i=\lambda _1+2\lambda _2 \varvec{\delta } _i\), then we need to prove \(\varvec{\mu }_i\ge 0\), based on (A4), we get \(\frac{{{\lambda _1}}}{{{\lambda _2}}} \ge 2(c + \frac{1}{p})\). That is, when \(\frac{{{\lambda _1}}}{{{\lambda _2}}} \ge 2(c + \frac{1}{p})\), for each \(\varvec{\alpha }\in R^m\), we have \({\nabla ^2}{f_1}(\varvec{\alpha } ) \succeq 0\). Therefore, the objective function \(f_1(\varvec{\alpha })\) in (24) is convex. \(\square\)

It is obvious that the objective function in Eq.(25) is convex since \({\varphi _p}(r)\) and \({({\varphi _p}(r))^2}\) are convex functions. When \(\frac{{{\lambda _1}}}{{{\lambda _2}}} \ge 2(c + \frac{1}{p})\), the solution of problem (24) obtained by algorithm 1 is globally optimal based on the Theorem 2. In fact, we have rarely encountered non-convergence in a large number of experiments. Of course we can also make a simple rule for selecting a optimal superparameter that satisfies the conditions given above.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fu, C., Zhou, S., Zhang, J. et al. Risk-Averse support vector classifier machine via moments penalization. Int. J. Mach. Learn. & Cyber. 13, 3341–3358 (2022). https://doi.org/10.1007/s13042-022-01598-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-022-01598-4

Keywords

Navigation