Product Length Predictions with Machine Learning: An Integrated Approach Using Extreme Gradient Boosting | SN Computer Science Skip to main content

Advertisement

Log in

Product Length Predictions with Machine Learning: An Integrated Approach Using Extreme Gradient Boosting

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

The study aims to introduce a novel machine learning approach for the prediction of product lengths by addressing diverse data types including numeric, textual and categorical data and extracting valuable information from the dataset to enhance prediction accuracy. This is achieved by employing methods that combine text vectorization, gradient boosting algorithm and feature encoding of categorical data, specifically using Term Frequency-Inverse Document Frequency (TF-IDF), eXtreme Gradient Boosting (XGBoost) and target encoding. Our method begins with thorough data preparation, removing outliers and filling in missing values, then extracts important features from product titles, descriptions, and bullet points present in the dataset. We convert text from product titles, descriptions, and bullet points into numerical form using the TF-IDF technique. It captures the weighted frequency of words in the form of TF-IDF feature vectors enabling the effective application of the algorithm. Our training process employs RandomizedSearchCV to optimize the XGBoost model’s hyperparameters utilizing TF-IDF vectors and target encoded product type IDs. This allows the model to effectively handle variability and uncertainty for product length predictions. The techniques used contribute to the adaptability of the method and enable accurate prediction of product length in e-commerce which can be helpful in inventory management across diverse products. This can extend their utility to optimize supply chain operations, improving demand forecasting across a variety of products, and aiding in strategic planning for procurement and stock levels.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

The data utilized has not been made publicly available. Specific inquiries regarding the data can be directed to the corresponding author and the data can be made available on demand.

References

  1. Rebelo CGS, Pereira MT, Silva FJ, Ferreira LP. The relevance of space analysis in warehouse management. Procedia Manufact. 2021. https://doi.org/10.1016/j.promfg.2021.10.064.

    Article  Google Scholar 

  2. Escursell Silvia, Llorach-Massana Pere, Roncero M. Blanca. Sustainability in e-commerce packaging: a review. J Cleaner Prod. 2020. https://doi.org/10.1016/j.jclepro.2020.124314.

    Article  Google Scholar 

  3. Albayrak Ünal Ö, Erkayman B, Usanmaz B. Applications of artificial intelligence in inventory management: a systematic review of the literature. Archiv Comput Methods Eng. 2023. https://doi.org/10.1007/s11831-022-09879-5.

    Article  Google Scholar 

  4. Klimek L, Funta R. Data and e-commerce: an economic relationship. DANUBE. 2021;12:33–44. https://doi.org/10.2478/danb-2021-0003.

    Article  Google Scholar 

  5. Hamarashid HK, Saeed SA, Rashid TA. A comprehensive review and evaluation on text predictive and entertainment systems. Soft Comput. 2022. https://doi.org/10.1007/s00500-021-06691-4.

    Article  Google Scholar 

  6. Jia W, Sun M, Lian J, et al. Feature dimensionality reduction: a review. Complex Intell Syst. 2022;8:2663–93. https://doi.org/10.1007/s40747-021-00637-x.

    Article  Google Scholar 

  7. Shafiq Alam, Muhammad Sohaib Ayub, Sakshi Arora, Muhammad Asad Khan, An investigation of the imputation techniques for missing values in ordinal data enhancing clustering and classification analysis validity. Decis Analyt J 2023;9:100341, ISSN 2772–6622, https://doi.org/10.1016/j.dajour.2023.100341

  8. Rayhan, Abu & Kinzler, Robert & Rayhan, Rajan. Natural language processing: transforming how machines understand human language. 2023; https://doi.org/10.13140/RG.2.2.34900.99200

  9. Singh A, Tiwari A. A study of feature selection and dimensionality reduction methods for classification-based phishing detection system. Int J Inform Retriev Res. 2021;11:1–35. https://doi.org/10.4018/IJIRR.2021010101.

    Article  Google Scholar 

  10. Pargent F, Pfisterer F, Thomas J, Bischl B. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput Statist. 2022;37(5):2671–92.

    Article  MathSciNet  Google Scholar 

  11. Templ M. Enhancing precision in large-scale data analysis: an innovative robust imputation algorithm for managing outliers and missing values. Mathematics. 2023;11(12):2729.

    Article  Google Scholar 

  12. Yates LA, Aandahl Z, Richards SA, Brook BW. Cross validation for model selection: a review with examples from ecology. Ecol Monogr. 2023;93(1): e1557.

    Article  Google Scholar 

  13. Abubakar HD, Umar M, Bakale MA. Sentiment classification: Review of text vectorization methods: Bag of words, Tf-Idf, Word2vec and Doc2vec. SLU J Sci Technol. 2022;4(1 & 2):27–33.

    Article  Google Scholar 

  14. Ahmad SN, Laroche M. Extracting marketing information from product reviews: a comparative study of latent semantic analysis and probabilistic latent semantic analysis. J Market Analyt. 2023;11(4):662–76.

    Article  Google Scholar 

  15. Ichien N, Lu H, Holyoak KJ. Predicting patterns of similarity among abstract semantic relations. J Exp Psychol Learn Mem Cogn. 2022;48(1):108.

    Article  Google Scholar 

  16. Amro A, Al-Akhras M, Hindi K, Habib M, Shawar B. Instance reduction for avoiding overfitting in decision trees. J Intellig Syst 2021;30(1):438–459. https://doi.org/10.1515/jisys-2020-0061

  17. Khan AM, BinZiad A, Subaii AA. Boosting algorithm choice in predictive machine learning models for fracturing applications. In SPE Asia Pacific Oil and Gas Conference and Exhibition 2021; (p. D011S009R003). SPE.

  18. Benchekroun MT, Zaki S, Aboussaleh M, Belrhiti H, Diassana F. Development of a kiln petcoke mill predictive model based on a multi-regression XGBoost algorithm. Int J Adv Manufact Technol. 2024;130(7):3373–86.

    Article  Google Scholar 

  19. Devi MD, Saharia N. Unsupervised tweets categorization using semantic and statistical features. Multimedia Tools Appl. 2023;82(6):9047–64.

    Article  Google Scholar 

  20. Shi F, Lu S, Gu J, Lin J, Zhao C, You X, Lin X. Modeling and evaluation of the permeate flux in forward osmosis process with machine learning. Indust Eng Chem Res 2022; 61(49):18045–18056. https://doi.org/10.1021/acs.iecr.2c03064

  21. Hu Y, Ghosh C, Malakpour-Estalaki S. A methodological framework for improving the performance of data-driven models: a case study for daily runoff prediction in the maumee domain, usa. Geoscient Model Dev. 2023;16(7):1925–36. https://doi.org/10.5194/gmd-16-1925-2023.

    Article  Google Scholar 

  22. Zhang X, Guo F, Chen T, Pan L, Beliakov G, Wu J. A brief survey of machine learning and deep learning techniques for E-commerce research. J Theor Appl Electron Commer Res. 2023;18(4):2188–216.

    Article  Google Scholar 

  23. Necula SC, Păvăloaia VD. AI-driven recommendations: a systematic review of the state of the art in E-commerce. Appl Sci. 2023;13(9):5531.

    Article  Google Scholar 

  24. Liu CJ, Huang TS, Ho PT, Huang JC, Hsieh CT. Machine learning-based e-commerce platform repurchase customer prediction model. PLoS One. 2020;15(12): e0243105.

    Article  Google Scholar 

  25. Fernandes AAA, Koehler M, Konstantinou N, et al. Data preparation: a technological perspective and review. SN Comput Sci. 2023;4:425. https://doi.org/10.1007/s42979-023-01828-8.

    Article  Google Scholar 

  26. Muslikh AR, Andono PN, Marjuni A, Santoso HA. Systematic literature review of data distribution in preprocessing stage with focus on outliers. In 2023 International Seminar on Application for Technology of Information and Communication (iSemantic) 2023; (pp. 328–333). IEEE.

  27. Ahmadiyeh F, Sajedi-Amin S, Kafili-Hajlari T, Naseri A. Roadmap for outlier detection in univariate linear calibration in analytical chemistry: tutorial review. J Chemom. 2023;37(1): e3460.

    Article  Google Scholar 

  28. Vinisha FA, Sujihelen L. Study on missing values and outlier detection in concurrence with data quality enhancement for efficient data processing. In 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT) 2022; pp. 1600–1607. IEEE

  29. Sikder MNK, Batarseh FA. Outlier detection using AI: a survey. AI Assurance, 2023; pp.231–291.

  30. Seu K, Kang MS, Lee H. An intelligent missing data imputation techniques: a review. JOIV Int J Inform Visual 2022; 6(1–2):278–283.

  31. Bharadiya JP. A tutorial on principal component analysis for dimensionality reduction in machine learning. Int J Innovat Sci Res Technol. 2023;8(5):2028–32.

    Google Scholar 

  32. Anuragi A, Sisodia DS, Pachori RB. Mitigating the curse of dimensionality using feature projection techniques on electroencephalography datasets: an empirical review. Artif Intell Rev. 2024;57(3):75.

    Article  Google Scholar 

  33. Jetybayeva A, Borodinov N, Ievlev AV, Haque MIU, Hinkle J, Lamberti WA, Meredith JC, Abmayr D, Ovchinnikova OS. A review on recent machine learning applications for imaging mass spectrometry studies. J Appl Phys. 2023. https://doi.org/10.1063/5.0100948.

    Article  Google Scholar 

  34. Hutke A, Deshmukh J. A systematic review of machine learning approaches and missing data imputation techniques for predicting heart disease. In 2023 International Conference on Advanced Computing Technologies and Applications (ICACTA) 2023; (pp. 1–5). IEEE.

  35. Hameed WM, Ali NA. Missing value imputation techniques: a survey. UHD J Sci Technol. 2023;7(1):72–81.

    Article  Google Scholar 

  36. Worth PJ. Word embeddings and semantic spaces in natural language processing. Int J Intellig Sci. 2023;13(1):1–21.

    Article  Google Scholar 

  37. Ali PJM. Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO Scient J Koya Univers. 2022;10(1):85–91.

    Article  MathSciNet  Google Scholar 

  38. Kosaraju N, Sankepally SR, Mallikharjuna Rao K. Categorical data: Need, encoding, selection of encoding method and its emergence in machine learning models—a practical review study on heart disease prediction dataset using pearson correlation. In Proceedings of International Conference on Data Science and Applications: ICDSA 2022, 2023; Volume 1 (pp. 369–382). Singapore: Springer Nature Singapore.

  39. Bischl B, Binder M, Lang M, Pielok T, Richter J, Coors S, Thomas J, Ullmann T, Becker M, Boulesteix AL, Deng D. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplin Rev Data Mining Knowl Discov. 2023;13(2): e1484.

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

The corresponding author, Abhishek Thakur has helped in conceptualization and methodology. Ankit Kumar has drafted the initial manuscript and contributed to algorithm design. Sudhansu Kumar Mishra has done validation and formal analysis. Subhendu Kumar Behera and Jagannath Sethi have helped in the resources. Sitanshu Shekhar Sahu contributed to the supervision part.

Corresponding author

Correspondence to Abhishek Thakur.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest. The authors declare that they have no competing interests.

Research Involving Human and/or Animals

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent

All the authors have given their consent to this research article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thakur, A., Kumar, A., Mishra, S.K. et al. Product Length Predictions with Machine Learning: An Integrated Approach Using Extreme Gradient Boosting. SN COMPUT. SCI. 5, 659 (2024). https://doi.org/10.1007/s42979-024-02999-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-024-02999-8

Keywords

Navigation