Abstract
The study aims to introduce a novel machine learning approach for the prediction of product lengths by addressing diverse data types including numeric, textual and categorical data and extracting valuable information from the dataset to enhance prediction accuracy. This is achieved by employing methods that combine text vectorization, gradient boosting algorithm and feature encoding of categorical data, specifically using Term Frequency-Inverse Document Frequency (TF-IDF), eXtreme Gradient Boosting (XGBoost) and target encoding. Our method begins with thorough data preparation, removing outliers and filling in missing values, then extracts important features from product titles, descriptions, and bullet points present in the dataset. We convert text from product titles, descriptions, and bullet points into numerical form using the TF-IDF technique. It captures the weighted frequency of words in the form of TF-IDF feature vectors enabling the effective application of the algorithm. Our training process employs RandomizedSearchCV to optimize the XGBoost model’s hyperparameters utilizing TF-IDF vectors and target encoded product type IDs. This allows the model to effectively handle variability and uncertainty for product length predictions. The techniques used contribute to the adaptability of the method and enable accurate prediction of product length in e-commerce which can be helpful in inventory management across diverse products. This can extend their utility to optimize supply chain operations, improving demand forecasting across a variety of products, and aiding in strategic planning for procurement and stock levels.
Similar content being viewed by others
Data Availability
The data utilized has not been made publicly available. Specific inquiries regarding the data can be directed to the corresponding author and the data can be made available on demand.
References
Rebelo CGS, Pereira MT, Silva FJ, Ferreira LP. The relevance of space analysis in warehouse management. Procedia Manufact. 2021. https://doi.org/10.1016/j.promfg.2021.10.064.
Escursell Silvia, Llorach-Massana Pere, Roncero M. Blanca. Sustainability in e-commerce packaging: a review. J Cleaner Prod. 2020. https://doi.org/10.1016/j.jclepro.2020.124314.
Albayrak Ünal Ö, Erkayman B, Usanmaz B. Applications of artificial intelligence in inventory management: a systematic review of the literature. Archiv Comput Methods Eng. 2023. https://doi.org/10.1007/s11831-022-09879-5.
Klimek L, Funta R. Data and e-commerce: an economic relationship. DANUBE. 2021;12:33–44. https://doi.org/10.2478/danb-2021-0003.
Hamarashid HK, Saeed SA, Rashid TA. A comprehensive review and evaluation on text predictive and entertainment systems. Soft Comput. 2022. https://doi.org/10.1007/s00500-021-06691-4.
Jia W, Sun M, Lian J, et al. Feature dimensionality reduction: a review. Complex Intell Syst. 2022;8:2663–93. https://doi.org/10.1007/s40747-021-00637-x.
Shafiq Alam, Muhammad Sohaib Ayub, Sakshi Arora, Muhammad Asad Khan, An investigation of the imputation techniques for missing values in ordinal data enhancing clustering and classification analysis validity. Decis Analyt J 2023;9:100341, ISSN 2772–6622, https://doi.org/10.1016/j.dajour.2023.100341
Rayhan, Abu & Kinzler, Robert & Rayhan, Rajan. Natural language processing: transforming how machines understand human language. 2023; https://doi.org/10.13140/RG.2.2.34900.99200
Singh A, Tiwari A. A study of feature selection and dimensionality reduction methods for classification-based phishing detection system. Int J Inform Retriev Res. 2021;11:1–35. https://doi.org/10.4018/IJIRR.2021010101.
Pargent F, Pfisterer F, Thomas J, Bischl B. Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput Statist. 2022;37(5):2671–92.
Templ M. Enhancing precision in large-scale data analysis: an innovative robust imputation algorithm for managing outliers and missing values. Mathematics. 2023;11(12):2729.
Yates LA, Aandahl Z, Richards SA, Brook BW. Cross validation for model selection: a review with examples from ecology. Ecol Monogr. 2023;93(1): e1557.
Abubakar HD, Umar M, Bakale MA. Sentiment classification: Review of text vectorization methods: Bag of words, Tf-Idf, Word2vec and Doc2vec. SLU J Sci Technol. 2022;4(1 & 2):27–33.
Ahmad SN, Laroche M. Extracting marketing information from product reviews: a comparative study of latent semantic analysis and probabilistic latent semantic analysis. J Market Analyt. 2023;11(4):662–76.
Ichien N, Lu H, Holyoak KJ. Predicting patterns of similarity among abstract semantic relations. J Exp Psychol Learn Mem Cogn. 2022;48(1):108.
Amro A, Al-Akhras M, Hindi K, Habib M, Shawar B. Instance reduction for avoiding overfitting in decision trees. J Intellig Syst 2021;30(1):438–459. https://doi.org/10.1515/jisys-2020-0061
Khan AM, BinZiad A, Subaii AA. Boosting algorithm choice in predictive machine learning models for fracturing applications. In SPE Asia Pacific Oil and Gas Conference and Exhibition 2021; (p. D011S009R003). SPE.
Benchekroun MT, Zaki S, Aboussaleh M, Belrhiti H, Diassana F. Development of a kiln petcoke mill predictive model based on a multi-regression XGBoost algorithm. Int J Adv Manufact Technol. 2024;130(7):3373–86.
Devi MD, Saharia N. Unsupervised tweets categorization using semantic and statistical features. Multimedia Tools Appl. 2023;82(6):9047–64.
Shi F, Lu S, Gu J, Lin J, Zhao C, You X, Lin X. Modeling and evaluation of the permeate flux in forward osmosis process with machine learning. Indust Eng Chem Res 2022; 61(49):18045–18056. https://doi.org/10.1021/acs.iecr.2c03064
Hu Y, Ghosh C, Malakpour-Estalaki S. A methodological framework for improving the performance of data-driven models: a case study for daily runoff prediction in the maumee domain, usa. Geoscient Model Dev. 2023;16(7):1925–36. https://doi.org/10.5194/gmd-16-1925-2023.
Zhang X, Guo F, Chen T, Pan L, Beliakov G, Wu J. A brief survey of machine learning and deep learning techniques for E-commerce research. J Theor Appl Electron Commer Res. 2023;18(4):2188–216.
Necula SC, Păvăloaia VD. AI-driven recommendations: a systematic review of the state of the art in E-commerce. Appl Sci. 2023;13(9):5531.
Liu CJ, Huang TS, Ho PT, Huang JC, Hsieh CT. Machine learning-based e-commerce platform repurchase customer prediction model. PLoS One. 2020;15(12): e0243105.
Fernandes AAA, Koehler M, Konstantinou N, et al. Data preparation: a technological perspective and review. SN Comput Sci. 2023;4:425. https://doi.org/10.1007/s42979-023-01828-8.
Muslikh AR, Andono PN, Marjuni A, Santoso HA. Systematic literature review of data distribution in preprocessing stage with focus on outliers. In 2023 International Seminar on Application for Technology of Information and Communication (iSemantic) 2023; (pp. 328–333). IEEE.
Ahmadiyeh F, Sajedi-Amin S, Kafili-Hajlari T, Naseri A. Roadmap for outlier detection in univariate linear calibration in analytical chemistry: tutorial review. J Chemom. 2023;37(1): e3460.
Vinisha FA, Sujihelen L. Study on missing values and outlier detection in concurrence with data quality enhancement for efficient data processing. In 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT) 2022; pp. 1600–1607. IEEE
Sikder MNK, Batarseh FA. Outlier detection using AI: a survey. AI Assurance, 2023; pp.231–291.
Seu K, Kang MS, Lee H. An intelligent missing data imputation techniques: a review. JOIV Int J Inform Visual 2022; 6(1–2):278–283.
Bharadiya JP. A tutorial on principal component analysis for dimensionality reduction in machine learning. Int J Innovat Sci Res Technol. 2023;8(5):2028–32.
Anuragi A, Sisodia DS, Pachori RB. Mitigating the curse of dimensionality using feature projection techniques on electroencephalography datasets: an empirical review. Artif Intell Rev. 2024;57(3):75.
Jetybayeva A, Borodinov N, Ievlev AV, Haque MIU, Hinkle J, Lamberti WA, Meredith JC, Abmayr D, Ovchinnikova OS. A review on recent machine learning applications for imaging mass spectrometry studies. J Appl Phys. 2023. https://doi.org/10.1063/5.0100948.
Hutke A, Deshmukh J. A systematic review of machine learning approaches and missing data imputation techniques for predicting heart disease. In 2023 International Conference on Advanced Computing Technologies and Applications (ICACTA) 2023; (pp. 1–5). IEEE.
Hameed WM, Ali NA. Missing value imputation techniques: a survey. UHD J Sci Technol. 2023;7(1):72–81.
Worth PJ. Word embeddings and semantic spaces in natural language processing. Int J Intellig Sci. 2023;13(1):1–21.
Ali PJM. Investigating the Impact of min-max data normalization on the regression performance of K-nearest neighbor with different similarity measurements. ARO Scient J Koya Univers. 2022;10(1):85–91.
Kosaraju N, Sankepally SR, Mallikharjuna Rao K. Categorical data: Need, encoding, selection of encoding method and its emergence in machine learning models—a practical review study on heart disease prediction dataset using pearson correlation. In Proceedings of International Conference on Data Science and Applications: ICDSA 2022, 2023; Volume 1 (pp. 369–382). Singapore: Springer Nature Singapore.
Bischl B, Binder M, Lang M, Pielok T, Richter J, Coors S, Thomas J, Ullmann T, Becker M, Boulesteix AL, Deng D. Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges. Wiley Interdisciplin Rev Data Mining Knowl Discov. 2023;13(2): e1484.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
The corresponding author, Abhishek Thakur has helped in conceptualization and methodology. Ankit Kumar has drafted the initial manuscript and contributed to algorithm design. Sudhansu Kumar Mishra has done validation and formal analysis. Subhendu Kumar Behera and Jagannath Sethi have helped in the resources. Sitanshu Shekhar Sahu contributed to the supervision part.
Corresponding author
Ethics declarations
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest. The authors declare that they have no competing interests.
Research Involving Human and/or Animals
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed Consent
All the authors have given their consent to this research article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Thakur, A., Kumar, A., Mishra, S.K. et al. Product Length Predictions with Machine Learning: An Integrated Approach Using Extreme Gradient Boosting. SN COMPUT. SCI. 5, 659 (2024). https://doi.org/10.1007/s42979-024-02999-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-024-02999-8