The third party logistics provider freight management problem: a framework and deep reinforcement learning approach | Annals of Operations Research
Skip to main content

The third party logistics provider freight management problem: a framework and deep reinforcement learning approach

  • Original Research
  • Published:
Annals of Operations Research Aims and scope Submit manuscript

Abstract

In many large manufacturing companies, freight management is handled by a third-party logistics (3PL) provider, thus allowing manufacturers and their suppliers to focus on the production of goods rather than managing their delivery. Provided their pivotal supply chain role, in this work we propose a general framework for what we term as “the 3PL freight management problem” (3PLFMP). Our framework identifies three primary activities involved in 3PL freight management: the assignment of orders to a fleet of vehicles, efficient routing of the fleet, and packing the assigned orders in vehicles. Furthermore, we provide a specific instantiation of the 3PLFMP that considers direct vs. consolidated shipping strategies, one dimensional packing constraints, and a fixed vehicle routing schedule. We solve this instantiated problem using several Reinforcement Learning (RL) methods, including Q-learning, Double Q-learning, SARSA, Deep Q-learning, and Double Deep Q-learning, comparing against two benchmark methods, a simulated annealing heuristic and a variable neighborhood descent algorithm. We evaluate the performance of these methods on two datasets. One is fully simulated and based on past work, while another is semi-simulated using real-world automobile manufacturers and part supplier locations, and is of our own design. We find that RL methods vastly outperform the benchmark heuristic methods on both datasets, thus establishing the superiority of RL methods in solving this highly complicated and stochastic problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. For ease in relating the definitions to a real-world case of a 3PL company, we intentionally used the terms “customers” and “suppliers”. However, customers can more generally be referred to as destinations, while suppliers can more generally be referred to as origins. A more general definition of C and S are therefore “destination” and “origin” sets. We will continue using the customer/supplier nomenclature throughout the paper, however. These C and S sets might also contain any intermediate nodes such as consolidation centers or warehouses.

  2. Note that such a consideration may also be defined at the order level, where one package may house multiple products.

References

  • Ali, S., Ramos, A. G., Carravilla, M. A., & Oliveira, J. F. (2024). Heuristics for online three-dimensional packing problems and algorithm selection framework for semi-online with full look-ahead. Applied Soft Computing, 151, 111168.

    Google Scholar 

  • Alipour, M. M., Razavi, S. N., Derakhshi, M. R. F., & Balafar, M. A. (2018). A hybrid algorithm using a genetic algorithm and multiagent reinforcement learning heuristic to solve the traveling salesman problem. Neural Computing and Applications, 30(9), 2935–2951.

    Google Scholar 

  • Arviv, K., Stern, H., & Edan, Y. (2016). Collaborative reinforcement learning for a two-robot job transfer flow-shop scheduling problem. International Journal of Production Research, 54(4), 1196–1209.

    Google Scholar 

  • Asghari, M., & Mirzapour Al-e-hashem, S. M. J. (2021). Green vehicle routing problem: A state-of-the-art review. International Journal of Production Economics, 231, 107899.

    Google Scholar 

  • Automotive News. (2020). Top 150 OEM parts suppliers to North America. https://www.autonews.com/assets/PDF/CA27261020.PDF. ([Online; accessed 27-July-2020])

  • Bausch, D. O., Brown, G. G., & Ronen, D. (1995). Consolidating and dispatching truck shipments of mobil heavy petroleum products. Interfaces, 25(2), 1–17.

    Google Scholar 

  • Baykasoglu, A., & Kaplanoglu, V. (2011). A multi-agent approach to load consolidation in transportation. Advances in Engineering Software, 42(7), 477–490.

    Google Scholar 

  • Bayley, T. A., & Bookbinder, J. H. (2015). The dynamic family assignment heuristic. IFAC-PapersOnLine, 48(3), 1161–1166. (15th IFAC Symposium on Information Control Problems in Manufacturing)

  • Beasley, J. (1984). Fixed routes. Journal of the Operational Research Society, 35(1), 49–55.

    Google Scholar 

  • Bellman, R. (1966). Dynamic programming. Science, 153(3731), 34–37.

    Google Scholar 

  • Bertazzi, L., & Speranza, M. G. (2012). Inventory routing problems: an introduction. EURO Journal on Transportation and Logistics, 1, 307–326.

    Google Scholar 

  • Bertsimas, D., & Tsitsiklis, J. (1993). Simulated annealing. Statistical Science, 8(1), 10–15.

    Google Scholar 

  • Borges, Y. G., Schouery, R. C., & Miyazawa, F. K. (2024). Mathematical models and exact algorithms for the colored bin packing problem. Computers & Operations Research, 106527.

  • Bortfeldt, A., & Yi, J. (2020). The split delivery vehicle routing problem with three-dimensional loading constraints. European Journal of Operational Research, 282(2), 545–558.

    Google Scholar 

  • Brown, G. G., & Ronen, D. (1997). Consolidation of customer orders into truckloads at a large manufacturer. Journal of the Operational Research Society, 48, 779–785.

    Google Scholar 

  • Bélisle, C. J. P. (1992). Convergence theorems for a class of simulated annealing algorithms on \(R^{d}\). Journal of Applied Probability, 29(4), 885–895. https://doi.org/10.2307/3214721

    Article  Google Scholar 

  • Çetinkaya, S., Üster, H., Easwaran, G., & Keskin, B. B. (2009). An integrated outbound logistics model for Frito-Lay: Coordinating aggregate-level production and distribution decisions. INFORMS Journal on Applied Analytics, 39(5), 460–475.

    Google Scholar 

  • Çetinkaya, S. (2005). Coordination of inventory and shipment consolidation decisions: A review of premises, models, and justification. J. Geunes, E. Akçali, P.M. Pardalos, H.E. Romeijn, & Z.-J.M. Shen (Eds.), Applications of supply chain management and e-commerce research (pp. 3–51). Boston, MA: Springer US.

  • Christensen, H. I., Khan, A., Pokutta, S., & Tetali, P. (2017). Approximation and online algorithms for multidimensional bin packing: A survey. Computer Science Review, 24, 63–79.

    Google Scholar 

  • Christofides, N. (1971). Fixed routes and areas for delivery operations. International Journal of Physical Distribution, 1(2), 87–92.

    Google Scholar 

  • Coffman, Jr., E. G., Csirik, J., Galambos, G., Martello, S., & Vigo, D. (2013). Bin packing approximation algorithms: survey and classification. Handbook of combinatorial optimization (pp. 455–531). New York, NY: Springer New York.

  • Cortes, J. D., & Suzuki, Y. (2020). Vehicle routing with shipment consolidation. International Journal of Production Economics, 227, 107622.

    Google Scholar 

  • Côté, J.-F., Haouari, M., & Iori, M. (2021). Combinatorial benders decomposition for the two-dimensional bin packing problem. INFORMS Journal on Computing, 33(3), 963–978.

    Google Scholar 

  • Dell’Amico, M., Díaz, J. C. D., & Iori, M. (2012). The bin packing problem with precedence constraints. Operations Research, 60(6), 1491–1504.

    Google Scholar 

  • Du, T., Wang, F., & Lu, P.-Y. (2007). A real-time vehicle-dispatching system for consolidating milk runs. Transportation Research Part E: Logistics and Transportation Review, 43(5), 565–577.

    Google Scholar 

  • Elhedhli, S., Gzara, F., & Yildiz, B. (2019). Three-dimensional bin packing and mixed-case palletization. INFORMS Journal on Optimization, 1(4), 323–352.

    Google Scholar 

  • Granville, V., Krivanek, M., & Rasson, J.-P. (1994). Simulated annealing: a proof of convergence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 652–656. https://doi.org/10.1109/34.295910

    Article  Google Scholar 

  • Guo, W., Atasoy, B., & Negenborn, R. (2022). Global synchromodal shipment matching problem with dynamic and stochastic travel times: a reinforcement learning approach. Annals of Operations Research, 1–32.

  • Gupta, V., & Radovanović, A. (2020). Interior-point-based online stochastic bin packing. Operations Research, 68(5), 1474–1492.

    Google Scholar 

  • Gzara, F., Elhedhli, S., Yildiz, U., & Baloch, G. (2020). Data-driven modeling and optimization of the order consolidation problem in e-warehousing. INFORMS Journal on Optimization, 2(4), 273–296.

    Google Scholar 

  • Hansuwa, S., Velayudhan Kumar, M. R., & Chandrasekharan, R. (2022). Analysis of box and ellipsoidal robust optimization, and attention model based reinforcement learning for a robust vehicle routing problem. Sādhanā, 47(2), 72.

    Google Scholar 

  • Haouari, M., & Mhiri, M. (2024). Lower and upper bounding procedures for the bin packing problem with concave loading cost. European Journal of Operational Research, 312(1), 56–69.

    Google Scholar 

  • Hasselt, H. (2010). Double Q-learning. J. Lafferty, C. Williams, J. Shawe- Taylor, R. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems (Vol. 23). Curran Associates, Inc.

  • Haughton, M. A., & Stenger, A. J. (1998). Modeling the customer service performance of fixed-routes delivery systems under stochastic demand. Journal of Business Logistics, 19(1), 155.

    Google Scholar 

  • Hemmelmayr, V., Doerner, K. F., Hartl, R. F., & Savelsbergh, M. W. (2009). Delivery strategies for blood products supplies. OR Spectrum, 31(4), 707–725.

    Google Scholar 

  • Hildebrandt, F. D., Thomas, B. W., & Ulmer, M. W. (2023). Opportunities for reinforcement learning in stochastic dynamic vehicle routing. Computers & operations research, 150, 106071.

    Google Scholar 

  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. https://doi.org/10.1016/0893-6080(89)90020-8

    Article  Google Scholar 

  • Hosseini, S. D., Shirazi, M. A., & Karimi, B. (2014). Cross-docking and milk run logistics in a consolidation network: A hybrid of harmony search and simulated annealing approach. Journal of Manufacturing Systems, 33, 567–577. https://doi.org/10.1016/j.jmsy.2014.05.004

    Article  Google Scholar 

  • Hu, Y., Yao, Y., & Lee, W. S. (2020). A reinforcement learning approach for optimizing multiple traveling salesman problems over graphs. Knowledge-Based Systems, 204, 106244.

    Google Scholar 

  • Iyer, A. V. (2019). Toyota: supply chain management. McGraw Hill.

    Google Scholar 

  • Ji, B., Zhou, S., Zhang, D., & Yu, S. S. (2024). A branch-and-price-based heuristic for the vehicle routing problem with two-dimensional loading constraints and time windows. International Transactions in Operational Research, 31(2), 658–691.

    Google Scholar 

  • Jiang, Y., Cao, Z., & Zhang, J. (2021). Solving 3d bin packing problem via multimodal deep reinforcement learning. Proceedings of the 20th international conference on autonomous agents and multiagent systems (pp. 1548–1550).

  • Kalatzantonakis, P., Sifaleras, A., & Samaras, N. (2023). A reinforcement learning-variable neighborhood search method for the capacitated vehicle routing problem. Expert Systems with Applications, 213, 118812.

    Google Scholar 

  • Karagul, K., Sahin, Y., Aydemir, E., & Oral, A. (2019). A simulated annealing algorithm based solution method for a green vehicle routing problem with fuel consumption. Lean and green supply chain management (pp. 161–187). Springer.

  • Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M. J., & Jordan, M. I. (2021). Is temporal difference learning optimal? an instance-dependent analysis. SIAM Journal on Mathematics of Data Science, 3(4), 1013–1040. https://doi.org/10.1137/20M1331524

    Article  Google Scholar 

  • Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671–680.

    Google Scholar 

  • Kosanoglu, F., Atmis, M., & Turan, H. H. (2022). A deep reinforcement learning assisted simulated annealing algorithm for a maintenance planning problem. Annals of Operations Research, 1–32.

  • Kovacs, A. A., Golden, B. L., Hartl, R. F., & Parragh, S. N. (2014). Vehicle routing problems in which consistency considerations are important: A survey. Networks, 64(3), 192–213.

    Google Scholar 

  • Kullman, N. D., Froger, A., Mendoza, J. E., & Goodson, J. C. (2021). frvcpy: An open-source solver for the fixed route vehicle charging problem. INFORMS Journal on Computing, 33, 1277–1283. https://doi.org/10.1287/ijoc.2020.1035

    Article  Google Scholar 

  • Kumar, A., Schwarz, L. B., & Ward, J. E. (1995). Risk-pooling along a fixed delivery route using a dynamic inventory-allocation policy. Management Science, 41(2), 344–362.

    Google Scholar 

  • Kuo, F. Y., & Sloan, I. H. (2005). Lifting the curse of dimensionality. Notices of the AMS, 52(11), 1320–1328.

    Google Scholar 

  • Laterre, A., Fu, Y., Jabri, M. K., Cohen, A.-S., Kas, D., Hajjar, K., ... & Beguir, K. (2018). Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. arXiv preprintarXiv:1807.01672, 1-11.

  • Leung, L. C., Van Hui, Y., Wang, Y., & Chen, G. (2009). A 0–1 LP model for the integration and consolidation of air cargo shipments. Operations Research, 57(2), 402–412.

    Google Scholar 

  • Leung, S. C., Zhou, X., Zhang, D., & Zheng, J. (2011). Extended guided tabu search and a new packing algorithm for the two-dimensional loading vehicle routing problem. Computers & Operations Research, 38(1), 205–215.

    Google Scholar 

  • Li, Y., Soleimani, H., & Zohal, M. (2019). An improved ant colony optimization algorithm for the multi-depot green vehicle routing problem with multiple objectives. Journal of Cleaner Production, 227, 1161–1172.

    Google Scholar 

  • Liao, C. S., Lu, S. H., & Shen, Z. J. M. (2016). The electric vehicle touring problem. Transportation Research Part B: Methodological, 86, 163–180. https://doi.org/10.1016/j.trb.2016.02.002

    Article  Google Scholar 

  • Liu, F., & Zeng, G. (2009). Study of genetic algorithm with reinforcement learning to solve the TSP. Expert Systems with Applications, 36(3), 6995–7001.

    Google Scholar 

  • Malmberg, F., & Marklund, J. (2023). Evaluation and control of inventory distribution systems with quantity based shipment consolidation. Naval Research Logistics (NRL), 70(2), 205–227.

    Google Scholar 

  • Mao, C., & Shen, Z. (2018). A reinforcement learning framework for the adaptive routing problem in stochastic time-dependent network. Transportation Research Part C: Emerging Technologies, 93, 179–197.

    Google Scholar 

  • Mazyavkina, N., Sviridov, S., Ivanov, S., & Burnaev, E. (2021). Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134, 105400.

    Google Scholar 

  • Mendoza, J. E., Rousseau, L.-M., & Villegas, J. G. (2016). A hybrid metaheuristic for the vehicle routing problem with stochastic demand and duration constraints. Journal of Heuristics, 22, 539–566.

    Google Scholar 

  • Miki, S., Yamamoto, D., & Ebara, H. (2018). Applying deep learning and reinforcement learning to traveling salesman problem. 2018 international conference on computing, electronics & communications engineering (iccece) (pp. 65–70).

  • Mladenović, N., & Hansen, P. (1997). Variable neighborhood search. Computers & Operations Research, 24(11), 1097–1100.

    Google Scholar 

  • Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. A. (2013). Playing atari with deep reinforcement learning. CoRR, arXiv:1312.5602

  • Molina, F., Morabito, R., & de Araujo, S. A. (2016). MIP models for production lot sizing problems with distribution costs and cargo arrangement. Journal of the Operational Research Society, 67(11), 1395–1407.

    Google Scholar 

  • Mutlu, F., & Çetinkaya, S. (2010). An integrated model for stock replenishment and shipment scheduling under common carrier dispatch costs. Transportation Research Part E: Logistics and Transportation Review, 46(6), 844–854.

    Google Scholar 

  • Nazari, M., Oroojlooy, A., Snyder, L., & Takac, M. (2018). Reinforcement learning for solving the vehicle routing problem. S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31). Curran Associates, Inc.

  • Nguyen, C., Dessouky, M., & Toriello, A. (2014). Consolidation strategies for the delivery of perishable products. Transportation Research Part E: Logistics and Transportation Review, 69, 108–121.

    Google Scholar 

  • Nottingham, K., Balakrishnan, A., Deshmukh, J., & Wingate, D. (2021). Using logical specifications of objectives in multi-objective reinforcement learning. International conference on machine learning workshop on human-ai collaboration in sequential decision-making. JMRL.

  • Oyola, J., Arntzen, H., & Woodruff, D. L. (2018). The stochastic vehicle routing problem, a literature review, part i: models. EURO Journal on Transportation and Logistics, 7(3), 193–221. https://doi.org/10.1007/s13676-016-0100-5

    Article  Google Scholar 

  • Pan, W., & Liu, S. Q. (2023). Deep reinforcement learning for the dynamic and uncertain vehicle routing problem. Applied Intelligence, 53(1), 405–422.

    Google Scholar 

  • Paradiso, R., Roberti, R., Laganá, D., & Dullaert, W. (2020). An exact solution framework for multitrip vehicle-routing problems with time windows. Operations Research, 68(1), 180–198.

    Google Scholar 

  • Pollaris, H., Braekers, K., Caris, A., Janssens, G. K., & Limbourg, S. (2015). Vehicle routing problems with loading constraints: state-of-the-art and future directions. OR Spectrum, 37(2), 297–330.

    Google Scholar 

  • Pollaris, H., Braekers, K., Caris, A., Janssens, G. K., & Limbourg, S. (2016). Capacitated vehicle routing problem with sequence-based pallet loading and axle weight constraints. EURO Journal on Transportation and Logistics, 5(2), 231–255.

    Google Scholar 

  • Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of dimensionality (Vol. 703). John Wiley & Sons.

    Google Scholar 

  • Powell, W. B., Bouzaiene-Ayari, B., Berger, J., Boukhtouta, A., & George, A. P. (2011). The effect of robust decisions on the cost of uncertainty in military airlift operations. ACM Transactions on Modeling and Computer Simulation (TOMACS), 22(1), 1–19.

    Google Scholar 

  • Praxedes, R., Bulhões, T., Subramanian, A., & Uchoa, E. (2024). A unified exact approach for a broad class of vehicle routing problems with simultaneous pickup and delivery. Computers & Operations Research, 162, 106467.

    Google Scholar 

  • Puche, A. V., & Lee, S. (2022). Online 3d bin packing reinforcement learning solution with buffer. 2022 ieee/rsj international conference on intelligent robots and systems (iros) (pp. 8902–8909).

  • Qin, H., Zhang, Z., Qi, Z., & Lim, A. (2014). The freight consolidation and containerization problem. European Journal of Operational Research, 234(1), 37–48.

    Google Scholar 

  • Rummery, G. A., & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Cambridge University Engineering Department, 37.

  • Santiyuda, G., Wardoyo, R., Pulungan, R., & Vincent, F. Y. (2024). Multiobjective reinforcement learning for bi-objective time-dependent pickup and delivery problem with late penalties. Engineering Applications of Artificial Intelligence, 128, 107381.

    Google Scholar 

  • Satır, B., Erenay, F. S., & Bookbinder, J. H. (2018). Shipment consolidation with two demand classes: Rationing the dispatch capacity. European Journal of Operational Research, 270(1), 171–184.

    Google Scholar 

  • Singh, S., Jaakkola, T., Littman, M. L., & Szepesvári, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38(3), 287–308.

    Google Scholar 

  • Sluijk, N., Florio, A. M., Kinable, J., Dellaert, N., & Van Woensel, T. (2023). A chance-constrained two-echelon vehicle routing problem with stochastic demands. Transportation Science, 57(1), 252–272. https://doi.org/10.1287/trsc.2022.1162

    Article  Google Scholar 

  • Śniezyński, B., Wojcik, W., Gehrke, J. D., & Wojtusiak, J. (2010). Combining rule induction and reinforcement learning: An agent-based vehicle routing. 2010 ninth international conference on machine learning and applications (pp. 851–856).

  • Song, H., Hsu, V. N., & Cheung, R. K. (2008). Distribution coordination between suppliers and customers with a consolidation center. Operations Research, 56(5), 1264–1277.

    Google Scholar 

  • Statista. (2021). Worldwide number of vehicles produced by Toyota from FY 2007 to FY 2021(in 1,000s). Retrieved 2021-24-05, from https://www.statista.com/statistics/267272/worldwide-vehicleproduction-of-toyota/

  • Subramanyam, A., Repoussis, P. P., & Gounaris, C. E. (2020). Robust optimization of a broad class of heterogeneous vehicle routing problems under demand uncertainty. INFORMS Journal on Computing, 32(3), 661–681.

    Google Scholar 

  • Sun, L., Rangarajan, A., Karwan, M. H., & Pinto, J. M. (2015). Transportation cost allocation on a fixed route. Computers & Industrial Engineering, 83, 61–73.

    Google Scholar 

  • Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.

    Google Scholar 

  • The World Bank. (2018). Gdp. (https://data.worldbank.org/indicator/NY.GDP.MKTP.CD)

  • Tian, R., Kang, C., Bi, J., Ma, Z., Liu, Y., Yang, S., & Li, F. (2023). Learning to multi-vehicle cooperative bin packing problem via sequence-to-sequence policy network with deep reinforcement learning model. Computers & Industrial Engineering, 177, 108998.

    Google Scholar 

  • Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double q-learning. Proceedings of the aaai conference on artificial intelligence (Vol. 30).

  • van Hasselt, H.P., Guez, A., Guez, A., Hessel, M., Mnih, V., & Silver, D. (2016). Learning values across many orders of magnitude. D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29). Curran Associates, Inc.

  • van Heeswijk, W. (2022). Strategic bidding in freight transport using deep reinforcement learning. Annals of Operations Research, 1–38.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. I. Guyon et al. (Eds.), Advances in neural information processing systems (Vol. 30, p. 11). Curran Associates, Inc.

  • Verma, R., Singhal, A., Khadilkar, H., Basumatary, A., Nayak, S., Singh, H. V., ... & Sinha, R. (2020). A generalized reinforcement learning algorithm for online 3D bin-packing. arXiv preprintarXiv:2007.00463, 1-9.

  • Wang, F., Tao, Y., & Shi, N. (2009). A survey on vehicle routing problem with loading constraints. 2009 international joint conference on computational sciences and optimization (Vol. 2, p. 602-606).

  • Waschneck, B., Reichstaller, A., Belzner, L., Altenmüller, T., Bauernhansl, T., Knapp, A., & Kyek, A. (2018). Deep reinforcement learning for semiconductor production scheduling. 2018 29th annual semi advanced semiconductor manufacturing conference (asmc) (pp. 301–306).

  • Watkins, C. J. (1989). Learning from delayed rewards (Unpublished doctoral dissertation). King’s College, Cambridge United Kingdom.

    Google Scholar 

  • Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.

    Google Scholar 

  • Wei, L., Luo, Z., Baldacci, R., & Lim, A. (2020). A new branch-and-price-and-cut algorithm for one-dimensional bin-packing problems. INFORMS Journal on Computing, 32(2), 428–443.

    Google Scholar 

  • Wen, M., Larsen, J., Clausen, J., Cordeau, J.-F., & Laporte, G. (2009). Vehicle routing with cross-docking. Journal of the Operational Research Society, 60(12), 1708–1718.

    Google Scholar 

  • Wikipedia contributors. (2020). List of automotive assembly plants in the united states — Wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=List_of_automotive_assembly_plants_in_the_United_States &oldid=1026716433. ([Online; accessed 27-July-2020])

  • Yang, S., Song, S., Chu, S., Song, R., Cheng, J., Li, Y., & Zhang, W. (2023). Heuristics integrated deep reinforcement learning for online 3d bin packing. IEEE Transactions on Automation Science and Engineering.

  • Yu, J. J., Yu, W., & Gu, J. (2019). Online vehicle routing with neural combinatorial optimization and deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(10), 3806–3817.

    Google Scholar 

  • Zhang, K., He, F., Zhang, Z., Lin, X., & Li, M. (2020). Multi-vehicle routing problems with soft time windows: A multi-agent reinforcement learning approach. Transportation Research Part C: Emerging Technologies, 121(October), 102861.

    Google Scholar 

  • Zhang, K., Lin, X., & Li, M. (2023). Graph attention reinforcement learning with flexible matching policies for multi-depot vehicle routing problems. Physica A: Statistical Mechanics and its Applications, 611, 128451.

    Google Scholar 

  • Zhang, X., Chen, L., Gendreau, M., & Langevin, A. (2022). Learning-based branch-and-price algorithms for the vehicle routing problem with time windows and two-dimensional loading constraints. INFORMS Journal on Computing, 34(3), 1419–1436.

    Google Scholar 

  • Zhang, Y., Sun, L., Hu, X., & Zhao, C. (2019). Order consolidation for the lastmile split delivery in online retailing. Transportation Research Part E: Logistics and Transportation Review, 122, 309–327.

    Google Scholar 

  • Zhang, Z., Che, Y., & Liang, Z. (2024). Split-demand multi-trip vehicle routing problem with simultaneous pickup and delivery in airport baggage transit. European Journal of Operational Research, 312(3), 996–1010.

    Google Scholar 

  • Zhang, Z., Zheng, L., Li, N., Wang, W., Zhong, S., & Hu, K. (2012). Minimizing mean weighted tardiness in unrelated parallel machine scheduling with reinforcement learning. Computers & Operations Research, 39(7), 1315–1324.

    Google Scholar 

  • Zhao, H., She, Q., Zhu, C., Yang, Y., & Xu, K. (2020). Online 3D bin packing with constrained deep reinforcement learning. arXiv preprintarXiv:2006.14978, -, 1-9.

  • Zhao, H., She, Q., Zhu, C., Yang, Y., & Xu, K. (2021). Online 3d bin packing with constrained deep reinforcement learning. Proceedings of the aaai conference on artificial intelligence (Vol. 35, pp. 741–749).

  • Zhu, W., Chen, S., Dai, M., & Tao, J. (2024). Solving a 3d bin packing problem with stacking constraints. Computers & Industrial Engineering, 188, 109814.

    Google Scholar 

  • Çağrı, Koç., & Karaoglan, I. (2016). The green vehicle routing problem: A heuristic based exact solution approach. Applied Soft Computing, 39, 154–164.

    Google Scholar 

  • Çağrı, Koç., Laporte, G., & Tükenmez, İlknur. (2020). A review of vehicle routing with simultaneous pickup and delivery. Computers & Operations Research, 122, 104987.

    Google Scholar 

  • Ülkü, M. A. (2012). Dare to care: Shipment consolidation reduces not only costs, but also environmental damage. International Journal of Production Economics, 139(2), 438–446.

    Google Scholar 

Download references

Funding

This study received no funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael T. Lash.

Ethics declarations

Conflict of interest

Amin Abassi-Pooya declares he has no conflicts of interest. Michael T. Lash declares he has no conflicts of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Nomenclature

Notation

Definition

C

Set of customers (indexed by i)

\(n_c, n_s, n_p, n_v, n_t,n_{ipj}\)

Number of customers, suppliers, parts, vehicles, time periods, and flows

S

Set of suppliers (indexed by j)

P

Set of products (indexed by p)

V

Set of vehicles (indexed by v)

T

Set of time periods (indexed by t)

D

Demand domain (indexed by d)

\(o_{ijp}^{(t)}\)

Order of product p from the supplier j by customer i at time t

\(t_{o^{(t)}_{ijp}} \text {(or} \, t_o)\)

Due date of order \(o_{ijp}^{(t)}\)

\(O^{(t)}\)

Set of unassigned orders at time t

\(\Sigma \)

Set of consolidation strategies (indexed by \(\sigma \))

\(\Omega \)

Set of solution strategies (indexed by \(\omega \))

\(\Psi \)

Set of problem considerations

\(u_i (u'_i)\)

Weight (volume) of customer i’s order

\(U_v (U'_v)\)

Weight (volume) capacity of vehicle v

\({\tilde{O}}^{(t)}\)

Set of assigned orders to vehicles

\((l_p, w_p, h_p)\)

Length, width, and height of product p

\((L_v, W_v, H_v)\)

Length, width, and height of vehicle v’s container

\({\hat{O}}\)

Data structure containing the arrangements of orders in containers

\(f(\cdot )\)

Function capturing the overall transportation costs

\(TC^{(t)}\)

Transportation costs at time t

R

Reward function

\({\mathcal {X}}\)

The set of decision variables

\({\mathbb {G}}\)

The constraint set

\(x_{ov}\)

A binary variable that takes the value of 1 if order o is assigned to vehicle v, and 0 otherwise

\(x_{iv} (x_{jv})\)

A binary variable that takes the value of 1 if vehicle v visits customer i (supplier j) on it’s route, and 0 otherwise

\(t_{vi} (t_{vj})\)

The time that vehicle v next visits customer i (supplier j)

\(V'\)

Feasible vehicle/action set (\(V'\subset V\)) created by removing invalid actions/vehicles

\(\theta \)

RL method parameters

\(\gamma \)

Bellman equation discount parameter

\(\pi \)

Penalty function

\(\beta \)

Barrier function

\(s \in {\mathcal {S}}\)

State s and state space \({\mathcal {S}}\)

\(a \in {\mathcal {A}}\)

Action a and action space \({\mathcal {A}}\)

\(\vartheta \)

RL policy

\({\tilde{v}}\)

Direct shipment vehicle

Appendix B An expanded state definition

In real-world applications, there are challenges in defining the elements of an MDP because of infinite state space and computational and data efficiency concerns (Sutton & Barto, 2018). Therefore, the state representation needs to be defined in a way that contains the most salient interaction aspects between the agent and the environment. Therefore, the state representation should give the agent the proper domain knowledge and also make generalization possible, but must also necessarily be limited in scope (Sutton & Barto, 2018).

In light of this, we experimented with using more complicated state representations by including each vehicle’s current weight and/or volume, the day of the week, chronological time step (expressed as both a scalar and a binary subvector), and various combinations of these. Denote the state representation with additional features as \(s'^{(t)}\) and the set of extended features as \(y^{(t)}\). With this notation, we have \(s'^{(t)} = s^{(t)} \oplus y^{(t)} = (ipjd)^{(t)} \oplus y^{(t)}\) where \(\oplus \) is the concatenation operator, and \(y^{(t)}\) is an arbitrary set of extended features, e.g., vehicles’ weight (at time t), vehicles volume (at time t), etc.

We experimented with \(y^{(t)} \in Y= \{ \{\overrightarrow{W}^{(t)}\}, \{\overrightarrow{V}^{(t)}\}, \{t\}, \{\overrightarrow{W}^{(t)} \oplus \overrightarrow{V}^{(t)}\}, \{\overrightarrow{W}^{(t)} \oplus \overrightarrow{V}^{(t)} \oplus t \}, \{t_c\}, \{\overrightarrow{W}^{(t)} \oplus \overrightarrow{V}^{(t)} \oplus t_c \} \}\) where \(\overrightarrow{W}\) and \(\overrightarrow{V}\) are normalized vectors of vehicle weights and volumes, respectively, t is the chronological time step, and \(t_c\) is the day of the week. We did not observe performance or convergence improvements by including any \(y^{(t)} \in Y\) and thus elected to keep the state representation simple by excluding them.

Appendix C Discrete demand scenario details

One could alternatively adopt a discrete state space representation by discretizing the demand since customer i, part p, and supplier j are already discretely represented. Such a representation would permit tabular RL formulations and produce a less complex state space. While we did not experiment with such representations we wished to provide details on these should one choose to adopt our framework.

There are two discrete demand representations that one may wish to consider:

  • Binary demand satisfaction: The demand of each ipj can only have two levels (\(n_d = 2\)), which is interpreted as either not satisfied or satisfied, i.e. \(d\in \{0,1\}\). The maximum state space size would therefore be \(\vert {\mathcal {S}}\vert = 2*n_{ipj} * n_t\) in this scenario.

  • Ordinal demand bins: The demand of each ipj is generated with a countably finite number of levels, i.e., \(1< n_d < \infty \) and such that the \(n_d\) demand levels coincide with binned continuous demand levels, each having a governing lower and upper limit—e.g., \(d^{(1)}_l < d^{(1)}_u \le d^{(2)}_l,\dots ,d^{(n_d)}_u\). This representation allows the incoming demands, and coinciding stochasticity, to be modeled in a discretized fashion. The maximum state space size is \(\vert {\mathcal {S}} \vert = n_d *n_{ipj} * n_t\) in this scenario.

Appendix D CDF1 3PLFMP algorithms

In this section, we provide the specific details of the ASSIGN procedure for each of our adopted methods, Q-learning, Double Q-learning, SARSA, Simulated Annealing (SA) heuristic, and Variable Neighborhood Descent (VND) algorithm.

1.1 Q-learning algorithm

Algorithm 2 provides the ASSIGN Q-learning pseudocode for the CDF1 3PLFMP setting.

Algorithm 2
figure b

ASSIGN Q-learning Algorithm for the CDF1 3PLFMP.

1.2 Deep Q-learning algorithm

Algorithm 3 provides the pseudocode for ASSIGN using Deep Q-learning with the CDF1 3PLFMP setting.

Algorithm 3
figure c

ASSIGN Deep Q-learning Algorithm for the CDF1 3PLFMP.

1.3 Double Q-learning algorithm

The full ASSIGN pseudocode of the CDF1 3PLFMP Double Q-learning (DQL) method is shown in Algorithm 4.

Algorithm 4
figure d

ASSIGN Double Q-learning Algorithm for the CDF1 3PLFMP.

1.4 Double deep Q-learning algorithm for the CDF1 3PLFMP

Algorithm 5 shows the ASSIGN Double Deep Q-learning pseudocode for the CDF1 3PLFMP setting.

Algorithm 5
figure e

ASSIGN Double Deep Q-learning Algorithm for the CDF1 3PLFMP.

dummy

1.5 SARSA algorithm

The full pseudocode of SARSA adapted to the CDF1 3PLFMP setting is shown in Algorithm 6.

Algorithm 6
figure f

ASSIGN SARSA Algorithm for the CDF1 3PLFMP.

1.6 Benchmark simulated annealing and variable neighborhood descent heuristic details and algorithms

Heuristic methods are capable of finding reasonable solutions to large, stochastic optimization problems and have been extensively employed to handle NP-Hard problems, such as TSP, BPP, etc. Indeed, as shown in our literature review in Section 2, they are exclusively employed to handle pairings (i.e., two) of the three overarching problems that constitute the 3PLFMP. Therefore, such methods are suitable benchmarks against which we can compare our proposed RL methods. Furthermore, we can also assess how well heuristics perform on this specific 3PLFMP.

A popular and well-known heuristic method is Simulated Annealing (SA), introduced by Kirkpatrick et al. (1983). The method is directly motivated and derived from the metallurgy annealing process. The annealing process involves initially heating a substance to a high temperature, which is then gradually decreased. In metallurgy, this process allows the molecules in the metal to reach a crystalline (optimal) form, which takes the system from a starting state with a high level of internal energy to a state with minimum levels of energy. The SA heuristic applies this idea to optimization problems by moving from one solution state to another through an arbitrary perturbation algorithm, defined based on the problem at hand, with the goal of reaching an optimal objective function value. To reflect this process, SA has a temperature parameter that is initially set to a starting value that is decreased over the course of algorithmic execution using a “cooling schedule”. The temperature parameter affects the probability of accepting non-improving solutions and thus acts as a control on the exploration-exploitation trade-off. The temperature is initially high to encourage exploration early on. Like the RL methods, the SA algorithm is shown to converge to optimality under certain conditions (Bertsimas & Tsitsiklis, 1993).

The pseudocode of the SA algorithm for the CDF1 3PLFMP is shown in Algorithm 7. First, the parameters are initialized and the temperature is set to an initial value (line 2). Subsequently, the algorithm iterates over orders (line 3). To obtain an initial solution (state of the system), orders are assigned to vehicles randomly (line 6). Then a swap heuristic (the perturbation mechanism; lines 10 and 11) is used to create a neighborhood of the current solution at each inner iteration. If the new solution does not improve the reward, it may be set as the incumbent solution probabilistically relative to the current temperature (calculated in line 16). In the outer loop, the temperature is reduced using the cooling schedule (line 21; \( T = T_0 \zeta ^ {it_{outer}}, (0< \zeta < 1)\)). This procedure repeats until all orders are assigned.

Algorithm 7
figure g

Simulated Annealing for the 3PLFMP.

Another popular optimization algorithm is Variable Neighborhood Descent (VND), first proposed by Mladenović and Hansen (1997). This method works by iteratively exploring different neighborhoods around the current solution, where each neighborhood corresponds to a specific perturbation of an inputted solution. VND’s flexibility in exploring diverse solution spaces and its iterative refinement process make it a powerful method for solving optimization problems. The pseudocode of the VND algorithm for the CDF1 3PLFMP is shown in Algorithm 8 (Note that the method is technically Variable Neighborhood Ascent since we are maximizing the reward. This is however equivalent to minimizing the negative reward and we thus opt to maintain the original “descent“ nomenclature.). The algorithm begins by first initializing the parameters and neighborhood structures (lines 1–2). Subsequently, the algorithm iterates over orders (line 3). To obtain an initial solution, orders are assigned to vehicles randomly (line 6). Subsequently, swap and shift heuristics (the neighborhood structure function in our problem; line 11) are used to create alternative neighborhood solutions at each inner iteration. If there is an improvement in the reward (lines 15–16), the solution with the largest improvement (line 17) will be set as the incumbent solution (line 18). This procedure repeats until all orders are assigned.

Algorithm 8
figure h

Variable Neighborhood Descent for the 3PLFMP.

Appendix E Experiment details

In this section we provide some additional experiment details. In particular, we provide visual illustrations of the customers, suppliers, and depots of both datasets, describe how the flows used in our experiments are generated in greater detail, how orders are generated, and how vehicle routing schedule, weight, and volume capacities are determined. We also provide additional details on the parameterization the 3PLFMP framework, MDP formulation, and RL methods.

1.1 Dataset visualizations

A map showing the geolocated customers, suppliers, and depot of the fully-simulated dataset is shown in Figure 9.

Fig. 9
figure 9

The locations of the customers, suppliers, and depot of the fully-simulated data

A map showing the geolocated customers, suppliers, and depot of the semi-simulated dataset is shown in Figure 10.

Fig. 10
figure 10

The locations of the customers, suppliers, and depot of the semi-simulated data

1.2 Flow generation

The ipj flows of an instantiated problem are randomly determined. A user first specifies the number of customers (\(n_c^{\prime } \le n_c\)), parts (\(n_p^{\prime } \le n_p\)), and suppliers (\(n_s^{\prime } \le n_s\)) that should be in a given problem’s logistics network. To create the flows, the following procedure is carried out:

  1. 1.

    Create set \(J^{'}\): for each supplier \(j=1,\dots ,n_s\), select a random customer \(i^{\prime } \sim {\mathcal {U}}\left( \{1,\dots ,n_c\}\right) \) and a random part \(p^{\prime } \sim {\mathcal {U}}\left( \{1,\dots ,n_p\}\right) \); \(J^{'} \leftarrow J^{'} \cup \{i^{\prime }p^{\prime }j\}\). Set \(J^{'}\) includes all \(n_s\) suppliers paired with random customers and products.

  2. 2.

    Create set \(P^{'}\): for each part \(p=1,\dots n_p\), select a random customer \(i^{\prime } \sim {\mathcal {U}}\left( \{1,\dots ,n_c\}\right) \)and a random supplier \(j^{\prime } \sim {\mathcal {U}}\left( \{1,\dots ,n_s\}\right) \); \(P^{'} \leftarrow P^{'} \cup \{i^{\prime }pj^{\prime }\}\). Set \(P^{'}\) includes all \(n_p\) parts paired with random customers and suppliers.

  3. 3.

    Create set \(I^{'}\): for each customer \(i=1,\dots ,n_c\), select a random part \(p^{\prime } \sim {\mathcal {U}}\left( \{1,\dots ,n_p\}\right) \) and a random supplier \(j^{\prime } \sim {\mathcal {U}}\left( \{1,\dots ,n_s\}\right) \); \(I^{'} \leftarrow I^{'} \cup \{ip^{\prime }j^{\prime }\}\). Set \(I^{'}\) includes all \(n_c\) customers paired with random parts and suppliers.

  4. 4.

    \(flows \leftarrow J^{'} \cup P^{'} \cup I^{'}\). The number of elements in this set is \(n_{ipj} = \vert flows \vert \), representing the total number of flows used in the problem instance.

Depending on whether the flows are being generated from the semi- or fully-simulated data, the customer and supplier coordinates are selected from the dataset, starting from the first customer and supplier. This ensures that when the problem size increases, the customers and suppliers in previous (i.e., smaller) instances are also in the larger-sized instance.

The following table provides the number of customers \(n_c^{'}\), suppliers \(n_s^{'}\), and parts \(n_p^{'}\) that produce the total number of flows \(n_{ipj}\) for each problem instance:

1.3 Order generation

To generate the orders for each episode, we assign each \(ipj \in flows\), corresponding to an eligible order \(o_{ipj}\), a random arrival time \(t_o^{'} \sim {\mathcal {U}}(\{1,\dots ,\left\lfloor .5*n_t \right\rfloor \})\). Based on this arrival time we then randomly generate an order due date \(t_o = \sim {\mathcal {U}}(\{t_o^{'}+1,\dots ,t_o^{'}+\left\lfloor .4*n_t \right\rfloor \})\).

We also randomly generate a demand for each ipj at each episode. This demand represents the amount of part p from supplier j that is in demand by customer i at the time \(t_o^{'}\) the ipj order \(o_{ipj}\) is placed. Demand \(d_{ipj}\) is drawn randomly from a normal distribution as follows:

$$\begin{aligned} d_{ipj} \sim {\mathcal {N}}\left( \tau _{ipj},\rho _{ipj}\right) , \end{aligned}$$
(E1)

where \(\tau _{d}\) and \(\rho _{d}\) are determined by \(\tau _{d} \sim {\mathcal {U}}(\{600,\dots ,4000\})\) and \(\rho _{d} = \frac{\tau _{d}}{6}\). This procedure produces a fair amount of stochasticity in the demand-generating process since the average of the demand-producing distribution has a large and random range of mean values, as well as a large and varying standard deviation, which is dependent upon the random mean selected.

1.4 Vehicle parameterization

For each vehicle \(v \in V\) we need to generate (simulate) weight capacity, volume capacity, days available (in a week), fixed routing schedule, starting location, transportation cost, and transportation time. These properties, however, must be generated such that the final problem instance is feasible. For example, if we set very low capacities the vehicles may not be able to support the generated ipj demands.

Therefore, we set the vehicle volume and weight capacities by summing over the mean ipj demands, denoted \(\tau \), plus one standard deviation, denoted \(\rho \), and dividing by the number of vehicles. Furthermore, we multiply by an expansion factor, \(\eta (\eta \ge 1)\) to alter capacity constraint tightness. Thus, assuming each product has one unit of weight (volume), the weight (volume) capacity for each vehicle is \(U_v (U^{\prime }_v) = \frac{\sum _{ipj \in flows}{(\tau _{ipj} + \rho _{ipj})}*\eta }{n_v}\). The weight and volume capacities of the direct shipping vehicle are set to \(\infty \).

The fixed routing schedule for each vehicle is set by selecting a number of suppliers \(n_s^{(v)} \sim {\mathcal {U}}\left( \{2,\dots ,\frac{n_s^{\prime }}{2}\}\right) \) and then randomly selecting this number of suppliers for inclusion on v’s route – i.e., \(\{Route_v \leftarrow Route_v \cup \{j\}: j \sim {\mathcal {U}}\left( \{1,\dots ,n_s^{\prime }\}_{\setminus Route_v}\right) ,1,\dots ,n_s^{(v)}\}\). All customers i associated with any supplier j through an \(ipj \in flows\) are then added to the vehicle’s route to ensure there are no spurious supplier-only visits (i.e., a vehicle would not visit a supplier without there being a corresponding customer to deliver to). Appending the customers to the end of the route also ensures that suppliers are visited before their customers. The direct shipment vehicle is assumed to visit all customers and suppliers. The starting location of all vehicles is initialized to a depot, located at the centroid of the geocoded customers and suppliers for the semi-simulated dataset; the depot is already given in the fully-simulated dataset.

The transportation distance for each leg of vehicle’s tour is calculated using the Euclidean metric. By setting the average speed for each vehicle, the transportation time can also be calculated. It is worth noting that when orders are delivered using direct shipment, the transportation time is smaller, but more vehicles are being used so the total transportation distance (and therefore cost) is higher.

1.5 Parameterization

In this subsection, we provide the details on how our 3PLFMP framework is parameterized, as well as the parameterization of our adopted RL methods.

1.5.1 Framework and MDP parameterizations

We specify a variety of framework- and MDP-specific parameters in this subsection. These specifications encompass both the broader 3PLFMP framework, disclosed in Sect. 3, as well as the specific implementation we adopted for the CDF1 3PLFMP.

First, we note that we incorporate the constraints \({\mathbb {G}}({S}^{(t)})\) into our reward function, through both a barrier term and a penalty term, and also incorporate these into \(\Psi \), although we do not explicitly enforce them. The \(\Psi \) constraints are, however, used to inform vehicle eligibility when assigning orders. We set the exponent of the penalty term (Eq. (13)) \(\xi =10\). The following are specific framework component parameterizations:

  • \(\mathbf {ASSIGN_\sigma ^\omega (\cdot )}\): As we mentioned, our specific implementation of the 3PLFMP explores the use of an order consolidation strategy, which is what \(\sigma \) is set equal to. On the other hand, the solution strategy \(\omega \in \{\text {Q-learning}, \text {Double Q-learning}, \text {SARSA}, \text {SA}\}\).

  • \(\mathbf {ROUTE_\sigma ^\omega (\cdot )}\): We set \(\sigma =fixed\), \(\omega = \emptyset \), because the vehicle routes are fixed. The procedure therefore returns the fixed routing schedule. This pre-determined routing schedule is one of the elements included in \(\Psi \).

  • \(\mathbf {PACK_\sigma ^\omega (\cdot )}\): We set \(\sigma =1D\), \(\omega = \emptyset \) since we only consider one-dimensional weight and volume constraints. That is, an order is considered packed at the time of assignment so long as the addition of the order does not cause the vehicle’s weight and volume capacities to be exceed. These weight and volume constraints are included in \(\Psi \).

1.5.2 Method parameterization

There are a few parameters associated with each of our methods that must be specified. Therefore, we perform a variety of parameter tuning experiments and report the results in Table 8, below. Table 8 shows the average of average rewards over problem instances 01–07 (see Table 7) for different parameter combinations for each RL method on the semi-simulated data. We observe that parameter setting 0 gives the best reward for all RL algorithms (we thus also adopt this parameterization for the fully simulated data as well). Accordingly, we parameterize the three RL algorithms with a future discount reward factor of \(\gamma = 0.8\), a learning parameter \(\alpha = 0.001\), a batch size of 512, and \(\epsilon =0.90\) for \(\epsilon \)-greedy selection.

Table 7 The \(n^{\prime }_c\), \(n^{\prime }_p\), and \(n^{\prime }_s\) values used to generate the \(n_{ipj}\) flows of each problem instance
Table 8 Parameter tuning of RL methods on the semi-simulated data

Further, note that to improve the stability of the RL methods, we employ reward clipping (see, e.g., van Hasselt et al., 2016; Sutton & Barto, 2018). We clipped the rewards between \([-8000,0]\) and then normalized the value to a range \([-1,0]\) when computing the loss and updating the parameters. In addition to the above-mentioned parameters for all RL algorithms, when training the two deep RL algorithms, we used stochastic gradient descent due to its efficiency and a hard update scheme with a fixed target network, updated every 50 iterations.

Fig. 11
figure 11

3PLFMP subproblem benchmark results. Semi-simulated dataset. \(n_{ipj} = 53\). Equation 12 reward function moving average (window = 50). a Baseline, b No VRP c No BPP, and d No VRP + BPP

Appendix F 3PLFMP subproblem benchmark experiments

In this section we conduct additional experiments that showcase the importance of each component of the 3PLFMP, thus highlighting the inter-related nature of the three combinatorial subproblems we identify. Doing so creates a benchmark between what we propose and past works that consider only a single subproblem or pairings of these subproblems (we review these works in Section 2.2).

To conduct these benchmark experiments we remove certain elements from Eq. 12 to simulate the removal of a certain combinatorial subproblem component of the 3PLFMP. We consider the following cases:

  • Baseline: Original 3PLFMP (no alteration to Eq. 12).

  • No VRP: Remove the routing penalties.

  • No BPP: Remove the packing penalties.

  • No VRP + BPP: Removing both the routing and packing penalties.

For each of the modified reward functions above we then train the various RL methods and plot the results using the original reward function to assess whether the method can still account for the intricacies of the full 3PLFMP problem setting. We opt to use a problem size of \(n_{ipj}=53\) and the semi-simulated dataset.

The results of these benchmark experiments are provided in Fig. 11.

We observe that removal of any of the subproblem components (VRP, BPP, VRP+BPP) prevents all RL methods from appropriately accounting for the removed component; all averaged rewards are flat and unable to improve their solution quality. This result showcases the importance of considering all three subproblem elements that make up the 3PLFMP framework.

Appendix G Robustness to demand distribution

Prior researches used various continuous and discrete distributions to model uncertainty in demand (see, e.g., the survey by Oyola et al., 2018). As described in Section E, our obtained results are generated using a normal distribution, which is widely used in the literature of stochastic VRP. However, to test the robustness of our framework, in this section we explore Poisson as a discrete distribution, another common VRP distribution - e.g., in Mendoza et al. (2016) and Sluijk et al. (2023). The experiments will assess the effects of this change on our results, and observe the robustness of our proposed framework. Note that we set the mean values of the Poisson distribution similar to the mean values mentioned in Section E for normal distribution. Table 9 summarizes the percentage of times each algorithm wins, and Table 10 shows the results of semi- and fully-simulated test instances along with the results of the two comparator algorithms. Looking at both tables, we observe that in all of the larger instances, RL algorithms perform the best compared to SA and VND algorithms. VND algorithm could provide the best solution in only one small instance (instance #2) but it is dominated in all other instances by RL algorithms. However, SA algorithm never outperformed other algorithms in the experiments. Thus, we find that the results obtained using a Poisson distribution are comparable to those obtained used a normal distribution.

Table 9 Percentage of semi- and fully-simulated problem instances that each algorithm performs the best based on average rewards (demand is Poisson-distributed)
Table 10 Semi-simulated (first 12 rows) and fully-simulated (second 12 rows) test results for Poisson demand

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abbasi-Pooya, A., Lash, M.T. The third party logistics provider freight management problem: a framework and deep reinforcement learning approach. Ann Oper Res 339, 965–1024 (2024). https://doi.org/10.1007/s10479-024-05876-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10479-024-05876-y

Keywords