{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2024,10,30]],"date-time":"2024-10-30T22:13:02Z","timestamp":1730326382497,"version":"3.28.0"},"publisher-location":"New York, NY, USA","reference-count":38,"publisher":"ACM","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,5,8]]},"DOI":"10.1145\/3578356.3592578","type":"proceedings-article","created":{"date-parts":[[2023,5,4]],"date-time":"2023-05-04T19:44:37Z","timestamp":1683229477000},"page":"78-86","update-policy":"http:\/\/dx.doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":7,"title":["Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems"],"prefix":"10.1145","author":[{"ORCID":"http:\/\/orcid.org\/0009-0008-6362-0967","authenticated-orcid":false,"given":"Mehran","family":"Salmani","sequence":"first","affiliation":[{"name":"Iran University of Science and Technology, Tehran, Iran"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-3799-5702","authenticated-orcid":false,"given":"Saeid","family":"Ghafouri","sequence":"additional","affiliation":[{"name":"Queen Mary University of London, London, United Kingdom"}]},{"ORCID":"http:\/\/orcid.org\/0000-0001-6461-1650","authenticated-orcid":false,"given":"Alireza","family":"Sanaee","sequence":"additional","affiliation":[{"name":"Queen Mary University of London, London, United Kingdom"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-3232-5657","authenticated-orcid":false,"given":"Kamran","family":"Razavi","sequence":"additional","affiliation":[{"name":"Technical University of Darmstadt, Darmstadt, Germany"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-4713-5327","authenticated-orcid":false,"given":"Max","family":"M\u00fchlh\u00e4user","sequence":"additional","affiliation":[{"name":"Technical University of Darmstadt, Darmstadt, Germany"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-1840-9616","authenticated-orcid":false,"given":"Joseph","family":"Doyle","sequence":"additional","affiliation":[{"name":"Queen Mary University of London, London, United Kingdom"}]},{"ORCID":"http:\/\/orcid.org\/0000-0002-9342-0703","authenticated-orcid":false,"given":"Pooyan","family":"Jamshidi","sequence":"additional","affiliation":[{"name":"University of South Carolina, Columbia, USA"}]},{"ORCID":"http:\/\/orcid.org\/0000-0003-4992-2500","authenticated-orcid":false,"given":"Mohsen","family":"Sharifi","sequence":"additional","affiliation":[{"name":"Iran University of Science and Technology, Tehran, Iran"}]}],"member":"320","published-online":{"date-parts":[[2023,5,8]]},"reference":[{"volume-title":"GPU virtualization in K8S: challenges and state of the art. https:\/\/www.arrikto.com\/blog\/gpu-virtualization-in-k8s-challenges-and-state-of-the-art\/. (Nov","year":"2022","key":"e_1_3_2_1_1_1","unstructured":"2022. GPU virtualization in K8S: challenges and state of the art. https:\/\/www.arrikto.com\/blog\/gpu-virtualization-in-k8s-challenges-and-state-of-the-art\/. (Nov 2022 ). 2022. GPU virtualization in K8S: challenges and state of the art. https:\/\/www.arrikto.com\/blog\/gpu-virtualization-in-k8s-challenges-and-state-of-the-art\/. (Nov 2022)."},{"volume-title":"Horizontal Pod autoscaling. (Jun","year":"2022","key":"e_1_3_2_1_2_1","unstructured":"2022. Horizontal Pod autoscaling. (Jun 2022 ). https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/ 2022. Horizontal Pod autoscaling. (Jun 2022). https:\/\/kubernetes.io\/docs\/tasks\/run-application\/horizontal-pod-autoscale\/"},{"key":"e_1_3_2_1_3_1","unstructured":"2022. Inter-op parallelism threads. (2022). https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/config\/threading\/set_inter_op_parallelism_threads 2022. Inter-op parallelism threads. (2022). https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/config\/threading\/set_inter_op_parallelism_threads"},{"key":"e_1_3_2_1_4_1","unstructured":"2022. Intra-op parallelism threads. (2022). https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/config\/threading\/set_intra_op_parallelism_threads 2022. Intra-op parallelism threads. (2022). https:\/\/www.tensorflow.org\/api_docs\/python\/tf\/config\/threading\/set_intra_op_parallelism_threads"},{"key":"e_1_3_2_1_5_1","unstructured":"2022. TorchServe parallelism threads. (2022). https:\/\/pytorch.org\/docs\/stable\/notes\/cpu_threading_torchscript_inference.html 2022. TorchServe parallelism threads. (2022). https:\/\/pytorch.org\/docs\/stable\/notes\/cpu_threading_torchscript_inference.html"},{"key":"e_1_3_2_1_6_1","unstructured":"2023. Kserve. https:\/\/github.com\/kserve\/kserve. (2023). 2023. Kserve. https:\/\/github.com\/kserve\/kserve. (2023)."},{"key":"e_1_3_2_1_7_1","unstructured":"2023. Seldon. https:\/\/github.com\/SeldonIO\/seldon-core. (2023). 2023. Seldon. https:\/\/github.com\/SeldonIO\/seldon-core. (2023)."},{"key":"e_1_3_2_1_8_1","unstructured":"2023. Triton inference server. https:\/\/github.com\/triton-inference-server\/server. (2023). 2023. Triton inference server. https:\/\/github.com\/triton-inference-server\/server. (2023)."},{"key":"e_1_3_2_1_9_1","unstructured":"2023. Vertical Pod autoscaling. https:\/\/github.com\/kubernetes\/autoscaler\/tree\/master\/vertical-pod-autoscaler. (2023). 2023. Vertical Pod autoscaling. https:\/\/github.com\/kubernetes\/autoscaler\/tree\/master\/vertical-pod-autoscaler. (2023)."},{"key":"e_1_3_2_1_10_1","volume-title":"Arnaud Van Looveren, and Clive Cox","author":"Akoush Sherif","year":"2022","unstructured":"Sherif Akoush , Andrei Paleyes , Arnaud Van Looveren, and Clive Cox . 2022 . Desiderata for next generation of ML model serving. arXiv preprint arXiv:2210.14665 (2022). Sherif Akoush, Andrei Paleyes, Arnaud Van Looveren, and Clive Cox. 2022. Desiderata for next generation of ML model serving. arXiv preprint arXiv:2210.14665 (2022)."},{"key":"e_1_3_2_1_11_1","volume-title":"AI and compute. https:\/\/openai.com\/blog\/ai-and-compute\/. (Nov","author":"Amodei Dario","year":"2019","unstructured":"Dario Amodei and Danny Hernandez . 2019. AI and compute. https:\/\/openai.com\/blog\/ai-and-compute\/. (Nov 2019 ). Dario Amodei and Danny Hernandez. 2019. AI and compute. https:\/\/openai.com\/blog\/ai-and-compute\/. (Nov 2019)."},{"key":"e_1_3_2_1_12_1","unstructured":"archiveteam. 2021. Archiveteam-twitter-stream-2021-08. https:\/\/archive.org\/details\/archiveteam-twitter-stream-2021-08. (2021). archiveteam. 2021. Archiveteam-twitter-stream-2021-08. https:\/\/archive.org\/details\/archiveteam-twitter-stream-2021-08. (2021)."},{"key":"e_1_3_2_1_13_1","volume-title":"Amazon EC2 ML inference. https:\/\/tinyurl.com\/5n8yb5ub. (Dec","author":"Bar Jeff","year":"2019","unstructured":"Jeff Bar . 2019. Amazon EC2 ML inference. https:\/\/tinyurl.com\/5n8yb5ub. (Dec 2019 ). Jeff Bar. 2019. Amazon EC2 ML inference. https:\/\/tinyurl.com\/5n8yb5ub. (Dec 2019)."},{"key":"e_1_3_2_1_14_1","volume-title":"Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791","author":"Cai Han","year":"2019","unstructured":"Han Cai , Chuang Gan , Tianzhe Wang , Zhekai Zhang , and Song Han . 2019. Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 ( 2019 ). Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2019. Once-for-all: train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019)."},{"key":"e_1_3_2_1_15_1","unstructured":"PyTorch Serve Contributors. 2020. Torch serve. https:\/\/pytorch.org\/serve\/. (2020). PyTorch Serve Contributors. 2020. Torch serve. https:\/\/pytorch.org\/serve\/. (2020)."},{"volume-title":"14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 613--627.","author":"Crankshaw Daniel","key":"e_1_3_2_1_16_1","unstructured":"Daniel Crankshaw , Xin Wang , Guilio Zhou , Michael J Franklin , Joseph E Gonzalez , and Ion Stoica . 2017. Clipper: a low-latency online prediction serving system . In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 613--627. Daniel Crankshaw, Xin Wang, Guilio Zhou, Michael J Franklin, Joseph E Gonzalez, and Ion Stoica. 2017. Clipper: a low-latency online prediction serving system. In 14th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 17). 613--627."},{"key":"e_1_3_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/2382553.2382556"},{"key":"e_1_3_2_1_18_1","doi-asserted-by":"publisher","DOI":"10.1145\/3135974.3135993"},{"key":"e_1_3_2_1_19_1","volume-title":"Serving DNNs like clockwork: performance predictability from the bottom up. arXiv preprint arXiv:2006.02464","author":"Gujarati Arpan","year":"2020","unstructured":"Arpan Gujarati , Reza Karimi , Safya Alzayat , Wei Hao , Antoine Kaufmann , Ymir Vigfusson , and Jonathan Mace . 2020. Serving DNNs like clockwork: performance predictability from the bottom up. arXiv preprint arXiv:2006.02464 ( 2020 ). Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, and Jonathan Mace. 2020. Serving DNNs like clockwork: performance predictability from the bottom up. arXiv preprint arXiv:2006.02464 (2020)."},{"key":"e_1_3_2_1_20_1","volume-title":"Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das.","author":"Gunasekaran Jashwant Raj","year":"2022","unstructured":"Jashwant Raj Gunasekaran , Cyan Subhra Mishra , Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022 . Cocktail: a multidimensional optimization for model serving in cloud. In USENIX NSDI. 1041--1057. Jashwant Raj Gunasekaran, Cyan Subhra Mishra, Prashanth Thinakaran, Bikash Sharma, Mahmut Taylan Kandemir, and Chita R Das. 2022. Cocktail: a multidimensional optimization for model serving in cloud. In USENIX NSDI. 1041--1057."},{"key":"e_1_3_2_1_21_1","unstructured":"Gurobi Optimization LLC. 2023. Gurobi optimizer reference manual. (2023). https:\/\/www.gurobi.com Gurobi Optimization LLC. 2023. Gurobi optimizer reference manual. (2023). https:\/\/www.gurobi.com"},{"key":"e_1_3_2_1_22_1","volume-title":"Long short-term memory. Neural computation 9, 8","author":"Hochreiter Sepp","year":"1997","unstructured":"Sepp Hochreiter and J\u00fcrgen Schmidhuber . 1997. Long short-term memory. Neural computation 9, 8 ( 1997 ), 1735--1780. Sepp Hochreiter and J\u00fcrgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780."},{"key":"e_1_3_2_1_23_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472883.3486993"},{"key":"e_1_3_2_1_24_1","volume-title":"Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association.","author":"Keahey Kate","year":"2020","unstructured":"Kate Keahey , Jason Anderson , Zhuo Zhen , Pierre Riteau , Paul Ruth , Dan Stanzione , Mert Cevik , Jacob Colleran , Haryadi S. Gunawi , Cody Hammock , Joe Mambretti , Alexander Barnes , Fran\u00e7ois Halbach , Alex Rocha , and Joe Stubbs . 2020 . Lessons learned from the Chameleon testbed . In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association. Kate Keahey, Jason Anderson, Zhuo Zhen, Pierre Riteau, Paul Ruth, Dan Stanzione, Mert Cevik, Jacob Colleran, Haryadi S. Gunawi, Cody Hammock, Joe Mambretti, Alexander Barnes, Fran\u00e7ois Halbach, Alex Rocha, and Joe Stubbs. 2020. Lessons learned from the Chameleon testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC '20). USENIX Association."},{"key":"e_1_3_2_1_25_1","volume-title":"AWS to offer Nvidia's T4 GPUs for AI inferencing. https:\/\/www.hpcwire.com\/2019\/03\/19\/aws-upgrades-its-gpu-backed-ai-inference-platform\/. (Mar","author":"Leopold George","year":"2019","unstructured":"George Leopold . 2019. AWS to offer Nvidia's T4 GPUs for AI inferencing. https:\/\/www.hpcwire.com\/2019\/03\/19\/aws-upgrades-its-gpu-backed-ai-inference-platform\/. (Mar 2019 ). George Leopold. 2019. AWS to offer Nvidia's T4 GPUs for AI inferencing. https:\/\/www.hpcwire.com\/2019\/03\/19\/aws-upgrades-its-gpu-backed-ai-inference-platform\/. (Mar 2019)."},{"key":"e_1_3_2_1_26_1","volume-title":"2022 IEEE Real-Time Systems Symposium (RTSS). IEEE, 277--290","author":"Nigade Vinod","year":"2022","unstructured":"Vinod Nigade , Pablo Bauszat , Henri Bal , and Lin Wang . 2022 . Jellyfish: timely inference serving for dynamic edge networks . In 2022 IEEE Real-Time Systems Symposium (RTSS). IEEE, 277--290 . Vinod Nigade, Pablo Bauszat, Henri Bal, and Lin Wang. 2022. Jellyfish: timely inference serving for dynamic edge networks. In 2022 IEEE Real-Time Systems Symposium (RTSS). IEEE, 277--290."},{"key":"e_1_3_2_1_27_1","volume-title":"Tensorflow-Serving: flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139","author":"Olston Christopher","year":"2017","unstructured":"Christopher Olston , Noah Fiedel , Kiril Gorovoy , Jeremiah Harmsen , Li Lao , Fangwei Li , Vinu Rajashekhar , Sukriti Ramesh , and Jordan Soyke . 2017. Tensorflow-Serving: flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139 ( 2017 ). Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke. 2017. Tensorflow-Serving: flexible, high-performance ML serving. arXiv preprint arXiv:1712.06139 (2017)."},{"key":"e_1_3_2_1_28_1","unstructured":"Jongsoo Park Maxim Naumov Protonu Basu Summer Deng Aravind Kalaiah Daya Khudia James Law Parth Malani Andrey Malevich Satish Nadathur etal 2018. Deep learning inference in Facebook data centers: characterization performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018). Jongsoo Park Maxim Naumov Protonu Basu Summer Deng Aravind Kalaiah Daya Khudia James Law Parth Malani Andrey Malevich Satish Nadathur et al. 2018. Deep learning inference in Facebook data centers: characterization performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886 (2018)."},{"key":"e_1_3_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.1109\/RTAS54340.2022.00020"},{"key":"e_1_3_2_1_30_1","volume-title":"2021 USENIX Annual Technical Conference (USENIX ATC 21)","author":"Romero Francisco","year":"2021","unstructured":"Francisco Romero , Qian Li , Neeraja J Yadwadkar , and Christos Kozyrakis . 2021 . {INFaaS}: automated model-less inference serving . In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 397--411. Francisco Romero, Qian Li, Neeraja J Yadwadkar, and Christos Kozyrakis. 2021. {INFaaS}: automated model-less inference serving. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 397--411."},{"key":"e_1_3_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3342195.3387524"},{"key":"e_1_3_2_1_32_1","doi-asserted-by":"crossref","first-page":"423","DOI":"10.1016\/j.procs.2021.01.025","article-title":"Deep learning in image classification using residual network (ResNet) variants for detection of colorectal cancer","volume":"179","author":"Sarwinda Devvi","year":"2021","unstructured":"Devvi Sarwinda , Radifa Hilya Paradisa , Alhadi Bustamam , and Pinkie Anggia . 2021 . Deep learning in image classification using residual network (ResNet) variants for detection of colorectal cancer . Procedia Computer Science 179 (2021), 423 -- 431 . Devvi Sarwinda, Radifa Hilya Paradisa, Alhadi Bustamam, and Pinkie Anggia. 2021. Deep learning in image classification using residual network (ResNet) variants for detection of colorectal cancer. Procedia Computer Science 179 (2021), 423--431.","journal-title":"Procedia Computer Science"},{"key":"e_1_3_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1145\/3341301.3359658"},{"key":"e_1_3_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Leonid Velikovich Ian Williams Justin Scheiner Petar S Aleksic Pedro J Moreno and Michael Riley. 2018. Semantic lattice processing in contextual automatic speech recognition for Google assistant. In Interspeech. 2222--2226. Leonid Velikovich Ian Williams Justin Scheiner Petar S Aleksic Pedro J Moreno and Michael Riley. 2018. Semantic lattice processing in contextual automatic speech recognition for Google assistant. In Interspeech. 2222--2226.","DOI":"10.21437\/Interspeech.2018-2453"},{"key":"e_1_3_2_1_35_1","doi-asserted-by":"publisher","DOI":"10.1145\/3472883.3486987"},{"key":"e_1_3_2_1_36_1","unstructured":"Chengliang Zhang Minchen Yu Wei Wang and Feng Yan. 2019. MArk: exploiting cloud services for cost-effective SLO-aware machine learning inference serving. In 2019 {USENIX} Annual Technical Conference ({USENIX}{ATC} 19). 1049--1062. Chengliang Zhang Minchen Yu Wei Wang and Feng Yan. 2019. MArk: exploiting cloud services for cost-effective SLO-aware machine learning inference serving. In 2019 { USENIX } Annual Technical Conference ( { USENIX }{ ATC } 19). 1049--1062."},{"key":"e_1_3_2_1_37_1","volume-title":"14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17)","author":"Zhang Haoyu","year":"2017","unstructured":"Haoyu Zhang , Ganesh Ananthanarayanan , Peter Bodik , Matthai Philipose , Paramvir Bahl , and Michael J Freedman . 2017 . Live video analytics at scale with approximation and {delay-tolerance} . In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) . 377--392. Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J Freedman. 2017. Live video analytics at scale with approximation and {delay-tolerance}. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). 377--392."},{"volume-title":"12th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 20).","author":"Zhang Jeff","key":"e_1_3_2_1_38_1","unstructured":"Jeff Zhang , Sameh Elnikety , Shuayb Zarar , Atul Gupta , and Siddharth Garg . 2020. Model-switching: dealing with fluctuating workloads in machine-learning-as-a-service systems . In 12th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 20). Jeff Zhang, Sameh Elnikety, Shuayb Zarar, Atul Gupta, and Siddharth Garg. 2020. Model-switching: dealing with fluctuating workloads in machine-learning-as-a-service systems. In 12th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 20)."}],"event":{"name":"EuroMLSys '23: 3rd Workshop on Machine Learning and Systems","sponsor":["SIGOPS ACM Special Interest Group on Operating Systems"],"location":"Rome Italy","acronym":"EuroMLSys '23"},"container-title":["Proceedings of the 3rd Workshop on Machine Learning and Systems"],"original-title":[],"deposited":{"date-parts":[[2023,5,4]],"date-time":"2023-05-04T19:45:38Z","timestamp":1683229538000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3578356.3592578"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,8]]},"references-count":38,"alternative-id":["10.1145\/3578356.3592578","10.1145\/3578356"],"URL":"https:\/\/doi.org\/10.1145\/3578356.3592578","relation":{},"subject":[],"published":{"date-parts":[[2023,5,8]]},"assertion":[{"value":"2023-05-08","order":2,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}