Abstract
Transformer-based models such as BERT model have achieved state-of-the-art accuracy in the natural language processing (NLP) tasks. Nevertheless, these models are extremely cumbersome and have low throughput in NLP inference. This is more challenging for edge inference due to the limited memory size and computational power of edge devices. Therefore, we aim to improve the edge inference throughput of transformer-based models, which is critical for real-life applications that process multiple independent tasks concurrently on resource-constrained devices to provide a better user experience. Pipelining a deep neural network (DNN) model across heterogeneous processing elements has been shown to significantly improve throughput. However, existing deep learning (DL) frameworks do not support pipeline inference, and previous works dedicated to pipelining lack full support for BERT models. In this work, we propose a heterogeneous pipelining framework (PipeBERT), built on TVM, for BERT models to utilize all available heterogeneous resources present in the ARM big.LITTLE architecture, which is common in modern edge devices. PipeBERT is the first pipelining framework that fully supports BERT operations, and improve overall throughput by employing heterogeneous ARM CPU clusters concurrently. PipeBERT splits BERT model into subgraphs, then maps subgraphs onto either ARM big or LITTLE cluster. To efficiently find pipeline configurations that balance the workload between heterogeneous clusters, we propose an improved binary search algorithm, which uses hardware performance metric feedback to find the best split configurations faster. Our search algorithm finds the best split point on average 1.2x and 165x faster than baseline binary search and exhaustive search, respectively. On the HiKey970 embedded platform and for BERT models, PipeBERT demonstrates on average 48.6% of higher inference throughput than running on four big cores (i.e., ARM big CPU cluster), and an average 61% of lower energy-delay product (EDP) than the best homogeneous inference.
Similar content being viewed by others
Notes
We compare with Pipe-all because it is the only open-source pipelining framework.
References
Pacheco, A., et al. (2018). A smart classroom based on deep learning and osmotic iot computing. In 2018 Congreso Internacional de Innovación y Tendencias en Ingeniería (CONIITI).
Amazon Alexa. Retrieved Oct 5, 2022, from https://developer.amazon.com/en-US/alexa
Google Home Nest. Retrieved Oct 5, 2022, from https://store.google.com/product/nest_hub_2nd_gen?hl=en-GB
Palanica, A., & Fossat, Y. (2021). Medication name comprehension of intelligent virtual assistants: A comparison of amazon alexa, google assistant, and apple siri between 2019 and 2021. Frontiers in Digital Health, 3, 48.
Iandola, F. N., et al. (2020). SqueezeBERT: What can computer vision teach nlp about efficient neural networks? http://arxiv.org/abs/2006.11316
Wu, C. -J., et al. (2019). Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
Xiao, Z., et al. (2013). Security and privacy in cloud computing. IEEE Communication Surveys and Tutorials, 15(2), 843–859.
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual bert? http://arxiv.org/abs/1906.01502
Google, Google pixel 6 live translation. Retrieved Oct 5, 2022, from https://support.google.com/pixelphone/answer/11209263?hl=ena
Geekbench 5. Retrieved Oct 5, 2022, from https://browser.geekbench.com/
Arm big.LITTLE. Retrieved Oct 5, 2022, from https://www.arm.com/why-arm/technologies/big-little
Jouppi, N. P., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ACM ISCA).
Wang, S., et al. (2019). High-throughput CNN inference on embedded ARM big.little multi-core processors. IEEE TCAD, 39(10), 2254–2267.
Wang, S., et al. (2020). Neural network inference on mobile SoCs. IEEE Design & Test, 37(5), 50–57.
Aghapour, E., et al. (2021) Integrated ARM big.Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv. https://doi.org/10.36227/techrxiv.14994885.v2
Hikey970. (2018). Retrieved Oct 5, 2022, from https://www.96boards.org/product/hikey970/
Braun, T., et al. (2001). A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Elsevier Journal of Parallel and Distributed Computing, 61(6), 810–837.
Han, S., et al. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149
Zhang, D., et al. (2018). Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV).
Hinton, G. et al. (2015). Distilling the knowledge in a neural network.
Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. http://arxiv.org/abs/1910.01108
Kim, Y., et al. (2019). μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In Proceedings of the Fourteenth EuroSys Conference 2019.
Soomro, P. N., et al. (2021). An online guided tuning approach to run cnn pipelines on edge devices. In Proceedings of the 18th ACM International Conference on Computing Frontiers.
Krizhevsky, A., et al. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS.
Szegedy, C. A. O. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Howard, A. G., et al. (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861
He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Iandola, F. N., et al. (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 mb model size. http://arxiv.org/abs/1602.07360
Chen, T., et al. (2018). TVM: end-to-end optimization stack for deep learning. http://arxiv.org/abs/1802.04799
Lan, Z., et al. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. In Submitted to International Conference on Learning Representations.
Sun, Z., et al. (2020). MobileBERT: a compact task-agnostic BERT for resource-limited devices. http://arxiv.org/abs/2004.02984
Lane, N., et al. (2016). Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).
Kang, W., et al. (2021). Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks. In 2021 IEEE Real-Time Systems Symposium (RTSS).
Minakova, S., et al. (2020). Combining task-and data-level parallelism for high-throughput cnn inference on embedded cpus-gpus mpsocs. In Springer SAMOS.
Bilsen, G., et al. (1995). Cyclo-static data flow. in IEEE ICASSP.
Nvidia jetson tx2. (2017). Retrieved Oct 5, 2022, from https://developer.nvidia.com/embedded/jetson-tx2
Kang, D., et al. (2020). Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access, 8, 43 980–43 991.
Devlin, J., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805
Vaswani, A., et al. Attention is all you need. http://arxiv.org/abs/1706.03762
Wang, A., et al. (2019). Glue: A multi-task benchmark and analysis platform for natural language understanding. http://arxiv.org/abs/1804.07461
Bhandare, A., et al. (2019). Efficient 8-bit quantization of transformer neural machine language translation model. http://arxiv.org/abs/1906.00532
Zafrir, O., et al. (2019). Q8BERT: Quantized 8bit BERT. http://arxiv.org/abs/1910.06188
Kim, S., et al. (2021). I-BERT: Integer-only BERT quantization. http://arxiv.org/abs/2101.01321
Gordon, M. A., et al. (2020). Compressing BERT: Studying the effects of weight pruning on transfer learning.
Dehghani, M., et al. (2018). Universal transformers. http://arxiv.org/abs/1807.03819?context=cs
Tambe, T., et al. (2021). EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference. in IEEE/ACM MICRO.
Kwon, H., et al. (2019). Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. in IEEE/ACM MICRO.
Zhou, L., et al. (2019). Adaptive parallel execution of deep neural networks on heterogeneous edge devices. in ACM/IEEE SEC.
Zeng, L., Chen, X., Zhou, Z., Yang, L., & Zhang, J. (2021). Coedge: Cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices. IEEE/ACM TON, 29(2), 595–608.
Compute library: A software library for computer vision and machine learning. Retrieved Oct 5, 2022, from https://developer.arm.com/ip-products/processors/machine-learning/compute-library
Ignatov, A., et al. (2019). Ai benchmark: All about deep learning on smartphones in 2019. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
Roesch, J., et al. (2018). Relay: A new ir for machine learning frameworks. In ACM PLDI.
Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis and transformation. In International Symposium on Code Generation and Optimization (CGO).
Tensorflow lite. Retrieved Oct 5, 2022, from https://www.tensorflow.org/lite
Arm neural network: Arm software developer kit. Retrieved Oct 5, 2022, from https://www.arm.com/products/silicon-ip-cpu/ethos/arm-nn
Tensorflow extended. Retrieved Oct 5, 2022, from https://www.tensorflow.org/tfx
Wolf, T., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing, http://arxiv.org/abs/1910.03771
Torchscript. Retrieved Oct 5, 2022, from https://pytorch.org/docs/stable/jit.html
Wang, S., et al. (2018). Optic: Optimizing collaborative cpu-gpu computing on mobile devices with thermal constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(3), 393–406.
Arm streamline performance analyzer. Retrieved Oct 5, 2022, from https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
Nvidia data center deep learning product performance: Inference. Retrieved Oct 5, 2022, from https://developer.nvidia.com/deep-learning-performance-training-inference
Gibson, P., et al. (2020). Optimizing grouped convolutions on edge devices. In 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP).
Choudhury, A. R., et al. (2020). Variable batch size across layers for efficient prediction on cnns. In 2020 IEEE 13th International Conference on Cloud Computing (CLOUD).
Zhou, H., et al. (2018). S^ 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
Kosaian, J., et al. (2021). Boosting the throughput and accelerator utilization of specialized cnn inference beyond increasing batch size. In International Conference on Machine Learning (ICML) PMLR.
Funding
The author(s) received funding from Huawei Canada.
Author information
Authors and Affiliations
Contributions
Hung-Yang, Dr. Seyyed Hasan, and Prof. Brett developed the methodologies and planned the experiment. Hung-Yang performed experimental implementations, and Cheng designed the searching algorithm. Hung-Yang and Dr. Seyyed Hasan wrote the manuscripts, and all authors provided critical feedback to shape the research, analysis, and manuscript.
Corresponding author
Ethics declarations
Conflicting Interest
We declare that we do not have any potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chang, HY., Mozafari, S.H., Chen, C. et al. PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors. J Sign Process Syst 95, 877–894 (2023). https://doi.org/10.1007/s11265-022-01814-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-022-01814-y