Abstract
The General Transit Feed Specification (GTFS) standard for publishing transit data is ubiquitous. With the advent of LLMs being used widely, this research explores the possibility of extracting transit information from GTFS through natural language instructions. To evaluate the capabilities and limitations of LLMs, we introduce two benchmarks, namely “GTFS Semantics” and “GTFS Retrieval” that test how well LLMs can “understand” GTFS standards and retrieve relevant transit information. We benchmark OpenAI’s GPT-3.5 Turbo and GPT-4 LLMs, which are backends for the ChatGPT interface. In particular, we use zero-shot, one-shot, chain of thought, and program synthesis techniques with prompt engineering. For our multiple questions, GPT-3.5 Turbo answers 59.7% correctly and GPT-4 answers 73.3% correctly, but they do worse when one of the multiple choice options is replaced by “None of these”. Furthermore, we evaluate how well the LLMs can extract information from a filtered GTFS feed containing four bus routes from the Chicago Transit Authority. Program synthesis techniques outperformed zero-shot approaches, achieving up to 93% (90%) accuracy for simple queries and 61% (41%) for complex ones using GPT-4 (GPT-3.5 Turbo).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
Both ‘GTFS Semantics’ and ‘GTFS Retrieval’ benchmarks along with the filtered GTFS data and questionnaire used in this paper are available at https://github.com/UTEL-UIUC/GTFS_LLM.
Notes
Note g is not a mathematical or computer function, but rather stands in for the grading process. Different types of answers are graded in different ways. Multiple-choice answers are graded automatically by a script that carries out string-matching, but the program synthesis questions require copying the code to run in a Python terminal.
Example Q &A available at https://platform.openai.com/examples/default-qa [Accessed 2023-07-29]
OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/completions/create [Accessed 2023-07-29]
Open AI Python Library https://github.com/openai/openai-python
The CTA feed is available for download at https://transitfeeds.com/p/chicago-transit-authority/165/20230503
References
Bai F, Kang J, Stanovsky G, Freitag D, Ritter A (2023) Schema-driven information extraction from heterogeneous tables. https://doi.org/10.48550/arXiv.2305.14336. arXiv:2305.14336
Bommarito II M, Katz DM (2022) GPT takes the bar exam. arXiv:2212.14402v1
Brown TB, Mann B, Ryder N et al. (2020) Language models are few-shot learners. https://doi.org/10.48550/arXiv.2005.14165. arXiv:2005.14165
Chen M, Tworek J, Jun H et al. (2021) Evaluating large language models trained on code. https://doi.org/10.48550/arXiv.2107.03374. arXiv:2107.03374
Conveyal (2024) Conveyal R5 routing engine: rapid realistic routing on real-world and reimagined networks. Conveyal, https://github.com/conveyal/r5
Devunuri S, Lehe L (2024) GTFS segments: a fast and efficient library to generate bus stop spacings. J Open Source Softw 9(95):6306. https://doi.org/10.21105/joss.06306
Devunuri S, Lehe LJ, Qiam S, Pandey A, Monzer D (2024) Bus stop spacing statistics: theory and evidence. J Public Transp 26:100083. https://doi.org/10.1016/j.jpubtr.2024.100083
Gilson A, Safranek CW, Huang T et al. (2023) How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 9(1):e45312. https://doi.org/10.2196/45312
Jain N, Vaidyanath S, Iyer A et al. (2022) Jigsaw: large language models meet program synthesis. In: Proceedings of the 44th international conference on software engineering. association for computing machinery, New York, NY, USA, ICSE ’22, pp 1219–1231, https://doi.org/10.1145/3510003.3510203
Jiang AQ, Sablayrolles A, Mensch A et al. (2023) Mistral 7B. https://doi.org/10.48550/arXiv.2310.06825. arXiv:2310.06825
Katz DM, Bommarito MJ, Gao S et al. (2023) GPT-4 passes the bar exam. https://doi.org/10.2139/ssrn.4389233
Khatry A, Cahoon J, Henkel J et al. (2023) From words to code: harnessing data for program synthesis from natural language. https://doi.org/10.48550/arXiv.2305.01598. arXiv:2305.01598
Kim J, Lee J (2023) How does ChatGPT introduce transport problems and solutions in North America? Findings. https://doi.org/10.32866/001c.72634
Kojima T, Gu SS, Reid M et al. (2023) Large language models are zero-shot reasoners. https://doi.org/10.48550/arXiv.2205.11916
Lee TC, Staller K, Botoman V et al. (2023) ChatGPT answers common patient questions about colonoscopy. Gastroenterology 165(2):509-511.e7. https://doi.org/10.1053/j.gastro.2023.04.033
McHugh B (2013) Pioneering open data standards: the GTFS story. In: Beyond transparency: open data and the future of civic innovation. Code for America Press San Francisco, pp 125–135. https://beyondtransparency.org/part-2/pioneering-open-data-standards-the-gtfs-story/
McKinney W (2010) Data structures for statistical computing in Python. In: Proceedings of the 9th Python in science conference, pp 56–61
Mumtarin M, Chowdhury MS, Wood J (2023) Large language models in analyzing crash narratives—a comparative study of ChatGPT, BARD and GPT-4. https://doi.org/10.48550/arXiv.2308.13563. arXiv:2308.13563
Pereira RHM, Saraiva M, Herszenhut D et al. (2021) r5r: rapid realistic routing on multimodal transport networks with R\(^{5}\) in R. Findings. https://doi.org/10.32866/001c.21262
Pereira RHM, Andrade PR, Vieira JPB (2023) Exploring the time geography of public transport networks with the gtfs2gps package. J Geogr Syst 25(3):453–466. https://doi.org/10.1007/s10109-022-00400-x
Ray PP, Majumder P (2023) Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery: a critical appraisal. Obes Surg 33(8):2588–2589. https://doi.org/10.1007/s11695-023-06664-6
Schrittwieser J, Antonoglou I, Hubert T et al. (2020) Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588(7839):604–609. https://doi.org/10.1038/s41586-020-03051-4
Sobania D, Briesch M, Rothlauf F (2022) Choose your programming copilot: a comparison of the program synthesis performance of github copilot and genetic programming. In: Proceedings of the genetic and evolutionary computation conference. Association for Computing Machinery, New York, NY, USA, GECCO ’22, pp 1019–1027. https://doi.org/10.1145/3512290.3528700
Sobania D, Briesch M, Hanna C et al. (2023a) An analysis of the automatic bug fixing performance of ChatGPT. https://doi.org/10.48550/arXiv.2301.08653
Sobania D, Schweim D, Rothlauf F (2023b) A comprehensive survey on program synthesis with evolutionary algorithms. IEEE Trans Evol Comput 27(1):82–97. https://doi.org/10.1109/TEVC.2022.3162324
Taori R, Gulrajani I, Zhang T et al. (2024) Alpaca: a strong, replicable instruction-following model. Tatsu’s shared repositories, https://github.com/tatsu-lab/stanford_alpaca
Touvron H, Lavril T, Izacard G et al. (2023) LLaMA: open and efficient foundation language models. https://doi.org/10.48550/arXiv.2302.13971. arXiv:2302.13971
Voß S (2023) Bus bunching and bus bridging: what can we learn from generative AI tools like ChatGPT? Sustainability 15(12):9625. https://doi.org/10.3390/su15129625
Voulgaris CT, Begwani C (2023) Predictors of early adoption of the general transit feed specification. Findings. https://doi.org/10.32866/001c.57722
Wei J, Wang X, Schuurmans D et al. (2023) Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903
Whalen D (2024) Remix/partridge. Remix. https://github.com/remix/partridge
Zhao WX, Zhou K, Li J et al. (2023) A survey of large language models. https://doi.org/10.48550/arXiv.2303.18223. arXiv:2303.18223
Zheng O, Abdel-Aty M, Wang D et al. (2023) ChatGPT is on the horizon: could a large language model be suitable for intelligent traffic safety research and applications? https://doi.org/10.48550/arXiv.2303.05382. arXiv:2303.05382
Zhuang Y, Yu Y, Wang K et al. (2023) ToolQA: a dataset for LLM question answering with external tools. https://doi.org/10.48550/arXiv.2306.13304. arXiv:2306.13304
Author information
Authors and Affiliations
Contributions
The authors confirm their contribution to the paper as follows: study conception and design: SD; data collection: SD, SQ; analysis and interpretation of results: SD, SQ, LL; draft manuscript preparation: SD, SQ, LL. All authors reviewed the results and approved the final version of the manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Devunuri, S., Qiam, S. & Lehe, L.J. ChatGPT for GTFS: benchmarking LLMs on GTFS semantics... and retrieval. Public Transp 16, 333–357 (2024). https://doi.org/10.1007/s12469-024-00354-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12469-024-00354-x