ChatGPT for GTFS: benchmarking LLMs on GTFS semantics... and retrieval

Devunuri, Saipraneeth; Qiam, Shirin; Lehe, Lewis J.

doi:10.1007/s12469-024-00354-x

ChatGPT for GTFS: benchmarking LLMs on GTFS semantics... and retrieval

Original Research
Published: 10 April 2024

Volume 16, pages 333–357, (2024)
Cite this article

Public Transport Aims and scope Submit manuscript

542 Accesses
3 Altmetric
Explore all metrics

Abstract

The General Transit Feed Specification (GTFS) standard for publishing transit data is ubiquitous. With the advent of LLMs being used widely, this research explores the possibility of extracting transit information from GTFS through natural language instructions. To evaluate the capabilities and limitations of LLMs, we introduce two benchmarks, namely “GTFS Semantics” and “GTFS Retrieval” that test how well LLMs can “understand” GTFS standards and retrieve relevant transit information. We benchmark OpenAI’s GPT-3.5 Turbo and GPT-4 LLMs, which are backends for the ChatGPT interface. In particular, we use zero-shot, one-shot, chain of thought, and program synthesis techniques with prompt engineering. For our multiple questions, GPT-3.5 Turbo answers 59.7% correctly and GPT-4 answers 73.3% correctly, but they do worse when one of the multiple choice options is replaced by “None of these”. Furthermore, we evaluate how well the LLMs can extract information from a filtered GTFS feed containing four bus routes from the Chicago Transit Authority. Program synthesis techniques outperformed zero-shot approaches, achieving up to 93% (90%) accuracy for simple queries and 61% (41%) for complex ones using GPT-4 (GPT-3.5 Turbo).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Uncovering LLMs for Service-Composition: Challenges and Opportunities

Research on Improving Decision-Making Efficiency with ChatGPT

Charting the Path of Futuristic Support Tools: Opportunities, Challenges, Recent Advances, and Future Directions in the Era of ChatGPT

Discover the latest articles, news and stories from top researchers in related subjects.

Automotive Engineering

Data Availability

Both ‘GTFS Semantics’ and ‘GTFS Retrieval’ benchmarks along with the filtered GTFS data and questionnaire used in this paper are available at https://github.com/UTEL-UIUC/GTFS_LLM.

Notes

Note g is not a mathematical or computer function, but rather stands in for the grading process. Different types of answers are graded in different ways. Multiple-choice answers are graded automatically by a script that carries out string-matching, but the program synthesis questions require copying the code to run in a Python terminal.
Example Q &A available at https://platform.openai.com/examples/default-qa [Accessed 2023-07-29]
OpenAI Chat Completions API: https://platform.openai.com/docs/api-reference/completions/create [Accessed 2023-07-29]
Open AI Python Library https://github.com/openai/openai-python
The CTA feed is available for download at https://transitfeeds.com/p/chicago-transit-authority/165/20230503

References

Bai F, Kang J, Stanovsky G, Freitag D, Ritter A (2023) Schema-driven information extraction from heterogeneous tables. https://doi.org/10.48550/arXiv.2305.14336. arXiv:2305.14336
Bommarito II M, Katz DM (2022) GPT takes the bar exam. arXiv:2212.14402v1
Brown TB, Mann B, Ryder N et al. (2020) Language models are few-shot learners. https://doi.org/10.48550/arXiv.2005.14165. arXiv:2005.14165
Chen M, Tworek J, Jun H et al. (2021) Evaluating large language models trained on code. https://doi.org/10.48550/arXiv.2107.03374. arXiv:2107.03374
Conveyal (2024) Conveyal R5 routing engine: rapid realistic routing on real-world and reimagined networks. Conveyal, https://github.com/conveyal/r5
Devunuri S, Lehe L (2024) GTFS segments: a fast and efficient library to generate bus stop spacings. J Open Source Softw 9(95):6306. https://doi.org/10.21105/joss.06306
Article Google Scholar
Devunuri S, Lehe LJ, Qiam S, Pandey A, Monzer D (2024) Bus stop spacing statistics: theory and evidence. J Public Transp 26:100083. https://doi.org/10.1016/j.jpubtr.2024.100083
Article Google Scholar
Gilson A, Safranek CW, Huang T et al. (2023) How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 9(1):e45312. https://doi.org/10.2196/45312
Article Google Scholar
Jain N, Vaidyanath S, Iyer A et al. (2022) Jigsaw: large language models meet program synthesis. In: Proceedings of the 44th international conference on software engineering. association for computing machinery, New York, NY, USA, ICSE ’22, pp 1219–1231, https://doi.org/10.1145/3510003.3510203
Jiang AQ, Sablayrolles A, Mensch A et al. (2023) Mistral 7B. https://doi.org/10.48550/arXiv.2310.06825. arXiv:2310.06825
Katz DM, Bommarito MJ, Gao S et al. (2023) GPT-4 passes the bar exam. https://doi.org/10.2139/ssrn.4389233
Khatry A, Cahoon J, Henkel J et al. (2023) From words to code: harnessing data for program synthesis from natural language. https://doi.org/10.48550/arXiv.2305.01598. arXiv:2305.01598
Kim J, Lee J (2023) How does ChatGPT introduce transport problems and solutions in North America? Findings. https://doi.org/10.32866/001c.72634
Article Google Scholar
Kojima T, Gu SS, Reid M et al. (2023) Large language models are zero-shot reasoners. https://doi.org/10.48550/arXiv.2205.11916
Lee TC, Staller K, Botoman V et al. (2023) ChatGPT answers common patient questions about colonoscopy. Gastroenterology 165(2):509-511.e7. https://doi.org/10.1053/j.gastro.2023.04.033
Article Google Scholar
McHugh B (2013) Pioneering open data standards: the GTFS story. In: Beyond transparency: open data and the future of civic innovation. Code for America Press San Francisco, pp 125–135. https://beyondtransparency.org/part-2/pioneering-open-data-standards-the-gtfs-story/
McKinney W (2010) Data structures for statistical computing in Python. In: Proceedings of the 9th Python in science conference, pp 56–61
Mumtarin M, Chowdhury MS, Wood J (2023) Large language models in analyzing crash narratives—a comparative study of ChatGPT, BARD and GPT-4. https://doi.org/10.48550/arXiv.2308.13563. arXiv:2308.13563
Pereira RHM, Saraiva M, Herszenhut D et al. (2021) r5r: rapid realistic routing on multimodal transport networks with R\(^{5}\) in R. Findings. https://doi.org/10.32866/001c.21262
Article Google Scholar
Pereira RHM, Andrade PR, Vieira JPB (2023) Exploring the time geography of public transport networks with the gtfs2gps package. J Geogr Syst 25(3):453–466. https://doi.org/10.1007/s10109-022-00400-x
Article Google Scholar
Ray PP, Majumder P (2023) Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery: a critical appraisal. Obes Surg 33(8):2588–2589. https://doi.org/10.1007/s11695-023-06664-6
Article Google Scholar
Schrittwieser J, Antonoglou I, Hubert T et al. (2020) Mastering Atari, Go, chess and shogi by planning with a learned model. Nature 588(7839):604–609. https://doi.org/10.1038/s41586-020-03051-4
Article Google Scholar
Sobania D, Briesch M, Rothlauf F (2022) Choose your programming copilot: a comparison of the program synthesis performance of github copilot and genetic programming. In: Proceedings of the genetic and evolutionary computation conference. Association for Computing Machinery, New York, NY, USA, GECCO ’22, pp 1019–1027. https://doi.org/10.1145/3512290.3528700
Sobania D, Briesch M, Hanna C et al. (2023a) An analysis of the automatic bug fixing performance of ChatGPT. https://doi.org/10.48550/arXiv.2301.08653
Sobania D, Schweim D, Rothlauf F (2023b) A comprehensive survey on program synthesis with evolutionary algorithms. IEEE Trans Evol Comput 27(1):82–97. https://doi.org/10.1109/TEVC.2022.3162324
Article Google Scholar
Taori R, Gulrajani I, Zhang T et al. (2024) Alpaca: a strong, replicable instruction-following model. Tatsu’s shared repositories, https://github.com/tatsu-lab/stanford_alpaca
Touvron H, Lavril T, Izacard G et al. (2023) LLaMA: open and efficient foundation language models. https://doi.org/10.48550/arXiv.2302.13971. arXiv:2302.13971
Voß S (2023) Bus bunching and bus bridging: what can we learn from generative AI tools like ChatGPT? Sustainability 15(12):9625. https://doi.org/10.3390/su15129625
Article Google Scholar
Voulgaris CT, Begwani C (2023) Predictors of early adoption of the general transit feed specification. Findings. https://doi.org/10.32866/001c.57722
Article Google Scholar
Wei J, Wang X, Schuurmans D et al. (2023) Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903
Whalen D (2024) Remix/partridge. Remix. https://github.com/remix/partridge
Zhao WX, Zhou K, Li J et al. (2023) A survey of large language models. https://doi.org/10.48550/arXiv.2303.18223. arXiv:2303.18223
Zheng O, Abdel-Aty M, Wang D et al. (2023) ChatGPT is on the horizon: could a large language model be suitable for intelligent traffic safety research and applications? https://doi.org/10.48550/arXiv.2303.05382. arXiv:2303.05382
Zhuang Y, Yu Y, Wang K et al. (2023) ToolQA: a dataset for LLM question answering with external tools. https://doi.org/10.48550/arXiv.2306.13304. arXiv:2306.13304

Download references

Author information

Authors and Affiliations

Department of Civil and Environmental Engineering, University of Illinois at Urbana Champaign, Urbana, IL, USA
Saipraneeth Devunuri, Shirin Qiam & Lewis J. Lehe

Authors

Saipraneeth Devunuri
View author publications
You can also search for this author in PubMed Google Scholar
Shirin Qiam
View author publications
You can also search for this author in PubMed Google Scholar
Lewis J. Lehe
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors confirm their contribution to the paper as follows: study conception and design: SD; data collection: SD, SQ; analysis and interpretation of results: SD, SQ, LL; draft manuscript preparation: SD, SQ, LL. All authors reviewed the results and approved the final version of the manuscript.

Corresponding author

Correspondence to Saipraneeth Devunuri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

See Figs. 6, 7, 8.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Devunuri, S., Qiam, S. & Lehe, L.J. ChatGPT for GTFS: benchmarking LLMs on GTFS semantics... and retrieval. Public Transp 16, 333–357 (2024). https://doi.org/10.1007/s12469-024-00354-x

Download citation

Accepted: 09 February 2024
Published: 10 April 2024
Issue Date: June 2024
DOI: https://doi.org/10.1007/s12469-024-00354-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

ChatGPT for GTFS: benchmarking LLMs on GTFS semantics... and retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Uncovering LLMs for Service-Composition: Challenges and Opportunities

Research on Improving Decision-Making Efficiency with ChatGPT

Charting the Path of Futuristic Support Tools: Opportunities, Challenges, Recent Advances, and Future Directions in the Era of ChatGPT

Data Availability

Notes

References