Recent research has shown that Large Language Models (LLMs) can utilize external

tools to improve their contextual processing

abilities, moving away from the pure language

modeling paradigm and paving the way for

Artificial General Intelligence. Despite this,

there has been a lack of systematic evaluation to demonstrate the efficacy of LLMs using

tools to respond to human instructions. This

paper presents API-Bank, the first benchmark

tailored for Tool-Augmented LLMs. APIBank includes 53 commonly used API tools,

a complete Tool-Augmented LLM workflow,

and 264 annotated dialogues that encompass

a total of 568 API calls. These resources have

been designed to thoroughly evaluate LLMs’

ability to plan step-by-step API calls, retrieve

relevant APIs, and correctly execute API calls

to meet human needs. The experimental results show that GPT-3.5 emerges the ability to

use the tools relative to GPT3, while GPT-4

has stronger planning performance. Nevertheless, there remains considerable scope for further improvement when compared to human

performance. Additionally, detailed error analysis and case studies demonstrate the feasibility of Tool-Augmented LLMs for daily use, as

well as the primary challenges that future research needs to address