Benchmarking Agentic Workflow Generation

Qiao, Shuofei; Fang, Runnan; Qiu, Zhisong; Wang, Xiaobin; Zhang, Ningyu; Jiang, Yong; Xie, Pengjun; Huang, Fei; Chen, Huajun

Computer Science > Computation and Language

arXiv:2410.07869 (cs)

[Submitted on 10 Oct 2024 (v1), last revised 23 Feb 2025 (this version, v3)]

Title:Benchmarking Agentic Workflow Generation

Authors:Shuofei Qiao, Runnan Fang, Zhisong Qiu, Xiaobin Wang, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen

View PDF

Abstract:Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at this https URL.

Comments:	ICLR 2025
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as:	arXiv:2410.07869 [cs.CL]
	(or arXiv:2410.07869v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.07869

Submission history

From: Ningyu Zhang [view email]
[v1] Thu, 10 Oct 2024 12:41:19 UTC (2,993 KB)
[v2] Wed, 30 Oct 2024 14:49:49 UTC (2,993 KB)
[v3] Sun, 23 Feb 2025 15:16:14 UTC (2,997 KB)

Computer Science > Computation and Language

Title:Benchmarking Agentic Workflow Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Benchmarking Agentic Workflow Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators