[2409.15934] Automated test generation to evaluate tool-augmented LLMs as conversational AI agents