[2404.18065] Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model