[2404.04514] Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models