[2304.01603] Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA