[2310.03724] Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer