[2403.04866] A Modular End-to-End Multimodal Learning Method for Structured and Unstructured Data