[2407.09781] Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding