[2407.03188] MuDiT & MuSiT: Alignment with Colloquial Expression in Description-to-Song Generation