[2411.15714] ROOT: VLM based System for Indoor Scene Understanding and Beyond