[2409.15268] Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking