[2411.04368] Measuring short-form factuality in large language models