Hacker News new | past | comments | ask | show | jobs | submit login

Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: