Indeed, test data like this constantly leaks into the training data, so these le...

carschno 4 months ago | parent | context | favorite | on: Computer use, a new Claude 3.5 Sonnet, and Claude ...

Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229