Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229