4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.
Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them
For large context (up to 100K tokens in some cases). We found that GPT-5:
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?
ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.
Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.
Ya, the original commenter likely does not work in the space - hence the ask.
> the evaluation of new models is actually very quantitative.
While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.