Let’s see the code. A bit skeptical, this hasnt over complicated something architecturally. Need more clear drawings of architecture. What prompts exist, what tool calls are made, and what gets updated.
Included here is a bit of the old tried and true: NDCG/MRR/Precision @k - what you really want for measuring your information retrieval systems.
But we also talk through a bit of the "new", how to use Evals to generate the building blocks for those metrics above. You will want both hand labels and the automated Evals in the end to evaluate your system.
community is having large debates on whether an LLM can reason outside of its training.This feels ignored in here.