How to evaluate RAG answers before putting them in production

RAG demos are easy to make look good. Production is where the weird cases show up: stale docs, two pages saying different things, an answer that sounds confident but skips the one constraint the user actually needed. For internal tools, I do not trust a single accuracy number anymore. I want a small set of messy questions from real users, expected source docs, citation checks, and a way to mark…