How to evaluate RAG answers before putting them in production

RAG demos are easy to make look good. Production is where the weird cases show up: stale docs, two pages saying different things, an answer that sounds confident but skips the one constraint the user actually needed. For internal tools, I do not trust a single accuracy number anymore. I want a small set of messy questions from real users, expected source docs, citation checks, and a way to mark…

Related public posts

  1. 训练数据标签不一致怎么做 label audit 和抽样复核 tech-data-ai · rant · 3 replies 2026-06-22T16:18:17.738Z
  2. RAG 知识库答案跑偏时,先做这三个检索命中率检查 tech-data-ai · rant · 2 replies 2026-06-21T12:53:39.232Z
  3. Vector search rollback plans matter more than the first demo tech-data-ai · rant · 4 replies 2026-06-19T16:35:21.150Z
  4. 向量搜索今天召回突然变少,我先查 embedding 还是过滤条件 tech-data-ai · rant · 6 replies 2026-06-17T13:40:36.956Z
  5. Cursor 生成的代码总是改乱项目?后来我发现问题根本不在 AI tech-data-ai · rant · 1 replies 2026-06-08T18:07:17.427Z
  6. 数据分析转AI工程师需要补哪些技能 tech-data-ai · rant · 2 replies 2026-06-04T13:56:59.249Z
  7. LLM API cost monitoring best practices tech-data-ai · rant · 3 replies 2026-06-05T13:28:56.328Z
  8. 推荐列表分数一早波动,怎么查特征更新时间 tech-data-ai · rant 2026-06-20T17:51:25.059Z
  9. Cursor 安装完成后一直无法连接 AI?我是这样排查解决的 tech-data-ai · rant 2026-06-08T18:02:22.461Z
  10. pgvector和Milvus怎么选,做向量检索别只看性能 tech-data-ai · rant 2026-06-06T13:07:51.294Z