How can I help you today?
Ask about company policies, report a bug, or get routed to the right team.
Evaluation Framework
Every response is scored in real time
An LLM-as-judge evaluates each answer across four dimensions before it reaches the user — no manual review required.
91%
Avg eval score
across all queries
24
KB sections indexed
via Pinecone RAG
Top 4
Chunks retrieved
per query, by cosine sim
<2s
Avg response time
embed → retrieve → LLM
Judge criteria breakdown
Aggregate scores across test queries
Architecture
How it works
Most AI tools hand the model your entire knowledge base and hope it finds the right answer. This agent searches first, then reads. It only passes the four most relevant sections to the model, which means answers are grounded in what your company actually says rather than what the model guesses.
Note
LangSmith is the observability layer. It records every API call for inspection. GPT-5-mini can call two tools mid-conversation: Linear for bug and feature requests, and Slack to notify the right team channel when routing. Arize receives the final response and scores it with an independent judge model.