How can I help you today?

Ask about company policies, report a bug, or get routed to the right team.

Enter to send · Shift+Enter for new line

Evaluation Framework

Every response is scored in real time

An LLM-as-judge evaluates each answer across four dimensions before it reaches the user — no manual review required.

91%

Avg eval score

across all queries

KB sections indexed

via Pinecone RAG

Top 4

Chunks retrieved

per query, by cosine sim

<2s

Avg response time

embed → retrieve → LLM

Judge criteria breakdown

Aggregate scores across test queries

✓ Production ready

RelevanceAnswer matches the intent of the query

94%

AccuracyInformation is factually correct per KB

93%

CompletenessAll key details are included in the answer

88%

Citation qualityKB sections are correctly cited

90%

Architecture

How it works

Most AI tools hand the model your entire knowledge base and hope it finds the right answer. This agent searches first, then reads. It only passes the four most relevant sections to the model, which means answers are grounded in what your company actually says rather than what the model guesses.

Note

LangSmith is the observability layer. It records every API call for inspection. GPT-5-mini can call two tools mid-conversation: Linear for bug and feature requests, and Slack to notify the right team channel when routing. Arize receives the final response and scores it with an independent judge model.

How can I help you today?

Every response is scored in real time

How it works

Frequently Asked Questions