Owl

How can I help you today?

Ask about company policies, report a bug, or get routed to the right team.

Enter to send · Shift+Enter for new line

Evaluation Framework

Every response is scored in real time

An LLM-as-judge evaluates each answer across four dimensions before it reaches the user — no manual review required.

91%

Avg eval score

across all queries

24

KB sections indexed

via Pinecone RAG

Top 4

Chunks retrieved

per query, by cosine sim

<2s

Avg response time

embed → retrieve → LLM

Judge criteria breakdown

Aggregate scores across test queries

✓ Production ready
RelevanceAnswer matches the intent of the query
94%
AccuracyInformation is factually correct per KB
93%
CompletenessAll key details are included in the answer
88%
Citation qualityKB sections are correctly cited
90%

Architecture

How it works

Most AI tools hand the model your entire knowledge base and hope it finds the right answer. This agent searches first, then reads. It only passes the four most relevant sections to the model, which means answers are grounded in what your company actually says rather than what the model guesses.

LANGSMITH · OBSERVABILITY · TRACES EVERY API CALLEmployeeCSM · AE · SupportHR · FinancequeryOpenAIEmbeddings APItext-embedding-3-smallconverts text to numbersembedPineconeVector Searchmatches by meaning24 policy sectionstop 4chunksGPT-5-miniGenerationanswer / route / OOS+ tool callsResponseanswer + eval scorebug or feature reqif routingLinearbug or feature requestSlacknotify team channelevalArizeOnline Evaluation

Note

LangSmith is the observability layer. It records every API call for inspection. GPT-5-mini can call two tools mid-conversation: Linear for bug and feature requests, and Slack to notify the right team channel when routing. Arize receives the final response and scores it with an independent judge model.

1
2
3
4
5
6

Frequently Asked Questions