Getting AI Context Right — Nick Bellistri

The Problem

Orum's parallel dialer connects sales reps to live prospects at high volume — the core product worked. But after every call, SDRs were expected to log notes that would inform the next touch, help sales leaders coach, and give RevOps the account-level data they needed.

The data told a clear story: SDRs were only taking notes 40% of the time, and spending 2–6 minutes on post-call disposition when they did. That's 15–20% of a rep's working day on a task they kept cutting when quota pressure hit.

40% of calls resulted in notes being taken

2–6 min spent dispositioning each call

15–20% of SDR workday on data entry

Without reliable notes, organizations couldn't build the account intelligence to improve their sales motion. The data gap compounded — and Orum's competitors were moving faster on this exact problem.

The Insight

Most AI product failures aren't model failures — they're context failures. The model gets thin inputs and produces thin outputs, and the team blames the technology. The fix isn't a better model. It's better context architecture.

Orum's first attempt at AI meeting summaries generated directly from raw transcripts — and the results were predictably poor. Speaker order jumbled, filler included, no signal about what actually happened. Users opened the summaries once and stopped using them.

The problem wasn't the AI. It was that we were asking the model to summarize everything rather than extract specific things. The gap between "summarize this call" and "extract the talk time ratio, key objections, committed next steps, and a coaching signal" is the difference between a feature that fails and one that sticks.

The Approach

I ran discovery with AEs and SDRs across market segments, plus Orum's internal sales team — 25 reps total. Three themes were consistent:

They needed tools that increased call volume, not just note quality
Automating routine tasks so they could focus on higher-value conversation work
Better pre-call context to be more prepared when a prospect picked up

The research reframed the product question. The goal wasn't "better notes" — it was saving 2–5 minutes per call so reps could dial again faster. Notes were the mechanism, not the outcome.

This led to the core design: AI-generated summaries structured around specific outputs that reps and managers actually used — not a generic summary of everything that was said.

Building It

Prompt Architecture

I served as the prompt engineer throughout. The structured extraction approach replaced the open-ended summarization with a specific schema: talk time ratio, key objections raised, committed next steps, and a coaching signal based on how the rep handled pivots.

Getting this right required iterating against real call transcripts — running them through prompts and comparing output against notes the SDRs had actually saved. We reverse-engineered accuracy from the ground up.

The Determinism Trade-off

One unexpected constraint: pushing for higher accuracy (increasing determinism) made summaries more rigid and less useful for the long tail of unusual call types. We landed on an 80% similarity threshold — accurate enough to be relied on, flexible enough to handle edge cases. Attempting to push higher degraded output quality in ways that were harder to explain to users than "sometimes it misses something."

Shadow Testing

Before shipping any UI, we released the feature to production without a visible interface. The AI generated summaries in the background; we compared them against what reps actually saved and evaluated similarity nightly. This gave us a real signal on accuracy before any user ever saw the output.

Internal Beta

Launched with Orum's internal SDR team. Evaluated similarity scores each night and shipped prompt refinements in response. Improved from 70% to 85% similarity over two weeks. That cleared the bar for external release.

External Launch

Shipped as a beta for top-tier package clients. Strong initial interest, with two early challenges that required rapid iteration: UI placement wasn't immediately obvious (slowing adoption in the first week), and enterprise clients raised data privacy questions about call recording. Both were addressed before GA.

Post-launch, incorporated RLHF and human-in-the-loop processes to keep improving accuracy without requiring a full model retrain.

Results

Within the first three months post-launch:

30% efficiency gain for end users

90% note coverage — up from 40%

The feature became the top-cited reason customers renewed. The efficiency gain wasn't just time saved per call — it was the downstream effect: sales leaders finally had reliable account-level data to work with, and RevOps could actually build on it.

Lessons

01 Context architecture is the product. The model was almost irrelevant. What mattered was the structure we imposed on the input — speaker attribution, call type, schema of desired outputs. That's the work.

02 Ship to production before shipping to users. Shadow testing — generating output without showing it — gave us real accuracy data before any user saw a bad summary. It's the right pattern for AI features where quality sets the floor.

03 Determinism and quality are in tension. Higher accuracy thresholds made summaries brittle. 80% was the right calibration — not because it sounds good, but because real data showed that pushing higher degraded the cases that matter most.

04 Enterprise users need opt-out on day one. Some SDRs had established note-taking systems. Forcing AI-generated notes created friction with power users. Opt-out should have been in the MVP.