previews · RCAs · Lawrence-2 US rollout · trace analysis

Day-one of Lawrence-2 in the US — what we learned.

Adoption was clean for two of three users, output quality was strong, one Anthropic outage took the agent off the air mid-stream, and the read tool silently returned an empty PDF on the showcase memo. Cost numbers are inflated by known Langfuse accounting bugs.

Date · 2026-05-13 Window · 2026-05-12 (rollout day) Env · prd Prompt · agent/lawrence-2 Users · Georgia, Sabrina, Scott Traces reviewed · 14 Author · Adolfo Final
One-liner
14 prod traces across the three users since the vfs-agent flag landed for them on 2026-05-12. Sabrina and Georgia engaged substantively; Scott made one Lawrence-2 turn and hasn't returned. Output quality is strong — multi-section legal memos, statute-cited drafts, scope-guardrail behaviour. The two real defects to fix: (1) the orchestrator leaks a raw MidStreamFallbackError into the trace when Anthropic overloads, and (2) the read tool returned a zero-length base64 envelope for one PDF instead of failing explicitly. reference_count = 0 in 14/14 traces — citations exist in prose, not in the structured channel. All cost figures below are Langfuse-displayed and over-counted by two known accounting issues — see the cost section before reading them as real spend.
14
Lawrence-2 turns
across 3 users
13
Completed
incl. 1 silent tool fail
1
Hard failure
Anthropic overload
0 / 14
Structured citations
all reference_count = 0

1. Adoption picture

Per-user Lawrence-2 turns since the flag landed on 2026-05-12:

UserL2 turnsModelsFirst L2 use
Georgia2Opus 4.72026-05-12 20:30 UTC
Sabrina117× Sonnet 4.6 · 4× Opus 4.72026-05-12 18:42 UTC
Scott1Opus 4.72026-05-12 20:15 UTC

Sabrina and Georgia engaged substantively on day one. Scott made one turn (the showcase memo, below) and hasn't returned to the sidebar in the window — worth a check-in to understand why.

2. Per-user behaviour

Sabrina · 3 sessions, exploratory then production

Session 1 — matter mat_4ew0ubd082rnfksu

18:42–18:51 UTC · 5 turns · Sonnet 4.6 · $0.10–$0.55 / turn

  1. "Hi Lawrence how can you help me?" → intro (no tools)
  2. "How do I get paid?" → read + search → answered with concrete fee numbers ($100 flat, 50% platform fee)
  3. "How long does it take me to get paid…" → web_search, correctly deflected to Lawhive support
  4. "can you recommend follow on work for this client?" → read + search, said it was too early to recommend
  5. "How do i draft a follow on fee agreement?" → no tools, deflected as a platform-ops question

Turns 3 and 5 were platform meta-questions, not legal. First thing US users want from the agent is to learn the platform. Worth feeding into Lawhive operations knowledge base.

Session 2 — matter mat_a10lschpo6ny2q5j

18:52–19:47 UTC · 2 turns · Sonnet 4.6

  1. "What is good follow on work opportunity here?" → vector_search + search → identified an open USPTO Office Action as the upsell
  2. "how do you know shirazi filed it?" → FAIL MidStreamFallbackError: Anthropic Overloaded · 5.4s · $0

This is the Claude outage that triggered the temporary Sonnet → Opus swap for the evening.

Session 3 — matter mat_nphxk37yxpjgw0m9 · 23:10–23:12 UTC · 4 turns · Opus 4.7

  1. "what are upsell opportunities here?" → structured opportunity table.
  2. "Can you draft me a lease for this?" → scope challenge: flagged Tenant Starter Pack as complete + drafting excluded, suggested lease termination instead.
  3. "draft me a lease so that sabrina can provide it for her landlord" → load_skill (skill:drafting) + search → flagged again, then proceeded.
  4. "A full residential lease" → search_legislation + create + vector_search → drafted full Florida lease with statutory cross-refs (§83.49(2)(d), §404.056(5), §83.47, lead paint addendum, federal 24 C.F.R. §35), written to documents://mat_nphxk37yxpjgw0m9. 233 events, 86s.

The scope-guardrail behaviour is genuinely impressive — kind of thing senior lawyers notice. Worth surfacing in onboarding demos.

Georgia · 1 session, IP-classification advice

Matter mat_42gii0ddf1rfsbd1 · 20:30–20:32 UTC · 2 turns · Opus 4.7

  1. "how would you classify these goods for a filing at the USPTO cigar intelligence platform…" → 3× web_search (TMEP, WIPO Class 42 & 45) + 2× vector_search → structured multi-class recommendation (Class 9 primary, Class 42 SaaS secondary, Class 45 social networking) with concrete USPTO ID wording per class. 93 events, 30s.
  2. "yes, please draft" → 1 iter, no tools → asked an A/B clarifying question before drafting. Sensible, but feels slightly verbose given the original turn had already proposed both options.

Cleanest example of the value proposition: research-grounded advice that produces draft-ready language with citable URLs. Worth saving as a demo trace.

Scott · 1 Lawrence-2 turn, the showcase memo

Matter mat_tr5ot7st7zindn68 · 20:15 UTC · Opus 4.7 · $8.60 (Langfuse-displayed) · 101.5s · 7 iterations · 277 events · 11 tool calls

"Can you draft a memo explaining the next steps in the appeals process for the opposing party, Mr. Averill?"

The agent ran a full research-and-draft pipeline: web search → legislation search → vector search over case docs and legal library → file read on appellate clerk notice → file read on notice of appeal (this one returned empty) → web/vector fallback → memo creation. The final memo cites A.R.S. §12-1177(A), §33-811(B), and Curtis v. Morris, Merrifield v. Merrifield, Olds Bros. Lumber Co. v. Rushing, and ARCAP 15(a)(2), and produced a six-section internal note.

Single trace illustrates almost the entire Lawrence-2 surface in one go. Most expensive trace in the dataset by ~5× — see cost caveat below.

3. Failures & quality issues

Hard failure MidStream Anthropic overload (Sabrina, trace ac66f580…)

litellm.ServiceUnavailableError: litellm.MidStreamFallbackError: litellm.InternalServerError: AnthropicException - Overloaded

Improvements

Silent failure Empty PDF read (Scott, read on matfil_1c79x1rn3lacrwd6)

The read tool returned:

{"file": {"file_data": "data:application/pdf;base64,"},
 "name": "2026-05-01 Notice of Appeal",
 "extension": "pdf", ...}

A well-formed envelope with an empty base64 payload. The agent worked around it via vector search and surfaced "the notice of appeal PDF didn't return readable content" to Scott. Right user-facing behaviour from the model, but the underlying failure should not happen.

Investigate

Gap Citations not structured (all 14 traces, reference_count = 0)

The model output is rich with citations (statutes, case law, regulations, USPTO Manual sections) but the structured references array is always []. This matches the LEX-264 / granular-citations workstream. Concrete impact:

The parser pipeline exists in agent-definition and is wired through to trace metadata. reference_count = 0 means the model isn't emitting the marker syntax yet. Either the Langfuse agent/lawrence-2 prompt isn't instructing it to, or the chat-agent-embed-citations flag isn't on. Confirm both.

Polish Streaming-text artefact in the saved final response

Multiple traces show final_response_content as a wall of concatenated text where inter-tool narration ("Let me check the relevant Florida statutory framework…", "I have enough research…") is inlined directly before the next narrative paragraph, with no separator. Live this probably renders cleanly because the FE shows tool-running indicators between segments. But the saved/transcripted form is harder to read — which matters for Langfuse review, customer audit trails, and matter-note exports. Consider persisting as discrete typed events instead of one flat string.

UX nits

4. Cost / latency

Heads-up: costs below are Langfuse-displayed and known to be over-counted.

Per the "Anthropic Prompt Caching — Investigation & Implementation" Notion doc, two known accounting issues inflate them:

  1. Langfuse #12996claude-sonnet-4-6 still has the deprecated Large Context tier ($6/$22.50 per Mtok) for input >200k tokens. Real Anthropic spend ≈ Langfuse displayed × 0.5 on Sonnet long-context traces.
  2. LiteLLM #7790 · agents PR #564 — streaming consumer broke before Anthropic's late usage chunk arrived, so cache_read_input_tokens were missed and the same tokens got billed as fresh input. Verified reductions once captured: cross-turn follow-up $0.58 → $0.05 (91%), multi-iteration loops ~$2 → $0.27 (87%).

Treat the numbers below as upper bounds, not real spend. The Opus-4-7 traces should be checked against per-iteration cache hit ratio once PR #564 lands.

UserTurnsCost (USD, Langfuse-displayed)Median latency (s)Max latency (s)
Georgia2$3.6523.930.5
Sabrina11$5.0813.786.2
Scott1$8.60101.5101.5
Total14$17.34

Even with the over-count caveat, the shape is informative: multi-iteration drafting turns are the dominant relative cost driver (Sabrina's lease, Scott's memo). What we don't yet know — and should check once PR #564 is in — is what fraction of the per-iteration input on Scott's 7-iteration trace was cache_read vs fresh. If caching was already working server-side but uncounted by Langfuse, his real spend is likely well under $5.

5. Top recommendations

Prioritised by impact on adoption / quality / cost:

  1. Check in with Scott. One Lawrence-2 turn then silence in the window. Worth understanding whether that's a workload thing or whether something put him off after the showcase memo. The Notice-of-Appeal PDF failure (#2) is a candidate contributor.
  2. Fix the empty-PDF read failure. Either the storage path for matfil_1c79x1rn3lacrwd6 or the read tool's PDF extraction. The user-visible workaround was OK but the cost of working around it was several iterations.
  3. Handle Anthropic overload gracefully. Wrap MidStreamFallbackError at the orchestrator boundary; verify the cross-provider fallback chain works; expose a "retry" control in the sidebar for mid-stream failures. This is the failure mode that forced last night's outage-driven Sonnet→Opus swap — surviving Anthropic overloads in-place removes the need for emergency prompt-version flips.
  4. Ship structured citations. All 14 traces emit textually-cited output but reference_count = 0. The pipeline exists; the model just isn't emitting the marker syntax yet. Largest remaining production-quality gap.
  5. Land agents PR #564 and bump Langfuse for #13367. Today's displayed costs are inflated by the cache_read drop + the stale Sonnet 4.6 large-context tier. Cost decisions (model routing, budgets) shouldn't be made off Langfuse numbers until both fixes are in. Once they are, revisit per-turn cost budgets.
  6. Persist orchestrator events as structured items instead of one concatenated final_response_content. Helps Langfuse review and matter-side transcripts.
  7. Suppress redundant scope re-flags. Once a user has acknowledged a scope concern in a thread, downgrade subsequent flags to a one-liner.

Was: "confirm Sonnet→Opus was intentional" — confirmed. Outage-driven, since reverted on the Langfuse prompt. No action.

6. Where to look

High-level orientation — see the full Notion write-up for file/line pointers and the long-form analysis.

Request flow

Sidebarlegal-os · Vercel AI SDK
User types in the agent-mode sidebar; frontend smooths the streamed tokens before render.
POST /lawrence/chat
lawrence-apiBFF inside platform-v3
Auth · matter access check · thread persistence · title generation · stream tee for the final-message save.
POST /agents/chat
agent-gatewayFastAPI · agents repo
Thin streaming passthrough to the chat agent.
POST /agents/chat/generate
Chat agent (ChatAgent)agents repo
The Lawrence-2 orchestrator.
  • Prompt selection from Langfuse: agent/lawrence-2 · agent/lawrence-2-read-only · agent/system (depending on surface, mode and the vfs-agent flag).
  • Iteration loop with tools: search · vector_search · search_legislation · web_search · read · create · edit · stat · load_skill.
  • Streams via packages/agent-definition (litellm → Anthropic / OpenAI).
  • Citation parser on the stream; retrieval via MatterRetriever (CCO sourced from lawrence-engine).
Surface footgun

All 14 traces resolved the write-enabled agent/lawrence-2 prompt — the new sidebar intentionally omits surface: "sidebar" from the request because sending it would route to agent/lawrence-2-read-only and strip the VFS write tools. The field name suggests the opposite, so it's worth a comment in the sidebar submit hook.

Which area owns each finding

FindingWhere to look
MidStream Anthropic error leaked into trace (#1) Backend orchestrator stream consumer — the single user-facing error emit in packages/agent-definition. Wrap MidStreamFallbackError there into a typed provider_overloaded event; verify the litellm cross-provider fallback chain reaches a non-Anthropic provider.
Empty PDF read (#2) VFS read tool + lawrence-api file resolver. Check whether bytes are missing on storage (S3) or the read path is returning an empty envelope; either way the tool should fail explicitly instead of returning data:application/pdf;base64, with no payload.
Citations not structured (#3) Langfuse agent/lawrence-2 prompt (does it instruct the model to emit the citation marker syntax?) and the chat-agent-embed-citations feature flag. Parser + trace roll-up in agent-definition already exists.
Concatenated narration in saved transcript (#4) lawrence-api stream adapter — the persisted UIMessage tee collapses tool-call boundaries at save time. Live render is fine; saved / audit form isn't.
Per-turn cost visibility (#5) Once agents PR #564 lands, revisit the iteration loop in the chat agent — today it has an iteration cap (10) but no cost ceiling. Real-cost figures from corrected Langfuse will dictate whether a soft ceiling is needed.
Redundant scope re-flag (#7) Langfuse agent/lawrence-2 prompt + the drafting skill content loaded via load_skill. Prompt-side behaviour, not orchestrator.

When a Lawrence-2 trace surfaces trouble — quick triage