previews · RCAs · Lawrence-2 US rollout · trace analysis

Day-one of Lawrence-2 in the US — what we learned.

Adoption was clean for two of three users, output quality was strong, one Anthropic outage took the agent off the air mid-stream, and the read tool silently returned an empty PDF on the showcase memo. Cost numbers are inflated by known Langfuse accounting bugs.

Date · 2026-05-13 Window · 2026-05-12 (rollout day) Env · prd Prompt · agent/lawrence-2 Users · Georgia, Sabrina, Scott Traces reviewed · 14 Author · Adolfo Final

One-liner

14 prod traces across the three users since the vfs-agent flag landed for them on 2026-05-12. Sabrina and Georgia engaged substantively; Scott made one Lawrence-2 turn and hasn't returned. Output quality is strong — multi-section legal memos, statute-cited drafts, scope-guardrail behaviour. The two real defects to fix: (1) the orchestrator leaks a raw MidStreamFallbackError into the trace when Anthropic overloads, and (2) the read tool returned a zero-length base64 envelope for one PDF instead of failing explicitly. reference_count = 0 in 14/14 traces — citations exist in prose, not in the structured channel. All cost figures below are Langfuse-displayed and over-counted by two known accounting issues — see the cost section before reading them as real spend.

Lawrence-2 turns

across 3 users

Completed

incl. 1 silent tool fail

Hard failure

Anthropic overload

0 / 14

Structured citations

all reference_count = 0

1. Adoption picture

Per-user Lawrence-2 turns since the flag landed on 2026-05-12:

User	L2 turns	Models	First L2 use
Georgia	2	Opus 4.7	2026-05-12 20:30 UTC
Sabrina	11	7× Sonnet 4.6 · 4× Opus 4.7	2026-05-12 18:42 UTC
Scott	1	Opus 4.7	2026-05-12 20:15 UTC

Sabrina and Georgia engaged substantively on day one. Scott made one turn (the showcase memo, below) and hasn't returned to the sidebar in the window — worth a check-in to understand why.

2. Per-user behaviour

Sabrina · 3 sessions, exploratory then production

Session 1 — matter `mat_4ew0ubd082rnfksu`

18:42–18:51 UTC · 5 turns · Sonnet 4.6 · $0.10–$0.55 / turn

"Hi Lawrence how can you help me?" → intro (no tools)
"How do I get paid?" → read + search → answered with concrete fee numbers ($100 flat, 50% platform fee)
"How long does it take me to get paid…" → web_search, correctly deflected to Lawhive support
"can you recommend follow on work for this client?" → read + search, said it was too early to recommend
"How do i draft a follow on fee agreement?" → no tools, deflected as a platform-ops question

Turns 3 and 5 were platform meta-questions, not legal. First thing US users want from the agent is to learn the platform. Worth feeding into Lawhive operations knowledge base.

Session 2 — matter `mat_a10lschpo6ny2q5j`

18:52–19:47 UTC · 2 turns · Sonnet 4.6

"What is good follow on work opportunity here?" → vector_search + search → identified an open USPTO Office Action as the upsell
"how do you know shirazi filed it?" → FAIL MidStreamFallbackError: Anthropic Overloaded · 5.4s · $0

This is the Claude outage that triggered the temporary Sonnet → Opus swap for the evening.

Session 3 — matter `mat_nphxk37yxpjgw0m9` · 23:10–23:12 UTC · 4 turns · Opus 4.7

"what are upsell opportunities here?" → structured opportunity table.
"Can you draft me a lease for this?" → scope challenge: flagged Tenant Starter Pack as complete + drafting excluded, suggested lease termination instead.
"draft me a lease so that sabrina can provide it for her landlord" → load_skill (skill:drafting) + search → flagged again, then proceeded.
"A full residential lease" → search_legislation + create + vector_search → drafted full Florida lease with statutory cross-refs (§83.49(2)(d), §404.056(5), §83.47, lead paint addendum, federal 24 C.F.R. §35), written to documents://mat_nphxk37yxpjgw0m9. 233 events, 86s.

The scope-guardrail behaviour is genuinely impressive — kind of thing senior lawyers notice. Worth surfacing in onboarding demos.

Georgia · 1 session, IP-classification advice

Matter mat_42gii0ddf1rfsbd1 · 20:30–20:32 UTC · 2 turns · Opus 4.7

"how would you classify these goods for a filing at the USPTO cigar intelligence platform…" → 3× web_search (TMEP, WIPO Class 42 & 45) + 2× vector_search → structured multi-class recommendation (Class 9 primary, Class 42 SaaS secondary, Class 45 social networking) with concrete USPTO ID wording per class. 93 events, 30s.
"yes, please draft" → 1 iter, no tools → asked an A/B clarifying question before drafting. Sensible, but feels slightly verbose given the original turn had already proposed both options.

Cleanest example of the value proposition: research-grounded advice that produces draft-ready language with citable URLs. Worth saving as a demo trace.

Scott · 1 Lawrence-2 turn, the showcase memo

Matter mat_tr5ot7st7zindn68 · 20:15 UTC · Opus 4.7 · $8.60 (Langfuse-displayed) · 101.5s · 7 iterations · 277 events · 11 tool calls

"Can you draft a memo explaining the next steps in the appeals process for the opposing party, Mr. Averill?"

The agent ran a full research-and-draft pipeline: web search → legislation search → vector search over case docs and legal library → file read on appellate clerk notice → file read on notice of appeal (this one returned empty) → web/vector fallback → memo creation. The final memo cites A.R.S. §12-1177(A), §33-811(B), and Curtis v. Morris, Merrifield v. Merrifield, Olds Bros. Lumber Co. v. Rushing, and ARCAP 15(a)(2), and produced a six-section internal note.

Single trace illustrates almost the entire Lawrence-2 surface in one go. Most expensive trace in the dataset by ~5× — see cost caveat below.

3. Failures & quality issues

Hard failure MidStream Anthropic overload (Sabrina, trace ac66f580…)

litellm.ServiceUnavailableError: litellm.MidStreamFallbackError: litellm.InternalServerError: AnthropicException - Overloaded

Error originated on orchestrator-call (GENERATION) and propagated up iteration-1 → chat-response.
Trace output stored is {"error": "...", "status": "failed"} — i.e. the raw litellm error string lands in Langfuse. The user-facing emit in packages/agent-definition is the generic "Error generating response" string, so the user did not see the raw payload, but they also got no actionable cue or retry control.
The error name MidStreamFallbackError suggests a fallback chain tried and failed — likely same-provider (Sonnet → Opus or Opus → Sonnet) and Anthropic was overloaded across the board.
This was the trigger for the evening's Sonnet → Opus swap; the Langfuse prompt has since been reverted to Sonnet.

Improvements

Catch MidStreamFallbackError / ServiceUnavailableError at the orchestrator boundary; emit a typed provider_overloaded stream event with a "retry" affordance.
Confirm the litellm fallback chain reaches a non-Anthropic provider (Anthropic API → Anthropic Bedrock) so a regional Anthropic outage is recoverable in-place.
For mid-stream failures, let the user "retry this turn" as a first-class FE control — the user's question is unchanged.

Silent failure Empty PDF read (Scott, `read` on matfil_1c79x1rn3lacrwd6)

The read tool returned:

{"file": {"file_data": "data:application/pdf;base64,"},
 "name": "2026-05-01 Notice of Appeal",
 "extension": "pdf", ...}

A well-formed envelope with an empty base64 payload. The agent worked around it via vector search and surfaced "the notice of appeal PDF didn't return readable content" to Scott. Right user-facing behaviour from the model, but the underlying failure should not happen.

Investigate

Is the file actually present in S3 / blob storage with content, or did the upload-side complete with zero bytes?
If the source has bytes, the read tool's PDF extraction is the suspect — content-type or size bug producing empty file_data.
Either way, the tool should return {"status": "failed", "error": "..."} instead of an envelope with an empty payload — the current shape forces the model to infer failure.

Gap Citations not structured (all 14 traces, `reference_count = 0`)

The model output is rich with citations (statutes, case law, regulations, USPTO Manual sections) but the structured references array is always []. This matches the LEX-264 / granular-citations workstream. Concrete impact:

Citations in the UI are not clickable — users can't verify or click through to Justia / TMEP / etc.
We cannot programmatically audit hallucination rate per citation.
We cannot easily generate citation indices for export or for client-facing versions of a memo.

The parser pipeline exists in agent-definition and is wired through to trace metadata. reference_count = 0 means the model isn't emitting the marker syntax yet. Either the Langfuse agent/lawrence-2 prompt isn't instructing it to, or the chat-agent-embed-citations flag isn't on. Confirm both.

Polish Streaming-text artefact in the saved final response

Multiple traces show final_response_content as a wall of concatenated text where inter-tool narration ("Let me check the relevant Florida statutory framework…", "I have enough research…") is inlined directly before the next narrative paragraph, with no separator. Live this probably renders cleanly because the FE shows tool-running indicators between segments. But the saved/transcripted form is harder to read — which matters for Langfuse review, customer audit trails, and matter-note exports. Consider persisting as discrete typed events instead of one flat string.

UX nits

Georgia's "yes, please draft" produced an A/B clarifying question, but the original turn had already proposed Class 9 primary with secondary 42/45. A confirmation should authorise the proposed plan, not re-elicit scope.
Sabrina's drafting flow had the agent flag scope concerns three times before drafting. Good lawyering, but verbose — once the user has acknowledged a concern in the thread, the next acknowledgement should be a single sentence at the top.

4. Cost / latency

Heads-up: costs below are Langfuse-displayed and known to be over-counted.

Per the "Anthropic Prompt Caching — Investigation & Implementation" Notion doc, two known accounting issues inflate them:

Langfuse #12996 — claude-sonnet-4-6 still has the deprecated Large Context tier ($6/$22.50 per Mtok) for input >200k tokens. Real Anthropic spend ≈ Langfuse displayed × 0.5 on Sonnet long-context traces.
LiteLLM #7790 · agents PR #564 — streaming consumer broke before Anthropic's late usage chunk arrived, so cache_read_input_tokens were missed and the same tokens got billed as fresh input. Verified reductions once captured: cross-turn follow-up $0.58 → $0.05 (91%), multi-iteration loops ~$2 → $0.27 (87%).

Treat the numbers below as upper bounds, not real spend. The Opus-4-7 traces should be checked against per-iteration cache hit ratio once PR #564 lands.

User	Turns	Cost (USD, Langfuse-displayed)	Median latency (s)	Max latency (s)
Georgia	2	$3.65	23.9	30.5
Sabrina	11	$5.08	13.7	86.2
Scott	1	$8.60	101.5	101.5
Total	14	$17.34	—	—

Even with the over-count caveat, the shape is informative: multi-iteration drafting turns are the dominant relative cost driver (Sabrina's lease, Scott's memo). What we don't yet know — and should check once PR #564 is in — is what fraction of the per-iteration input on Scott's 7-iteration trace was cache_read vs fresh. If caching was already working server-side but uncounted by Langfuse, his real spend is likely well under $5.

5. Top recommendations

Prioritised by impact on adoption / quality / cost:

Check in with Scott. One Lawrence-2 turn then silence in the window. Worth understanding whether that's a workload thing or whether something put him off after the showcase memo. The Notice-of-Appeal PDF failure (#2) is a candidate contributor.
Fix the empty-PDF read failure. Either the storage path for matfil_1c79x1rn3lacrwd6 or the read tool's PDF extraction. The user-visible workaround was OK but the cost of working around it was several iterations.
Handle Anthropic overload gracefully. Wrap MidStreamFallbackError at the orchestrator boundary; verify the cross-provider fallback chain works; expose a "retry" control in the sidebar for mid-stream failures. This is the failure mode that forced last night's outage-driven Sonnet→Opus swap — surviving Anthropic overloads in-place removes the need for emergency prompt-version flips.
Ship structured citations. All 14 traces emit textually-cited output but reference_count = 0. The pipeline exists; the model just isn't emitting the marker syntax yet. Largest remaining production-quality gap.
Land agents PR #564 and bump Langfuse for #13367. Today's displayed costs are inflated by the cache_read drop + the stale Sonnet 4.6 large-context tier. Cost decisions (model routing, budgets) shouldn't be made off Langfuse numbers until both fixes are in. Once they are, revisit per-turn cost budgets.
Persist orchestrator events as structured items instead of one concatenated final_response_content. Helps Langfuse review and matter-side transcripts.
Suppress redundant scope re-flags. Once a user has acknowledged a scope concern in a thread, downgrade subsequent flags to a one-liner.

Was: "confirm Sonnet→Opus was intentional" — confirmed. Outage-driven, since reverted on the Langfuse prompt. No action.

6. Where to look

High-level orientation — see the full Notion write-up for file/line pointers and the long-form analysis.

Request flow

Sidebarlegal-os · Vercel AI SDK

User types in the agent-mode sidebar; frontend smooths the streamed tokens before render.

POST /lawrence/chat

lawrence-apiBFF inside platform-v3

Auth · matter access check · thread persistence · title generation · stream tee for the final-message save.

POST /agents/chat

agent-gatewayFastAPI · agents repo

Thin streaming passthrough to the chat agent.

POST /agents/chat/generate

Chat agent (ChatAgent)agents repo

The Lawrence-2 orchestrator.

Prompt selection from Langfuse: agent/lawrence-2 · agent/lawrence-2-read-only · agent/system (depending on surface, mode and the vfs-agent flag).
Iteration loop with tools: search · vector_search · search_legislation · web_search · read · create · edit · stat · load_skill.
Streams via packages/agent-definition (litellm → Anthropic / OpenAI).
Citation parser on the stream; retrieval via MatterRetriever (CCO sourced from lawrence-engine).

Surface footgun

All 14 traces resolved the write-enabled agent/lawrence-2 prompt — the new sidebar intentionally omits surface: "sidebar" from the request because sending it would route to agent/lawrence-2-read-only and strip the VFS write tools. The field name suggests the opposite, so it's worth a comment in the sidebar submit hook.

Which area owns each finding

Finding	Where to look
MidStream Anthropic error leaked into trace (#1)	Backend orchestrator stream consumer — the single user-facing error emit in `packages/agent-definition`. Wrap `MidStreamFallbackError` there into a typed `provider_overloaded` event; verify the litellm cross-provider fallback chain reaches a non-Anthropic provider.
Empty PDF read (#2)	VFS `read` tool + lawrence-api file resolver. Check whether bytes are missing on storage (S3) or the read path is returning an empty envelope; either way the tool should fail explicitly instead of returning `data:application/pdf;base64,` with no payload.
Citations not structured (#3)	Langfuse `agent/lawrence-2` prompt (does it instruct the model to emit the citation marker syntax?) and the `chat-agent-embed-citations` feature flag. Parser + trace roll-up in `agent-definition` already exists.
Concatenated narration in saved transcript (#4)	lawrence-api stream adapter — the persisted UIMessage tee collapses tool-call boundaries at save time. Live render is fine; saved / audit form isn't.
Per-turn cost visibility (#5)	Once agents PR #564 lands, revisit the iteration loop in the chat agent — today it has an iteration cap (10) but no cost ceiling. Real-cost figures from corrected Langfuse will dictate whether a soft ceiling is needed.
Redundant scope re-flag (#7)	Langfuse `agent/lawrence-2` prompt + the drafting skill content loaded via `load_skill`. Prompt-side behaviour, not orchestrator.

When a Lawrence-2 trace surfaces trouble — quick triage

Unexpected tool used / write in a read-only context → check surface + use_vfs_tools resolution and the prompt name that actually got picked.
Missing or duplicated citations → check whether the model emitted the citation marker syntax in its output. If yes, parser side; if no, the prompt.
Laggy stream / wrong batching feel → frontend smoothing hook, backend fine-grained-streaming flag, gateway buffering headers.
Hallucinated content / missing matter context → CCO retrieval failure path silently falls back to empty context; check the fetch-case-context observation.
Generic "Error generating response" → root cause is upstream (LLM provider exception or context-window retry); the route-level OTel span carries the real error.message.