Adoption was clean for two of three users, output quality was strong, one Anthropic outage took the agent off the air mid-stream, and the read tool silently returned an empty PDF on the showcase memo. Cost numbers are inflated by known Langfuse accounting bugs.
vfs-agent flag landed for them on 2026-05-12. Sabrina and Georgia engaged substantively; Scott made one Lawrence-2 turn and hasn't returned. Output quality is strong — multi-section legal memos, statute-cited drafts, scope-guardrail behaviour. The two real defects to fix: (1) the orchestrator leaks a raw MidStreamFallbackError into the trace when Anthropic overloads, and (2) the read tool returned a zero-length base64 envelope for one PDF instead of failing explicitly. reference_count = 0 in 14/14 traces — citations exist in prose, not in the structured channel.
All cost figures below are Langfuse-displayed and over-counted by two known accounting issues — see the cost section before reading them as real spend.
reference_count = 0Per-user Lawrence-2 turns since the flag landed on 2026-05-12:
| User | L2 turns | Models | First L2 use |
|---|---|---|---|
| Georgia | 2 | Opus 4.7 | 2026-05-12 20:30 UTC |
| Sabrina | 11 | 7× Sonnet 4.6 · 4× Opus 4.7 | 2026-05-12 18:42 UTC |
| Scott | 1 | Opus 4.7 | 2026-05-12 20:15 UTC |
Sabrina and Georgia engaged substantively on day one. Scott made one turn (the showcase memo, below) and hasn't returned to the sidebar in the window — worth a check-in to understand why.
mat_4ew0ubd082rnfksuread + search → answered with concrete fee numbers ($100 flat, 50% platform fee)web_search, correctly deflected to Lawhive supportread + search, said it was too early to recommendTurns 3 and 5 were platform meta-questions, not legal. First thing US users want from the agent is to learn the platform. Worth feeding into Lawhive operations knowledge base.
mat_a10lschpo6ny2q5jvector_search + search → identified an open USPTO Office Action as the upsellMidStreamFallbackError: Anthropic Overloaded · 5.4s · $0This is the Claude outage that triggered the temporary Sonnet → Opus swap for the evening.
mat_nphxk37yxpjgw0m9 · 23:10–23:12 UTC · 4 turns · Opus 4.7load_skill (skill:drafting) + search → flagged again, then proceeded.search_legislation + create + vector_search → drafted full Florida lease with statutory cross-refs (§83.49(2)(d), §404.056(5), §83.47, lead paint addendum, federal 24 C.F.R. §35), written to documents://mat_nphxk37yxpjgw0m9. 233 events, 86s.The scope-guardrail behaviour is genuinely impressive — kind of thing senior lawyers notice. Worth surfacing in onboarding demos.
web_search (TMEP, WIPO Class 42 & 45) + 2× vector_search → structured multi-class recommendation (Class 9 primary, Class 42 SaaS secondary, Class 45 social networking) with concrete USPTO ID wording per class. 93 events, 30s.Cleanest example of the value proposition: research-grounded advice that produces draft-ready language with citable URLs. Worth saving as a demo trace.
The agent ran a full research-and-draft pipeline: web search → legislation search → vector search over case docs and legal library → file read on appellate clerk notice → file read on notice of appeal (this one returned empty) → web/vector fallback → memo creation. The final memo cites A.R.S. §12-1177(A), §33-811(B), and Curtis v. Morris, Merrifield v. Merrifield, Olds Bros. Lumber Co. v. Rushing, and ARCAP 15(a)(2), and produced a six-section internal note.
Single trace illustrates almost the entire Lawrence-2 surface in one go. Most expensive trace in the dataset by ~5× — see cost caveat below.
orchestrator-call (GENERATION) and propagated up iteration-1 → chat-response.{"error": "...", "status": "failed"} — i.e. the raw litellm error string lands in Langfuse. The user-facing emit in packages/agent-definition is the generic "Error generating response" string, so the user did not see the raw payload, but they also got no actionable cue or retry control.MidStreamFallbackError suggests a fallback chain tried and failed — likely same-provider (Sonnet → Opus or Opus → Sonnet) and Anthropic was overloaded across the board.MidStreamFallbackError / ServiceUnavailableError at the orchestrator boundary; emit a typed provider_overloaded stream event with a "retry" affordance.read on matfil_1c79x1rn3lacrwd6)The read tool returned:
{"file": {"file_data": "data:application/pdf;base64,"},
"name": "2026-05-01 Notice of Appeal",
"extension": "pdf", ...}
A well-formed envelope with an empty base64 payload. The agent worked around it via vector search and surfaced "the notice of appeal PDF didn't return readable content" to Scott. Right user-facing behaviour from the model, but the underlying failure should not happen.
read tool's PDF extraction is the suspect — content-type or size bug producing empty file_data.{"status": "failed", "error": "..."} instead of an envelope with an empty payload — the current shape forces the model to infer failure.reference_count = 0)The model output is rich with citations (statutes, case law, regulations, USPTO Manual sections) but the structured references array is always []. This matches the LEX-264 / granular-citations workstream. Concrete impact:
The parser pipeline exists in agent-definition and is wired through to trace metadata. reference_count = 0 means the model isn't emitting the marker syntax yet. Either the Langfuse agent/lawrence-2 prompt isn't instructing it to, or the chat-agent-embed-citations flag isn't on. Confirm both.
Multiple traces show final_response_content as a wall of concatenated text where inter-tool narration ("Let me check the relevant Florida statutory framework…", "I have enough research…") is inlined directly before the next narrative paragraph, with no separator. Live this probably renders cleanly because the FE shows tool-running indicators between segments. But the saved/transcripted form is harder to read — which matters for Langfuse review, customer audit trails, and matter-note exports. Consider persisting as discrete typed events instead of one flat string.
Per the "Anthropic Prompt Caching — Investigation & Implementation" Notion doc, two known accounting issues inflate them:
claude-sonnet-4-6 still has the deprecated Large Context tier ($6/$22.50 per Mtok) for input >200k tokens. Real Anthropic spend ≈ Langfuse displayed × 0.5 on Sonnet long-context traces.cache_read_input_tokens were missed and the same tokens got billed as fresh input. Verified reductions once captured: cross-turn follow-up $0.58 → $0.05 (91%), multi-iteration loops ~$2 → $0.27 (87%).Treat the numbers below as upper bounds, not real spend. The Opus-4-7 traces should be checked against per-iteration cache hit ratio once PR #564 lands.
| User | Turns | Cost (USD, Langfuse-displayed) | Median latency (s) | Max latency (s) |
|---|---|---|---|---|
| Georgia | 2 | $3.65 | 23.9 | 30.5 |
| Sabrina | 11 | $5.08 | 13.7 | 86.2 |
| Scott | 1 | $8.60 | 101.5 | 101.5 |
| Total | 14 | $17.34 | — | — |
Even with the over-count caveat, the shape is informative: multi-iteration drafting turns are the dominant relative cost driver (Sabrina's lease, Scott's memo). What we don't yet know — and should check once PR #564 is in — is what fraction of the per-iteration input on Scott's 7-iteration trace was cache_read vs fresh. If caching was already working server-side but uncounted by Langfuse, his real spend is likely well under $5.
Prioritised by impact on adoption / quality / cost:
matfil_1c79x1rn3lacrwd6 or the read tool's PDF extraction. The user-visible workaround was OK but the cost of working around it was several iterations.MidStreamFallbackError at the orchestrator boundary; verify the cross-provider fallback chain works; expose a "retry" control in the sidebar for mid-stream failures. This is the failure mode that forced last night's outage-driven Sonnet→Opus swap — surviving Anthropic overloads in-place removes the need for emergency prompt-version flips.reference_count = 0. The pipeline exists; the model just isn't emitting the marker syntax yet. Largest remaining production-quality gap.final_response_content. Helps Langfuse review and matter-side transcripts.Was: "confirm Sonnet→Opus was intentional" — confirmed. Outage-driven, since reverted on the Langfuse prompt. No action.
High-level orientation — see the full Notion write-up for file/line pointers and the long-form analysis.
agent/lawrence-2 · agent/lawrence-2-read-only · agent/system (depending on surface, mode and the vfs-agent flag).search · vector_search · search_legislation · web_search · read · create · edit · stat · load_skill.packages/agent-definition (litellm → Anthropic / OpenAI).MatterRetriever (CCO sourced from lawrence-engine).All 14 traces resolved the write-enabled agent/lawrence-2 prompt — the new sidebar intentionally omits surface: "sidebar" from the request because sending it would route to agent/lawrence-2-read-only and strip the VFS write tools. The field name suggests the opposite, so it's worth a comment in the sidebar submit hook.
| Finding | Where to look |
|---|---|
| MidStream Anthropic error leaked into trace (#1) | Backend orchestrator stream consumer — the single user-facing error emit in packages/agent-definition. Wrap MidStreamFallbackError there into a typed provider_overloaded event; verify the litellm cross-provider fallback chain reaches a non-Anthropic provider. |
| Empty PDF read (#2) | VFS read tool + lawrence-api file resolver. Check whether bytes are missing on storage (S3) or the read path is returning an empty envelope; either way the tool should fail explicitly instead of returning data:application/pdf;base64, with no payload. |
| Citations not structured (#3) | Langfuse agent/lawrence-2 prompt (does it instruct the model to emit the citation marker syntax?) and the chat-agent-embed-citations feature flag. Parser + trace roll-up in agent-definition already exists. |
| Concatenated narration in saved transcript (#4) | lawrence-api stream adapter — the persisted UIMessage tee collapses tool-call boundaries at save time. Live render is fine; saved / audit form isn't. |
| Per-turn cost visibility (#5) | Once agents PR #564 lands, revisit the iteration loop in the chat agent — today it has an iteration cap (10) but no cost ceiling. Real-cost figures from corrected Langfuse will dictate whether a soft ceiling is needed. |
| Redundant scope re-flag (#7) | Langfuse agent/lawrence-2 prompt + the drafting skill content loaded via load_skill. Prompt-side behaviour, not orchestrator. |
surface + use_vfs_tools resolution and the prompt name that actually got picked.fetch-case-context observation.error.message.