Developer Blog

Training LLMs with LLMs: Building a Debugging Environment with Claude Code × Langfuse

Also available in:한국어|日本語
Training LLMs with LLMs: Building a Debugging Environment with Claude Code × Langfuse

The Daily Langfuse Stare

When you're running LLM-powered products, you end up spending a lot of time staring at Langfuse dashboards.

At Neuradex, we're building a memory platform for AI agents. We use LLMs heavily, so monitoring traces in Langfuse is part of our daily routine.

"This trace seems slow for some reason." "This session went off the rails midway through." "Costs spiked—which prompt is the culprit?"

Find problematic traces, compare inputs and outputs, check the prompts, figure out what went wrong.

Isn't this exactly what LLMs are good at?

So I let Claude Code read Langfuse. And it started roasting its own kind.

How Claude Code × Langfuse Skills Works

Claude Code has a feature called "Skills"—custom commands you can define to execute specific tasks.

I built a set of skills that call the Langfuse API:

/langfuse-trace-list    # List traces
/langfuse-trace-view    # View trace details
/langfuse-session-list  # List sessions
/langfuse-session-view  # View session details
/langfuse-prompt-list   # List prompts
/langfuse-prompt-view   # View prompt details
/langfuse-prompt-upsert # Update prompts
/langfuse-model-list    # List models and costs

Now you can ask Claude Code to "look at the Langfuse data."

The LLM can now read other LLMs' medical records.

In Practice: How to Train Your Fellow LLM

The basic flow looks like this:

  1. Browse the Langfuse Console
  2. Spot a suspicious trace ID or session ID
  3. Toss it to Claude Code
  4. Let it analyze, and if needed, fix things

Example 1: Diagnosing a Suspicious Trace

Found a slow trace in Langfuse. Copy the ID and throw it at Claude Code.

Check out this trace ID: abc123-xxxx-yyyy
It seems slow—any idea why?

Claude Code fetches the trace details via /langfuse-trace-view and responds:

This trace has over 12,000 input tokens. The system prompt has too many examples. Consider reducing the few-shot examples from 3 to 1, or try a different approach.

Instead of staring at JSON thinking "hmm, lots of tokens," you get concrete improvement suggestions.

Example 2: Analyzing a Broken Session

While browsing session lists in Langfuse, I spotted a session where the conversation broke down midway. Grab the session ID and hand it over.

Look at this session: session-xyz-789
Looks like the conversation went off track midway

Claude Code retrieves all traces in the session via /langfuse-session-view and analyzes the conversation flow chronologically.

On the 5th turn, the user mixed Japanese and English in their input. The LLM misidentified the language, and subsequent responses shifted to English. You might want to add language handling rules to the system prompt, or introduce a language detection step.

If a human tried the same analysis, they'd need to read through every turn's input/output. The LLM grasps the context instantly.

Example 3: Analyzing Tool Call Behavior

Tool call debugging is particularly painful. Errors that crash are at least noticeable—but "no errors, just unexpected behavior" is the hardest to track down.

Check out this session: session-tool-unexpected-123
No errors, but search results are wrong

Claude Code traces tool call flows across multiple traces in the session and responds:

In the 3rd trace, search_documents is called but the query argument is an empty string. Looking at the 2nd trace's extract_keywords output, keywords were extracted correctly, but the format got mangled when passing to query. Either specify the output format in the prompt or review the parsing logic.

When there's no error, you need to expand JSON across multiple traces and figure out "where did the value diverge from intent." LLM looks at the entire session and points out "values went wrong here."

Example 4: Delegating Prompt Improvements

Once you know the issue is in the prompt, you can ask for fixes right there.

Improve this prompt: my-chat-prompt
Make sure that language mixing issue doesn't happen again

Claude Code fetches the current prompt via /langfuse-prompt-view and suggests improvements. If you approve, it can push the changes directly to Langfuse via /langfuse-prompt-upsert.

Update the prompt in your dev environment, verify no issues arise. If all good, hit "Promote to Production" in Langfuse Console. This entire loop happens through conversation alone.

Why "Having LLMs Diagnose LLMs" Works

1. Escape from JSON Hell

Langfuse trace data is deeply nested. metadata, input, output, spans... Parsing this as a human is exhausting.

Hand it to an LLM, and it understands the structure and summarizes: "basically, here's what's going on."

2. You Can Say "Something's Off" and It Gets It

"The response is weird." "It feels slow." "Costs seem high."

LLMs interpret these vague descriptions just fine. They translate them into concrete metrics and pinpoint the issue.

3. You Can Dig Deeper Through Conversation

"Tell me more about that." "Compare it with other traces." "So how should I fix it?"

You can explore interactively. Instead of clicking around dashboards, you debug following your train of thought.

Skills Are Surprisingly Easy to Build

The skills I used are basically just wrappers around the Langfuse API.

# /langfuse-trace-view

Receive a trace ID, fetch details from Langfuse API.
Format and return the result as Markdown.

Claude Code Skills let you define tasks in natural language. Claude Code handles the API call details.

Building one skill takes about 30 minutes. Once it's done, your whole team can do /langfuse-trace-view abc123 and get the same capabilities.

Wrap Up

  • Made Langfuse data readable through Claude Code
  • Just toss a suspicious trace/session ID and the LLM diagnoses it
  • Can delegate prompt fixes on the spot
  • The age of training LLMs with LLMs

Having one LLM diagnose another. A bit meta, but using an LLM to understand LLM output actually makes sense. They speak the same language, after all.

At Neuradex, we're applying these LLM operations insights to build a memory foundation that helps AI agents work smarter. It's no longer just about "using" LLMs—it's about "nurturing" them. We're continuing to build that environment.