back

Time: 3 minute read

Created: July 12, 2024

Author: Lina Lam

What is LLM Observability and Monitoring?

Building with LLMs in production (well) is incredibly difficult. You probably have heard of the word LLM Observability. But what is it? How does it differ from traditional observability? What is being observed? We have the answers.

Helicone: What is LLM Observability and Monitoring

The TL;DR

LLM Observability is complete visibility into every layer of an LLM-based software system - the application, the prompt, and the response. LLM Observability comes hand-in-hand with LLM Monitoring. While monitoring tracks application performance metrics, observability is more investigative.

LLM ObservabilityLLM Monitoring
PurposeEvent loggingCollect metrics
Key AspectsTrace the flow of requests to understand system dependencies and interactionsTrack application performance metrics, such as usage, cost, latency, error rates
ExampleCorrelate different types of data to understand issues and complex behavioursSet up thresholds for unexpected behaviors

LLM vs. Traditional Observability - What’s the Diff?

Traditional development is typically transactional. Developers observe how the application handles HTTP requests/responses, a database query, or a published message. In contrast, LLMs are much more complex.

Here’s a comparison of the logs:

TraditionalLLMs
Simple, isolated interactionsIndefinitely nested interactions, creating a complex tree structure
Clear start and end pointsEncompass multiple interactions
Small body size (low KBs of data)Massive payloads (potentially GBs)
Predictable behavior (easy to evaluate)Lack of predictability (difficult to evaluate)
Primarily text-based logs and numerical metricsMulti-modal data (text, image, audio, video)

Issues with LLMs

Hallucination: LLMs’ objective is to predict the next few characters and not accuracy. This means that responses are not grounded in facts.

Complex use cases: LLM-based software systems require an increasing number of LLM calls to execute a complex task (i.e. agentic workflow). Reflexion is a technique engineers use to get LLMs to analyze their own results. But this consists of having multiple calls inside of multiple spans for checking hallucinations.

Proprietary data: Managing proprietary data is tricky. You need it to answer specific customer questions, but it can accidentally find its way into the responses.

Quality of response: Is the response in the wrong tone? Is the amount of detail appropriate for your users’ ask?

Cost (the big elephant in the room) - As usage goes up, and your LLM setup becomes more complicated (i.e. adding Reflexion), the cost can easily add up.

Third-party models: Their API can change, new models and new guardrails can be added, causing your LLM app to behave differently than before.

Limited competitive advantage: LLMs are hard to train and maintain. Chances are that you are using the same model as your competitor. Your differentiator becomes your prompt engineering and proprietary data.

What LLM Observability Tools Have In Common

Developers working on LLM applications need effective tools to understand and address bugs, and exceptions, and prevent regressions. They require unique visibility into the functioning of these applications, including:

  • Real-time monitoring of AI models
  • Detailed error tracking and reporting
  • Insights into user interactions and feedback
  • Performance metrics and trend analysis
  • Multi-metric correlations
  • Tools for prompt iterations and experimentation

Further Reading

Arize AI created a very in-depth read about the Five Pillars of LLM Observability, covering common use cases and issues with LLM apps, the importance of LLM observability, and the five pillars (evaluation, traces and spans, retrieval augmented generation, fine-tuning, prompt engineering) crucial for making your application reliable.

The author

Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a leader in machine learning observability. She is recognized in Forbes 30 Under 30 and led ML engineering at Uber, Apple, and TubeMogul (Adobe).

What we’ve learned

At Helicone, we’ve seen the complexities of productizing LLMs first-hand. Effective observability is key to navigating these challenges, and we strive to help our customers produce reliable and high-quality LLM applications, making the observability process easier and faster.

What are your thoughts?