Complete Guide to Monitoring Local LLMs with Llama and Open WebUI

The real challenge with running a local LLM like Llama is understanding what's happening inside your system โ how accurately your models respond to prompts, how efficiently they use resources, and how your local AI implementation is performing overall.
This tutorial will show you how to gain visibility into your Llama AI local model using Helicone with Open WebUI, providing powerful monitoring capabilities for anyone looking to run local LLM systems effectively.
Open WebUI (formerly Ollama WebUI) is a feature-rich, self-hosted web interface designed for interacting with various AI LLM local implementations including Llama models, OpenAI-compatible APIs, and custom LLM setups.
Table of Contents
- Why Monitoring Is Essential When You Run Local LLM Systems
- Prerequisites
- Step-by-Step Tutorial: Setting Up Helicone to Monitor Your Local LLM with Open WebUI
- How to Get the Most from Your Open WebUI Monitoring
- Advanced Monitoring Techniques for LLMs
- Best Practices for LLM Monitoring and Optimization
- How to Optimize Local LLMs Based on Monitoring Insights
- Conclusion: The Power of Monitoring for Your Local AI Implementation
- Next Steps in Your LLM Monitoring Journey
Why Monitoring Is Essential When You Run Local LLM Systems
- Track request patterns: See which types of prompts are being sent to your Llama AI models
- Measure inference latency: Monitor how long your local LLM takes to generate responses
- Analyze usage patterns: Identify how users interact with your AI LLM local implementation
- Evaluate model performance: Compare different Llama models with the same prompts to optimize your setup
- Test locally before deploying to production: Evaluate implementations locally before deploying to production where costs and risks are higher
Prerequisites
- Docker installed
- Ollama or another local model runner
- A Helicone account
Step-by-Step Tutorial: Setting Up Helicone to Monitor Your Local LLM with Open WebUI
In this tutorial, we will create the proxy server to log every LLM request into Helicone so you can monitor and evaluate how your model responds.
Step 1: Create a Helicone Account and Generate an API Key
- Sign up for a Helicone account
- Navigate to the API Keys section in your Helicone dashboard settings
- Generate a new API key for your Open WebUI integration
Step 2: Set up a proxy server to log the local LLM requests to Helicone
Since Open WebUI doesn't directly support custom Helicone headers, we'll need to set up a JavaScript layer to handle this integration. This allows us to monitor Llama requests made through Open WebUI.
Beyond Open WebUI ๐
We will use Open WebUI as our interface to the local LLM, but you can use any other interface to make LLM requests - as long as you use the proxy server's domain as your base URL like we do below.
- Run your Llama server locally
First, let's get your local model runner - in this case, Ollama - set up locally.
Then, in your terminal, run the following command to start running your Llama model locally.
export OLLAMA_HOST=0.0.0.0 && ollama serve
- Build the Helicone proxy server
Let's create a Node.js proxy server that will sit between Open WebUI and your AI implementation. This proxy forms the backbone of your local LLM monitoring solution, logging all requests to Helicone for analysis.
โ You can find a finished version of the codebase here. Feel free to clone it locally and skip to the last step of this section.mkdir helicone-ollama-proxy && cd helicone-ollama-proxy
npm init
npm install express dotenv typescript ts-node @types/node @types/express @helicone/helpers
touch .env && touch proxy.ts && touch tsconfig.json
Include your environment variables in your .env
file:
// Example
HELICONE_API_KEY=<YOUR_HELICONE_API_KEY>
Define your tsconfig.json
file with the following configurations.
{
"compilerOptions": {
"target": "ES2022",
"module": "NodeNext",
"moduleResolution": "NodeNext",
"esModuleInterop": true,
"strict": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"outDir": "dist",
"sourceMap": true,
"resolveJsonModule": true,
"declaration": true,
"allowSyntheticDefaultImports": true,
"removeComments": true,
"baseUrl": ".",
"paths": {
"*": ["node_modules/*"]
}
},
"include": ["**/*.ts"],
"exclude": ["node_modules", "dist"]
}
Create a simple proxy file (proxy.ts
) to handle the Helicone logging every time you make an LLM request using your local LLM implementation.
import express, { Request, Response } from "express";
import { HeliconeManualLogger } from "@helicone/helpers";
// Define types for Ollama requests
interface ChatMessage {
role: string;
content: string;
}
interface ChatRequest {
model: string;
messages: ChatMessage[];
stream?: boolean;
options?: Record<string, unknown>;
}
interface GenerateRequest {
model: string;
prompt: string;
stream?: boolean;
options?: Record<string, unknown>;
}
interface OllamaResponse {
id: string;
model: string;
created_at: string;
message?: {
role: string;
content: string;
};
response?: string;
done: boolean;
}
const app = express();
app.use(express.json());
// Initialize Helicone logger
const logger = new HeliconeManualLogger({
apiKey: process.env.HELICONE_API_KEY || ""
});
// Get Ollama version
app.get("/api/version", async (req: Request, res: Response) => {
try {
const response = await fetch("http://localhost:11434/api/version");
const data = await response.json();
res.json(data);
} catch (error) {
console.error("Error fetching version:", error);
res.status(500).json({ error: "Failed to fetch version" });
}
});
// Get available tags
app.get("/api/tags", async (req: Request, res: Response) => {
try {
const response = await fetch("http://localhost:11434/api/tags");
const data = await response.json();
res.json(data);
} catch (error) {
console.error("Error fetching tags:", error);
res.status(500).json({ error: "Failed to fetch tags" });
}
});
// Proxy requests to Ollama and log them to Helicone
app.post("/api/chat", async (req: Request, res: Response) => {
const reqBody = req.body as ChatRequest;
console.log("Received chat request:", JSON.stringify(reqBody, null, 2));
console.log("Forwarding to Ollama at: http://localhost:11434/api/chat");
try {
const result = await logger.logRequest(reqBody, async (resultRecorder) => {
const response = await fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: {
"Content-Type": "application/json"
},
body: JSON.stringify(reqBody)
});
if (!response.ok) {
const errorText = await response.text()
console.error("Ollama API error:", errorText)
throw new Error(`Ollama API error: ${response.status} ${errorText}`)
}
// Handle streaming response
const reader = response.body?.getReader()
const decoder = new TextDecoder()
let finalResponse: OllamaResponse | null = null
let accumulatedContent = ""
if (!reader) {
throw new Error("No response body received from Ollama")
}
while (true) {
const { done, value } = await reader.read()
if (done) break
const chunk = decoder.decode(value)
const lines = chunk.split('\n').filter(line => line.trim())
for (const line of lines) {
try {
const jsonResponse = JSON.parse(line) as OllamaResponse
if (jsonResponse.message?.content) {
accumulatedContent += jsonResponse.message.content
}
if (jsonResponse.done) {
// Create final response with accumulated content
finalResponse = {
...jsonResponse,
message: {
role: "assistant",
content: accumulatedContent
}
}
resultRecorder.appendResults(finalResponse)
return finalResponse
}
} catch (parseError) {
console.error("Failed to parse chunk:", line)
}
}
}
if (!finalResponse) {
throw new Error("No valid response received from LLM")
}
return finalResponse
})
res.json(result)
} catch (error) {
console.error("Error:", error)
res.status(500).json({ error: error instanceof Error ? error.message : "Failed to process request" })
}
});
// Handle generation requests
app.post("/api/generate", async (req: Request, res: Response) => {
const reqBody = req.body as GenerateRequest;
console.log("Received generate request:", JSON.stringify(reqBody, null, 2));
console.log("Forwarding to Ollama at: http://localhost:11434/api/generate");
try {
const result = await logger.logRequest(reqBody, async (resultRecorder) => {
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify(reqBody),
});
const resBody = (await response.json()) as OllamaResponse;
resultRecorder.appendResults(resBody);
return resBody;
});
res.json(result);
} catch (error) {
console.error("Error:", error);
res.status(500).json({ error: "Failed to process request" });
}
});
const PORT = 3100;
app.listen(PORT, () => {
console.log(`Ollama proxy with Helicone logging running on port ${PORT}`);
});
In your package.json
file, include the following config and start script to run your proxy server locally.
{
"name": "..",
...,
"main": "dist/proxy.js",
"type": "module",
"scripts": {
"build": "tsc",
"start": "npm run build && node -r dotenv/config dist/proxy.js",
"dev": "tsc -w",
...
},
...
}
In your terminal, start your proxy server by running the following command.
npm run start
- Run your OpenWebUI instance pointing to our proxy service as the Ollama Base URL.
Finally, update your Open WebUI configuration to point to this proxy and run your OpenWebUI instance!
# Run Open WebUI using the proxy as the LLAMA base URL
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -e OLLAMA_BASE_URL="http://host.docker.internal:3100" -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
Step 3: Verify Your Monitoring Setup
Once your proxy server is running and Open WebUI is configured to use it:
- Open your Helicone dashboard
- Go to the Requests tab
- Make a few test queries in your Open WebUI interface
- Confirm that requests appear in your Helicone dashboard
You should see your Llama model requests being logged, along with their full prompts and responses.
Not using Open WebUI? That's okay!
Once you have your proxy server running, feel free to use the Helicone headers anywhere in your application for enhanced monitoring.
fetch("http://localhost:3100/api/chat", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Helicone-Property-App-Context": "dashboard",
"Helicone-Property-User-Id": "user-123"
},
body: JSON.stringify({
model: "llama3",
messages: [{ role: "user", content: "What is the meaning of life?" }],
stream: true
})
});
These custom headers enable you to pass metadata about your application context and user to Helicone, which can be used to filter and segment your monitoring data in the dashboard. To read more about Helicone's custom headers, check out the documentation.
Monitor your Llama API costs with Helicone โก๏ธ
Track usage, debug model responses, and optimize your local LLM implementation.
How to Get the Most from Your Open WebUI Monitoring
Now that you have the basic local LLM monitoring integration set up, let's explore how to gain actionable insights into your Llama AI implementation.
Tracking Your LLM Interactions
When using Llama AI models through Open WebUI with monitoring enabled, your interaction typically follows this flow:
- Your prompt is sent through the monitored interface to your local LLM
- The system processes your request in your LLM implementation
- The LLM generates a response based on your input
- All interactions are logged for monitoring purposes
To see Llama monitoring in action:
- Create a new chat in Open WebUI
- Select your local Llama model
- Ask a variety of questions to generate diverse data
- Check your Helicone dashboard to analyze both queries and responses
Debugging Llama AI Performance Issues
To understand how your local LLM is performing in detail:
- In Open WebUI, enable detailed logging in settings (if available)
- Run different types of prompts through your system
- Use Helicone to track response times, token usage, and response quality
- Compare different configurations to optimize how you run local LLM systems
Advanced Monitoring Techniques for LLMs
Comprehensive Prompt Tracing
For complex local implementations, tracking the full prompt engineering process is invaluable:
- In your Helicone dashboard, examine the complete prompts sent to your LLM
- Analyze the "system" messages that guide your LLM's behavior
- Measure how different prompt structures affect your LLM's response quality
- Experiment with different prompt engineering techniques and compare results in your prompts experimentation dashboard
Best Practices for LLM Monitoring and Optimization
- Start with fundamentals: Begin with basic request tracking in your Open WebUI monitoring before exploring advanced metrics
- Establish benchmarks: Create a consistent set of test prompts to evaluate your AI performance
- Implement regular monitoring cycles: Check your Helicone dashboard regularly, especially after making changes to how you run local LLM systems
- Focus on application-specific needs: Track which types of prompts work best for your particular use case
- Maintain comprehensive documentation: Keep detailed records of all optimizations and their measurable impacts
How to Optimize Local LLMs Based on Monitoring Insights
After collecting substantial data through your Open WebUI monitoring setup, here are practical adjustments to enhance your AI LLM local implementation:
- Temperature Parameter Optimization: If your AI monitoring shows inconsistent outputs, fine-tune the temperature settings
- Context Window Management: If responses are hitting token limits, adjust your prompt engineering approach
- AI Model Selection: Use monitoring data to compare performance across different sizes of LLM models
- System Prompt Engineering: Develop and test specialized system prompts that guide your LLM models behavior
Small Models (7B) | Medium Models (13B-34B) | Large Models (70B+) | |
---|---|---|---|
Response Speed | Very Fast ๐ | Moderate | Slower |
Resource Usage | Low (4-8GB VRAM) ๐ | Medium (12-24GB VRAM) | High (32GB+ VRAM) |
Response Quality | Good for simple tasks | Great for most tasks | Superior for complex tasks ๐ |
Monitoring Focus | Optimizing for speed | Balancing speed/quality | Maximizing context usage |
Ideal Use Cases | Quick assistants, basic Q&A | General purpose assistant | Complex reasoning, creation |
Conclusion: The Power of Monitoring for Your Local AI Implementation
Implementing comprehensive AI monitoring with Helicone through a custom server provides insights that help optimize performance, improve accuracy, and enhance the overall experience of running your applications in the long-term.
This proxy approach allows you to monitor all your AI requests, even when direct header support isn't available in your interface of choice (as is the case with Open WebUI). We can capture rich metadata about your local LLM performance, helping you make data-driven decisions about optimizations before deploying to production.
As you continue to refine how you run your local AI implementation, these monitoring tools will help you build a more robust, performant, and effective implementation tailored to your specific needs.
Next Steps in Your LLM Monitoring Journey
- Experiment with different AI model parameter settings to find optimal configurations
- Test various prompt engineering techniques and compare results in your monitoring dashboard
- Establish regular review sessions to analyze your AI monitoring data
- Implement systematic A/B testing with different system prompts to guide your AI models
Remember that the perfect way to run local LLM systems is through continuous iteration โ monitor your Llama AI implementation, make adjustments based on data, and monitor again to progressively improve your system's effectiveness.
Learn more about local LLM implementation:
- Llama 3.3 just dropped โ is it better than GPT-4 or Claude-Sonnet-3.5?
- Top Open WebUI Alternatives for Running LLMs Locally
- Should You Build or Buy LLM Observability?
Frequently Asked Questions
How do I monitor multiple Llama models simultaneously?
You can set up multiple proxy instances pointing to different model endpoints and label them with distinct properties in Helicone headers. This allows you to segment your monitoring data by model while using a unified dashboard.
What are the hardware requirements for running local Llama models?
For Llama 3.1 models: (a) 7B models: minimum 8GB RAM, 4GB VRAM. (b) 13B models: minimum 16GB RAM, 8GB VRAM. (c) 70B models: minimum 32GB RAM, 24GB VRAM (or use quantized versions).
Can I use this monitoring setup with other LLMs besides Llama?
Yes, this monitoring architecture works with any local model accessible via API. Simply adjust the proxy endpoints to point to your specific model server, such as Mistral.
Questions or feedback?
Are the information out of date? Please raise an issue or contact us, we'd love to hear from you!