Create a LLM Eval Application

A walkthrough of creating an evaluation framework with Shiny for Python

Motivation

When using LLMs in practice:

  1. Evals are the most important thing to get right when working with LLMs
  2. The right evals are very use-case and domain dependent

Because of this it is important to:

  1. Have as much flexibility possible when working with evals.
  2. Be able to optimize your eval workflow

The best way to do get that is to build your own eval application. Thanks to Shiny for Python this isn't as monumental of a task as it may sound and can be done completely in Python. This post will walk through building a sample eval application to help familiarize you with the process.

The Evals

For this example I doing a simple task. Given an address I want the LLM to clean and standardize it and return a JSON. While this isn't the best LLM use-case, this is a really great use case for having lots of simple and easy evals to test. This will let us focus on building the framework and not get bogged down with the challenges of LLM evals.

Storage

You need to have a good and organized way to store inputs, outputs, and the test results to be queried. There are many tools to do this and many formats, but I like to use sqlite. The sqlite-utils is extremely helpful and helps manage things like automatically creating tables on first insert, and automatically adding missing columns on insert. It also provides a flexible API for querying in python.

Database normalization is a skill very few people actually can apply effectively in practice (you'll hear things like semi-normalized or it's mostly normalized, even when its not at all normalized). If you master it and actually apply it you will be able to easily delegate tasks, transition off projects, and rely on others for maintenence and debugging in production.

If you don't, you will end up being the only reliable repository of information and so will be stuck answering questions and writing queries for other people indefinitely. There is simply no reliable way to transfer expert knowledge of a relational databased that was not built on best-practice normalization. You will be forever stuck hoping that others will step up and take over, but they simply won't be able to.

Get in Touch

Let's connect and collaborate!