A walkthrough of creating an evaluation framework with Shiny for Python
When using LLMs in practice:
Because of this it is important to:
The best way to do get that is to build your own eval application. Thanks to Shiny for Python this isn't as monumental of a task as it may sound and can be done completely in Python. This post will walk through building a sample eval application to help familiarize you with the process.
For this example I doing a simple task. Given an address I want the LLM to clean and standardize it and return a JSON. While this isn't the best LLM use-case, this is a really great use case for having lots of simple and easy evals to test. This will let us focus on building the framework and not get bogged down with the challenges of LLM evals.
You need to have a good and organized way to store inputs, outputs, and the test results to be queried. There are many tools to do this and many formats, but I like to use sqlite. The sqlite-utils is extremely helpful and helps manage things like automatically creating tables on first insert, and automatically adding missing columns on insert. It also provides a flexible API for querying in python.
Database normalization is a skill very few people actually can apply effectively in practice (you'll hear things like semi-normalized or it's mostly normalized, even when its not at all normalized). If you master it and actually apply it you will be able to easily delegate tasks, transition off projects, and rely on others for maintenence and debugging in production.
If you don't, you will end up being the only reliable repository of information and so will be stuck answering questions and writing queries for other people indefinitely. There is simply no reliable way to transfer expert knowledge of a relational databased that was not built on best-practice normalization. You will be forever stuck hoping that others will step up and take over, but they simply won't be able to.