Evidently AI

Open-source evaluations and observability for LLM apps

Evidently is an open-source framework to evaluate, test and monitor AI-powered apps.

📚 100+ built-in checks, from classification to RAG. 🚦 Both offline evals and live monitoring. 🛠 Easily add custom metrics and LLM judges.Hi Makers!

I'm Elena, a co-founder of Evidently AI. I'm excited to share that our open-source Evidently library is stepping into the world of LLMs! 🚀

Three years ago, we started with testing and monitoring for what's now called "traditional" ML. Think classification, regression, ranking, and recommendation systems. With over 20 million downloads, we're now bringing our toolset to help evaluate and test LLM-powered products.

As you build an LLM-powered app or feature, figuring out if it's "good enough" can be tricky. Evaluating generative AI is different from traditional software and predictive ML. It lacks clear criteria and labeled answers, making quality more subjective and harder to measure. But there is no way around it: to deploy an AI app to production, you need a way to evaluate it.

For instance, you might ask:

How does the quality compare if I switch from GPT to Claude?
What will change if I tweak a prompt? Do my previous good answers hold?
Where is it failing?
What real-world quality are users experiencing?

It's not just about metrics—it's about the whole quality workflow. You need to define what "good" means for your app, set up offline tests, and monitor live quality.

With Evidently, we provide the complete open-source infrastructure to build and manage these evaluation workflows. Here's what you can do: 📚 Pick from a library of metrics or configure custom LLM judges 📊 Get interactive summary reports or export raw evaluation scores 🚦 Run test suites for regression testing
📈 Deploy a self-hosted monitoring dashboard ⚙️ Integrate it with any adjacent tools and frameworks

It's open-source under an Apache 2.0 license.

We build it together with the community: I would love to learn how you address this problem and any feedback and feature requests.

Check it out on GitHub: https://github.com/evidentlyai/e..., get started in the docs: http://docs.evidentlyai.com or join our Discord to chat: https://discord.gg/xZjKRaNp8b.@elenasamuylova Congrats on bringing your idea to life! Wishing you a smooth and prosperous journey. How can we best support you on this journey?@kjosephabraham Thanks for the support! We always appreciate any feedback and help in spreading the word. As an open-source tool, it is built together with the community! 🚀Amazing team + product. Been using evidently for years now and can confidently say its one of the best in the market!@hamza_tahir Thanks for the support ❤️Hi everyone! I am Emeli, one of the co-founders of Evidently AI.

I'm thrilled to share what we've been working on lately with our open-source Python library. I want to highlight a specific new feature of this launch: LLM judge templates.

LLM as a judge is a popular evaluation method where you use an external LLM to review and score the outputs of LLMs.

However, one thing we learned is that no LLM app is alike. Your quality criteria are unique to your use case. Even something seemingly generic like "sentiment" will mean something different each time. While we do have templates (it's always great to have a place to start), our primary goal is to make it easy to create custom LLM-powered evaluations.

Here is how it works:

🏆 Define your grading criteria in plain English. Specify what matters to you, whether it's conciseness, clarity, relevance, or creativity. 💬 Pick a template. Pass your criteria to an Evidently template, and we'll generate a complete evaluation prompt for you, including formatting it as JSON and asking the LLM to explain its scores. ▶️ Run evals. Apply these evaluations to your datasets or recent traces from your app. 📊 Get results. Once you set a metric, you can use it across the Evidently framework. You can generate visual reports, run conditional test suites, and track metrics in time on a dashboard.

You can track any metric you like - from hallucinations to how well your chatbot follows the brand guidelines.

We plan to expand on this feature, making it easier to add examples to your prompt and adding more templates, such as pairwise comparisons.

Let us know what you think! To check it out, visit our GitHub: https://github.com/evidentlyai/e..., docs http://docs.evidentlyai.com or Discord to chat: https://discord.gg/xZjKRaNp8b.@hamza_afzal_butt Thank you so much!This solves so many of my current pain points working with LLMs. I'm developing AI mentors and therapists and I need a better way to run evals for each update and prompt optimization. Upvoting, bookmarking, and going to try this out!

Thank you Elena!@danielwchen Thank you! Let us know how it works for you. We see a lot of usage with healthcare-related apps; these are the use cases where quality is paramount - you can't just ship on vibes!

Categories: Open Source Developer Tools Artificial Intelligence