
In this episode, we’re lucky to be joined by Alexandre Sallinen and Tony O’Halloran from the Laboratory for Intelligent Global Health & Humanitarian Response Technologies to discuss how large language models are assessed, including their Massive Open Online Validation & Evaluation (MOOVE) initiative.
0:25 - Technical wrap: what are agents?
13:20 - What are benchmarks?
-
18:20 - Automated evaluation
-
20:10 - Benchmarks
-
37:45 - Human feedback
-
44:50 - LLM as judge
Read more about the projects we discuss here:
- Learn about the MOOVE or contact our team if you'd like to be involved
-
Listen to the LiGHTCAST including their recent excellent outline of the HealthBench paper
More details in the show notes on our website.
Version: 20241125
No comments yet. Be the first to say something!