Why single benchmark scores mislead: interpreting a low Vectara score with high AA-Omniscience
https://www.livebinders.com/b/3714649?tabid=549f6323-62b4-da9d-de95-c44660210660
3 key factors when evaluating LLMs beyond a single leaderboard number Many teams pick a model because it tops a single benchmark