Advancements in Generative AI Tools
While ChatGPT from OpenAI has dominated discussions surrounding generative AI tools, a shift has occurred with the recent emergence of Claude 3 Opus by Anthropic, which has claimed the top spot on a widely recognized crowdsourced leaderboard frequented by AI researchers.
Chatbot Arena Rankings
The ascent of Claude 3 Opus in the Chatbot Arena rankings signifies a significant milestone, as it marks the first instance of OpenAI’s GPT-4, the engine behind ChatGPT Plus, being displaced since its initial inclusion on the leaderboard in May of the previous year. Run by the Large Model Systems Organization (LMSYS ORG), Chatbot Arena serves as a platform dedicated to open models that foster collaboration among students and faculty at reputable institutions like the University of California, Berkeley, UC San Diego, and Carnegie Mellon University.
Unlike conventional AI benchmarks, Chatbot Arena adopts a subjective approach by presenting users with two unlabeled language models and prompting them to evaluate based on their personal preferences and criteria. Following the aggregation of numerous subjective comparisons, Chatbot Arena determines the “best” models for the leaderboard and updates rankings accordingly. This emphasis on user preference sets Chatbot Arena apart from its counterparts, as it circumvents the possibility of model trainers skewing results to manipulate algorithms, a strategy commonly observed in quantitative benchmarks. Consequently, Chatbot Arena emerges as a qualitative resource that provides valuable insights for AI researchers.
The platform collates user feedback and leverages the Bradley-Terry statistical model to predict the likelihood of a particular model surpassing others in direct competition. This method facilitates the generation of robust statistics, including confidence interval ranges for Elo rating estimates, a technique commonly employed in assessing the skill levels of chess players.
Leaderboard Dynamics
Besides Claude 3 Opus’s noteworthy advancement to the pinnacle of the leaderboard, other significant developments include the commendable performance of Claude 3 Sonnet and Claude 3 Haiku, both developed by Anthropic and currently holding positions of 4th and 6th, respectively.
The leaderboard further showcases various iterations of GPT-4, encompassing versions like GPT-4-0314, GPT-4-0613, GPT-4-1106-preview, and GPT-4-0125-preview, with Sonnet and Haiku outperforming the original GPT-4 and a redesigned version introduced by OpenAI in June 2023.
Regrettably, the upper echelons of the leaderboard see a scarcity of open-source Large Language Models (LLMs), with only Qwen making the top 10 cut. Starling 7b and Mixtral 8x7B emerge as the sole other open models featured within the top 20 rankings.
Token Capacity and Retrieval Proficiency
A standout feature distinguishing Claude from GPT-4 lies in its token context capacity and retrieval capabilities. The public version of Claude 3 Opus boasts the ability to handle over 200,000 tokens, with claims of a restricted version capable of processing 1 million tokens with near-flawless retrieval rates. This heightened token capacity enables Claude to comprehend lengthier prompts and retain information more effectively compared to GPT-4 Turbo, which has a token handling limit of 128,000 and experiences a decline in retrieval capabilities with extended prompts.
Considering the recall accuracy of Claude 3 Opus versus GPT-4 Turbo, the former emerges as a formidable contender in the generative AI landscape, bolstering its status as a top-ranking model.
Emerging Players in AI Assistant Space
Google’s Gemini Advanced has also made significant strides in the realm of AI assistants, offering a package that encompasses 2TB of storage and AI capabilities across Google’s suite of products at a competitive price point akin to a Chat GPT Plus subscription.
Despite the commendable performance of the free Gemini Pro model, currently holding the 4th position on the leaderboard between GPT-4 Turbo and Claude 3 Sonnet, the premium Gemini Ultra model remains untested and is yet to secure a spot in the existing rankings, hinting at further developments within the AI landscape.
Image/Photo credit: source url