Evaluating AI Growing Pains: Flaws in Traditional Methods

Table of Contents

Read Time:3 Minute

Challenges in Evaluating AI Systems

The escalating dominance of the most recent artificial intelligence (AI) systems is pushing established assessment methods to their limits, presenting a dilemma to organizations and governmental entities on how to best navigate the swiftly evolving technology landscape. Deficiencies in the conventional evaluation criteria typically utilized to measure performance, precision, and safety are being brought to light as an array of models flood the market, as relayed by professionals involved in the development, testing, and funding of AI tools. They contend that the traditional tools are susceptible to manipulation and are too restrictive to adequately capture the intricacies of the latest models.

The frenzied technology race ignited by the unveiling of OpenAI’s chatbot ChatGPT in 2022 and fueled by the substantial investments from venture capitalists and major tech conglomerates like Microsoft, Google, and Amazon, has rendered many older benchmarks for assessing AI progress obsolete. Aidan Gomez, the founder and CEO of AI startup Cohere, pointed out that a public benchmark’s relevance is transitory, lasting only until individuals optimize their models to suit it or manipulate it to their advantage. What once took years to accomplish now transpires within months.

Intensified Competition in the AI Arena

Companies such as Google, Anthropic, Cohere, and Mistral have recently introduced AI models in a bid to outshine the Microsoft-supported OpenAI and secure top positions in the rankings of substantial language models (LLMs) that serve as the underpinnings for systems like ChatGPT. With each new iteration of AI systems, there emerges the capability to effortlessly outperform existing benchmarks, rendering prior evaluations outdated. This renders the challenge of assessing LLMs a prevalent concern not only in academic circles but also in corporate boardrooms.

The ascendancy of generative AI has prompted a paradigm shift in investment priorities, with 70% of global chief executives acknowledging it as a top priority, per a KPMG survey. Shelley McKinley, GitHub’s chief legal officer, emphasized the necessity for companies to offer reliable products, given that consumers are unlikely to embrace technology they do not trust.

Addressing AI Risks and Governance

Governments are faced with the challenge of deploying and overseeing the risks associated with the latest AI models. Last week, the US and UK forged a significant bilateral agreement on AI safety, augmenting the new AI institutes established by the two nations the previous year, all in a bid to mitigate unexpected disruptions arising from rapid advances in AI. President Joe Biden’s executive order calls for the development of benchmarks to evaluate AI tools’ risks, underscoring the urgency for robust governance and risk-mitigation strategies.

Evolving Evaluation Standards

Rishi Bommasani, leading a team at the Stanford Center for Research on Foundation Models, has devised the Holistic Evaluation of Language Models, designed to evaluate reasoning, memorization, resistance to disinformation, and other critical aspects. Public initiatives like the Massive Multitask Language Understanding benchmark and HumanEval aim to test models across diverse subjects and coding proficiency metrics.

However, the challenge lies in adapting evaluation methodologies to align with the sophistication of contemporary AI models, which are proficient in performing a sequence of intricate tasks over extended durations. This complexity renders it arduous to evaluate such models under controlled conditions, paralleling the inherent difficulty in assessing the multifaceted nature of human intelligence.

Moreover, concerns have been raised regarding the integrity of public assessments, with observations that training data for models may overlap with the questions used in evaluations, leading to potential contamination and biases. In light of these limitations, the need for a nuanced and comprehensive evaluation framework capable of accommodating the multifaceted nature of AI systems is increasingly pronounced.

Embracing Innovation in Evaluation

Hugging Face, valued at $4.5 billion, provides tools for developing AI and has pioneered innovative evaluation methods through its LMSys leaderboard, which allows users to create bespoke tests for ranking models based on personalized criteria. Cohere’s Gomez advocates for internal test sets tailored to individual businesses, asserting that human evaluation remains the most effective metric for assessing performance.

Ultimately, the selection of AI models by enterprises transcends mere metrics, infusing an element of subjectivity akin to selecting a vehicle based on more than just horsepower and torque. As AI governance and evaluation practices evolve, the imperative lies in cultivating a comprehensive evaluation ecosystem that transcends standardized benchmarks to encompass a holistic understanding of AI models.

Image/Photo credit: source url

About Post Author

Chris Jones

Hey there! 👋 I'm Chris, 34 yo from Toronto (CA), I'm a journalist with a PhD in journalism and mass communication. For 5 years, I worked for some local publications as an envoy and reporter. Today, I work as 'content publisher' for InformOverload. 📰🌐 Passionate about global news, I cover a wide range of topics including technology, business, healthcare, sports, finance, and more. If you want to know more or interact with me, visit my social channels, or send me a message.

[email protected]

https://informoverload.com