Tencent has launched ArtifactsBench, a new benchmark designed to address the shortcomings in evaluating creative AI models. Many users have experienced frustration when asking AI to create simple applications like webpages or charts, only to be met with solutions that function but are visually unappealing.
Examples include improperly placed buttons, clashing colors, or awkward animations. This scenario underscores a significant challenge in AI development: teaching machines to have a sense of good taste.
Traditionally, AI models have been evaluated based on their ability to produce functionally correct code. However, these evaluations often overlook the visual quality and interactive elements that contribute to modern user experiences.
ArtifactsBench seeks to remedy this gap by serving as an automated critic of AI-generated code. The process begins with AI tackling creative tasks from a library of over 1,800 challenges, which range from developing data visualizations to creating interactive mini-games.
Once the AI produces the code, ArtifactsBench automatically runs it in a secure environment and captures screenshots over time to observe how the application behaves. This includes monitoring animations and feedback during user interactions.
After gathering evidence from the original request, the AI’s code, and the visual outputs, a Multimodal LLM (MLLM) acts as the judge. Unlike human reviewers, this system employs a detailed checklist to evaluate ten different metrics, ensuring a fair, consistent, and thorough assessment of functionality, user experience, and aesthetic quality.
The benchmark’s automated judging system has proven to perform impressively, achieving a 94.4% consistency rate when compared to human evaluations on WebDev Arena. Notably, ArtifactsBench revealed that generalist models often outperformed specialized coding AIs in creative tasks, likely because these general models possess a more integrated skill set that encompasses coding, design aesthetics, and reasoning.
Ultimately, Tencent aims for ArtifactsBench to provide a reliable measure of AI’s ability to create engaging and user-friendly applications.