Tracking the Explosive World of Generative AI

Bye-Bye, Mechanical Turk? How ChatGPT is Making Humans Obsolete

As AI advancements threaten the crowd-worker industry, a recent study reveals ChatGPT consistently outperforms human crowd-workers in classification tasks. Experts predict an uncertain future for crowd-working platforms like Mechanical Turk, which may struggle to stay relevant amidst the surge of AI-powered solutions.

A newly published research report shows how ChatGPT could upend the crowd-working industry around tasks like human labeling. Photo illustration: Artisana

🧠 Stay Ahead of the Curve

  • Researchers find ChatGPT's significantly outperforms human crowd-workers in classification tasks with vastly lower cost, signaling a potential industry shift.

  • The study highlights ChatGPT's superior performance and cost-effectiveness, underscoring the urgency for crowd-working platforms to adapt and stay competitive.

  • With AI models like ChatGPT and GPT-4 advancing rapidly, the future of crowd-working faces uncertainty, potentially impacting millions of workers and machine learning strategies.

By Michael Zhang

April 09, 2023

As the capabilities of large language models and chatbot technology expand, they are beginning to pose a significant threat to the once indispensable crowd-worker industry. This sector, which has long thrived on providing technology companies with simple and repetitive task solutions, may now face an uncertain future.

A recent study conducted by researchers at the University of Zurich revealed that ChatGPT's gpt-3.5-turbo model consistently outperformed crowd-workers in various classification tasks, boasting up to three times better performance and twenty times lower cost. The experiment centered on a corpus of 2,382 tweets related to content moderation—a topic notoriously difficult even for human labelers to comprehend. Results were compared between ChatGPT, top crowd-workers on Amazon's Mechanical Turk platform, and trained graduate student human labelers.

Crowd-working has grown into a multi-billion-dollar industry, with companies like Mechanical Turk, Figure Eight, and others providing scalable, flexible human workforces for repetitive tasks. One key application of crowd-workers is generating human-annotated data for machine learning algorithms. Newer entrants, such as Labelbox, a platform for managing AI training data, also offer labeling workforces as part of their services.

However, ChatGPT's meteoric rise and rapid advancement (the study employed the gpt-3.5-turbo model, while ChatGPT has since released GPT-4) now casts a shadow over the crowd-working industry. The researchers intentionally chose content moderation tweets as the basis for their tests, given the topic's inherent complexity and challenge. Their findings indicated that ChatGPT's zero-shot accuracy surpassed crowd-worker performance in four out of five tasks, while its intercoder agreement (the consistency with which it assigned the same label across multiple runs) reached an impressive 95%.

A figure from the research paper details how ChatGPT performed significantly better than human labelers. Source: arxiv.org

In a complicated task such as Stance Detection, where ChatGPT and crowd-workers were asked to assess whether a tweet supported, opposed, or remained neutral regarding a piece of US legislation, ChatGPT consistently demonstrated three times greater accuracy than Amazon's most elite Mechanical Turk workers based in the U.S.

These findings were all achieved without the use of GPT-4, which has since demonstrated further gains in comprehending and processing intricate topics and prompts.

One startup CTO, who spoke on the condition of anonymity and whose company relies on services like Labelbox and Mechanical Turk for machine learning labels, envisions a bleak future for the industry. "In a few years," he remarked, "I'm not sure some of these services may be around at all. They'll need to start reinventing themselves now to stay relevant."

Read More: ChatGPT