AI

AI Platforms Are Inconsistent in Detecting Hate Speech

With the proliferation of online hate speech, which can increase political polarization and damage mental health, leading artificial intelligence companies have started to release large language models (LLMs) that promise automatic content filtering. 

Yphtach Lelkes
Yphtach Lelkes

“Private technology companies have become the de facto arbiters of what speech is permissible in the digital public square, yet they do so without any consistent standard,” said Associate Professor Yphtach Lelkes. A lack of comparison between these models raises questions about arbitrariness, bias and disproportionate harm.

He and Annenberg doctoral candidate Neil Fasching examined these models and produced the first large-scale comparative analysis of artificial intelligence-powered content moderation, publishing their study in Findings of the Association for Computational Linguistics.

“The research shows that content moderation systems have dramatic inconsistencies when evaluating identical hate speech content, with some systems flagging content as harmful while others deem it acceptable,” Fasching said. Lelkes noted that “these inconsistencies are especially pronounced for different demographic groups, meaning some communities are left far more vulnerable to online hate than others.”

They analyzed seven models: the dedicated moderation endpoints from OpenAI and Mistral, in addition to Claude 3.5 Sonnet, GPT-4o, Mistral Large, DeepSeek V3, and Google Perspective API. Fasching says that while platforms like Facebook, Instagram, and X don’t clarify which models they’re using, it’s “highly likely” they’re using or seriously considering these or similar systems. 

Neil Fasching
Neil Fasching

Their analysis includes 1.3 million synthetic sentences — generated using a full factorial design — that make statements about 125 distinct groups. Each group falls into the category of race, ideology, religion, disabilities, specific interest, gender, level of education, sexual orientation, age or occupation. Descriptors of the groups include both neutral terms and slurs.

Each sentence combines the quantifier “all” or “some,” a group, and a hate speech phrase. Some also contained a “weak incitement” to hostility or exclusion or a “strong incitement” to harm.

Fasching and Lelkes found that evaluations across the seven systems were more consistent for sexual orientation, race and gender, while inconsistencies intensified for groups based on education, interest and class. Classification of statements related to Christians also varied widely across platforms.

The researchers found that the Mistral Moderation Endpoint was most likely to classify material as hate speech, OpenAI’s Moderation Endpoint demonstrated less consistent decision-making, and GPT-4o and Perspective API showed the most measured approach. “These differences highlight the challenge of balancing detection accuracy with avoiding over-moderation,” they wrote.

A minority of the 1.3 million sentences generated were neutral or positive, to assess both false identification of hate speech and how models handled pejorative terms in non-hateful contexts, such as “All [slur] are great people.” The researchers found that Claude 3.5 Sonnet and Mistral Moderation Endpoint treat slurs as harmful regardless of positive context, whereas other systems seem to prioritize the overall sentiment.

Fasching and Lelkes are expanding their research by comparing these LLM-based moderation results to human perceptions of hate speech. They’re also investigating differences in hate speech detection based on whether the content includes calls to action or incitement of violence, allowing them to compare performance to U.S. legal definitions of unprotected speech.

2025 magazine cover

Connections: A Year at Annenberg