Liliia Maiorova & Ariq Suryo Hadi P
01.07.2025
In times of political tension, the way people speak online, especially on social media, can become a reflection of deeper divides in society. Just like elections reveal what people really think, the comments under political videos can be full of emotion, anger, frustration, and sometimes toxic language. But in an age when automated systems appear every day, a fundamental question emerges: can AI truly tell the difference between emotional intensity and harmful intent?
This post shares insights from our attempt to answer that question. At the heart of our exploration exercise is a machine learning model called Detoxify – a tool designed to detect toxic speech in online comments. We tested it on real political content on TikTok to better understand what the model gets right, where it stumbles, and what that means for public discourse in the digital age.
How Detoxify Works
Detoxify is an open-source AI model that was built using ideas from the Jigsaw Toxic Comment Classification challenges – global competitions where developers worked on better ways to detect harmful language online. It was trained on over 1.8 million comments, each labeled by people for things like threats, insults, and hate speech. Now, it scans new text and predicts if it might be toxic, based on what it learned.
Prior research (e.g., Dixon et al., 2018) has shown that the dataset used to train Detoxify contains biases, especially an overrepresentation of identity terms (e.g., “gay”, “Muslim”) in toxic comments. The creators of the model addressed this issue, and expanded beyond the original Jigsaw dataset by including additional data, such as translated comments and combined datasets from multiple Jigsaw challenges (like the 2018 and 2019 competitions).
Our Dataset: Political Talk on TikTok
TikTok might not be the first place you think of for serious political discourse, but it has rapidly become one of the most influential platforms for political messaging, especially among younger audiences.
So we started with a sample of 1,594 TikTok posts and used a language model (news-category-classification-distilbert by Yueh-Huan) to identify which ones were political. This gave us a final set of 566 political posts for our analysis.
In total, we gathered over 375,000 comments. After removing emoji-only replies, we were left with about 337,000 text comments.
Proposed Definition of Toxicity
One of the biggest challenges in evaluating whether a model truly understands toxicity is agreeing on what toxicity actually means. Many researchers have attempted to define it, each with their own approach. The figure below, taken from The Toxicity Phenomenon (Hanscom et al., 2022), shows just how complex and nuanced this area of research can be.
Widely cited works across multiple research fields related to toxicity on social media platforms. (Adapted from The Toxicity Phenomenon, Hanscom et al., 2022)
We propose a definition formulated in The Toxicity Phenomenon (Hanscom et al., 2022): toxicity is any interaction intentionally designed to provoke, inflame, or create counterproductive conflict. It may target individuals, communities, ideals, or organizations. Importantly, toxicity always positions itself in opposition to a specific entity — which distinguishes it from general negativity or emotional expression.
Toxicity is not the same as negativity. While negativity may reflect frustration, criticism, or dissatisfaction, toxicity is deliberately provocative and aimed at a specific target.
Difference between toxicity and negativity. (Adapted from The Toxicity Phenomenon, Hanscom et al., 2022)
What We Found: Strengths and Blind Spots
Running our dataset through Detoxify revealed a pattern. The model often did well when comments were explicitly aggressive or conspiratorial. Some examples:
“Shut up fat ass up and get out of my state Pritzker you suck!” (Toxicity score: 0.9)
“A circus of fools” (Toxicity score: 0.8)
“What a parrot! Just yank his chain and he will say whatever you want?” (Toxicity score: 0.8)
“TRUMP IS MENTALLY INSANE” (Toxicity score: 0.9)
That said, some comments did use strong or aggressive language, but they seemed to fall more under general “negativity” rather than true toxicity:
"Shameful. What an ugly display of cruelty." (Toxicity score: 0.8)
“This is not HOW IT WORKS!! NO!! YOU DONT GET TO TELL US WHAT TO DO!! COURT DOCUMENTS ARE PUBLIC RECORDS!!! WE ARENT STUPID!! AMERICA IS SUPPOSED TO BE FREE!!” (Toxicity score: 0.9)
“Democrats & Republicans are NOT doing a damn thing to help us Americans.. I will NEVER VOTE again ever.. What the HELL are you people afraid of” (Toxicity score: 0.9)
Although the creators of the model have made several updates to reduce bias, especially around words like “gay” and “Muslim,”, our results show that this issue hasn’t been fully resolved. The model still tends to flag sentences with these words as toxic, even when they aren’t harmful:
"They would only be deporting brown, black, and gay people" (Toxicity score: 0.7)
“Dictators love parades! except the gay ones. or the black ones. or the Palestinian ones. or the women ones. or the Mexican ones. or the Asian ones. or the Irish ones...” (toxicity score: 0.7)
“Impeach the entire party. MAGA wake-up stop being afraid of brown, legal immigrants and gay people” (toxicity score: 0.9)
“I salut you my brave Muslim sister because you emerged brave than all the cowardly Muslim rules of the world” (toxicity score: 0.6)
“it's gonna be like in Germany. in Auschwitz Jews and Gypsies were deported. Even though they were German citizens. If he succeeds in this ... Mexicans and Muslims will fly to El Salvador” (toxicity score: 0.6)
Finally, there were also comments that clearly seemed toxic but received surprisingly low toxicity scores from the model.:
“If she were able to read she could have prepared better” (Toxicity Score: 0.0005)
“china thanks you for your help in defeating America” (Toxicity Score: 0.0005)
“Taylor Swift continues to thrive, all while he descends deeper into dementia” (Toxicity Score: 0.0005)
A Better Way to Detect Toxicity
So what can we do? Our results show that to improve AI moderation, we first need to rethink how we define the problem, because if the model wrongly flags strong opinions or mentions of identity as toxic, it becomes harder for researchers to study online conversations and find useful insights.
We propose the following principles:
Start with a clear definition, as developing computational toxicity models requires an interdisciplinary approach to first define what constitutes 'toxicity’
Include a wide mix of speech styles – critical, emotional, supportive, not just toxic content, so the model learns what’s normal
Provide clear guidelines to crowdsourcing instructors for labelling data
Add context to the model so it can understand what a comment is responding to and interpret its meaning more accurately
Incorporate contextual and topical information from each post to improve the model’s ability to distinguish between toxic language and non-toxic expressions such as constructive criticism or emotionally charged but valid commentary
References
Hanscom, R., Lehman, T. S., Lv, Q., & Mishra, S. (2022). The Toxicity Phenomenon Across Social Media. University of Colorado Boulder.
Patel, D., Pramanik, P. K. D., Suryawanshi, C., & Pareek, P. (2020). Detecting toxic comments on social media: An extensive evaluation of machine learning techniques. Social Network Analysis and Mining, 10(1), 1–20.
Dixon, L., Li, J., Sorensen, J., Thain, N., & Vasserman, L. (2018). Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (pp. 67–73).