We all have unconscious bias. Take that scene from "The Office", for example, when Michael asks Oscar "Is there a term besides Mexican that you prefer? Something less offensive?"
Oscar quickly rebukes Michael. Mexican isn't offensive. And Michael quickly backpedals from his statement that it carries certain negative connotations. But the damage is done. Michael has already betrayed the fact that to him Mexican is offensive. He's so used to hearing and using the term derisively that when it occurs in a neutral context, it sounds wrong to him.
We all do this. If not about race, then about gender. If not about gender, then sexual orientation. Or religion. Or political affiliation. Or education. Or economic status.
We all have biases, including unconscious ones, and those biases often come out subtly in our language. And there's no place our personal biases show themselves in our language more starkly than social media.
During the statewide elections in Virginia last month, I collected about three weeks worth of tweets that included the name or Twitter handle of one or more of the candidates, or one of the more common hashtags associated with the election. Since then, I've done some digging into those tweets to see how people talking about the election referred to the candidates, the parties, and each other. The results are striking, if not (unfortunately) surprising.
On the whole, I found three particularly disturbing trends:
- There were major biases around identity terms (gender, race, nationality, etc.).
- Most identity terms were routinely used as slurs, rather than identifiers — especially those associated with individuals/groups other than straight, white men.
- There was a noticeable right-wing (and pro-white, -male) lean in the observed biases, which is all the more noteworthy given the democrats' sweep of the statewide races and major gains in the state house.
Let's unpack these trends some more...
To uncover bias in my Twitter archive, I built a word vector model out of the tweets. A word vector model (or word2vec) is a deep-learning model that is easy to put together, and that allows you to explore the semantic similarity of words, based on the contexts in which they appear. Once you've built the model, you can easily query the model for the words that most closely resemble the usage of a particular term (or set of terms).
For example, here are the 20 words most semantically matched to the word conservatives in the Virginia election tweets (the numbers refer to the closeness of the words in the high-dimensional vector space of the model, with 1 being the maximum closeness):
There are a couple things to note here. First, the top term is "morons", and there are a number of other insults in the list: "sewer", "disgrace", "greed", "liars", "#whitesupremacists". This suggests that a significant number of tweets use the word "conservatives" as an insult. However, the presence of other political identifiers on the list ("christians", "republicans", "independents", "democrats", etc.) shows that this term is also used in very similar semantic context to other identifiers. So "conservatives" is not always a slur. There are also positive words like "inspire" and "#keepamericagreat", showing that many hold conservatives in high regard.
(I should note that this model treats the entire collection of tweets as a single corpus — a sample of a single strain of discourse. That's not true, as there were tens of thousands of authors in this tweet corpus, and they had often very different views from each other. So the model will necessarily show us a mix of those views. But as we'll see shortly, even that mix can be incredibly revealing.)
Let's compare the above results to the 20 most similar words to "liberals":
While there are clearly similarities between "liberals" and other identifiers like "democrats" and "republicans", it's clear that this word is much more of an insult in this tweet corpus than "conservatives". There are a couple mroe insults on this list, they tend to sit higher on the list, and the distance values are higher. (The 0.44 for "libtards" and "deranged" is about the same as the similarity of "conservatives" with "people" — number 5 on the "conservatives" similarity list.) And interestingly, there are no terms of admiration in the top 20 — only insults, netural identifiers, and "noise" (random artifacts of a model of a large, diverse corpus).
From a comparison of "conservatives" to "liberals", it would seem that both are used as slurs in this corpus, though "liberals" more so than "conservatives".
Interestingly, party names only carry some of this difference. Here are the 20 terms most similar to "republicans":
These party names seem to be less often used as insults than the corresponding ideological identifiers, "conservatives" and "liberals". I do find it interesting, though, that "people", "americans", and "ppl" top the similarity list for "republicans". Do these Twitter users tend to think of republicans as normal American people and democrats as other? On the other hand, there are also more insults in the republican list... Perhaps those on the right, at least during the Trump presidency, are simply more polarizing — either good American patriots, or greedy morons. We need more data to tease that out.
Race, gender, religion
Many of the hot-button issues in the Virginia election related to identity politics: race (and racism), sex (and sexism), and religious identification (and the values that shake out from that). When we compare these identity terms to the political identifiers, the trends become even more clear.
Here are the 10 most similar terms in the election Twitter corpus for several racial identifiers.
Similar to "whites":
Similar to "blacks":
Similar to "hispanics":
Similar to "latinos":
Similar to "asians":
All of these terms are routinely used as slurs, but not all racial slurs are created equal. In this corpus, the insults associated with "whites" and "asians" are noticeably less prominent and less vicious. Blacks and latinos, on the other hand, are referred to almost interchangeably with "criminals". In fact, the similarity measure for blacks–criminals is higher than any similarity score in this post so far, except for democrats–dems. It's as if "blacks" is just an abbreviation for "criminals" in this tweet archive (or vice versa).
Let's turn to gender. The trends here aren't quite as stark as in racial identifiers, but they do carry over.
Here are the 10 words most similar to "men":
(There was a prominent trans woman, Danica Roem, running for — and ultimately winning — a Virginia house seat. However, queries for "trans", "transgender", "queer", "lgbt(q)", "danica", and "roem" returned mainly noise. Though there were certainly biases at play around LGTBQ issues and gender identity in this Virginia election, there just isn't enough coherent data in this Twitter archive to analyze it meaningfully.)
While neither "men" nor "women" (nor their singulars) seem to be used as slurs, there are interesting alignments. "Men" seems most strongly identified with neutral identifers ("male", "males") and identifiers of neutrality ("people", "ones", "americans"). "Women" seems most strongly identified with markedness. There was one female candidate for statewide office, Jill Vogel, who ran on a platform of family and education issues — two domains that have a strong traditional gender bias towards women — so the association with "families" is unsurprising, both in general and in the context of the 2017 Virginia election. But the strong association with "leaders" and "activists" suggests to me not that women are associated with activism in this corpus, but that women are talked about like activists. That is, the term "women" is most notably brought up in contexts when "leaders" and "activists" are brought up — when they do something prominent, out of the ordinary, where "men" is used far more generically.
We can see this when we look at the similarity list for "people":
While "men" appears at number 8 (0.49), "women" doesn't even appear in the top 50! (It's similarity score with "people" is 0.36.)
In fact, the word vector modeling tool I use has a "doesn't fit" function. Give it a list of words, and it will tell you which word is the least similar to all the rest in the corpus. The odd word out in "men women people"? Women. The odd word out in "men women americans"? Women. In this archive at least, men are neutral, and women are marked.
Nothing polarizes Americans quite like religion. In fact, of all the identifiers I investigated, religious identifiers were the most charged. Here are the 10 words most similar to the most common religious identifiers in the corpus.
Similar to "christians":
Similar to "jews":
Similar to "muslims":
Similar to "catholics":
(Other identifiers like "protestants", "mormons", "buddhists", and "hindus" did not appear in the corpus enough times to meet the threshold of the model.)
While I was not surprised at these terms having strong association with each other and with insults, I was surprised that "muslims" appeared to be the least charged of all the religious identifiers in this dataset, given the prominent rhetoric in the US around Islam and Islamic extremism. Figuring that out will require more digging. (Though I did find that "islamist" was the most similar identifier to "extremist".)
It's clear that our political discourse on social media is often less than civil. But it's not just bots, sockpuppets, and trolls causing the problem. And it's not just on social media where our conscious and unconscious biases emerge in our language. Political identifiers (especially "liberal") have been used as slurs for a long time. So have religious, racial, and gender identifiers.
While algorithms and machine learning can reinscribe human biases and prejudices, we can also use machine learning to uncover biases in our discourse, some of which we might not even be aware of. As Nate Silver said in a talk about racial bias in voting patterns, if we can measure it we can predict it, and if we can predict it we can change it.
When it comes to biased human language, social media data helps us measure it. Machine learning helps us predict it. But it's up to us to change it.