Measuring Online Positivity: Machine Learning Models vs. Human Coders
May 2, 2022
The following is a first draft introducing elements of the directed study analyzing positive sentiment towards female journalists on Twitter. This blog post focuses primarily on methods and approaches explored in the early stages of the research. Results will be posted in a separate blog.
Social Media has vastly changed the way that the general public interact with traditional media journalists. With the widespread use of social media platforms like Twitter, users can directly interact with journalists at lightning speeds to comment on posts and share their views online. On average, there are approximately 500 million tweets posted on Twitter everyday (Internet Live Stats, 2022), with each post reaching up to a maximum character limit of up to 280 characters (Rigolin, 2018). These tweets can include different kinds of attributes like hashtags and emojis, as well as additional attachments including photos, gifs, and videos. With the diverse amount of content available online, social media platforms face the challenge of moderating high quantities of this content on a daily basis. Platforms have come to face the challenge of moderating online hate speech and inflammatory content posted online. In the process of regulating this content to protect freedom of speech, different models have been created to classify hate speech and moderate platforms to reduce harmful online activity. However, very little research has been conducted in order to identify the kinds of positivity that exist on online platforms. Research in this area is particularly important as it can help identify the kinds of interactions that exist and identify potential ways to promote positivity. Potential examples can include the way tweets are phrased to engage in meaningful dialogue and how to further engage users in positive discussions. Furthermore, there is limited research that has been done to compare the differences in interpretation of positivity between machine learning models and human coded sentiment. This case study aims to further understand the inconsistencies between positively defined tweets by machine learning models vs. human coded positivity. The research looks to respond to the following questions:
1. What is the rate of inconsistency between human coded tweets and machine learning to identify positive tweets?
2. What types of positive tweets were in fact labeled positive by machine learning?
3. What attributes were most commonly found in individual tweets from the mislabeled positive tweets?
What Constitutes Positivity?
According to the Oxford Learners Dictionary positivity is defined as ‘the practice of being positive in your attitude and focusing on what is good’ (Oxford Advanced Learning Dictionary, 2022). With the evolution of language, various cultural references, and openness to new ideologies over time, perceptions can drastically change regarding what constitutes a positive term or interaction over a period of time. Different terms can also take on new meanings with events that take place locally, nationally, or globally. For example, before January 2022, the term ‘trucker’ was viewed as a generally positive [MOU1] term referring to an individual delivering goods and services to Canadians. In January 2022, a Trucker Convoy headed to Ottawa, Ontario, and occupied the downtown core facing the Parliament of Canada for 3 weeks. This organized group looked to protest different government policies including vaccine mandates and attempt to force Justin Trudeau to resign as Prime Minister (Terrell, 2022). When looking at that same term following this event, more individuals see the term in a negative light. This sentiment can be especially negative if the individual was a resident living in downtown Ottawa that was personally affected by the events that occurred on Parliament Hill. On the contrary, the term may also continue to be considered as positive if the individual shares common ideologies with the organizers of the convoy. Therefore, to a certain degree, positivity can hold a subjective meaning and may not always be clearly defined. This is an additional consideration to more commonly associated terms with positivity such as happiness, joy, words of encouragement, and compliments.
When referring to one-to-one human interactions, positivity can be more clearly visible through the ways in which humans interact. Positivity can often be measured through facial expressions and the tone of the conversation (DAEL, 2012). For instance, this could be done through smiling to an individual, providing words of support or encouragement, and meaningfully contributing to a neutral dialogue. The additional benefit of face-to-face interactions is that if there is confusion, non-verbal cues can reduce confusion or signal a need for clarity. The tone of an individual’s voice in conversation can also allow an individual to get implicit cues regarding the sentiment of the discussion. These cues are not available online to provide this insight and therefore require alternative means to understanding online sentiment.
Cultural Bias
Another factor that affects sentiment analysis is the review of cultural differences that can also contribute to interpretation of sentiment. Each individual's background, and culture can greatly dictate how individuals interpret different symbols. For example, for the film ‘Inside Out’, Pixar had produced different variations of the film that did not influence the story line but rather contributed to maintain symbolic meaning across cultures. One example of this in the film is a scene where a father is trying to feed his child vegetables to his 1-year old baby. In the American version of the film, the meaning of broccoli for a child had a negative sentiment as it was commonly associated as the least preferred vegetable. When the film was translated for Japanese audiences, broccoli was seen as a popular vegetable with a positive sentiment and was therefore replaced with green peppers to convey the same negative meaning to the Japanese audience (Acuna, 2015). This idea may be less common with more traditionally used emojis within tweets such as a thumbs up, or a smiley face. Alternatively, it uncovers the notion that posting videos or images may have different background meaning, depending on the perspective of the reader. By adding these additional attachments, it may change the way that the sentiment of the content is perceived.
Methods for Measuring Sentiment
Sentiment Analysis, also commonly referred to as ‘Opinion Mining’ is a classification system that assesses the overall sentiment of a fragment of content and attributes a polarity score. The assessment of the scores are generally classified into three categories: positive, negative, or neutral (D. Sunitha, 2016). The measurement of the sentiment can take a subjective meaning as the categories are determined by the coder and defined individually. This review of sentiment is especially important in Natural Language Processing (NLP), which is a field in computer science that works with human-computer language interaction (D. Sunitha, 2016). Using sentiment analysis, training models are created to provide an outline for these natural language processes, which they use to output measurement of sentiment of social media content. There are numerous formulas that can determine how sentiment should be calculated in a post. For example, one approach towards sentiment included calculating the number of positive words subtracted by the number of negative comments (Philander, 2016). For instance, if a post has 20 words, 10 being positive and 5 being negative, it would be classified as being overall a positive tweet. Conversely, if the post had more words that were negative than positive, then the tweet would be classified as negative overall. In addition, if there is a tweet that contains an equal amount of positive and negative content, it would be classified as neutral (ex. I hate Batman but love Superman). Furthermore, certain terms are removed for analysis as they do not demonstrate sentiment. Examples of these terms include “and”, “this”, and “they” (Zavattero, 2015).
Marketers have made more advances in terms of making strides towards understanding sentiment in users commenting and engagement with customer products. Using the feature level sentiment classification system can be completed using a tier system in terms of defining the sentiment of specific content. The tier system is labeled as: document, sentiment, and feature/attribute (Chiu, 2015). The document perspective classifies the entire piece of content into either positive or negative according to the overall sentiment expressed in the text. Therefore, if there are generally positive posts with a few negative terms, the post would still be categorized as being positive overall. Measuring sentiment on a sentence level, would determine the sentiment in each sentence, to weigh the overall meaning (Chiu, 2015). The final measurement would include the feature level method which would measure sentiment to identify opinion stated in the text using different attributes and features of the content. Additional methods of measuring the stylistic approaches and attributes include a focus on lexical features, syntax, and structures (Chiu, 2015). Lexical features can be either word or character-based features measures: words/ characters word length distribution, and vocabulary richness. Syntax refers to the patterns that are used to form sentences, phrasing patterns. Structural features deal with the text’s organization and use of quoted content such as retweets.
Punctuation can also equally contribute to the interpretation of the sentiment. Often punctuation marks such as exclamation points also are an indicator of positive tone. Recent articles have also indicated that at times periods may be considered as negative when posting on social media. Quotations can be viewed as a form of sarcasm as it implies a lower level of seriousness or an emphasis on a specific word. Capitalization can also contribute to increased levels of negativity as the content of the post can be implied as negative.
Context
Another difficult consideration towards sentiment analysis is contextual analysis. As conducting a human coded sentiment analysis can already be time consuming and labor intensive, social context can greatly contribute to the understanding of information. Context is the visible information from a previous thread and list of interactions with other users (Sánchez-Rada, 2019). Characteristically Twitter is also renowned for having a limited and short word count where much of the context within the post itself is omitted as there are alternative ways to connect posts. This can include responding directly to tweets to create a thread or posting subtweets (Rigolin, 2018).
Sample
The sample of the tweets analyzed in this case study were dated from August 5, 2019 – October 31, 2019. This time period also happened to be the time period for the campaigning of the 2019 Canadian Federal Election. The collection of the tweets included replies, mentions, quote tweets, and individual tweets directed towards users. The types of users that contributed to the data set made a total group of 346 that included journalists, hosts, panelists, and editors. The sample was collected through a multistep process in order to complete the sentiment analysis. Firstly, 743, 685 tweets were run through a software called BERT, which is an open source machine learning model framework for natural language processing (Lutkevich, 2020). BERT sentiment classification categories follow 7 possible outcomes: Highly negative, medium, low negativity, positive, neutral, and unclear. From these tweets, 29319 tweets were classified as positive using BERT. From this collection of positive tweets, 1000 were randomly sampled to complete the sentiment analysis.
Methods
The 1000 positive tweets classified by BERT were manually coded using sentiment content analysis by a single human coder. The coding book was divided into three primary areas of analysis: human coded sentiment, written content, and tweet attributes.
Human Coded Sentiment
Each tweet was classified under one of the four categories: Positive, Neutral, Negative, and Undefined. The category was selected based on the term that most prominently defined the overall sentiment of the tweet overall. The fourth category, ‘Undefined’, was added as a result of a list of tweets that did not fit in the initial three categories. The reason these tweets were difficult to classify was because they could hold two of the sentiments and could therefore not fit in a single category. For example, a tweet saying “And they’re friends with you so they can see what is, what it’s like when your team actually wins games” could potentially be taken as positive or negative without any context.
Written Content
The written content was defined as the subject matter of the tweet that could generate a sentiment. This section looked at the kinds of words that were used in order to develop a sentiment and generate a specific feeling. As the length of the posts could include multiple different kinds of feelings, tweets were classified to include one or more sentiments to account for the different combinations of words that could be used. As the premise of the research is focused on positivity, categories were generated to identify clear forms of positive sentiment and negativity. A total of eight categories were selected to code the written content: Condemning Hate, Informative, Apology, Encouragement, Compliment, Negativity, Swear Words, and Other.
Defined Positive Categories
· Condemning hate was defined as a user directly responding to an account to acknowledge negative commentary directed at the user (ex. @Terrible. People can be so nasty.).
· Informative Content could include any kind of announcement on upcoming events, linked articles, or details to share with the public (ex. We are a few weeks away from an election, and I am really proud and excited to co-host Partly from @cbcpodcasts with @RosieBarton.) Informative content could not include personal opinion. Furthermore, the statement of values from unverified accounts were excluded from this category as the information could not be fact checked or verified for accuracy.
· Apology was classified as using the words ‘Sorry’ or ‘My apologies’ directed towards a user to demonstrate regretful behavior. Regret could have been included for personal or another user’s commentary or for the outcome of an occurrence of events (@RVAwonk Yikes. Sorry you had to deal with that.). An apology was classified as positive as there is a notion of regret for a negative occurrence and a means to reduce the harm.
· Encouragement was classified as words of motivation to continue pursuing an activity (ex. a hobby, project) as well as an acknowledgement for personal achievements (ex. getting a promotion). Specific terms that were highlighted in this category included ‘Congratulations’, ‘Great job’, ‘Keep up the good work’. Furthermore, stand-alone terms such as ‘Good’, ‘Excellent’, ‘Fantastic’, that could be classified as a compliment but that did not include further text to expand on its context, were also classified as encouragement.
· Compliment Was a positive statement that was directed at a specific user to (ex. Wow, I watched the debate and have now read your piece. Fantastic work Mr. Tasker. I was eager to see CBC’s reporting. I see fairness. Good Work🙂)
Defined Negative Categories
· Negativity is a formulation of ideas that would be explicitly unpleasant to a user or an opposing opinion that does not use constructive dialogue to engage in a neural exchange of ideas. (ex. Enjoy prison dipshit; @RosieBarton @CBCNews so nice to see you laughing about Alberta's issues. Wow!!! You should resign and go work for Justin where you belong.)
· Swearwords include any type of course language that would not be used in a professional context such as school or work. Words included ‘Fuck’, ‘Dipshit’, ‘Cunt’, ‘Pussy’. Swearwords were also coded in the cases that they were not fully written out but were mentioned symbolically. (ex. OMFG; Congrats to every dumb F who voted UPC; Utterly f***ing amazing).
Additional Categories
· Other was added as an additional category to account for posts that did not fit into the previously aforementioned categories.
Tweet Attributes
The second part of the coding book assessed the format of the tweet and the number of different attributes it has. By defining the attributes of each post, it also allowed to determine possible patterns or inconsistencies when defining the sentiment of the post. Overall, 7 attributes were selected to measure the format of the post which included: emojis, punctuation, length, capitals, attachments, hashtags, and language (ex., English). The terms were defined as follows:
· Emojis – Any icons that are found among the text of the tweet that can demonstrate a type of sentiment. If the emoji was represented through symbols that represent a specific emoji [ex. :)], it would still be included in this category.
· Punctuation- Any type of markers used to follow general grammatical guidelines were included [.,;:‘’“”()-] can also have a significant influence in the way that content is read, substituting the tone of the person's interactions and can be read differently. For example, if comparing the difference between tweet one, Nice reporting, and tweet two, Nice “Reporting”, the quotations on the second version of the post could be classified as sarcastic or imply a greater meaning.
· Length was organized into three separate categories: Short, Medium, and Long. Short was classified as a post that ranged from a single character, up to 92 characters. Medium length tweets ranged from 93 to 140 characters. Long ranged from 141 characters up to 280 characters.
· Capitalization was defined as terms that are completely spelled out in full capitals (ex.. HELLO). There was a total of three categories selected. None (complete words capitalized, partially capitalized and fully capitalized.
· Attachments any additional content that is not a part of the text. Attachments could include, images,
· Hashtags – Any post that includes at least one hashtag (ex. #DefundCBC).
· Language – English or French.
Limitations
For this research there were three research limitations that influenced the results. Firstly, due to the time constraints, there was not a sufficient amount of time available to analyze the threads in which the content was posted. This context could be particularly relevant when discussing specific topics including politics and pop culture references. Another consideration is that tweets in general are meant to be brief posts; when a reply was very short (ie. less than 280 characters), it was difficult to assess a deeper meaning of the tweet. Therefore, the tweets were assessed on the basis of the content available itself, including text, punctuation, hashtags, tweets, quote tweets. Additionally, each tweet was referenced to verify if there was additional content. As the sample was taken in 2019, some of the content has since been removed and could not be verified for additional content that could influence the overall meaning. Taking the time considerations there was a limited number of criteria that could be coded and that the coding book could not be tested on a smaller sample. The final limitation to the study is that there was only one coder reviewing the tweets. As the interpretation of sentiment can vary by individual, inter-rater reliability could not be calculated to assess the degree of homogeneity in the classification categories. For future studies, it would be strongly recommended to include additional coders to maintain consistency.