Identifying and rating humor and offensiveness of texts


Project Team: William Avery, Satya Boddu, Sai Koukuntla, Jin Lee, Lindia Tjuatja, and Logan Winger

Github Repo: https://github.com/satyaboddu26/EE460J-FinalProject [1]

Sentiment analysis is a widely studied technique within the field of natural language processing, specifically for identifying positive and negative emotion. However, more nuanced expressions of emotion, such as humor, pose a challenge due to their range and tendency towards idiosyncrasy. Humorous statements can vary widely in quality, from plays on words and tacky one-liners, to situational irony and “dark” or offensive humor.

Our project focuses on multiple aspects of sentiment analysis with humor. The dataset and main goals for our project come from the SemEval 2021 Shared Task 7 (HaHackathon: Rating Humor and Offense) [2]. The training data consisted of 8,000 short texts (about one to two sentences long) that were labeled as humorous (1) or not humorous (0). If the text was considered humorous, it was given a rating for how humorous (0–5) and how “controversial” its humor label is (0 or 1). A humorous text is controversial if its variance in humor ratings is higher than the median variance of all texts. In addition, there is a rating for how offensive the text is (0–5).

We focused on the three subtasks from Task 1 of the competition:

Since this dataset contains a large proportion of text samples that are both humorous and offensive, it is quite unique compared to much larger humor datasets. While we were able to achieve an acceptable performance for our first subtask (up to ~0.89 F1 score), the limitations in the data made the second and third subtasks rather difficult.

Exploratory Data Analysis

The data contains 8,000 texts and comes in the following format:

Figure 1: Data format

By plotting a histogram of the data, we can see that almost 5,000 of the 8,000 texts in the training set are considered humorous.

Figure 2: Histogram of is_humor

Of the texts that are humorous, nearly half are also considered controversial.

Figure 3: Histogram of humor_controversy

To look for trends and common words in texts that are considered humorous, we created a word cloud.

Figure 4: Word Cloud of humorous texts

Many of the largest words in the word cloud above are gendered, such as “wife”, “man”, “guy”, and “girl”. This suggests that humorous texts in this dataset often use gender roles as a source of comedy. Sexist jokes are often considered offensive or controversial, so we created another word cloud from texts classified as humorous and controversial.

Figure 5: Word Cloud of controversial texts

We can see several gendered words in this cloud as well, but they are significantly larger, indicating these words are more common in controversial texts.

Data Preprocessing Techniques

The importance of data preprocessing cannot be understated for machine learning, especially for NLP tasks. As such, we tried a variety of approaches for processing our text data into usable features. We found that some “standard” NLP approaches to preprocessing — such as the removal of stopwords and punctuation — were not suitable for this task, as some information about the tone of the statement was often lost (e.g. exclamation points).


Word2Vec uses a neural network to learn word associations within a corpus of text by representing each distinct word as a vector. Importantly, word vectors are positioned in the vector space such that contextually similar words are closer to each other within the space. Thus, how close two word vectors are in the vector space indicates the level of semantic similarity within the text, allowing for more complex analysis via other models.

Paragraph Vector

Paragraph Vector was introduced by Quoc Le and Tomas Mikolov in a paper titled Distributed Representations of Sentences and Documents [3]. Similar to Word2Vec, Paragraph Vector is an unsupervised learning algorithm that captures semantic meaning and represents text data as vectors.

Like Word2Vec, Paragraph Vector learns unique vectors for each word; however, Paragraph Vector introduces a matrix D during training that contains vectors for each paragraph or document. According to Le and Mikolov, the paragraph vectors act “as a memory that remembers what is missing from the current context — or the topic of the paragraph” [3]. In Paragraph Vector, the word vectors persist across documents, while the paragraph vectors change. Figure 6 below provides an example of this difference.

Figure 6: Paragraph Vector (left) versus Word2Vec (right)

After training is complete, Paragraph Vector discards matrix D and uses the learned word vectors to compute the appropriate Paragraph Vector for new documents.

Fortunately, the heavy implementation work has been done by Gensim, which provides a class called Doc2Vec that implements the Paragraph Vector algorithm.


TreebankWordTokenizer uses regular expressions to tokenize words as follows: split standard contractions, treat punctuation as tokens, split off commas and single quotes when followed by whitespace, and separate periods that appear at the end of the line [4]. This assumes that text has already been separated into sentences using sent_tokenize().


CountVectorizer is a method of transforming text to a vector by mapping each word or token to a column of a matrix, and each document to a row of the matrix [5]. If document i contains word j, then location i, j of the matrix will contain the number of times j appears in i, and a zero otherwise.


TF-IDF is a method to vectorize text using frequencies instead of counts. There are two components for each word: term frequency (TF) and inverse document frequency (IDF). Term frequency measures the frequency of a word in a single document, and inverse document frequency indicates how uncommon the word is in all documents. TfidfVectorizer then takes the product of these two values for each word in each document to create a sparse matrix.


VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool. VADER computes a compound sentiment score by summing the valence values of each word in the text, and then adjusts the score according to syntactic rules. VADER is trained on a corpus of tweets, so it accounts for capitalization, slang and contractions when calculating the compound score. An example of this is shown in Figure 7 below. VADER also outputs the proportions of positive, neutral, and negative words in the text.

Figure 7: Example of VADER sentiment analysis


SentiWordNet is an opinion mining lexicon, which takes words from WordNet (an online lexical reference system) and gives tags to each word based on its estimated degree of positivity, negativity, and neutrality.

GloVe Embeddings

While Word2Vec uses the local context information of text in the dataset to generate word vectors, GloVe incorporates word co-occurrences from a global dictionary in addition to local statistics to generate them. GloVe embedding is founded on the idea that relevant semantic relationships between words can be derived from the co-occurrence matrix of a text dataset [6].

Given a dataset containing N unique words, the corresponding N x N co-occurrence matrix D contains the number of times word i co-occurred with word j in the same text at D_ij. Figure 8 below is an example of a co-occurrence matrix for the sentence “the cat sat on the mat.”

Figure 8: Example of a co-occurrence matrix

In addition to a local co-occurrence matrix for a text dataset, GloveVectorizer uses aggregated global word co-occurrence statistics from an external dictionary and incorporates its pre-trained word vectors in its log-bilinear model with a weighted least-squares objective.


Bidirectional Encoder Representations from Transformers (BERT) is a Google-created NLP model pre-trained on a large corpus of data. Like other models, BERT is used to find language representations from text inputs. However, BERT is special in that it is deeply bidirectional, meaning that the model checks both to the left and right of tokens when determining context. In addition to learning the relationships between words, BERT also learns relationships between sentences.


In the following sections, we will discuss the preprocessing techniques and models used for detecting humor, rating humor, and detecting humor controversy.

Performance Metric

For the binary classification subtasks (detecting humor and controversy), our primary performance metric was the F-score (or F1 score). This metric is calculated from the precision and recall of the model, where precision is the fraction of true positives over the number of classified positive samples and recall is the fraction of samples classified as positive over the total number of positive samples. F-score is commonly used to evaluate NLP tasks.

For the regression subtask (humor rating), RMSE was used. Training scores in development were calculated using a 75/25 train-test split.

Models to Detect Humor

The humor detection task was the main focal point of our project. Although we tried many different model variations, we can broadly classify our attempts into three categories: naive Bayes models, tree-based models, and transfer learning. The naive Bayes and tree-based models cover all of our attempts that used out-of-the-box models paired with various methods of preprocessing, including feature generation. In contrast, the transfer learning category encapsulates our attempt at implementing a pre-trained humor detection model on our own data.

Naive Bayes Models

Because humor detection is essentially a form of sentiment analysis, a logical approach to solving this binary classification problem is to implement naive Bayes models. Despite being somewhat ‘baseline’, naive Bayes models are typically strong solutions for various sentiment analysis and NLP tasks. Since there are multiple versions of naive Bayes models as well as numerous options for preprocessing text, we tried many different pairings of Bayes models and preprocessing techniques to determine which performed best for humor detection.

To quickly summarize what makes naive Bayes models naive, these models make the assumption that, given the class label, each feature is conditionally independent of all other features. The two main naive Bayes models used for text classification are multinomial naive Bayes and complement naive Bayes. The main difference between the two is that complement naive Bayes tries to correct for some of the more “severe” assumptions made by the standard multinomial naive Bayes model, resulting in better performance in general and greater stability. Additionally, naive Bayes models work well on small training sets, which is helpful since we are limited to only 8,000 texts.

For our first attempts on the humor detection tasks, we tried training a few naive Bayes models without any hyperparameter tuning to establish a baseline for performance. Using TfidfVectorizer as our base preprocessor, we found that scikit-learn’s ComplementNB initially performed better than MultinomialNB. This was expected since ComplementNB is well-suited for imbalanced data sets such as ours, and typically outperforms MultinomialNB in text classification, according to the documentation [7].

With the baseline performance established, we then tried different pairings of preprocessing techniques and naive Bayes model types. One particular regime of preprocessing techniques led to large improvement in our training score. We used sent_tokenize() and TreebankWordTokenizer to tokenize each text. After tokenizing the texts, we passed the tokens into CountVectorizer with an ngram range of (1,2), and fed the resulting vectors into a ComplementNB model with an alpha (smoothing parameter) of 0.2. This method of preprocessing led to an F1-score of 0.8889 and an accuracy of 0.863 on the test set.

For completeness, we also tried using the same multistep preprocessing regime again but with a tuned MultinomialNB model instead. We were able to slightly improve our performance on the test set to an F1-score of 0.8903 and an accuracy of 0.864.

We would like to note that LinearSVC and SVC are also good baseline models for text classification, however they did not perform as well as the naive Bayes models on our data, so we did not explore them further.

Sentiment Feature Engineering with Naive Bayes Models

In addition to representing the text through different embedding methods, we also tried feature engineering by using sentiment analysis tools on the word and sentence level. We stacked these features on top of previous preprocessing steps.

We tried two different methods of sentiment analysis. The first method is NLTK VADER, which was stacked on TF-IDF and ComplementNB. The second method was tokenizing words and passing them to SentiWordNet, then stacking aggregate positive and negative scores with TreeBankTokenizer, CountVectorizer, and both naive Bayes models. Out of the two, the second method worked better, but did not increase our overall accuracy or F1-score.

Tree-Based Models and Gradient Boosting

For a different approach, we tried pairing GloVe embeddings and various tree-based models. We experimented with GloVe embeddings to generate word vectors, using pre-trained word vector dictionaries extracted from Wikipedia with a 400k vocabulary and vector dimensions of 50, 100, and 300, resulting in three different dictionaries. We then trained RandomForestClassifier, ExtraTreesClassifier, XGBoostClassifier, and CatBoostClassifer models on vectors generated from each dictionary.

CatBoost appeared to be the best model out of these, with an F1-score of 0.869 and accuracy of 0.834 on the training set vectorized with the dictionary of 300-dimension pre-trained vectors. However, the model only had an accuracy of 0.792 on the public test set. As a result, GloVe embeddings were deemed an unsuccessful approach for vectorizing. We believe the GloVe embedding method’s poor performance is most likely due to the small training set, along with the size and generality of the pre-trained dictionary.

Transfer Learning the ColBERT model

An alternative approach to building our own models for humor detection is to try to transfer learning from other models. Transfer learning is the notion of using pre-trained deep learning models as a starting point for a different but related problem. Even though transfer learning is regarded as a powerful tool in the machine learning community, we struggled to get any meaningful results using our own version of a transfer learning model.

Although the typical recipe for neural network training is to update every layer of the model via backpropagation, transfer learning is different. Instead, we “freeze” all but the last few layers of the pre-trained model, meaning the only weight updates we make during training is on the unfrozen layers. In fact, all the frozen layers will retain their original pre-trained weights. The rationale behind this is that the pre-trained models have often already been trained on much larger datasets, suggesting that these frozen layers may be able to find latent features that can be generalized to other similar datasets. Thus, we can use these frozen layers to hopefully find useful latent features within our data and then train the unfrozen layers to correctly interpret them.

To implement this idea of transfer learning on our own data, we found a paper titled ColBERT: Using BERT Sentence Embedding for Humor Detection that performs the same humor detection task albeit on a different dataset [8]. The model specified in the paper was able to achieve an F1-score of 0.982 for humor detection on a dataset composed of 200,000 short texts. Luckily for us, the paper’s authors provided the model and implementation instructions on their Github, making the transfer learning implementation possible.

To ensure maximum performance, we applied the preprocessing steps outlined in the paper on our own data. This involved dropping texts that were not between 30 and 100 characters long as well as texts that did not contain between 10 and 18 words. Once these initial requirements were met, we expanded the contractions on the remaining texts using the contractions library. Finally, we applied multiple layers of Python’s capitalize function to ensure that only the first letter of the first word in each text was capitalized. After the preprocessing was completed, we followed the model’s implementation instructions on the authors’ Github page and prepared for training.

With the model loaded, we unfroze the last 3 layers of the model and trained on our preprocessed dataset for 10 epochs. Despite ColBERT’s exceptional performance on the paper’s dataset, our transfer-learned version was only able to achieve an F1-score of 0.8385 and an accuracy of 0.7392 on the test data. This result is disappointing since it is significantly lower than our baseline models’ scores.

One reason we believe that there is such a stark difference between the performance of the ColBERT model on the original paper’s dataset and our dataset is a difference in text. The original paper’s dataset sourced most of its 100,000 humorous texts from the r/Jokes and r/CleanJokes subreddits. The jokes on these subreddits typically follow the standard setup then punchline joke structure. In contrast, our dataset had numerous instances of sarcasm or irony-heavy jokes that break this setup/punchline mold. Thus, ColBERT’s latent features might not translate as well to our data. Another potential cause for the performance differences is the size of our dataset. Without preprocessing, the dataset only contains 8,000 texts to train on, a relatively small amount of data in the context of neural network training. Moreover, the preprocessing requirements dropped the amount of data even further, leaving us with only 2,893 texts to train on.

Another indication of the difference in humor between our data and the ColBERT paper’s data is the poor results we got when training our baseline models (the aforementioned naive Bayes and tree-based models) on the paper’s data. In an attempt to improve our scores with a larger training set, we tried training a model on the dataset of 200,000 texts for humor detection that was used in the paper and used this to predict on our data [9]. We used a ComplementNB model with TreebankWordTokenizer and CountVectorizer for preprocessing, and predicted on the training set of 8,000 texts to get an F1-score of 0.8. This was significantly worse than the score we got by training on the competition data. This drastic decrease in score is another example of how the type of humor differs between the ColBERT paper and this competition and why humor classification is such a difficult task.

Models to Rate Humor

If a text is considered humorous, then the next question one might ask is how humorous? The goal of this task is to predict how funny a text is on a scale from 0–5, where a 0 rating means the annotator recognized the text as a joke but did not understand it and 5 indicates the annotator found the text hilarious. Since final ratings were made on average scores, the 0–5 scale is continuous, making this a regression task.

The best models we found for the previous task were Naive Bayes classifiers; however, as this is a regression task, new modelling techniques must be used. As a preprocessing step, we used TreebankWordTokenizer to tokenize and clean the training data before it was fed to the Doc2Vec algorithm. To prepare the data for training, we used the Doc2Vec algorithm to train length 100 vectors over 30 epochs on the training data.

Following these preprocessing steps,we implemented two boosting algorithms without any hyperparameter tuning: XGBRegressor from XGBoost and LGBMRegressor from LightGBM. XGBoost slightly outperformed LGBM in RMSE scores, so it was chosen for more extensive hyperparameter tuning.

To improve the XGBRegressor, a grid search, using 5-fold cross-validation, was performed as seen in Figure 9 below.

Figure 9: Code used for grid search and tuning the XGBRegressor

Hypertuning was able to slightly boost model performance! The untuned XGBoost model got an RMSE score of 0.5842 on the test set, while the tuned model got an RMSE score of 0.5796 on the test set.

On top of tuning the XGBoost model, the Doc2Vec model can also be tuned. For example, the size of the output vectors as well as the number of training epochs can be changed. However, using a 5-fold cross-validation to find the optimal vector size and number of epochs did not help improve the scores on the test set for this problem.

Models to Detect Controversy

This task was to predict if the variance of humor ratings (for the samples that are considered humorous) is greater than the median variance of all humorous samples. Due to the subjective nature of humor and the smaller subset of data we had to work with, this classification task was significantly harder in comparison to humor detection.

Since naive Bayes worked the best for the first classification subtask, we decided to try using similar models to detect controversy. The lowest performing model we tried used positive/negative sentiment scores from SentiWordNet combined with TreebankWordTokenizer and CountVectorizer in preprocessing. For Complement and MultinomialNB, this gave an F1-score of ~0.510 on a 75/25 test train split and ~0.421 on the test data.

Although TF-IDF didn’t perform as well for the humor detection subtask, we found that using it for preprocessing to detect controversy resulted in our best model (albeit, only slightly better than chance). Using TF-IDF and ComplementNB resulted in an accuracy of 0.528 and an F1-score of ~0.459.


Figure 10. Comprehensive performance table

Overall, the MultinomialNB model performed the best when classifying whether a text is humorous or not. This could be due to the smaller size of our training set, which makes it difficult for us to use several deep learning models. For rating humor, we used an XGBRegressor and preprocessed the text data using the Doc2Vec algorithm. Finally, the best performing model for classifying controversy was ComplementNB with TF-IDF.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store