A Complete End-to-End Machine Learning project to classify a comment into various levels of toxicity.
Table of contents
- Business Problem
- Use of Machine Learning to solve the problem
- Multi-Class Vs Multi-Label Classification
- Dataset Overview
- Evaluation Metrics
- Exploratory Data Analysis
- Data pre-processing
- Multi-Label Classification Techniques
- Comparison of models
- Future work
Toxic Comment Classification is a Kaggle competition held by the Conversation AI team, a research initiative founded by Jigsaw and Google.
In most of the online conversation platforms, social media users often face abuse, harassment, and insults from other users. Due to which, many users stop expressing their ideas and opinions. Platforms struggle to facilitate conversations effectively.
Use of Machine Learning to solve the problem
With all the latest advancements of AI-ML, the task is to build an efficient model that is capable of detecting various levels/categories of toxicity like threats, obscenity, insults, and identity-based hate for a given comment. Thus helping online conversations/discussion become more productive and respectful.
The problem at hand is not a multi-class classification, but it is a multi-label classification. In other words, a given comment may belongs to none or many levels of toxicity.
Multi-Class Vs Multi-Label Classification
Multi-Class Classification: “pick only one”
As the name suggests, this is a classification task with more than two classes. The basic assumption in multi-class classification is that for any data point in the dataset, it’s corresponding class-label will be one and only one among all class labels available. In probability theory, such events are called as “Mutually Exclusive Events”, which means the happening of all the events at the same time is ZERO.
Multi-Label Classification: “pick all applicable”
Multi-label classification is a generalization of multi-class classification. The basic assumption in multi-label classification is that for any data point in the dataset, it’s corresponding class-label(s) will be none or many among all class labels available. In other words, for any data point in the dataset there is no constraint on no: of class labels it may belongs to. It may be one or more among the available class labels or none of the class labels.
Multi-Label Classification target labels are not mutually exclusive events but, there exists some relation between them.
Disclaimer : the dataset contains text that may be considered profane, vulgar, or offensive.
- Dataset is available in the Kaggle competition page.
- There are 3 files provided in the competition, viz. train, test, sample submission all in CSV format.
- Train dataset contains large number of Wikipedia comments and its corresponding class labels encoded in binary format.
- Levels of toxicity i.e. class labels are toxic, severe_toxic, obscene, threat, insult, identity_hate.
- Like any machine learning classification task, test dataset contains only inputs (comments in this case) , for which toxicity probabilities need to be predicted with the help of model trained on train dataset.
Metrics play a vital role in any of the Machine learning problems. Metrics for multi-label classification are different from single-label classification. The metrics like Accuracy, Precision, Recall, etc., which are defined for a single label classification no more valid for multi-label classification.
But, there are many other metrics, which can be used for multi-label classification. Most commonly used metrics for multi label classification are Hamming Loss (HL), Exact Match Ratio (EMR). In this project, I have also considered the Multi-Class Log Loss.
why multi-class log loss ?
- For any comment, model predicts the 6 toxicity probability scores.
- As per business problem, there should be minimal misclassification.
- Multi-Class log loss will penalize misclassifications and thus improvising the model predictions.
Exact Match Ratio (EMR):
As the name suggests, it only takes the labels into calculation where the actual labels and predicted labels are the same. Sometimes it is also referred as “Subset-Accuracy”.
EMR is defined as :
where I is the indicator function.
n is the total no: of data points
Yi is the actual labels
Zi is the predicted labels
EMR ranges from [0–1].
The main disadvantage of EMR is that it is so strict and doesn’t care for partial correctness of predicted labels with actual labels. In multi-label classification it is not good to ignore partial correct predictions.
Hamming Loss (HL):
Hamming loss is defined as the ratio of incorrectly predicted labels to the total no: of labels.
Where ⊕ denotes XOR operation
N is the total no: of data points
L is the total no: of class-labels
Xi,l is the l-th label of i-th row in true labels
Yi,l is the l-th label of i-th row in predicted labels
- Hamming Loss ranges from [0–1]
- Smaller the value, the better the model predictions are.
- Hamming loss overcomes the disadvantages of EMR, by considering the partial correct predictions into calculation.
Note: This is just a brief introduction to the evaluation metrics that are used in my project. There are many more metrics like precision, recall, F-1 score for multi-label classification. For further learning please check the chapter-7 of A Literature Survey on Algorithms for Multi-label Learning
Exploratory Data Analysis (EDA)
Exploratory Data Analysis is the most important process in any machine learning project. EDA brings out the useful insights, summaries from the data, thus which will help in solving the business problem with logical and strategical approach.
Let us load the train dataset (train.csv) will help of pandas and bring out some useful insights .
- Loading Dataset
2. Shape of the dataset
Train data set contains 159571 rows and 8 columns.
Out of 8 columns, 6 columns are class labels and 1 column is for comment and one more column is id to identify a row uniquely.
3. Check for duplicates and missing values
There are no duplicates and missing values in the dataset.
4. Individual counts of Toxic and Non-Toxic comments in each class-label
Count of comments that are non-toxic(class-0) and toxic (class-1) under each class-label individually i.e. without considering the relation between each-other labels.
- From this below plot we can understand that the data set is highly imbalanced.
5. Count of Toxic & Non-Toxic comments in complete dataset
Count of comments that are completely non-toxic i.e. class-0 in all the class labels and toxic under at least one class labels.
- Data-set contains 89.83% Non-toxic comments and 10.17% Toxic comments.
6. Count of comments having multiple class-labels
Count of no: of comments have multiple labels.
- There are only 31 comments which are toxic under each class label, which is 0.1910 % of toxic comments.
- Out of 16,225 toxic comments, 6,360 comments are having only one target label which is 39.198 % of toxic comments.
7. Venn Diagrams on class-labels
Now let’s plot the “Venn diagram” , to understand the relationship between the class labels.
Note: I am showing all the combinations of Venn diagrams by considering 2 class labels, which is easy to draw conclusions. Please check out my repository for the remaining diagrams.
- All Severe_Toxic class comments are Toxic class comments.
- Most of the Obscene class & insult class comments are Toxic class comments.
- In most of the combinations Toxic class, Insult, Obscene are dominating.
8. Correlation between the class-labels
Let us plot the correlation map between the class labels. Also let us investigate if any of the characters like ! ? @ # * & helps in classification.
- From the correlation map, we can understand that the toxic → is much correlated with obscene, insult and severe_toxic.
- Obscene is much more correlated with insult and toxic target classes.
- Upper case words are correlated with dot, length and new line and exclamatory mark. all will be commonly seen in the toxic comments. And slightly highly correlated with severe_toxic class among all the 6 classes.
- Didn't find any strong correlation between the newly added features
9. PDF of no: of words in each class label
Till now, we haven’t drawn any conclusion w.r.t to input i.e. comments. Let us investigate if length of the comments helps in solving the problem or not.
Let us draw the PDF on no: of words in each class label.
- Unfortunately length or no: of words in any of the class labels doesn’t help to solve the problem
10. Word Cloud for Toxic labels in each class
Word Cloud for the most common words in toxic comments per each class label.
Data pre-processing is the next important step in solving any machine learning problem. Before actual training starts, raw data should be processed i.e. cleaned and transform in to vectors.
In this project , input to the machine learning model is a comment, which is a text. A comment can contain capital letters, punctuation marks, html tags, links etc. In this project, I have used regular expression (re) to remove all unnecessary characters/sub-text from the comments.
Below are the some of the pre-processing functions that are used to clean the comments and also examples are shown on random comments to illustrate how the preprocessing functions works.
- Convert the comment to lower case
2. Remove HTML tags from the comment
3. Remove links from the comment
4. Remove punctuation marks from the comment
5. Remove digits from the comment
6. Remove “\n” , “\t” , “/” and “\” from the comment
7. Remove repeated characters( more than 2 times) from the comment
This kind of behavior is very common while abusing in online conversations and due to which existing systems fail in detecting toxic words.
8. Remove words with length less than or equal to 2
Contractions are very commonly seen in any text. The below code converts contraction words to its full form. For example, “I’ll do it” is transformed as “I will do it”. To achieve this, I used a library called “contractions”.
while performing de-contractions, I faced errors because of some other language words present in the comments. Solution I found for this issue was, when I convert the same word in lowercase and try performing de-contractions, it worked. The below code does the same thing.
In online conversations, sometimes users use names of the person, cities etc. There is a “Chunking” concept in NLP which segments and labels based on the parts of speech of a text, entities and context . In other words, it will help to determine the parts of speech for a given text.
As an experiment, I have discarded all person names and allowed only geo-graphical area names which are needed to detect & classify location based/racial discriminations.
11. Remove Stop words
Stop words are the words , which are most occurring in the dataset and don’t carry any importance in classification tasks. For example, words like “the, i , he, she, we, our, his, her, it, a , an, in, of, one, two, three, four”, etc.
NLTK, WORD CLOUD libraries offer different sets of stop-words. I have added some specific stop words for this case study on union of stop words provided by these libraries.
Lemmatization is the process of grouping different forms of words into a single word. It brings the meaning to the context and morphological analysis.
Stemming and Lemmatization are similar but there is a difference between them. Stemming returns the “stem” of the word without considering context. while Lemmatization returns the base word by taking parts of speech into consideration.
Good → Better → Best, all are based on the word “Good”.
Note: In this project I have used only Lemmatization.
The below code defines the one step pre-processing function to perform all the operations discussed above on complete dataset and returns the preprocessed text.
- Now with these cleaned comments, we can proceed to convert text into vector representation. There is one more care should be taken — data leakage i.e. in order to build the robust model, data leakage should be avoided.
- It is always a good practice to split the dataset into train & test sets by stratify method. sklearn’s train_test_split function is not developed for multi-label classification.
- We need to install “scikit-multilearn” library. ( pip install scikit-multilearn) to solve multi-label classification tasks.
- “scikit-multilearn” library provides iterative_train_test_split method to split the dataset into train & test by stratify method.
There are various techniques to convert text into vector representation. In this project I have used TF-IDF weighted Average W2V, using, glove.6B.300d (pre-trained on Wikipedia + Gigaword). The below code loads the words and its corresponding vectors from glove.
To calculate, TF-IDF weights we need to use TF-IDF Vectorizer
The below code computes the TF-IDF weighted average W2V using the above w2v_dict, w2v_words, tfidf_dict, and tfidf_words.
As a good practice, it is always good to save the data to disk.
Successfully completed the pre-processing of data i.e. cleaned & transformed the raw-data into vector representations. The actual training of machine learning models starts now.
Multi-Label Classification Techniques
Solving a multi-label classification problem is not straight forward. We don't have straight away algorithms or classifiers, but we can leverage single-label classification algorithms to solve multi-label classification problem.
In this section, let’s discuss different approaches to solve a multi-label classification problem, namely:
1) Problem Transformation
3) Adapted Algorithm
4) Neural Networks (MLP)
1) Problem Transformation
In this approach we are transforming multi-label classification task into single-label classification task with different strategies.
There are 3 different strategies to convert multi-label to single label as follows.
1) Binary Relevance
2) Classifier Chains
3) Label Powerset
Scikit-multilearn library implements all the above strategies. Base classifiers, which can be sklearn’s classifier object are needed to train with these methods. For example: GaussianNB(),LogisticRegression(), DecisionTreeClassifier() etc.
1) Binary Relevance (BR)
Binary Relevance is the easiest & simplest approach to solve multi -label classification task. BR converts n-class multi-label classification into n separate binary class classification tasks.
In the below figure, 4-class multi-label classification is transformed into 4 different binary class classifications. The prediction for multi-label classification is a union of all the individual binary class classifiers.
The main disadvantage with this approach is it ignores the correlation between the class-labels. In other words, it assumes each label is independent (mutually exclusive), but which is not in our case.
Let us implement and check the metrics with BR on our dataset.
- Logistic regression and XGBoost gives similar results with small variations.
- SVM is skipped as it is computationally complex on big data-sets like in our case.
2) Classifier Chains (CC)
Classifier Chains is another simple technique, unlike Binary Relevance Classifier Chain preserves the relationship between the labels. CC also transforms the n-class multi-label classification into n different single label classifications.
The operation of classifier chain works as follows…
classifier-1 will take all the inputs and fits on the first target label alone and the classifier-2 takes all the inputs and the first target labels together and fits on the 2nd label. Classifier-3 takes all the inputs and the first, second target labels all together as input and fits on the 3rd target label. and so on
Generalizing the flow as the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
Training CC is same like BR, just need to change the function name.
- Comparing CC results with BR, metrics have been improved slightly. CC might be able to preserve the correlation between the labels and performing well on our dataset.
- In Classifier chain strategy also, Logistic regression and XGBoost classifiers’ results seem to be high and very close to each other.
3) Label Powerset (LP)
Label Power-Set is a very elegant and best way of transforming a multi-label problem to multi-class classification problem. LP also preserves the correlation between the labels.
Unlike BR & CC, we don’t need to train n separate binary classifiers. It is enough to train one multi-class classifier, with an elegant transformation of the dataset. The operation of LP works as follows…
We are mapping all unique labels with some class labels throughout the dataset. For example in the below figure, (1,1,1,1) is mapped to class-1 through out all the instances in the dataset. and also (1,0,0,1) is mapped to class-2 throughout all the instances in the dataset. Similarly all the unique class labels are mapped to its corresponding class throughout the data set.
Only disadvantage is that, as the no: of target labels increases, the unique class labels to be mapped also increases exponentially.
2-class labels contains 4 unique combinations
3-class labels contains 8 unique combinations
4-class labels contains 16 unique combinations
n-class labels contains 2^n unique combinations
And training such a huge multi-class classification problem becomes much more complex and predictions would be with lower accuracy.
- LP is the powerful problem transformation technique for multi-label classification problem.
- But LP’s results are very much close to the BR’s and CC’s results, but slightly ahead of both the techniques. Our dataset contains 41 unique combinations, because of such complexity, LP’s predictions are very much close to BR’s and CC’s.
- Results of Logistic Regression and XGBoost are also high and and close to each other.
In machine learning, the core idea of ensemble methods is to obtain better predictive performance by using multiple algorithms than a single algorithm alone. There are different types of Ensemble techniques like Bagging, Boosting, Stacking, Cascading etc.
In the project ,I have experimented with the custom stacking. Below are the steps to implement the custom stacking.
Steps to implement custom stacking
1) Split the training dataset, randomly into two equal parts by stratify method, let’s say D1, D2.
2) Train m-base classifiers (different base classifiers with LP) of the stacking model using m-different samples sets generated (with replacement) from D1.
3) After training the m-base classifiers, predict the output using D2.
4) Create a data-set, say D3 with the predictions of m-base classifiers and corresponding target labels from D2 as training data-set for meta classifier
5) Train Meta-Classifier of the stacking model using D3 data-set
6) To evaluate final output, create a data set D4 with the predictions with trained m-base classifiers and corresponding y-labels using testing dataset.
7) Calculate the final metrics using D4
To obtain optimal results from stacking, it is always suggested to perform the hyper parameter tuning of the base classifiers.
I assume no: of base classifiers and size of training sample sets as hyper parameters and experimented with various combinations. But, unfortunately custom stacking didn’t give me considerable improvements in the metrics.
Note: Please check out my repository for the code.
3) Adopted Algorithm
As the name suggests, there are few algorithms that are adopted from the single-label classification algorithms to directly solve the multi-label classification by changing loss/decision functions.
Currently the skmultilearn.adapt module implements few adoption algorithms, like MLKNN, MLTSVM, etc.
Note: I haven’t experimented with adopted algorithms, as the train time complexity of the adopted algorithms is very high.
4) Neural Networks (MLP)
- Natively, Neural Networks (NN) are able to solve multi-label classification problems.
- In case of multi-class classification, we generally choose categorical cross entropy as loss function and soft-max as the activations in the output layer.
- In case of multi-label classification, we just need to choose the loss function as Binary cross entropy and sigmoid activations in the output layer.
I created a simple MLP, with 4- hidden layers and 1-dropout layer and 1-Batch Normalization layer. I have used Adam optimizer and loss as binary cross entropy. All the required metrics are not available directly in Tensorflow & Keras. I have used custom call-back to calculate metrics after every epoch and save the model if the metrics are improved from previous epoch.
Compiling and Fitting the model.
- Slight improvement of EMR, Average AUC, Hamming-Loss has been observed with MLPs.
- There is no improvement with Log-loss.
Comparison of models
I have trained various models with different problem transformation techniques, stacking and MLP. Below table is the comparison of model’s performance on test dataset. MLP model is slightly ahead of others and stands out as the best.
Kaggle Submission & Scores
I achieved 0.93639 private score and 0.93548 public score with best model trained.
- Deep Learning architecture like LSTM, Transformers, BERT can be used to solve the same problem.
- MEKA project can also be used to solve this problem.
- It can also be experimented with clustering — an unsupervised learning approach. skmultilearn.cluster module provides various clustering algorithms to solve multi-label classification.
Git-hub repository link
Have a look at the code and the simulations of this project on Github.
Connect with me
Vijay Gadepalli - Senior Consultant - LTI - Larsen & Toubro Infotech | LinkedIn
View Vijay Gadepalli's profile on LinkedIn, the world's largest professional community. Vijay has 2 jobs listed on…
I hope you learnt something new from this post…! Thanks for reading…!