Cyberbullying Sentiment Analysis with Word2Vec and One-Against-All Support Vector Machine

,


INTRODUCTION
Cyberbullying refers to bullying that uses electronic technology such as smartphones and the internet. A victim of cyberbullying may increase the risk of low self-esteem [1]. Low self-esteem can cause anxiety and depression [2]. These impacts are supported by the statistics provided by Broadband Search regarding mental health that comes from cyberbullying that depression and social anxiety are in the top 2 ranks [3]. Unfortunately, 1 out of 3 young people in 30 countries has been a victim of cyberbullying [4].
To prevent cyberbullying from happening, detection will be needed. This detection can be achieved by NLP technique which focuses on the interactions between computers and human (natural) languages to do text processing [5]. One of them is sentiment analysis with its ultimate task is to do emotion identification [6]. Sentiment analysis will be used by implementing the Word Embedding approach. This approach will represent words into a vector space and will be achieved by using Word2Vec with Continuous Bag-of-Words (CBoW) model architecture. This model will take words as input and generate vectors as outputs. By using Word2Vec, semantic relationships between words in a sentence can also be found [7]. Thus, Word2Vec has a great role in performing sentiment analysis.
Detection of cyberbullying will be done by using sentiment analysis from Word2Vec and implementing Multi-label Classification. There will be six classes that will be used, namely toxic, severe toxic, obscene, threat, insult, and identity hate. Support Vector Machine (SVM) model will be used to do classification as it is performed better in text processing [8]. Then, One-Against-All (OAA) strategy will be used to be able to implement Multi-label Classification on the SVM.

A. Pre-processing
Pre-processing is an important step to transform text into a better form with the intention of preparing text for the next step. Pre-processing steps includes [9]:

B. Word Embedding
Word embeddings are type of word representation in a form of a vector. This approach is widely used in the case of Information Retrieval (IR) and Natural Language Processing (NLP) because of its ability to capture semantic and syntactic information from a word, so that words containing similar meanings can be measured [10].

C. Word2Vec
Word2Vec is one of the models used to implement Word Embedding. This model gets input from a collection of texts and generates a vector of the words. This vector can be used to find the proximity of each word in the vector space [11]. Thus, this model can check all the representation that has been learned and displays the closest word [12], as shown in Table I.

D. Continuous Bag-of-Words(CBoW)
CBoW is Word2Vec model rchitectures to create word embedding. The function of this model architecture is to predict a word based on the surrounding words [13]. The network model of CBoW is shown in Fig. 1.

E. Support Vector Machine (SVM)
Support vector machine (SVM) was introduced by Vapnik. The objective of this algorithm is to classify data points by using hyperplane or separator function between classes [14]. There are four hyperparameters used in this algorithm, such as:

 Kernel
This parameter will affect the type of hyperplane that will be used to separate the data. The Linear kernel will use a linear hyperplane (straight lines as in 2-dimensional space). The Radial Basis Function (RBF) and Polynomial kernels will use a non-linear hyperplane. An illustration of the kernel can be seen in Fig. 2 [15]. This parameter affects the margin maximization value. The smaller the value, the larger the margin that can be formed. On the other hand, the larger the value, the smaller the margin that will be formed [15].

ISSN 2355-0082  Degree
Degree is a parameter that will affect the flexibility of the hyperplane that is formed. The larger the value, the more flexible the boundary will be [16], as shown in Fig. 3.  One-Against-All is a strategy to train samples to each available class. By doing this, a sample can obtain a binary value for each class and it can be known whether a class is part of the sample or not. OAA has higher accuracy value than One-Against-One and is more suitable for relatively small number of labels [17].

G. Micro Averaged F1 Score
Micro Average F1 Score (Micro-f1) is a method to get the average of F1 score values. The Micro-f1 will be calculated as follows [18]: The is the precision value calculated using micro averaged approach formulated as follows: (2) And The is the recall value calculated using micro averaged approach formulated as follows: (3) The value of is the total class that is available. The value of is the number of true positives in class k. The value of is the number of false positives in class k. The value of is the number of false negatives in class k.

H. Hamming Loss
Hamming Loss is a metric specifically designed for multi-class (also called multi label) learning [19]. This metric is used to calculate how many misclassified pairs of sample and label. The range of values generated by the Hamming Loss metric is between 0 to 1 or 0 to 100 in percentage.
Smaller value of this metric means better the classification model that has been created. The calculation is carried out using the following equation [20]. (4)

A. Dataset
The dataset that will be used is the Toxic Comment Classification Challenge, available on the Kaggle Page [21]. This dataset is collected from Wikipedia page and has a focus to learn the negative behavior of online chatting. There are around 150.000 records for training data that is provided by this dataset. The dataset is divided into 6 classes, namely toxic, severe toxic, obscene, threat, insult, and identity hate. Fig. 5 shows the first two data in the dataset.

B. System Overview
The dataset file is in CSV format and will be retrieved in the first process. After retrieving dataset, it will go through pre-processing step. This step includes:  Generalization, which is the process of converting text into lowercase and removes punctuation.
 Tokenization, which is the process to break a

ISSN 2355-0082
text into the smallest form without losing its meaning.
 Stopwords Removal, which is the process to omit very common words to give more accurate result, such as "the", "a", "an", "in", etc.
 Lemmatization, which is the process to change word into its root forms, for example, words "liked", "liking", "and "likes" will be change to "like".
After pre-processing, the dataset is trained to the Word2Vec model. We use CBoW model architecture since this model architecture is faster and considered as the best approach for the use of words that are not unique.
After the process is done, we generalize the data distribution to avoid overfitting. After that, the generalized data will be prepared to be used by SVM. When the data is prepared, it will first pass through Hyperparameter Tuning and the best parameters from this process are used to predict. Finally, the prediction results will be evaluated using Micro Averaged F1 Score and Hamming Loss. Fig. 6 shows the system main flowchart. After pre-processing, each type is trained to the Word2Vec model twice, the first one is using 50 features and the second one is using 100 features. Eight types of Word2Vec models that will be generated are as follows. We divided the data so that 70% of the data is for learning and 30% of the data is for testing. Before the training begin, we generalized the data distribution to avoid overfitting. The training data that have been prepared will be used to tune the OAA SVM model. The tuning process will look for the best combination of 3 types of parameters. The three types of parameters are as follows. After the tuning process, it is found that the best parameter configuration as follows.       Table VII shows precision, recall, micro average, micro f1, and Hamming Loss of the model with lemmatized words, stop words and 100 features. From 650 testing data, the micro-f1 percentage is 81,92% and Hamming Loss percentage is 16,45%.   After getting the values of Micro-f1 and Hamming Loss from each model, we chose the best model for the prediction using OAA SVM. Fig. 7 shows predicting process to evaluate whether the comments "Your brain is now working, you are so idiot!" contain cyberbullying or not.  The result of the pre-processing stage for stop words removal is shown in Fig. 9. Common words are removed in this process, such as "your", "is", "not", "you", "are", and "so".  After the pre-processing stage, data preparation is performed to be used by the SVM model as shown in Fig. 11. This preparation will average all words from the context, making it exactly has 100 features.   Fig. 13 shows the result of prediction in text value. From this result, it can be concluded that the sentence "Your brain is not working, you are so idiot!" contains cyberbullying in the form of insult, obscene, and toxic. Based on the research that has been conducted, it can be concluded that the Word2Vec and OAA SVM methods can be implemented to carry out cyberbullying sentiment analysis. The most optimal model based on hyperparameter tuning is by using pre-processed words (lemmatized and without stop words) and 100 features in the Word2Vec model. Then, using Regularization value by 1, RBF Kernel, and Degree value by 3 in the OAA SVM model. Micro Averaged F1 and Hamming Loss percentage that is product by this tuned model is 83,40% and 15,13% respectively. Since the prediction model that is used is still classifying labels independently, there is no relation between one label with another. The final result still in the form of a model. Therefore, a classifier model that can also determine the relationship between labels like Classifier Chains might be a consideration for future research.