Analysis Sentiment Cyberbullying in Instagram Comments with XGBoost Method

— Technological developments make social media widely used by the general public, which causes negative impacts, one of which is cyberbullying. Cyberbullying is an act of insulting, humiliating another person on social media. A system that can detect cyberbullying because of the large amount of information circulating on social media is impossible for humans to visit. One suitable method to solve this problem is Extereme Gradient Boosting (XGBoost). XGBoost was chosen because it can run 10 times faster than other Gradient Boosting methods. The process of changing sentences into vectors uses the TF-IDF method. The TF/IDF method is known as a simple but relevant algorithm in doing words on a document. XGBoost accepts input in the form of vectors obtained from the TF-IDF process. In this research, there are 1452 comments which will be broken down into training data and testing data. By using XGBoost and TF-IDF methods, the accuracy is 75.20%, precision is 71%, recall is 87%, and F1-score is 78%.


I. INTRODUCTION
Social media is an online media, where users can easily participate, share, and create content for blogs, social networks, wikis, forums and virtual worlds [1]. Social media can be a place for positive or negative interactions. One negative example of social media is cyberbullying. Cyberbullying is an activity intentionally to send electronic text messages to other users with the aim of knowing, harassing, threatening, and disturbing [2].
According to Ditch The Label, which is an antibullying donation organization, Instagram is a social media that is often used to carry out cyberbullying. This is based on a survey of 10,020 teenagers from England with an age range of 12 to 20 years, from the survey results 42 percent of respondents claimed to have been victims of cyberbullying on Instagram [3]. Polda Metro Jaya said that there are at least 25 cyberbullying problems that are reported every day. In 2018 the Indonesian Child Protection Commission said the number of child victims of bullying reached 22.4%. The high value was triggered by the large number of internet consumption among children [4].
Previous research conducted in 2019 concluded that in classifying cyberbullying comments using the Support Vector Machine, the accuracy value was 76.77% [5]. In his research, it is suggested that further research on the classification of cyberbullying comments can use other classification methods. One other classification method is the XGBoost method. The XGBoost algorithm has higher accuracy and performance than the Support Vector Machine algorithm [6]. XGBoost can have better performance than the Support Vector Machine because XGBoost is a tree ensemble method that has been optimized both system and algorithmically [7].
There is a similar study regarding the implementation of XGBoost on sentiment analysis in Facebook social media in 2020 which resulted in 74.8% accuracy, 50% precision, 48% recall, and 49% F1 Score [8]. In addition, XGBoost which uses TF-IDF feature extraction has slightly better accuracy and precision performance than Count-Vectorization feature extraction [9].

A. Cyberbullying
Cyberbullying is a form of violence perpetrated by groups or individuals who use electronic media [10]. The forms of violence are mocking, insulting, intimidating or humiliating. Examples of cyberbullying behavior include threatening via e-mail, insulting in the comments section of social media, and posting disgraceful photos of someone [11].

B. Text Classification
Text Classification or text classification is the process of labeling documents based on the contents of the document. Text classification can be done in two ways, namely manual and automatic. Classification of ISSN 2355-0082 texts manually takes a lot of time and money but to get more accurate results, linguists must interpret the texts they want to classify and categorize them. Text classification automatically applies machine learning, natural processing, making it faster and more costeffective. In general, automatic text classification is divided into 3 groups, namely the first rule-based automatic text classification, machine learning, hybrid [12].
The first step in doing machine learning-based text classification is feature extraction to represent text into numeric vectors. Furthermore, the numerical vector will be used by the system to classify based on the correct label [13].

C. Sentimen Analisis
Sentiment analysis is the process of determining one's emotions or opinions, the process is expressed in the form of text and can be divided into positive emotions or negative emotions [14]. Sentiment analysis refers to various natural language processing, computational linguistics, and text mining. This analysis aims to analyze the views, sentiments, evaluations, attitudes, judgments, and emotions of speakers or writers related to certain topics, products, and services, organizations, individuals or activities [15].

D. Preprocessing
Preprocessing is a data mining technique that involves transforming raw data into a format that is easy to understand for computers. The data preprocessing step is needed to solve various types of problems, including data noise, data redundancy, and missing data [16]. The stages in doing preprocessing can be seen in Fig.1.

Fig. 1. Preprocessing flow
 Case folding is the process of converting all letters in a document or sentence into lowercase. Case folding is used to facilitate the search [17].  Tokenizing is a process of cutting text into words based on the constituent text [3]. With the aim of eliminating all punctuation or symbols that are not letters.  Filtering is the process of removing irrelevant words with sentiment adjectives if they stand alone [18].
 Stemming is the process of removing affixes and leaving only the base words [19].

E. Term Frequency -Inversed Document Frequency (TF-IDF)
Term Frequency -Inverse Document Frequency (TF-IDF) is the process of assigning weight to each word in a document. The TF-IDF method sorts words based on the word that appears the most [3]. Term Frequency focuses more on words that often appear in documents, while Inverse Document Frequency focuses more on counting words that often appear in a document that are considered unimportant general words [19].
Where TF(t) is the Term Frequency value of term t, IDF(t) is the Inverse Document Frequency value of term t, and W(t) is the weight of a term.

F. Extreme Gradient Boosting
XGBoost is one of the boosting variants. XGBoost can run 10 times faster than other Gradient Boosting implementations, so many researchers have used it for classification and regression in many cases such as seller prediction, customer behavior prediction, ad prediction, and web text prediction [20].
Boosting is an ensemble technique where a new model is added to correct the mistakes made by the previous model. Models will be added sequentially until there are no further improvements. The ensemble technique uses a tree ensemble model which is a collection of classification and regression trees. The ensemble technique approach sums the predictions from several trees into one [21]. It aims to take each predictor sequentially and model it based on the residual error from the previous model. The initial process when the dataset is entered is to create an initial model using the dataset that has been selected. Then the initial prediction value and residual error from the initial model are obtained using equations 4 and 5. Equation 4 is used to make the initial model, while equation 5 is used to make subsequent models.

ISSN 2355-0082
Where ℎ 0 ( ) is the initial predictive value of the first model and Y is the value of the residual error of the initial model. After that, the second model will be formed using the residual error of the initial model so that the predictive value of the second model is obtained. Then the third model will be formed using the residual error of the initial and second models so that the predictive value of the third model is obtained. This process will keep repeating as many times as n_estimator that has been set. This algorithm is called Gradient Boosting which aims to minimize errors when creating new models [22].
Just like boosting, XGBoost creates a set of decision trees in which the model will depend on the previous model. The first model in XGBoost will be weak in initializing the predicted value, then update the weights on each model that is formed so as to produce a strong predictive value. The predicted value of each model will be added up and then entered into Equation 6 to minimize the objective function [23].
Where n is the number of models to be used, l is a function to measure the difference between the predicted target and ̂, ( ) is the new model built. While is a function Ω to make the model avoid overfitting [23]. Equation 6 is used when used to find the overall value.

G. Performance Evaluation
The method used in evaluating the performance of the model used is the confusion matrix. Each component in the confusion matrix shows the number of predictions made by the model classifying it correctly or incorrectly [24]. Fig. 2 is an example of a confusion matrix with 2 binary classifications. Fig. 2. Confusion matrix binary classification [24] In Fig. 2 there are 4 main components, namely TP (True Positive) is the number of positive data that is predicted correctly, TN (True Negative) is the number of negative data that is predicted correctly, FP (False Positive) is the number of negative data that is predicted incorrectly, FN (False Negative) is the number of positive data that is predicted incorrectly. Based on these four components we can look for Accuracy, Precision, Recall, and F1 score. Equation 7 is a formula for finding the Accuracy value, Accuracy is the ratio of the level of how accurate the system can predict correctly.

H. Social Media
Social media is a medium used to interact online that allows humans to communicate with each other without being limited by space and time [25]. Users can share information with each other through text, images, video, and audio. Social media can also be used as a means to build a public profile so that it is increasingly known by others [26].

I. Instagram
Instagram is an application for sharing photos and videos, users can share their moments to the public, users can also comment on posts in the form of photos or videos from other users. Instagram has become so popular that it is used by its users to make Instagram a means of building a public profile [26]. Fig. 3 is the result of a survey conducted by the Global Web Index, Instagram is ranked 3rd with a percentage of 86.6% of users with an age range of 16 to 64 years [27].

A. Dataset
The dataset used in conducting training and testing uses other studies [28] [19]. The total number is 1552 comments containing 685 negative comments and 867 positive comments.  Figure 4 shows the process of merging two datasets. The two datasets are combined into a new DataFrame named df_merge. df_merger will store comments from DataFrame df and df2 into the Comments column and store labels from DataFrame df and df2 into the Sentiment column.

B. Performance Evaluation
There are 4 scenarios carried out to determine the best configuration to classify the comments of cyberbullying. First, testing with a comparison of the train set and test set with a ratio of 70:30 and 80:20. Second, testing with downsampling. Third, testing with Grid Search Cross Validation. Fourth, testing by comparing the Cyberbullying Instagram Comment dataset and Cyberbullying Celebrity Comments dataset.
The first test, the train set and test set with a ratio of 70:30 got an F1-score of 61% and a recall value of 48% The train set and test set with a ratio of 80:20 got an average F1-score of 66% and an average score of 66%. recalls 53%. Based on the test results, the comparison of the train set and test set with a ratio of 80:20 shows better performance than the test using a ratio of 70:30. However, the comparison of the train set and test set with a ratio of 80:20 still has a weakness to avoid predicting bullying comments that are actually nonbullying, it can be seen in the recall value of 53%.
In the second test, data trimming was carried out on comments labeled positive because comments labeled positive and comments labeled negative were not balanced. Train sets and test sets with a ratio of 70:30 that have been downsampled get an F1-score of 75% and a recall value of 83%

ISSN 2355-0082
Train sets and test sets with a ratio of 80:20 that have been downsampled get an F1-score of 78% and 87% recall value. In this scenario, the test comparison of the train set and test set with a ratio of 80:20 that has been downsampled has better performance than the test with a ratio of 70:30 that has been downsampled.
In the third test, the train set and test set with a ratio of 70:30 that have been downsampled get an F1-score of 75% and a recall value of 81% The train set and test set with a ratio of 80:20 that have been downsampled get an F1-sampling value. score 84% and recall value 77%. The purpose of the third test is to get hyper parameters. In the fourth test, when using the Cyberbullying Instagram Comments dataset with a comparison of the train set and the test set with a ratio of 80:20, the average F1-score was 80% and the recall value was 82%. Then, testing using the Cyberbullying Selebgram Comments dataset, a comparison of the train set and the test set with a ratio of 80:20, got an average F1-score of 57% and a recall value of 45%.

IV. CONCLUSION
The implementation of the Extreme Gradient Boosting algorithm for the classification of cyberbullying comments has been completed in the form of a web application. The results of the trial by using the F1-score and recall values for negative labels as a reference in selecting the best model, the best model was obtained when using a comparison of the train set and test set with a ratio of 80:20 which had been downsampling with the default parameter XGBoost. The results obtained are 75.20% accuracy, 71% precision, 87% recall, and 78% F1-score.
There are factors that cause performance to decline, namely because there are some non-standard words. This happens because the dataset used is a collection of comments from Instagram social media. So there are some words that are not standard or there is an error in typing. This causes the stemming and filtering processes to run less than optimally.
Based on the research that has been done, the following are suggestions for further research. Using other classification methods such as Decision Trees, Random Forest Classifier or Naïve Bayes Classification. Then enlarge the dataset for classification, with the enlargement of the dataset is expected to get better performance. Then use the word embedding method. Word embedding is able to detect the similarity of words semantically, by measuring based on the distance between vectors. It is hoped that with this, the resulting performance can be better because words that have similar meanings can be classified as one group.