Sentiment Analysis about Indonesian Lawyers Club Television Program Using K-Nearest Neighbor, Naïve Bayes Classifier, and Decision Tree

,


INTRODUCTION
Sentiment analysis is a process for determining an opinion or response regarding a particular product or topic. Sentiment analysis can be useful for overcoming several problems, one of which is determining how people respond to a television broadcast [1].
One most popular TV talk shows in Indonesia is Indonesia Lawyers Club (ILC). ILC is a talk show on TVOne that features dialogues that discuss topics around public phenomena and legal or criminal issues [2]. According to Liputan6.com, ILC has been nominated several times to win an award from the Panasonic Gobel Awards. The last award that was won in 2018 was the 2018 Panasonic Gobel Awards in the News Talkshow Program category. However, in 2019, ILC only received a nomination at the Panasonic Gobel Awards because it lost to Mata Najwa, which featured Talkshow like ILC.
The Award winner is calculated from the rating. The value of the rating is especially important for the survival of a television show, but it does not show that the quality level of the shows. Viewers often provide their opinions on television shows through social media, one of which is Twitter [2]. The opinion is an expression of the belief that holds together among members of a group or public, about a controversial issue that concerns the public interest. Opinions are not always logical, formless, always ambivalent, contradictory, and easy to change [3]. Public opinion on ILC can be made from tweets taken from Twitter, this is because quite a lot of Twitter users also watch ILC.
This study applies a sentiment analysis approach to calculate public opinion on Twitter about ILC and Mata Najwa in 2018 and 2019. This study will compare the results of public opinions on ILC and Mata Najwa and will be validated by using three different algorithmic methods for classification, namely K-Nearest Neighbor, Naïve Bayes Classifier, and Decision Tree.

A. Sentiment Analysis
Data Mining is an analysis of the process of seeking knowledge in a database. This knowledge can be interpreted as data patterns or relationships between valid data that have not been known before [4]. Sentiment analysis is usually done to seek public or customer opinion on a product or service that is owned by a company, organization, or entity [5]. Sentiment analysis can also be interpreted as learning an opinion, problem, feeling, or emotion from someone or the public in responding to something in the form of text or writing. In determining a sentiment, it is done by calculating some of the words contained in sentences, documents, or text [6]. Rapidminer is software that can be used to process data mining. The process that can ISSN 2355-0082 be carried out by RapidMiner in text mining is about text analysis, finding patterns from large datasets, and mixing them with various statistical methods, artificial intelligence, and databases.
In analyzing sentiments, several steps need to be taken to get the best results. The steps consist of data collection, data pre-processing, and sentiment classification. Fig. 1 shows the steps used in the analysis of sentiment analysis in this study.

B. Data Collecting
Data collection uses Python to retrieve data on Twitter from 2018 and 2019. Data collection for ILC uses the keywords "IndonesiaLawyersClub", "ILC", and "ILCtvone" while Mata Najwa uses the keywords "Mata Najwa". After getting the data, the next data is selected which is an opinion, and labeled manually by three volunteers as many as 30 tweets per month in 2018 and 2019, respectively. Rather than applying the single peer reviewer method, the benefit of applying the three peer-reviewing methods is to increase the accuracy of the manual review process [7]. An example of the labeling process result can be seen in Fig. 2.

C. Preprocessing
After collecting and labeling the data, the next step is to pre-process the data. As can be seen in Figure 3, the preprocessing consists of several processes from Cleansing to Weighting word. The sub Operators of the Document Process Operator can be seen in Fig. 4.
 Cleansing is the step to remove duplicate data, URLs and symbols, numbers, and punctuation that are not needed from text such as exclamation marks, question marks, quotation marks, and so forth. For example: "Sebuah Program televisi "Kriminal",tidak dicocokan untuk anak! www.kriminalberita.com" becomes "Sebuah Program televisi Kriminal tidak dicocokan untuk anak".
 Case folding is a step to make a sentence into uppercase or lowercase letters. In this study, all letters are changed to be small to facilitate the next process". For example: "Sebuah Program televisi Kriminal tidak dicocokan untuk anak" becomes "sebuah program televisi kriminal tidak dicocokan untuk anak".  Tokenization is a step to break or cut a sentence or document into several parts or words called tokens. There are three types of tokens, namely unigram, bigram, and trigram [8]. In this research, the type of token used is
 Filtering is a step to eliminate words that often appear, but are not needed or do not have meaning. Words that appear in large numbers and are considered to have no meaning are called stopwords. For example: "sebuah, program, televisi, kriminal, tidak, dicocokan, untuk, anak" becomes "program, televisi, kriminal, tidak, dicocokan, anak".
 Stemming is a step to make all words that have an affix or suffix into basic words by the correct Indonesian rules. Stemming is done by removing each prefix and suffix in the word prefix or suffix. For example: "program, televisi, kriminal, tidak, dicocokan, anak" becomes "program, televisi, kriminal, tidak, cocok, anak".
 Weighting word is the final step from data preprocessing to calculate a score or a value on the frequency of the occurrence of words in a document or text. One method for weighting words is term frequency-inverse document frequency (TF-IDF). TF-IDF refers to a weighting method that unites two concepts, namely Term Frequency, and Document Frequency. The term frequency is a concept in which weighting is applied by finding the frequency or frequency of a term occurring in a document or text. Each document or text usually has a different length, so a word may appear more in a long document or text compared to a short document or text. So term frequency is usually divided by the length of the document or the number of words in the document. Document frequency is the number of documents or text in which a word appears. The lower the frequency of occurrence, the lower the value. In calculating the Frequency of the term, all words in it are usually considered important or significant. Therefore, it is very necessary to calculate TF-IDF, where scores can be obtained using an equation [9].

D. Classification
Sentiment classification is the step in which all pre-processed data will be further processed with a classification algorithm. This study using three classification algorithms, namely Decision Tree, K-Nearest Neighbor, and Naïve Bayes Classifier. Fig. 5 shows a sentiment classification operator. In this sentiment classification operator, there are several sub-operators and sub-sub-operators. The sub operator there are 2 namely Cross-Validation and Apply Model (2). In Cross-Validation, there are 5 sub-sub-operators which are divided into 2 parts, training and testing. The part that includes training is the method or algorithm, namely K-NN, Decision Tree, and Naïve Bayes. The parts that include testing are Apply Model and Performance.
The use of K-NN in the illustration Fig. 6 where there are two classes, namely circles and triangles. But there is new data that is not yet known which class is marked with a red square shape. To find out the class of a square, we need the rule to determine the class [4]. In this example, K = 3. The value of K = 3 means that the class is classified according to the 3 closest members of its neighbor. Based on the value of K = 3, there are 2 classes of circles and 1 class of triangles. So the circle value is greater than the triangle and square data can be classified into circle classes. Fig. 6. K-Nearest Neighbor method [4] As a method known for using conditional probabilities or opportunities, the Naïve Bayes Classifier is formulated as follows [4]. (1)

ISSN 2355-0082
Equation 1 is where X is the proof or data while H is the hypothesis. P (H | X) is the probability that the hypothesis H is true for proof X. P (X | H) is the probability for proof X is true for hypothesis H. P (H) is the probability that the hypothesis H was true for each object the data does not care about the values of its attributes, while P (X) is the previous probability for the data object X.
Decision Tree is a hierarchical model in which local areas are identified as a series of recursive separations by decision nodes in the test function. Decisions in the decision tree are most used by logical methods. In Fig. 7, the Decision Tree is a tree-shaped flowchart structure, in which each internal node (not a leaf node) tests an attribute, each branch represents the test result, and each leaf node (or terminal node) indicates the class label. While the node at the top of the decision tree is the root node.

III. RESULTS
This section is an explanation of the results of the analysis process that has been done.   Comparison between ILC and Mata Najwa in each year can be seen in Fig. 10 and Fig. 11. In 2018, the ILC trend line tends to decrease while Mata Najwa tends to go up. However, the difference is still not that big compared to 2019, where the trend line on the ILC is far below Mata Najwa and also tends to fall, while Mata Najwa is far above and tends to go up. In 2019, ILC suffered defeat and was defeated by Mata Najwa. 14. News about the reason Rocky Gerung is no longer present at ILC [11] In March 2019, the positive sentiment towards ILC decreased to one sentiment. After further analysis, it turns out that in the March 2019 crawl text, there is a reason why sentiment has declined dramatically which can be seen in Fig. 12. It occurs since the disappointment with ILC because one of its sources, Rocky Gerung was not presented which is highlighted in yellow. Fig. 13 contains contra the community against the topic of the ILC discussion and asked to be revised, and also their longing for Rocky Gerung, while Fig. 14 contains the reason Rocky Gerung is no longer present in the ILC.
As a result of this manual analysis, public opinion on Twitter can be applied to explain the Panasonic Gobel Award Winner between ILC and Mata Najwa. In 2018. ILC outperforms Mata Najwa. This is matched with the public opinion result. ILC is above Mata Najwa before October 2018. Afterward, more positive opinions for Mata Najwa in 2019, and Mata Najwa comes as the winner this year.

B. Analysis Using Rapidminer
The analysis using RapidMiner is applied to validate the manual labeling results by predicting them using several algorithms. This analysis is performed by using algorithms and operators in Rapidminer. The main operators can be seen in Fig. 15. The analysis process is done by retrieving data from Twitter and then processed using a series of operators created in the RapidMiner application. The series starts from the Read Dataset Training which continues to the Preprocessing until Sentiment Classification.   Tables III and IV Table V and VI. By using the three algorithms, it can be stated that the number of True Positive in 2019 is more than in 2019.

ISSN 2355-0082
As shown in Table VII, in 2018, Naïve Bayes was the best algorithm, while in 2019, K-NN was the best algorithm. This means that no algorithm is always at the top. All algorithms are used based on data content and the level of accuracy of certain data is always changing. In 2018, the performance of both Mata Najwa and ILC is fluctuated, as a result, K-NN finds more difficulties to find the nearest neighbor. In such a situation, the conditional probability capability of Naïve Bayes performs better. While in 2019, the public opinion of Mata Najwa is clearly above the ILC. So K-NN can easily separate them into the nearest neighbor. The best accuracy is found on the ILC in 2019 by using K-NN because its public positive opinion in 2018 is above in 2019. So that K-NN can separate them easier to the nearest neighbor.

IV. CONCLUSION
The public opinion on Twitter can be successfully applied to confirm the winner of the Panasonic Gobel Award. As a result of manual analysis, ILC wins in 2018. The number of the positive sentiment of ILC is slightly better than Mata Najwa. But ILC lost in 2019 because its number of positive sentiments is declined whilst Mata Najwa is obviously above ILC. In addition, from 2018 to 2019, the number of positive sentiment for ILC is dramatically decreased, whilst Mata Najwa fluctuated.
Three Algorithms are applied to validate manual labeling results with the highest accuracy is 76.94% by using K-NN. However, no one algorithm shows the best performance on all data. In 2018, Naïve Bayes was the best algorithm, while in 2019, K-NN was the best algorithm.