Recommendation for Classification of News Categories Using Support Vector Machine Algorithm with SVD

online news is a digital information media currently has a very easy and flexible updating process. The News Document grouping process is implemented in several stages, including Text Mining which includes Text Pre-processing which includes Tokenizing, Stopword removal, Stemming, Word Merging, TF-IDF and Confusion Matrix. Of the several techniques in Text Mining, the most frequently used for News Document classification is Support Vector Machine (SVM) Algorithm. SVM has many advantages of being able to identify separate hyperplane that maximizes the margin between two or more different classes. The selection of the features in SVM Algorithm significantly affects the classification accuracy results. Therefore, in this study a combination of the feature selection methods is used, namely Singular Value Decomposition in order to increase accuracy and reduce the Classifier Time Support Vector Machine. This research resulted in text classification in the form of categories Entertainment, Health, Politics and Technology. Based on the Support Vector Machines Algorithm, an accuracy rate of 81% was obtained with 360 Data Training and 120 Data Testing, after adding the Singular Value Decomposition feature with a K-Rank value of 50%, a significant increase in accuracy was obtained with an accuracy value of 94% and The time of Algorithm process is faster.


I. INTRODUCTION
Information has become one of the necessities in everyday human life. Information can be interpreted as knowledge from a learning, experience or instruction. In some cases knowledge about events or situations that have been collected or received through the communication process, gathering news from certain events. [1] In the modern era and the sophistication of technology, speed and accuracy in obtaining news or information are needed by the community. Newspapers are the people's choice in obtaining fast and accurate news and information.
In general, the news that is conveyed in news portals consists of various categories such as Political News, Technology News, Entertainment News, Health News and others. For example on the Indonesian news website www.kompas.com, www.waspada.com, www.vivanews.com, etc. [2] However, in distributing news into certain categories, for now it is still done manually, where the news is separated by the author, meaning that in uploading news the writers must first know what the content of the news they will upload will be and then it will be included in the right and correct category. Therefore, it is necessary to have a system that can classify news categories automatically according to the existing news categories based on the title and content of the news so that this system can assist news uploaders in uploading news.
The SVM algorithm can be explained simply as an attempt to find the best hyperplane that can function as a separator between 2 classes in the input space. The SVM method is rooted in statistical learning theories where this method is very promising in providing better results than some other similar methods. [3] To be able to produce higher accuracy than a Support Vector Machine Algorithm, an experiment will be carried out by adding the Singular Value Decomposition (SVD) Method, SVD is a matrix decomposition technique to facilitate data processing, because the Singular Value Decomposition (SVD) Algorithm has advantages over process time efficiency Algorithm on a large-scale Dataset and Singular Value Decomposition (SVD) Algorithm were also chosen because they have the ability to perform the decomposition process on a term-document matrix, so that a matrix that still stores important information with far-reaching smaller dimensions can be obtained.
The purpose of this research is to build a system that can automatically classify news categories into a news category into an actual category, to find out the difference in the final results of News Classification with the Support Vector Machine Algorithm only and with the addition of the Singular Value Decomposition Method, and also to make it easier for news uploaders to upload news with an automatic news category categorization system.

A. Text Mining
Text mining is a technique in computer science that can be used to solve a large number of information problems by combining techniques from Data Mining, Machine Learning, Natural Language Processing, Information Retrieval and Knowledge Management. [4]. Text Mining seeks to extract useful information from data sources through an identification and exploration of interesting patterns. In Text Mining the data source can be obtained from a collection of documents, this means that the data can be in the form of newspapers, magazines, articles, letters, or research reports such as journals, or a thesis.

B. Stages of Text Mining
An explanation of each of the Text Preprocessing processes is as follows: [5] 1). Tokenizing.
Tokenizing Text is unstructured data that must be changed first to make it structured before further analysis. The text in the email entered into the application is stored in a 1-dimensional array. The words in a sentence are divided based on the sentence and then the words will be divided again based on spaces. 2). Stopword removal.
After doing the Tokenizing Text process that the word is not tied to other words. As a result of this separation, there are some words that have no relevant meaning at all in determining the characteristics of a tokenized document, such as the words "this, that, and, or" and many more words. kind. Words that have no relevant meaning are called stopwords. 3). Stemming.
Stemming is a process of mapping and parsing in the Variant form of a word into its basic word form. In Indonesian documents the Stemming process is very necessary before entering the Text Mining process because Indonesian has Prefixes, Suffixes, Infixes and Confixes which make a basic word can be changed into many forms and as a result of making word searches difficult. The following are examples and meanings of affixes in Indonesian.
a. Suffixes (Akhiran) is an affix that is usually added to the end of a word, such as "-an, -kan, and '-i".
b. Prefixes (Awalan) is an affix that is usually added to the beginning of a root word or the basic form of a word, such as "-per, -mem" c. Confixes (Sifiks and Prefiks) a single affix occurs from two separate elements, such as "ke-....-an" 4). Word weighting. The TF-IDF method is a method for calculating the weight of each word that is most commonly used in Information Retrieval. This method is also known to be efficient, easy to use and has high accuracy results. [6] The data that has gone through the preprocessing stage is in the form of a numeric using this method TF-IDF. The Term Frequency Inverse Document Frequency (TF-IDF) method is a method commonly used to determine how far connected a word (term) is to a document by giving weight to each word. The TF-IDF method itself combines two concepts, namely the frequency of occurrence of a word in a document and the inverse frequency of a document containing these words. [7] In calculating the weight with TF-IDF, what is first calculated is the TF value of a word with the weight of each word being 1. TF (Term Frequency) which states the number of words that appear in a document. DF (Document Frequency) states how many documents for TF calculation using the following formula: Explanation: • TF-lDF(w,d): the weight of one word in the entire document. IDF(word) is an IDF value of each word to be searched, while TD is the total number of existing documents and DF is the number of occurrences of words in a document.

C. Confusion Matrix
Confusion Matrix is the method used in the calculation of accuracy. In testing the accuracy and search results will be evaluated into the value of Recall, Precision and Accuracy. Where Precision is an evaluation of the ability of the system to find the most relevant ranking and can be defined as a percentage of documents that are retrieved and are truly relevant to the Query. Recall is an evaluation of the system's ability to find all relevant items from collections and can be defined as a percentage of documents relevant to the Query. Meanwhile, Accuracy is a comparison between ISSN 2085-4552 cases that will be correctly identified and the total number of existing cases. [6] D. Support Vector Machine Support Vector Machine (SVM) is a technique that is relatively new when compared to other existing techniques, but SVM has a much better performance in various application fields such as Bioformatics, Handwriting Recognition, Text Classification and so on. [3] SVM is a technique for making predictions, both in the case of classification and regression. SVM has the basic principle of linear classifier which means that linear classification cases can be separated, but SVM has been developed so that it can work on non-linear problems by including kernel concepts in a highdimensional workspace. The kernel function is commonly used to map a lower initial dimension to a relatively higher dimension. Kinds of kernel functions include: [10] Formula Support Vector Machine Kernel Function

E. Singular Value Decomposition (SVD)
Singular Value Decomposition is one of the many techniques in processing a matrix derived from the science of linear algebra which was introduced by Beltrami in 1873. SVD is one of the stages in the process contained in the Latent Semantic Analysis (LSA) method. used to process a matrix in linear algebra which is used as a tool in mathematics and is commonly used to represent a matrix and is capable of performing various analyzes and matrix computations. [11] SVD is very useful in decomposing a matrix divided into 3 new matrices, including the U orthogonal matrix, the S diagonal matrix and the last one is the transpose matrix of the D orthogonal matrix or it can also be formulated as follows: Explanation: • mxn is a matriks • A Matriks size mxn • U The singular vector of the matrix A and this vector is orthonormal • S The diagonal of the vector that composes the singular value of the corresponding singular vector. • VT The singular vector of matrix A is also orthonormal.
F. F1-Score F1 or F-Measure is a harmonic mean of precision and recall or it can be abbreviated as f1-score which is a comparison of the mean or average of precision and recall that is weighted. The range of an f1 value is 0 to 1. Here's the equation: [11] Explanation: • F1 : F-Measure or F-Score • P : Precision • R : Recall

III. RESEARCH METODOLOGY
In this chapter, it is explained about the stages in system design that will be made by the author, which includes Data, System Description, Analysis Model and Interface Design. Here's the explanation

A. Development Method
The method of developing a system used in this research is the Waterfall Method or Structured Approach. In general, the Waterfall method is a method that is often used to analyze systems. The essence of this method is the stages in working on a system that are carried out sequentially, the Waterfall method consists of several stages of activities, including:

B. Diagram Block
Below is a system process using Block Diagrams: Figure 2. Diagram Block.
The classification process starts from input data in the form of news titles, then continues with text operations, in this process there are several stages, namely the tokenizing stage to separate words and convert them into spaces, stopword stages to delete words that do not contain meaning, stemming stages to remove affixed words and weighting or TF-IDF for the process of giving the index or frequency contained in the final word of the stemming process, then it will enter the word merging process (synonym), if there are different words but have the same meaning, then the system can combine them together with the frequency, Stages The next step is Support Vector Machine and Singular Value Decomposition. Next is the testing phase using the Confusion Matrix, the ISSN 2085-4552 number of correct predictions is divided by the total of all data. And the last one is Classification with new text data in the form of title and news content to determine the category of the news.

C. Tokenizing
In the tokenizing the process that occurs is the process of breaking sentences into word for word, these words are also changed from uppercase to lowercase and eliminate unique characters that are not included in the word.

D. Stopword Removal
The next process is Stopword, which is the process of filtered words where unimportant words from the text will be discarded. The system will check the Stopword list dictionary, if the word exists then the word will be deleted.

E. Stemming
After the Stopword stage, the next process is the Stemming process. Where the system will search for words from the existing news text and convert them into basic words.

IV. RESULT AND DISCUSSION
In this chapter, are the stages of implementation and testing of the system that has been built.

A. User Interface
The image below is an image of the Process Data, display from the program that was created.

News Classification
The image below is a display of the new news classification In this stage, a news category classification process will be carried out based on the title and content of the news, there are 480 data on the title and content of the news with the division of 360 Training Data and 120 Testing Data, the following results are obtained: Total Term TF-IDF 4092 from 480 data documents News In this process the system will carry out a machine learning and training process which only uses 1 algorithm, namely Support Vector Machine and accuracy calculations on the Confusion Matrix.  Figure 5. Source Code SVM Figure 6. Confusion Matrix SVM Algorithm.
From Figure 5 we get: •    From Figure 7 we get: • And Average Score: • The average value of the Accuracy Value is 94% • Average Precision Value of 90%  From the results of the classification in Figure 8, it is known the accuracy of each category. The accuracy of each category is obtained from the number of words/terms after passing through the Text Preprocessing stages. The existing words/terms will be checked into the Training Data and Testing Data which will then be calculated for the total Words/Terms contained in each news category in the existing Training Data and Testing Data. The Support Vector Machine and Singular Value Decomposition Algorithms in this study were only tested with a little data, it would be better if there were more data.