Text Mining an Automatic Short Answer Grading (ASAG), Comparison of Three Methods of Cosine Similarity, Jaccard Similarity and Dice's Coefficient

This study aims to find correlation assessment of Automatic Short Answer Grading (ASAG) by comparing three methods of Cosine Similarity, Jaccard Similarity and Dice Coefficient by providing one reference answer. From the results of computing using Python programming language and data processing using spreadsheets, it was obtained that the Dice Coefficient method had the highest correlation average value of 0.76, followed by Cosine Similarity with an average correlation value of 0.76, and the lowest correlation average value was the Jaccard method with a value of 0.69. The contribution to this study is the use of three methods in one data, whereas the previous research only used 1 method for 1 data or 2 methods for 1 data. So, the value in this study resulted in a more complete comparison and accuracy of data.


Introduction
During the COVID-19 pandemic as it is today, the Indonesian government is taking several policies, one of which is on limiting social interaction. This policy has a significant impact on the world of education. The world of education changed the learning that was originally done on campus but now the learning is done at home so that the learning activities are done online. This encourages educational institutions in Indonesia to start developing e-learning systems in learning activities. E-Learning is one of the learning methods where the learning process, teaching process and even the assessment process are conducted electronically through the internet. By applying e-learning assessment of learning results is done automatically by using automatic grading system. This system has advantages such as being able to score answers quickly and objectively. The assessment model consists of three kinds of multiple choice, right wrong and essay (description) [2]. In college institutions, most lecturers give questions in the form of descriptions. Answers in the form of descriptions are not as easy as answering questions in the form of multiple choice and correct answers are wrong. The answer to the description requires further natural language processing. The answer description is a form of question where the choice of answer is not provided so the student must answer with a sentence. The description answer is the right method to assess the results of the learning activity, because the answer to the description will involve the student's ability to remember and express the ideas they have. The problem in the assessment of the description is about subjectivity, the assessment between one lecturer and another lecturer may be different. Another problem is the possibility of lecturers having errors in assessment such as the answers of the same students but have different scores.
Automatic Short Answer Grading is to score on the similarity between student answers and lecturer answers, while the difference between the two lies in the length of the answer. The answer length in Automatic Short Answer Grading ranges from two words to one paragraph [2]. Other researchers limited the number of answers to twenty words so that the results can be more relevant to provide a better correlation [ 4] [5]. The research method often used to score the answers to the description is String-Based Similarity. String-Based Similarity is divided into two characters based and term based. String-Based Similarity is a way to score by calculating the similarity of the character, while term-based calculates similarities based on the terms. ASAG research has been done a lot before, but most datasets used are questions and answers in English form.
The purpose of this study was to provide a score against the similarity of short answers in the type of short answer that uses Indonesian by comparing cosine similarity, Jaccard Similarity and Dice's Coefficient methods.

Text Mining
According to Firdaus [7] text mining is the process of analyzing text to extract useful information for a specific purpose. Text mining is a branch of data mining science. The difference between the two is in the form of data. Data mining has a structured form of data, on the contrary text mining has an unstructured form of data [8]. Research on text mining has been conducted ranging from word matching [9], document compaction [10], plagiarism detection [11], sentiment analysis [12] and automated assessment of essay answers [13] [14]. Research on automated assessment using natural language processing techniques was first conducted by page [15]. Then other researchers started to develop it a lot.

Preprocessing Techniques on Automatic Short Answer Grading (ASAG)
Automatic Short Answer Grading System is an assessment system that is done automatically in the brief by comparing between student answers and lecturer answers. In Table 1. There are five categories of ASAG preprocessing techniques [2].

Term-based Similarity Measures Cosine Similarity, Jaccard Similarity and Dice Coefficient
According to Hall and Dowling [16] string-based similarity measurements, it is divided into fourteen algorithms. The fourteen algorithms are seven of them character-based and the other seven term-based.  For this study focused with three term-based algorithms namely cosine similarity, Jaccard similarity and dice's coefficient.

Cosine Coefficient Method
Cosine Similarity method is a method used to calculate the similarity between two objects. In general, the calculation of this method is based on vector space similarity measure. This cosine similarity method calculates the similarity between two objects (e.g., D1 and D2) expressed in two vectors using keywords from a document as a size.

Jaccard Coefficient
Jaccard Coefficient is one of the methods used to calculate similarity between two objects(items). As with cosine distance and matching coefficient, in general the calculation of this method is based on vector space similarity measure.

Dice's Coefficient
Dice's coefficient is a method for comparing the similarities of two different text samples. Dice coefficient is a semi metric version of Jaccard coefficient. This method maintains accuracy on diverse datasets and gives less weight to datasets containing unrelated features [19].

Research Method
This study used the Cosine Similarity, Jaccard Similarity and Dice's Coefficient methods. The programming language used is Python 3.8 to calculate similarity scores and Microsoft Office Excel to calculate correlation and MAE values.  In this study, the data used were questions and answers in the E-business course quiz at Amikom Purwokerto University which was limited to questions containing definition answers. In this study took four questions answered by thirty-one students each question. Here is an example of questions and answers that can be seen in table 3.

Pre-processing data
Raw data usually has meaningless parts for text mining, such as stop word. For the data to be processed, the data needs to go through the stage of pre-processing data. Preprocessing data is the process to prepare raw data before the next process is done [20]. The steps in the data processing in this study are: 1.
Case folding: case folding is used to convert the entire text to lowercase [21]. This is done to make searching easier, because text documents are not always consistent in the use of letters.

2.
Tokenization is the process of dividing text derived from a sentence or paragraph into specific sections [22]. For example, the answer "Direct trade process" generates five tokens namely "Process", "Sell", "Buy", "in", "direct". The separator between tokens is spaces and punctuation.

3.
Stop words removal (filtering): In this step will be omitted words that appear frequently but do not contain the meaning [23]. Prefaces and conjunctions are also included in stop words removal.

4.
Stemming: The process of converting a word form into a base word [24] ISSN 2579-7069 49

Cosine Similarity, Jaccard Similarity and Dice Coefficient
The methods used to measure sentence similarity are Word Overlap methods, such as Cosine Similarity, Jaccard Similarity and Dice's Coefficient. This method only counts words that are similar in sentences. The formula of these three methods is shown in table 4.

Correlation and MAE
To measure the correlation between teacher answers and student answers, the study used Pearson Correlation on equations ( The category of success in the automatic scoring system based on the correlation value there are three categories namely Excellent, good and bad. The correlation category is very good the value is r> 0.75, the good category the correlation value is r = 0.40 -0.75 while the bad category if the correlation value is r <0.4 [26].

Discussion
The basic techniques in text mining are tokenization, case folding, stop words removal (filtering) and stemming. In this study removed all punctuation marks and symbols. The teacher and student's answers are input into the token and only take one unique token and then turn it into a vector. After that, change all forms of writing into all lowercase, then stop word removal (filtering) by referring to the research done by Tala [27]. The next stage is stemming by breaking up phrases using the Sastrawi library (https://github.com/sastrawi/sastrawi) based on the Nazief-Adriani Algorithm. In table 5. Describes an example of a preprocessing technique.

Table. 5. Examples of preprocessing techniques
Answer Business activities conducted electronically using the internet Case folding business activities conducted electronically using the internet Tokenisasi "Activity", "business", "that", "done", "by", "electronically", "with", "using", "internet" Stopwords removal Business activities are carried out electronic internet

Stemming
Enterprising internet business salable electronics After going through the Pre-processing phase of the data then the next stage measures the similarity of answers using the Cosine Similarity, Jaccard Similarity and Dice's Coefficient methods. In this study, the evaluation metric used was Pearson correlation test. Correlation tests are used to measure the degree of closeness between the value produced by the system and the value provided by the teacher. The assessment is manually done by two teachers, the goal is for the assessment to be done more objectively. The grades given by teachers to students' answers have a range of grades from 0 to 4. Furthermore, the correlation values generated by each method are compared to know the best performance in the assessment of the similarity of the answers. In addition to using Pearson correlation, the study also used Mean Absolute Error (MAE) to measure the error rate between teacher answers and system answers.
Here is an example of calculating the similarity of students' answers to the teacher's answers for the three methods. The results of the test conducted on four questions with one question each consisted of thirty-one student answers using Cosine Coefficient shown in table 6.   In table 6 shows that in dice's coefficient method has the same correlation value as cosine similarity method which is the highest correlation value found in answer number 2 and lowest in answer three with correlation value of 0.76010 and lowest of 0.70448. For the highest MAE value is in answer number 1 and lowest is in answer number 4. The average correlation value for this method is 0.73925 and the mae average value is 0.59750. For more details the correlation and MAE results comparison for the three methods is illustrated through the bar graph shown in figure 2. Figure 2 shows that the highest correlation value is found in Dice's Coefficient method with an average value of 0.73925 and the lowest correlation is found in Jaccard similarity method of 0.69705. For the highest MAE is found in Jaccard similarity method of 0.74735 and the lowest is found in cosine similarity method of 0.57737

Conclusion
After testing using the Cosine Similarity method, Jaccard Similarity and Dice's Coefficient can be concluded that the highest correlation value is found in Dice's Coefficient method but has a greater MAE when compared to the Cosine Similarity method. The three methods have a correlation between r = 0.40 -0.75 so that the three methods are said to have a good success rate [26].