Cognition-Based Document Matching Within the Chatbot Modeling Framework

The aim of the study is to examine cognitive methods for document matching in a chatbot modeling framework by utilizing Euclidean Distance, Cosine Similarity, and BERT methodologies. Five primary indications are used to carry out evaluation in testing: document matching accuracy, document matching execution time, document search efficiency, consistency of document matching results, and the quality of the document representation in the matrix. Document matching accuracy is evaluated by precision; document matching execution time is measured from the beginning to the end of the document matching process; document search efficiency is measured through evaluation of execution time and matching accuracy; the consistency of document matching results is assessed by comparing method results when tested against the same or similar queries and the quality of document representation is assessed based on the method's ability to represent documents in a matrix or vector. The test findings offer a comprehensive understanding of how well the three approaches operate and exhibit their capacity to address the unique requirements of chatbot users. These results may contribute to the advancement of language technology applications, making it possible for chatbots to deliver pertinent information more rapidly and precisely. There are 1,755 labeled question samples in the dataset, which were split up into two sets: 60% for training (1,053 pieces), and 40% for testing (702 samples) to evaluate the model's performance. The test results show the accuracy of the three methods based on five measured evaluation indications, namely Euclidean Distance 0,45%, Cosine similarity 0,59%, and BERT 0,91%. By comprehending the benefits and drawbacks of each approach, this research strengthens contributions to the growth of chatbot systems to better serve user demands and opens the door for the creation of more complex human-machine


Introduction
Chatbots have become an integral component in various aspects of human interaction with computer systems [1].The rapid growth in the use of chatbots underscores the importance of efficiency and accuracy in document matching to provide users with relevant and timely answers [2].Therefore, the concept of cognition-based document matching has emerged as a promising approach within the chatbot modeling framework [3].By utilizing cognitive principles, such as understanding natural language and semantic analysis [4], in the document matching process, it is expected to be able to improve the quality of interaction between users and chatbots.Cognition-based document matching improves chatbot performance, several things that need to be considered include: Understanding context, document-based cognition allows chatbots to better understand user context and goals.By understanding the content and topics in documents, chatbots can provide more relevant and appropriate responses.Example: The document "Aromatherapy for Relaxation" provides information about the benefits of aromatherapy in SPA.The chatbot uses this information to understand users' questions regarding aromatherapy and provide relevant answers.;Increased relevance, by using information from cognition-based documents, chatbots can tailor their responses to user needs and preferences.This helps increase the relevance and usefulness of the responses provided by the chatbot.Example: The document "Holistic Approaches to Wellness Retreats" describes a holistic approach to spas.The chatbot adjusts its recommendations based on this information to meet the user's preferences.;Entity and concept recognition, cognition-based documents contain information about entities and concepts relevant to a particular domain.By analyzing these documents, chatbots can recognize entities and concepts that are important for interacting with users.Example: The document "Innovative SPA The role of chatbots supported by artificial intelligence is very helpful for customers in getting information in real-time [7].The development of artificial intelligence (AI) has played a central role in advancing chatbot technology significantly [8].The integration of AI into chatbot development has resulted in tremendous progress in terms of human-machine interactions that are increasingly similar to interactions between humans [9], [10].The use of natural language processing (NLP) techniques in chatbots has allowed the system to understand and respond to user requests and questions in a more natural and contextual way [11].An illustration of utilizing the Long short-term memory (LSTM) algorithm in a chatbot to respond to inquiries about training registration details at BLK Surabaya [12], [13] [14].Financial services chatbot based on BERT method [15].AI technology has also enabled the development of more adaptive and personalized chatbots [16].Overall, AI developments have opened up exciting new opportunities in the development of chatbot technologies, enabling them to become catalysts for more productive, intuitive, and immersive human-machine interactions [17].By continuing to leverage innovation in the field of artificial intelligence, chatbots have the potential to continue to grow and provide increasingly rich and meaningful experiences for their users [18].
An important part of chatbots is likeness matching [19] documents relating to the provision of relevant and accurate information to users.By leveraging artificial intelligence techniques such as NLP and semantic analysis, chatbots can effectively match user requests or questions with the most relevant documents or content from existing data sources [20].By training a model that understands patterns in relevant text data, chatbots can identify the information that best fits a user's request [21].The cognitive approach to document matching in chatbots represents an interesting evolution in the development of human-machine interaction [22].By incorporating cognitive elements such as context understanding, in-depth semantic analysis, and the ability to recognize user intentions and emotions, chatbots become better able to understand the true purpose of a user's question or request [23] [24].Through the application of more sophisticated natural language processing techniques, chatbots are able to distinguish content that is similar in terms of concepts and meanings, not just in terms of words or phrases for document matching in chatbots represents an interesting evolution in the development of human-machine interaction [23].By incorporating cognitive elements such as context understanding [25], in-depth semantic analysis, and the ability to recognize user intentions and emotions, chatbots become better able to understand the true purpose of a user's question or request [26].Through the application of more sophisticated natural language processing techniques, chatbots can distinguish content that is similar in terms of concepts and meanings, not just words or phrases [27].

Documenting Matching Method
This method aims to identify the most relevant documents to a given query [28].The primary objective is to provide users with information that best meets their needs so they may get the answers or solutions they need [29].It involves gathering, examining, and contrasting text from several documents in order to identify the ones that most closely correspond to the user's query or request.

Chatbot Models
The primary objective of implementing a chatbot paradigm is to efficiently and successfully enable human-system interaction [29].The goal of the chatbot model is to enhance the user experience while engaging with the system by responding to queries or instructions from users in a contextual and appropriate manner [30].Chatbots, which utilize artificial intelligence and natural language processing, are made to respond to queries, offer information, and carry out other duties in a manner that mimics human communication, all with the goal of improving the user experience [31].

Evaluation Method
Evaluation techniques are employed to quantify and contrast the effectiveness of various document-matching strategies [32].The goal is to gain an in-depth understanding of the strengths and weaknesses of each method, thereby enabling

Application Development
The goal of application development is to provide a complex and dependable language technology system to facilitate communication between computers and people [35].The program is made to satisfy user demands by offering chatbot services that are dependable, responsive, and capable of giving users the information or help they want in a timely and correct manner.To improve user experience and satisfy a range of objectives, the produced apps must be able to incorporate several language technologies, such as speech recognition, natural language production, and natural language processing.

Method
In this study the stages were carried out as shown in figure 1 The Euclidean Distance, Cosine Similarity, and BERT methods were chosen because each has advantages in understanding and matching documents in a chatbot context.Euclidean Distance is simple and fast, while Cosine Similarity is more adaptive to variations in document length.BERT, with its deep understanding of context, is suited to dealing with semantic complexity.By utilizing these three methods, research can provide a comprehensive and adaptive solution for document matching in chatbots.
The three methods were chosen because of their ability to improve document matching in chatbots.Euclidean Distance provides fast results, Cosine Similarity is adaptive to document variations, and BERT understands context well.By combining the three, the research aims to provide more relevant and responsive answers to chatbot users.

Review
Data Collection Data Sampling Method Design Method Testing Results

Figure 1. Research Stages
This literature review was carried out by collecting and reviewing research articles related to the document matching system to explore the latest methods that have been applied [36].Chatbots use text mining to understand and respond appropriately to user texts [37].The purpose of exploring this method is to gain knowledge about existing documentmatching methods [38], understand the problems that exist in the document matching system, and get gaps from the research to be carried out.In addition to reviewing research articles, at this stage, a theoretical study of several literature related to the document matching system was also carried out to strengthen the theoretical basis that would be used to support the research.Data collection techniques, explain what techniques are most suitable for various types of research, so that one can easily decide which technique can be applied and is most suitable for his research.There are 1,755 labeled question samples in the dataset, which were split up into two sets: 60% for training (1,053 pieces), and 40% for testing (702 samples) to evaluate the model's performance in SPA Wellness documents.SPA Wellness is part of local wisdom [39].At this stage, the process of designing a solution method that will be used in research is carried out.The design of this solution method aims to improve the performance of the document matching method that has been used, considering some of the weaknesses that existed in the previous method.In this study, the design of the solution method was carried out using the Euclidean Distance (ED), Cosine Similarity (CS), and Bi-directional Encoder Representations from the Transformers (BERT) approach.The process of testing the methods used and comparing the test results.In this study, tests were carried out to measure the degree of similarity and matching of documents and measure the results of Execution Time.

Method Design
This research method is divided into 3 parts for explanation as shown in Figure 2, namely input, process, and output.Inputs in data collection include initial information such as text documents, conversation transcripts, user data, and conversation context.This data is used as raw material for analysis and modeling, to develop a cognition-based document matching algorithm in a chatbot framework.The process, where the data that has been collected, goes through the Filtering stage for cleaning, followed by text processing such as tokenization and stemming.Then, the data is converted into a vector with TF-IDF.These vectors were tested using Euclidean Distance and Cosine Similarity to measure similarity, as well as BERT for more in-depth analysis.Output, representation of document matching result using cognition-bases Euclidean Distance, Cosine Similarity, and Bert methods.

Figure 2. Method Design
The process begins by representing documents and user queries in a numerical vector space, where each dimension represents a specific feature of the text.Then, the Euclidean distance is calculated between the vector representations of the document and the query to determine how similar or different the document is to the query.The smaller the Euclidean distance between the document and query vectors, the more similar the two texts are.
Cosine Similarity is initially similar to Euclidean Distance, where documents and queries are represented in vector space.However, the difference lies in the similarity calculation.Cosine similarity measures the cosine of the angle between two vectors, reflecting the similarity in direction between them.The greater the cosine similarity value, the more similar the document is to the query The use of BERT is more complex because it involves a pre-trained language model that has learned language representations from very large text data.First, document and query texts are fed into BERT to obtain a vector representation that describes the meaning and context of the text in depth.Next, the similarity between the document and query vectors is evaluated, similar to Cosine Similarity or other methods.

Sample Documents
In this survey, we conducted a comprehensive data collection study from the point of view of data management.Data collection mostly consists of data acquisition, data labeling, and refinement of existing data or models [40].The goal of data acquisition is to find datasets that can be used to train machine learning models.Text data was collected from The selection of examples of documents in Table 1 is because researchers have identified them to determine the main topics, they want to investigate in the field of SPA wellness, such as aromatherapy, spa technology, nutrition and healthy food, art and creativity in spas, etc. Next, researchers searched for documents relevant to these topics using trusted sources, such as scientific journals, books, research reports, or official spa and health websites and conducted interviews with SPA business people and customers.The search results of the documents found are then selected and assessed based on certain criteria, such as newness of information, diversity of topics, and quality of content.The most relevant and high-quality documents were then selected for inclusion in the research.To ensure the diversity of topics covered, researchers also considered the variety of topics that exist in the SPA wellness field, such as water therapy, nutrition, art, and technology.The selected documents can represent various aspects and dimensions of SPA wellness.

Calculating the Weight with the TF-IDF Method
Text preprocessing is the initial step in applying the TF-IDF technique to determine weights [41].Case folding, tokenizing, filtering, and stemming are some of the steps that text processing entails in order to pick text data and make it more organized.Case folding This stage is almost always included when doing text preprocessing.Because the data held is not always structured and consistent in the use of capital letters, and numbers and eliminating empty characters in vector documents.Tokenizing is used to break sentences into words which are often called tokens.Filtering is used to extract important words from the token results.Common words that usually appear and have no meaning are called stop words.Stemming is a stage that is also needed to reduce the number of different indices of one data so that a word that has a suffix or prefix will return to its basic form.After the text processing process has been carried out, the next step is to calculate the Term Frequency (TF) of the appearance of each word (term).From the illustration in Table 2, the TF is obtained as in Figure 3.After the TF value is generated, the next step is to calculate the Inverse Document Frequency (IDF) with the formula: , where n is the number of documents (d1, d2,…,di), and document frequency (df) is the number of frequency tokens appearing in all documents (d1, d2).For example, the "relax" token d1=2, d2=0 then the value df=1+0 so that df = log 2 1 = 0,3010; the "explor" token d1=1, d2=1 then the value df=1+1 so that idf = log 2 2 = 0.As a note, if the number of tokens appears more than 1 then only 1 is taken as in the "relax" token.The next step is to find the weight value for each token using the formula W= TF x IDF, An example of finding the weight value of the "relax" token is Wrelax(d1)= 2 x 0,3010 = 0,6020, and Wrelax(d2)=0x0,3010 = 0, and so on until all tokens get a weight value.
The results of the document-stemming process will become a dataset that will be calculated using TF-IDF [42], [43].TF-IDF is used to analyze how important a word is in a document.TF, namely the higher the frequency value of the word appearing in a document, the higher the weight value for the word itself.Meanwhile, the IDF process is the opposite of the TF process.In IDF, the higher the frequency of occurrence of a word, the lower the weight value of the word itself will be.The next stage is weighting with the following formula: Term weighting is heavily influenced by the following: TF factor and IDF, where the Term Frequency (tf) factor is a factor in determining the weight of terms in documents, the number of values that appear in a word (term frequency) which is calculated in a weight to a word [44].So, it can be concluded that the greater the term frequency in the document, the greater the document weight conformity value.While IDF is a reduction in the dominance of terms that often appear in various documents, this has an impact on the appearance of many terms in various documents and is considered a common term so its value is not important.

Euclidean Distance
Euclidean distance is a metric used to measure the distance or difference between two points in Euclidean space [45].
In two dimensions (for example, on a plane), the Euclidean distance between two points is calculated using the formula: For more complex dimensions the Euclidean Distance between two points is calculated using the formula: ( 1 ,  2 , … ,   ) represents the coordinates of the first point, ( ,  2 , … ,   ) represents the coordinates of the second point.(  −   ) represents the difference in the  − ℎ coordinate of the two points.
When executing clustering, the Euclidean distance is considered, where if the interval between two points of the centroid is the same but the points are in the same or opposite direction, then they can also fall into the same cluster.
Euclidean distance achieves very good results when data sets are organized into compact clusters.Despite the fact that Euclidean distances are exceptional in clustering, there is a drawback: when two entities do not have a uniform standard, their distances may differ compared to other pairs of entities that have similar attribute values.
Document matching begins with text processing, where the document is converted into a numerical representation.
Next, the TF-IDF method is used to evaluate the importance of words in the document.Finally, the Euclidean distance is calculated between the document vectors to determine their similarity.This enables efficient and accurate document matching.For example, the value in Figure 2 is calculated using Euclidean distance with the help of the Python program as in Figure 4.The results are as in Figure 5.

Figure 4. Example of a Euclidean Distance Calculation Program
From Figure 4 Euclidean Distance Calculation, it can be seen that document q has a closer Euclidean distance to document d1 compared to document d2.The Euclidean distance between documents q and d1 is 4.242641, while the distance between documents q and d2 is 4.123106.This shows that document q has a higher degree of similarity with document d1 than with document d2 based on the Euclidean Distance metric.However, these two distances are still quite significant, indicating that document q has quite a big difference from these two documents.This evaluation provides insight into the extent to which document q matches the documents in the dataset based on the difference in Euclidean distance.Although document q has closer similarities to document d1, further analysis is still needed to understand the deeper context and relevance of these documents.One of the limitations is when the data does not have a uniform distribution or when the features are not on the same scale.For example, if a document is high in dimensionality or has attributes that differ greatly in scale, Euclidean distance may no longer be an ideal metric because the difference in scale may dominate the contribution of the other attributes.Therefore, potential strategies to mitigate this limitation include normalizing the data before calculating the Euclidean distance, or using other methods that are more suitable for complex data, such as non-linear mapping methods or the use of kernels in classifier algorithms.

Cosine Similarity
After the data is weighted, the next step is the data will be equated with the cosine similarity method [46], [47].Cosine Similarity is two vectors that have a measure of similarity in dimensional space, which is obtained from the product of the two vectors being compared.cosine of 0 0 is 1, then the similarity value of two vectors is said to be similar if the cosine similarity value is 1.Documents with known weights will be calculated using the cosine formula for the length of the vector.Cosine similarity data can be seen as follows: For example, the value in figure 2 is calculated using Cosine.Similar results as in figure 6.

Figure 6. Cosine Similarity Calculation
If calculated using cosine similarity for document matching, the result is as in Figure 5, It can be seen that the cosine similarity value between documents d1 and d2 is 0.156.This shows that the two documents have a fairly low level of similarity, which could mean that the content or topics discussed in the two documents are relatively different.However, when compared with the cosine similarity value between other documents, such as between documents d1 and q (0.075) or between documents d2 and q (0.154), the cosine similarity value between d1 and d2 is relatively higher.However, in general, the cosine similarity values recorded in the table indicate that there are quite significant differences between the vector representations of the documents in question.This can be caused by variations in topic or focus of discussion in each document so that the level of similarity between these documents varies.

Bidirectional Encoder Representation from Transformers (BERT)
BERT is a system by which Google's algorithm uses pattern recognition to better understand how human beings communicate so that it can return more relevant results for users [48].The development of Indonesian NLP technology can be said to be minimal when compared to English NLP [49].This also became a challenge when I wanted to develop an Indonesian NLP model [50].
On the other hand, the development of NLP in recent years has been very rapid, especially after the Attention-based approach was discovered [51].This approach underlies the new deep learning method in NLP, namely Transformers, which degrades the traditional LSTM approach [52].Now there are so many variants of the Transformers model architecture.One of the well-known is BERT [29].Figure 7 is the BERT source code Python algorithm.
BERT vector representation of a text  can be obtained by taking the output of a particular layer in the BERT transformer architecture.For example, we can use the output of the final layer before the SoftMax layer for [CLS] tokens in text, which are generally used for classification tasks.When using the BERT model, the term [CLS] refers to a special token that is added at the beginning of each text input.Let's assume that BERT vector representation of the text t is given by BERT (t) = v t , with v t is a vector that has the length d, where  is BERT vector representation dimensions BERT is a language model that uses a transformer architecture to better understand the context of words in sentences.
Compared with previous models, BERT is able to take into account word context from both directions in a sentence, resulting in richer and more accurate word representation.This makes BERT a breakthrough NLP and improves the performance of chatbot systems in understanding and responding to user questions better.
BERT makes a significant contribution to document matching within a chatbot framework by enabling deeper context understanding, complex language handling, better text representation, and adaptability to new contexts.With these capabilities, BERT enhances chatbots' ability to provide more relevant and meaningful answers to users, opening up new opportunities in the development of more sophisticated and effective chatbots in supporting more dynamic and contextual human-machine interactions.

Result and Discussion
Using an SPA wellness document, the test measures word similarity between 1500 to 2000 words, search accuracy (on a scale of 0 to 1, 1 being the best), execution duration in seconds, and Euclidean distance (on a similarity scale of 0 to 1, 1 being the most similar).Five indicators for testing include document matching accuracy, document matching execution time, efficiency in document search, consistency in document matching results, and quality of document representation in the matrix.Each indicator is tested with different variables as in Table 3. BERT method: The highest search accuracy value was obtained from the BERT method, with a value of 0.86.Even though it shows high accuracy, BERT requires a longer execution time, namely 30 seconds.This is due to the greater complexity of the BERT model and a more complicated computational process.Cosine Similarity: The cosine similarity method provides fast results with an execution time of only 5 seconds.However, although the execution time is shorter, the search accuracy is slightly lower compared to BERT, with a value of 0.91.Cosine similarity is a simple method that measures the angular similarity between text representation vectors, but its ability to understand the context and meaning of the text may be limited.Euclidean distance shows potential as a fast method with an execution time of 7 seconds.However, the accuracy and relevance values are not given in this table, making it difficult to directly evaluate how good this method is at document matching.
Table 4 represents a similarity matrix, comparing different data points (labeled as "d1" through "d10") and a query ("q").Each value indicates the similarity score between pairs of documents or between a document and the query.

Conclusion
In the context of document matching utilizing the BERT, cosine similarity, and Euclidean distance approaches, the BERT method exhibits good search accuracy, with a value of 0.91, even with a longer execution time, specifically 29
1) d = document to d t = the t-word of the keyword W = the weight of the d-th document to the t-word TF = number of words read IDF = many documents contain the searched word Vol. 5, No. 2, May 2024, pp.613-627 ISSN 2723-6471 619

A
= Vector A: which will be compared with the word similarity B = Vector B: which will be compared with the word similarity Ai = Term Weight (word)i in blocks Ai Bi = Term Weight (word)i in blocks Bi i = Number of terms (word) in documents/sentences n = Vector sum D = Documents

Figure 8 .
Figure 8. Graph comparison: The graph compares three different metrics: Euclidean Distance, Cosine Similarity, and BERT.Each metric is represented by a distinct line with different colors and patterns.The x-axis ranges from "d1" to "d10," while the y-axis values range from 0 to 1.2.Metric interpretation: Euclidean Distance (Orange Line), measures the straight-line distance between two points in a multi-dimensional space.Lower values indicate greater similarity.Cosine Similarity (Gray Line), measures the cosine of the angle between two vectors.Higher values indicate greater similarity.BERT (Blue Line), refers to Bidirectional Encoder Representations from Transformers.A powerful language model for natural language understanding.Its line represents some comparison or similarity score.Observations: Each metric's line fluctuates, showing variations across different "d" values.The graph provides insights into how these metrics perform relative to each other.

Calculating the weight with the TF-IDF method Collection data
to the topic of spa wellness.Table1shows examples of documents used in the SPA Wellness document.

Table 1 .
Simple Document SPA WellnessThis document focuses on the therapeutic benefits of hydrotherapy in spa settings.It covers various water-based treatments, such as hydro-massage, hot and cold plunge pools, and underwater jet massages, to promote relaxation and healingMindful Eating for Wellness (d5)Exploring the connection between nutrition and wellness, this document discusses the implementation of mindful eating practices within spa environments.It emphasizes the role of balanced nutrition in supporting overall well-being.
Art and Creativity Workshops in Spa Retreats (d6)This document explores the integration of art and creativity workshops into spa retreats to enhance the overall wellness experience.It discusses the therapeutic benefits of artistic expression in promoting relaxation and stress reduction Nature-Inspired Spa Designs for Tranquility (d7) Focusing on architectural and interior design aspects, this document showcases spa designs inspired by nature.It discusses how elements such as natural light, water features, and botanical aesthetics contribute to a tranquil spa atmosphere

Table 4 .
Cosine Similarity Measurement Results

Table 5
compares different similarity measures for a set of data points (d1 to d10).Euclidean Distance quantifies the straight-line distance between two points in a multi-dimensional space.Cosine Similarity assesses the similarity between two vectors by measuring the cosine of the angle between them.Higher values indicate greater similarity.BERT (Bidirectional Encoder Representations from Transformers) captures semantic context from text, and its similarity values also indicate how similar the data points are.Each measure serves a different purpose: Euclidean distance focuses on geometric distance, cosine similarity on direction, and BERT on semantic context/

Table 5 .
Comparison Results of Three Methods

Table 6
BERT, Cosine Similarity, and Euclidean Distance are compared based on execution time, search accuracy, and Euclidean distance scores.BERT takes the longest execution time (29 units), but achieves the highest accuracy (0.91).Meanwhile, Cosine Similarity has the shortest execution time (5 units) but lower accuracy (0.59).The Euclidean Distance scores vary across methods.

Table 6 .
Measurement Results