COREFERENCE SOLUTION USING THE CLUSTERING METHOD
Published:
2024-03-29Section:
СтатьиArticle language:
KazakhKeywords:
кореференция, кластерлеу, томита-парсер, референция, анафора.Abstract
Abstract. Nowadays, natural language processing, including the processing of referential relations, has become the most difficult and interesting topic. One of these processing processes is the extraction of referential relations within the sentence.
Resolution of reference, which often occurs as a resolution of reference, is a matter of resolving references to earlier or later elements in the discourse. the solution of the reference is to search the text, generalize the text, interpret dialogues, get information, etc. such an active field of research.
In linguistics, reference is a comparison of a textual expression with some non-linguistic object and an event in the real or abstract world. Traditional linguistics considers two main classes of reference phrases: lexical full forms (nominal phrases, etc.) and shortened forms (eg, pronouns, reference pronouns, classificatory pronouns, personal pronouns). The task of reference resolution is to identify a specific textual reference to a particular non-linguistic object with other references in that text. Determining referential relations in linguistics has been studied for a long time for other languages, but there are still very few studies for the Kazakh language. Considering these issues, we set the goal of solving reference relations in the Kazakh language.
In this article, we consider the solution of the coreference relation in the Kazakh language using the clustering method. The purpose of the studied system here is to solve coreference relations in the Kazakh language, that is, to cluster personal names related to persons (Type of Person). In other words, the task is to combine all the parts of the name in the text (that is, the title, first name, last name, patronymic of each person mentioned in the text).
To achieve the goal, we used Tomita-parser, keyword dictionary, grammar for extracting full name, grammar for extracting vocabulary names, clustering, pairwise model, feature vector pairwise weight vector.
Our algorithm consists of two stages: the first stage, writing grammars to the Tomita-parser parser (Tomita-parser) to extract named objects. In the second stage, use of clustering to combine named objects by their value (the architecture of the work is in Figure 1).
To implement the created algorithm, we used a collection of news from Tengrinews.kz as a test data set.
Algorithm performance was evaluated using traditional evaluation metrics, where Tomita-parser and clustering algorithms were individually evaluated, and the results were given in tabular form. The obtained results compared to other methods, the Tomita-parser algorithm was 0.87%, and the clustering algorithm was 0.81%, the results were tabulated (shown in Table 3 and Table 4).
License
Copyright (c) 2024 ШҚТУ Хабаршысы
This work is licensed under a Creative Commons Attribution 4.0 International License.