русский АВТОМАТИЧЕСКИЙ ПОИСК МУЛЬТИЯЗЫЧНЫХ ДОКУМЕНТОВ АТТЕСТУЮЩИХСЯ  УЧИТЕЛЕЙ  ПОСРЕДСТВОМ ПОЛЕЙ ДАТЫ : RETRIEVING DATE FIELDS IN MULTIPLE LANGUAGES FOR AUTOMATIC SEARCHING  OF CERTIFIED TEACHER DOCUMENTS

Birzhan Sapuanov

РУССКИЙ АВТОМАТИЧЕСКИЙ ПОИСК МУЛЬТИЯЗЫЧНЫХ ДОКУМЕНТОВ АТТЕСТУЮЩИХСЯ УЧИТЕЛЕЙ ПОСРЕДСТВОМ ПОЛЕЙ ДАТЫ

RETRIEVING DATE FIELDS IN MULTIPLE LANGUAGES FOR AUTOMATIC SEARCHING OF CERTIFIED TEACHER DOCUMENTS

Authors

Name	Affiliation
Birzhan Sapuanov	EKTU Serikbayev

Published:

2025-03-28

Issue:

No. 1 (2025): "Вестник ВКТУ им.Д.Серикбаева"

Section:

Information and communication technologies

Article language:

Russian

Keywords:

Индексирование на основе даты, академические сертификаты, определение даты, извлечение даты, процедура аттестации учителей, многоязычные документы, искусственный интеллект.

Abstract

Different schools conduct teacher certification processes to promote, reinforce, and innovate their qualification in accordance with own rules and procedures. This procedure at the Nazarbayev Intellectual schools (NIS) in Kazakhstan is followed without exception. The most important step in this process is the compilation portfolio of the teachers, where it contains numerous scanned documents that serve as evidences. As a result, teacher’s provided documents may have expired and are no longer valid during the certification period. The implementation of a system ground on text recognition techniques is vital to accelerate teachers' documents in verification process. In this article, deep learning, artificial intelligence and new other methods have been applied to improve computer vision and text recognition processes, which undoubtedly have increased the efficiency, innovation, and practicality of the verification process.

Each document has key information, such as particular name of certificate, family name and first name. The purpose of this article is to present a system for automatic extraction of the date fields from the multilingual written documents (Kazakh, English and Russian). The date is one of the most important information, which can be used in many automated applications, as document indexing/retrieval on basis of date. To design this system, first it was identified the script, which a document comes below it, and for each line of text that relates to an identified script, we classify word units into month, also with non-month classes applying word-level characteristic extraction and classification. Segmentation of non-monthly words into individual components followed by their labeling as digit, text, contraction or punctuation is then done. After that, tagged components are searched for possible date patterns available within them. Both regular expressions with numeric as well as semi-numeric parts have been used to extract date part. Month and non-month words classification is performed by utilizing Dynamic Time Warping (DTW) as well as profile feature-based approaches. Digits and punctuation marks are detected with respect to gradient-based characteristic approach and Support Vector Machine (SVM) classifier eventually. Experiments on Kazakh, English and Russian document datasets have shown promising results obtained from the proposed approach indicating its effectiveness.

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Download Citation

Sapuanov, B. (2025). русский АВТОМАТИЧЕСКИЙ ПОИСК МУЛЬТИЯЗЫЧНЫХ ДОКУМЕНТОВ АТТЕСТУЮЩИХСЯ УЧИТЕЛЕЙ ПОСРЕДСТВОМ ПОЛЕЙ ДАТЫ : RETRIEVING DATE FIELDS IN MULTIPLE LANGUAGES FOR AUTOMATIC SEARCHING OF CERTIFIED TEACHER DOCUMENTS. Вестник ВКТУ, (1). Retrieved from https://vestnik.ektu.kz/index.php/vestnik/article/view/1038