Просмотр статьи


Номер журнала: 2018.4

Заголовок статьи: Space character identification in Russian language texts with unknown encoding

Резюме

The paper considers two criteria and their joint use for identifying spaces in Russian language texts with unknown encoding. The criteria are based on comparing the word length distribution in a text glossary with the Poisson distribution. The first criterion estimates the difference between the expected value and variance of samples' distribution. The second criterion calculates the relation between two distribution squares called words length index. The measurements were made to estimate criteria statistics for texts varied in size and determine their critical values as well as terms of use. The accuracy of using the criteria with the determined critical values to solve the problem of space identification were calculated, and the statistics of type I and type II errors was obtained.
Based on the obtained statistics for texts of various sizes, the terms for sharing these criteria are determined with regard to possible change in space’s place in the frequencies ordering of text characters. The accuracy of the criteria sharing was calculated, and the statistics of type I and type II errors was obtained.

Авторы

Yu. Kotov, O. Sanina

Ключевые слова

space character, identification, Poisson distribution, words’ length index, frequency

Скачать полный текст