Номер журнала: 2020.1

Заголовок статьи: Space character identification in english language texts with unknown encoding


The paper considers three criteria to identify a space character in English-language texts with unknown character encoding. Two criteria are based on estimating the deviation of the word length distribution from the Poisson distribution, and the third considers the frequency of spaces in a text. The paper provides the criteria’s statistics for English-language texts varied in size. The critical values of the suggested criteria are determined, and the number of type I and type II errors is obtained experimentally. Based on the findings, the terms of the considered criteria sharing are determined. The accuracy of the space character identification in English-language texts employing the suggested criteria is proved on the validation sample.


Yu. Kotov, O. Sanina

Ключевые слова

space character, identification, Poisson distribution, words’ length index, frequency

