Preview

The Herald of the Siberian State University of Telecommunications and Information Science

Advanced search

Space character identification in Russian language texts with unknown encoding

Abstract

The paper considers two criteria and their joint use for identifying spaces in Russian language texts with unknown encoding. The criteria are based on comparing the word length distribution in a text glossary with the Poisson distribution. The first criterion estimates the difference between the expected value and variance of samples' distribution. The second criterion calculates the relation between two distribution squares called words length index. The measurements were made to estimate criteria statistics for texts varied in size and determine their critical values as well as terms of use. The accuracy of using the criteria with the determined critical values to solve the problem of space identification were calculated, and the statistics of type I and type II errors was obtained. Based on the obtained statistics for texts of various sizes, the terms for sharing these criteria are determined with regard to possible change in space’s place in the frequencies ordering of text characters. The accuracy of the criteria sharing was calculated, and the statistics of type I and type II errors was obtained.

About the Authors

Yu. .. Kotov
НГТУ
Russian Federation


O. .. Sanina
НГТУ
Russian Federation


References

1. Thelwall M., Buckley K., Paltoglou G., Cai D., Kappas A. Sentiment in short strength detection informal text // Journal of the American Society for Information Science and Technology. 2010. V. 61, № 12. P. 2544-2558.

2. Bowker L. Computer-aided Translation Technology: A Practical Introduction Front Cover. University of Ottawa Press, 2002. 185 p.

3. Ferrer-i-Cancho R., Elvevag B. Random texts do not exhibit the real Zipfs law-like rank distribution // PLoS One. 2010. V. 5, № 3. P. 1-10.

4. Котов Ю. А. Детерминированная идентификация буквенных биграмм в русскоязычных текстах // Труды СПИИРАН. 2016. № 1. С. 181-197.

5. Котов Ю. А. Аппроксимация распределений частот буквенных биграмм текста для идентификации букв // Труды СПИИРАН. 2017. № 1 (50). С. 190-208.

6. Shannon C. Communication theory of secrecy systems // Bell System Technical Journal. 1949. V. 28, № 4. P. 656-715.

7. Жданов О. Н., Куденкова И. А. Криптоанализ классических шифров. Красноярск: Изд-во Сиб. гос. аэрокосм. ун-та им. акад. М. Ф. Решетнева. 2008. 107 с.

8. Абденов А. Ж., Котов Ю. А., Санина О. В. Значения некоторых униграммных характеристик русскоязычных текстов // Научный вестник НГТУ. 2017. № 2 (67). С. 146-162.

9. Воевудский Д. С., Тушавин В. А. Статистическая обработка лингвистических данных нидерландско-русских словарей // Вестник Воронежского государственного университета. Серия: Системный анализ и информационные технологии. 2013. № 1. С. 169-176.

10. Smith R. D. Distinct word length frequencies: distributions and symbol entropies // Glottometrics. 2012. V. 23. P. 7-22.

11. Ляшевская О. Н., Шаров С. А. Частотный словарь современного русского языка (на материале Национального корпуса русского языка). М.: Азбуковник, 2009. 923 с.

12. Котов Ю. А., Санина О. В. Значения некоторых биграммных характеристик русскоязычных текстов // Вестник СибГУТИ. 2017. № 4 (40). С. 24-34.

13. Попов В. А. Теория вероятностей. Часть 2. Случайные величины. Казань: Казанский университет, 2013. 45 с.


Review

For citations:


Kotov Yu..., Sanina O... Space character identification in Russian language texts with unknown encoding. The Herald of the Siberian State University of Telecommunications and Information Science. 2018;(4):48-60. (In Russ.)

Views: 187


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6920 (Print)