Preview

The Herald of the Siberian State University of Telecommunications and Information Science

Advanced search

Space character identification in english language texts with unknown encoding

Abstract

The paper considers three criteria to identify a space character in English-language texts with unknown character encoding. Two criteria are based on estimating the deviation of the word length distribution from the Poisson distribution, and the third considers the frequency of spaces in a text. The paper provides the criteria’s statistics for English-language texts varied in size. The critical values of the suggested criteria are determined, and the number of type I and type II errors is obtained experimentally. Based on the findings, the terms of the considered criteria sharing are determined. The accuracy of the space character identification in English-language texts employing the suggested criteria is proved on the validation sample.

About the Authors

Yu. .. Kotov
НГТУ
Russian Federation


O. .. Sanina
НГТУ
Russian Federation


References

1. Котов Ю. А., Санина О. В. Идентификация пробела при неизвестной знаковой кодировке в русскоязычных текстах // Вестник СибГУТИ. 2018. № 4. С. 48-60.

2. Абденов А. Ж., Котов Ю. А., Санина О. В. Значения некоторых униграммных характеристик русскоязычных текстов // Научный вестник Новосибирского государственного технического университета. 2017. № 2. С. 146-162.

3. Oganian Y, Conrad M., Aryani A., Heekeren H. R., Spalek K. Interplay of bigram frequency and orthographic neighborhood statistics in language membership decision // Bilingualism: Language and Cognition. 2016. V. 19, № 3. P. 578-596.

4. Jones M. N., Mewhort D. J. K. Case-sensitive letter and bigram frequency counts from large-scale English corpora // Behavior Research Methods, Instruments, & Computers. 2004. V. 36, № 3. P. 388-396.

5. Kale. S., Prasad R. Author Identification on Literature in Different Languages: A Systematic Survey // 2018 International Conference On Advances in Communication and Computing Technology (ICACCT). Sangamner, India, February 8-9, 2018. P. 174-181.

6. Chuah C. W., A/L Samylingam V., Darmawan I., Shamala A/P Palaniappan P. S., Mohd Foozy C. F., Ramli S. N., Alawatugod J. Analysis of Four Historical Ciphers Against Known Plaintext Frequency Statistical Attack // International Journal of Integrated Engineering. 2018. V. 10. P.183-192.

7. Blondeau C., Nyberg K. Joint data and key distribution of simple, multiple, and multidimensional linear cryptanalysis test statistic and its impact to data complexity // Designs, Codes and Cryptography. 2017. V. 82, № 1. P. 319-349.

8. Sharma N., Meghwal H., Mehta M., Kumar T. A Review on Playfair Substitution Cipher and Frequency Analysis Attack on Playfair // 2nd International Conference on Trends in Electronics and Informatics (ICOEI). Tirunelveli, India, May 11-12, 2018. P. 1-9.

9. Rubinstein-Salzedo S. The Vigenere Cipher // Cryptography. 2018. P. 41-54.

10. Rajput N. K., Ahuja B., Riyal M. K. A statistical probe into the word frequency and length distributions prevalent in the translations of Bhagavad Gita // Pramana - Journal of Physics. 2019. V. 92, № 4. P. 60.

11. Kotov Yu. A., Sanina O. V. Criteria and Algorithm for the Russian Language Text Recognition Based on the Frequency Characteristics Set // 2018 XIV International scientific-technical conference on actual problems of electronic instrument engineering (APEIE 2018). Novosibirsk, Russia, October 2-6, 2018. P. 175-179.

12. Yang N., Mali A. D. Modifying Keyboard Layout to Reduce Finger-Travel Distance // 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). San Jose, CA, USA, November 6-8, 2016. P. 165-168.


Review

For citations:


Kotov Yu..., Sanina O... Space character identification in english language texts with unknown encoding. The Herald of the Siberian State University of Telecommunications and Information Science. 2020;(1):60-72. (In Russ.)

Views: 195


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1998-6920 (Print)