Text Representation and Similarity Measure for Text Clustering Based on Semantic Strings: A Case Study on Uyghur Language

Turdi  Tohti; Xing  Tan; Jimmy  Huang; Askar  Hamdulla

doi:10.6180/jase.202106_24(3).0009

Text Representation and Similarity Measure for Text Clustering Based on Semantic Strings: A Case Study on Uyghur Language

Computer Science and Information Engineering

Sample 1 for string extension (changes of the Queue and Index).

Turdi Tohti¹ , Xing Tan² , Jimmy Huang² , and Askar Hamdulla ¹

¹School of Information Science and Engineering, Xinjiang University, China
²Information Retrieval & Knowledge Management Research Lab, York University, Canada

Received: June 14, 2020
Accepted: December 5, 2020
Publication Date: June 1, 2021

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: ||https://doi.org/10.6180/jase.202106_24(3).0009

ABSTRACT

In Uyghur language, the words which are segmented by inter-word space as natural separator can hardly serve as features in text representation, which leads to the low efficiency of text processing, it is still a research topic how to use language units beyond word boundaries as features to represent texts and improve the efficiency of text processing. This paper proposes a semantic string extraction approach, which is a method for extracting language units beyond word boundaries. At the same time, it also proposes the methods for textual representation and similarity measurement, and verifies its effectiveness in Uyghur text clustering tasks. Specifically, a combination of string expansion and language rules are applied to identify the trusted frequent patterns (TFP) in the text set. Next, semantic strings are evaluated and selected from the text set. Regarding similarity measure, each text is represented as a weighted semantic string set, and a set-based text similarity measuring approach is presented. Finally, the above ideas and approaches are applied to the Uyghur text clustering, and the corresponding clustering algorithms are proposed and verified through series of experiments on the large-scale text corpus. Experimental results show that the semantic string-based text representation is in general very useful in processing Uyghur language.

Keywords: Uyghur language; Frequent pattern discovery; Semantic string extraction; Text representation; Text clustering

REFERENCES

[1] Xiaoan Bao, Shichao Dai, Na Zhang, and Chenghai Yu. Large-scale text similarity computing with spark. International Journal of Grid and Distributed Computing, 9(4):95–100, 2016.
[2] R. Rajalakshmi and Chandrabose Aravindan. A Naive Bayes approach for URL classification with supervised feature selection and rejection framework. Computational Intelligence, 34(1):363–396, feb 2018.
[3] Wenpeng Lu, Heyan Huang, and Chaoyong Zhu. Feature words selection for knowledge-based word sense disambiguation with syntactic parsing. Przeglad Elektrotechniczny, 88(1 B):82–87, 2012.
[4] Turdi Tohti, Winira Musajan, and Askar Hamdulla. Unsupervised Learning and Linguistic Rule Based Algorithm for Uyghur Word Segmentation. Journal of Multimedia, 9(5), 2014.
[5] Turdi Tohti, Winira Musajan, and Askar Hamdulla. Efficient term extraction and indexing approach in smallscale web search of Uyghur language. Journal of Multimedia, 8(5):481–488, 2013.
[6] A. E. Eldesoky, M Saleh, and N. A. Sakr. Novel similarity measure for document clustering based on topic phrases. In 2009 International Conference on Networking and Media Convergence, ICNM 2009, pages 92–96, 2009.
[7] Rashadul Hasan Rakib, Aminul Islam, and Evangelos Milios. Improving text relatedness by incorporating phrase relatedness with word relatedness. Computational Intelligence, 34(3):939–966, aug 2018.
[8] Xiaoyan Zhu, Yu Wang, Yingbin Li, Yonghui Tan, Guangtao Wang, and Qinbao Song. A new unsupervised feature selection algorithm using similarity-based feature clustering. Computational Intelligence, 35(1):2–22, feb 2019.
[9] Suthee Chaidaroon, Travis Ebesu, and Yi Fang. Deep semantic text hashing with weak supervision. In 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018, pages 1109–1112. Association for Computing Machinery, Inc, jun 2018.
[10] Lee-Feng Chien. PAT-tree-based keyword extraction for Chinese information retrieval. pages 50–58. Association for Computing Machinery (ACM), 1997.
[11] Jian Zhang, Jianfeng Gao, and Ming Zhou. Extraction of Chinese compound words - An Experimental Study on a Very Large Corpus. Proceedings of the second workshop on Chinese language processing held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics -, 12:132, 2000.
[12] Yu Sheng Lai and Chung Hsien Wu. Meaningful Term Extraction and Discriminative Term Selection in Text Categorization via Unknown-Word Methodology. ACM Transactions on Asian Language Information Processing, 1(1):34–64, mar 2002.
[13] H U Jixiang, X U Hongbo, L I U Yue, and Cheng Xueqi. Algorithm of Repeats-based Term Extraction and Its Application in Text Clustering. Computer Engineering, 33(2):65–67, 2007.
[14] He Min, Lihong Wang, Pan Du, Jin Zhang, and Xueqi Cheng. Microblog hot topic detection method based on meaningful string clustering. Journal on Communications, 34(Z1):1–7, 2013.
[15] Corina Florescu and Cornelia Caragea. PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), volume 1, pages 1105–1115, 2017.
[16] Min He, Caichun Gong, Huaping Zhang, and X.Q. Cheng. Method of new word identification based on lager-scale corpus. Computer Engineering & Applications, 43(21):157–159, 2007.
[17] Qi Li, Meng Jiang, Xikun Zhang, Meng Qu, Timothy Hanratty, Jing Gao, and Jiawei Han. TruePIE: Discovering reliable patterns in pattern-based information extraction. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1675–1684. Association for Computing Machinery, jul 2018.
[18] Taufik Abdullah Mohd and Kadir Rabiah. Multiword phrases indexing for malay-english cross-language information retrieval. Information Technology Journal, 10(8):1554–1562, 2011.
[19] Yu Feng Zhang, Fei Long, and Lv Bin. Identifying opinion sentences and opinion holders in internet public opinion. In Proceedings of the 2012 International Conference on Industrial Control and Electronics Engineering, ICICEE 2012, pages 1668–1671, 2012.
[20] Jianbo Liu. Algorithm of Meaning String Discovery of Short Text for Sentiment Analysis. Journal of Wuhan University of Technology(Information & Management Engineering), 33(5):742–745, 2011.
[21] Sreya Dey and M. Narasimha Murty. Using discriminative phrases for text categorization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 8227 LNCS, pages 273–280, 2013.
[22] Gael Dias and Elsa Alves. Unsupervised topic segmentation based on word co-occurrence and multi-word units for text summarization. Proceedings of the ELECTRA Workshop associated to 28th ACM SIGIR Conference, pages 41–48, 2005.
[23] Mihael Arcan, Marco Turchi, Sara Tonelli, and Paul Buitelaar. Leveraging bilingual terminology to improve machine translation in a CAT environment. Natural Language Engineering, 23(5):763–788, sep 2017.
[24] Nasredine Semmar and Meriama Laib. Building Multiword Expressions Bilingual Lexicons for Domain Adaptation of an Example-Based Machine Translation System. In acl-bg.org, pages 661–670, 2017.
[25] Aisikaer Aimudula Tuerdi Tuoheti, Weinila Mushajiang. Intelligent method for word grouping based on frequent pattern mining in Uyghur language. Journal of Computer Applications, 32(10):2920–2922+2926, 2012.
[26] Tohti Turdi, Pattar Akbar, and Hamdulla Askar. Adaptive word grouping algorithm based on mutual information in Uyghur language. Application Research of Computers, 30(2):429–435, 2013.
[27] Tohti Turdi, Musajan Winira, and Hamdulla Askar. Uyghur text automatic segmentation method based on inter-word association degree measuring. Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 52(1):155–164, 2016.
[28] Aili Mairehaba, Xialifu Aziguli, and Yibulayin Tuergen. Research on extracting methods of multi word expression in Uyghur texts. Computer Engineering and Applications, 50(8):26–30, 2014.
[29] Aysa Alimjan, Ubul Kurban, and Ibrahim Turgun. Bigram feature extraction for Uyghur text. Computer Engineering and Applications, 51(3):216–221, 2015.
[30] Wenchuan Yang, Jian Liu, and Miao Yu. Research of an improved algorithm for Chinese word segmentation dictionary based on Double-Array Trie Tree. In Communications in Computer and Information Science, volume 400, pages 355–362, 2013.
[31] Tanvir Ahmad and Mohammad Najmud Doja. Opinion mining using frequent pattern growth method from unstructured text. In Proceedings - International Symposium on Computational and Business Intelligence, ISCBI 2013, pages 92–95, 2013.
[32] J Manimaran and T. Velmurugan. A survey of association rule mining in text applications. In IEEE International Conference on Computational Intelligence and Computing Research, IEEE ICCIC 2013, 2013.
[33] Jianqiu Ji, Jianmin Li, Shuicheng Yan, Qi Tian, and Bo Zhang. Min-max hash for jaccard similarity. In Proceedings - IEEE International Conference on Data Mining, ICDM, pages 301–309, 2013.