شباهت یابی بین زبانی جملات فارسی-انگلیسی با استفاده از یادگیری عمیق

نوع مقاله : مقاله پژوهشی فارسی

نویسندگان

1 هوش مصنوعی و رباتیک، دانشکده مهندسی کامپیوتر، دانشگاه علم و صنعت ایران، تهران، ایران.

2 دانشکده مهندسی کامپیوتر دانشگاه علم و صنعت

چکیده

شباهت‌یابی معنایی متون یکی از زیرشاخه‌های پردازش زبان طبیعی محسوب می‌شود که در چند سال اخیر تحقیقات گسترده‌ای را به خود معطوف کرده است. سنجش تشابه معنایی بین کلمات یا اصطلاحات، جملات، پاراگراف و اسناد، نقش مهمی در پردازش زبان طبیعی و زبان‌شناسی رایانشی ایفا می‌کند. شباهت‌یابی معنایی متون در سامانه‌های پرسش و پاسخ، کشف تقلب، ترجمه ماشینی، بازیابی اطلاعات و نظیر آن کاربرد دارد. منظور از شباهت‌یابی معنایی، محاسبه میزان شباهت معنایی بین دو سند متنی، پاراگراف یا جمله می‌باشد که به دو صورت تک‌زبانه و چندزبانه مطرح است. در این مقاله با استفاد از پیکره موازی میزان، برای اولین بار مدل بین زبانی شباهت معنایی جملات فارسی-انگلیسی را ارائه داده و در ادامه مدل خود را با مدل برت چندزبانه مورد آزمون و مقایسه قرار دادیم. نتایج حاکی از آن است که با استفاده از پیکره‌های موازی می‌توان کیفیت تعبیه جملات را در دو زبان مختلف بهبود بخشید. در روش پیشنهادی، معیار همبستگی پیرسون بر اساس شباهت کسینوسی بین بردارهای معنایی حاصل از برت چندزبانه از 65 درصد به 73.77 درصد افزایش یافته است. روش پیشنهادی بر جفت زبان عربی-انگلیسی نیز مورد آزمون قرار گرفت که نتایج حاصله بیانگر برتری روش پیشنهادی نسبت به برت چند زبانه است.

کلیدواژه‌ها


[1]  Manjula, D., and T. V. Geetha. "Semantic search engine." Journal of Information & Knowledge Management 3.01 (2004): 107-117.
[2]  Aliguliyev, Ramiz M. "A new sentence similarity measure and sentence based extractive technique for automatic text summarization." Expert Systems with Applications 36.4 (2009): 7764-7772.
[3] De Boni, Marco, and Suresh Manandhar. "The Use of Sentence Similarity as a Semantic Relevance Metric for Question Answering." New Directions in Question Answering. (2003).
[4] Al-Anzi, Fawaz S., and Dia AbuZeina. "Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing." Journal of King Saud University-Computer and Information Sciences 29.2 (2017): 189-195.
[5] Žižka, Jan, and František Dařena. "Automatic sentiment analysis using the textual pattern content similarity in natural language." International Conference on Text, Speech and Dialogue. Springer, Berlin, Heidelberg, (2010).
[6] Alzahrani, Salha M., Naomie Salim, and Ajith Abraham. "Understanding plagiarism linguistic patterns, textual features, and detection methods." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42.2 (2011): 133-149.
[7] Majumder, Goutam, et al. "Semantic textual similarity methods, tools, and applications: A survey." Computación y Sistemas 20.4 (2016): 647-665.
[8] Jaro, Matthew A. "Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida." Journal of the American Statistical Association 84.406 (1989): 414-420.
[9] Winkler, William E. "String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage." (1990).
[10] Nayantara Jeyaraj, M., & Kasthurirathna, D. (2021). MNet-Sim: A Multi-layered Semantic Similarity Network to Evaluate Sentence Similarity. arXiv e-prints, arXiv-2111.
[11] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)
[12] Zhu, Ganggao, and Carlos A. Iglesias. "Computing semantic similarity of concepts in knowledge graphs." IEEE Transactions on Knowledge and Data Engineering 29.1 (2016): 72-85.
[13] Pires, Telmo, Eva Schlinger, and Dan Garrette. "How Multilingual is Multilingual BERT?." Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. (2019).
[14] Li, Yuhua, et al. "Sentence similarity based on semantic nets and corpus statistics." IEEE transactions on knowledge and data engineering 18.8 (2006): 1138-1150.
[15] Mihalcea, Rada, Courtney Corley, and Carlo Strapparava. "Corpus-based and knowledge-based measures of text semantic similarity." Aaai. Vol. 6. No. 2006. (2006).
[16] Agirre, Eneko, et al. "SemEval-2012 task 6: A pilot on semantic textual similarity." * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012). (2012).
[17] Agirre, Eneko, et al. "* SEM 2013 shared task: Semantic textual similarity." Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity. (2013).
[18] Agirre, Eneko, et al. "SemEval-2014 task 10: Multilingual semantic textual similarity." Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). (2014).
[19] Islam, Aminul, and Diana Inkpen. "Semantic text similarity using corpus-based word similarity and string similarity." ACM Transactions on Knowledge Discovery from Data (TKDD) 2.2 (2008): 1-25.
[20] Bjerva, Johannes, and Robert Östling. "Cross-lingual learning of semantic textual similarity with multilingual word representations." 21st Nordic Conference on Computational Linguistics, NoDaLiDa, Gothenburg, Sweden,(2017).
[21] Tang, Xin, et al. "Improving multilingual semantic textual similarity with shared sentence encoder for low-resource languages." arXiv preprint arXiv:1810.08740 (2018).
[22] Brychcín, Tomáš. "Linear transformations for cross-lingual semantic textual similarity." Knowledge-Based Systems 187 (2020): 104819.
[23] Cer, Daniel, et al. "SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation." (2017).
[24] Tian, Junfeng, et al. "Ecnu at SemEval-2017 task 1: Leverage kernel-based traditional nlp features and neural networks to build a universal model for multilingual and cross-lingual semantic textual similarity." Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017). (2017).
[25] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
[26] Wieting, John, et al. "Towards universal paraphrastic sentence embeddings." arXiv preprint arXiv:1511.08198 (2015).
[27] Wu, Hao, et al. "BIT at SemEval-2017 Task 1: Using semantic information space to evaluate semantic textual similarity." Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). (2017).
[28] Shao, Yang. "Hcti at SemEval-2017 task 1: Use convolutional neural network to evaluate semantic textual similarity." Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). (2017).
[29] Klementiev, Alexandre, Ivan Titov, and Binod Bhattarai. "Inducing crosslingual distributed representations of words." Proceedings of COLING 2012.(2012).
[30] Zou, Will Y., et al. "Bilingual word embeddings for phrase-based machine translation." Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013.
[31] Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine translation." arXiv preprint arXiv:1309.4168 (2013).
[32] Gouws, Stephan, Yoshua Bengio, and Greg Corrado. "BilBOWA: fast bilingual distributed representations without word alignments." Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37.(2015).
[33] Ammar, Waleed, et al. "Massively multilingual word embeddings." arXiv preprint arXiv:1602.01925 (2016).http://arxiv.org/abs/1602.01925.
[34] Chidambaram, Muthu, et al. "Learning Cross-Lingual Sentence Representations via a Multi-task Dual-Encoder Model." Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019).(2019).
[35] Conneau, Alexis, et al. "XNLI: Evaluating Cross-lingual Sentence Representations." Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.(2018).
[36] Conneau, Alexis, and Guillaume Lample. "Cross-lingual language model pretraining." Advances in Neural Information Processing Systems. (2019).
[37] Devlin, Jacob, et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL-HLT (1). (2019).
[38] Sever, Yiğit, and Gönenç Ercan. "Evaluating cross-lingual textual similarity on dictionary alignment problem." Language Resources and Evaluation 54.4 (2020): 1059-1078.
[39] Wang, Liang, Wei Zhao, and Jingming Liu. "Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast." Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021.
[40] He, Kaiming, et al. "Momentum contrast for unsupervised visual representation learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[41] Briakou, Eleftheria, and Marine Carpuat. "Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
[42] Dutta, Sourav. "“Alignment is All You Need”: Analyzing Cross-Lingual Text Similarity for Domain-Specific Applications." (2021).
[43] Karthikeyan, K., et al. "Cross-Lingual Ability of Multilingual BERT: An Empirical Study." International Conference on Learning Representations. 2020.
[44] Kashefi, Omid. "MIZAN: a large persian-english parallel corpus." arXiv preprint arXiv:1801.02107 (2018).
[45] Liu, Yinhan, et al. "Roberta: A robustly optimized برت pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
[46] Reimers, Nils, and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
[47] Mueller, Jonas, and Aditya Thyagarajan. "Siamese recurrent architectures for learning sentence similarity." Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. (2016).
[48] Singh, Archana, Avantika Yadav, and Ajay Rana. "K-means with Three different Distance Metrics." International Journal of Computer Applications 67.10 (2013).
[49] Cera, Daniel, et al. "Universal Sentence Encoder for English." EMNLP 2018 (2018): 169.
[50] Benesty, Jacob, et al. Noise reduction in speech processing. Vol. 2. Springer Science & Business Media, (2009).
[51] Benesty, Jacob, et al. Noise reduction in speech processing. Vol. 2. Springer Science & Business Media, 2009.
[52] SPEARMAN, C. " Correlation calculated from faulty data." British Journal of Psychology, 1904‐1920 3.3 (1910): 271-295.