Investigating the Impact of Ensemble Machine Learning Methods on Spam Review Detection Based on Behavioral Features

Document Type : Persian Original Article

Authors

Department of Industrial Engineering, K.N.Toosi University of Technology, Tehran, Iran

Abstract

One of the most influential links on the Internet is the feedback provided by consumers as an experience of using the product to the people who want to buy that product. Beneficiaries use this opportunity to transfer inaccurate experience in order to promote or demote the value of a particular service or product unjustly, and this is the cause of placing their reviews between spam reviews category. Therefore, identifying these reviews using machine learning techniques and ensemble learners has become a hot topic among researchers. The purpose of this study is to investigate the impact of using ensemble machine learning methods on identifying such reviews using behavioral features. Recent studies have shown that the ensemble methods used in this study in combination with text-based features in addition to imposing more computational expense are not able to improve the performance of the best base learners. In this study, in addition to identifying the best base and ensemble learners in using behavioral features, we seek to determine whether these features combination with ensemble learners can achieve greater accuracy or a significant change in model performance. For this purpose, seven base learners and four ensemble learners such as Bagging, Boosting, Random Forest and Extra Tree were used and the results were compared with the results of using text-based features. Our evaluations show that using the decision tree as a base learner, along with the method of boosting in unbalanced data set and bagging in balanced dataset, yields better results and we can achieve more tangible change in the performance of the best base algorithms by ensemble learners in using behavioral features over text-based.

Keywords


[1]      J. D'Onfro, "A whopping 20% of yelp reviews are fake", Business Insider, Sep. 25, 2013. Accessed on: Aug. 23, 2019. [Online]. Available: https://www.businessinsider.com/20-percent-of-yelp-reviews-fake-2013-9.
[2]      M. Ott, C. Cardie, and J. Hancock, "Estimating the prevalence of deception in online review communities," in Proceedings of the 21st International Conference on World Wide Web, 2012, pp. 201-210.
[3]      M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, "Finding deceptive opinion spam by any stretch of the imagination," in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human language Technologies-volume 1, 2011: Association for Computational Linguistics, pp. 309-319.
[4]      C. Xu and J. Zhang, "Combating product review spam campaigns via multiple heterogeneous pairwise features," in Proceedings of the 2015 SIAM International Conference on Data Mining, 2015: SIAM, pp. 172-180.
[5]      G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh, "Exploiting burstiness in reviews for review spammer detection," in Seventh International AAAI Conference on Weblogs and Social Media, 2013.
[6]      B. Viswanath et al., "Towards detecting anomalous user behavior in online social networks," in 23rd {USENIX} Security Symposium ({USENIX} Security 14), 2014, pp. 223-238.
[7]      L. Akoglu, R. Chandy, and C. Faloutsos, "Opinion fraud detection in online reviews by network effects," in Seventh International AAAI Conference on Weblogs and Social Media, 2013.
[8]      H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao, "Spotting fake reviews via collective positive-unlabeled learning," in 2014 IEEE International Conference on Data Mining, 2014: IEEE, pp. 899-904.
[9]      S. Rayana and L. Akoglu, "Collective opinion spam detection: Bridging review networks and metadata," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 985-994.
[10]   I. Salian, "Supervize me: What’s the difference between supervised, unsupervised, semi-supervised and reinforcement learning?", Nvidia, Aug. 2, 2018. Accessed on: 10. Aug, 2019. [Online]. Available: https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/.
[11]   B. Heredia, T. M. Khoshgoftaar, J. Prusa, and M. Crawford, "An investigation of ensemble techniques for detection of spam reviews," in 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016: IEEE, pp. 127-133.
[12]   J. D. Prusa, T. M. Khoshgoftaar, and N. Seliya, "Enhancing ensemble learners with data sampling on high-dimensional imbalanced tweet sentiment data," in The Twenty-Ninth International Flairs Conference, 2016.
[13]   J. Jin, P. Ji, and Y. Liu, "Recommending rating values on reviews for designers," in Encyclopedia of Business Analytics and Optimization: IGI Global, 2014, pp. 1998-2009.
[14]   N. Jindal and B. Liu, "Opinion spam and analysis," in Proceedings of the 2008 International Conference on Web Search and Data Mining, 2008, pp. 219-230.
[15]   S. Mani, S. Kumari, A. Jain, and P. Kumar, "Spam review detection using ensemble machine learning," in International Conference on Machine Learning and Data Mining in Pattern Recognition, 2018: Springer, pp. 198-209.
[16]   M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The weka data mining software: An update," ACM SIGKDD Explorations Newsletter, vol. 11, no. 1, pp. 10-18, 2009.
[17]   F. Khurshid, Y. Zhu, Z. Xu, M. Ahmad, and M. Ahmad, "Enactment of ensemble learning for review spam detection on selected features," International Journal of Computational Intelligence Systems, vol. 12, no. 1, pp. 387-394, 2018.
[18]   Y. Tan, Q. Wang, and G. Mi, "Ensemble decision for spam detection using term space partition approach," IEEE Transactions on Cybernetics, vol. 50, no. 1, pp. 297-309, 2018.
[19]   S. Shehnepoor, M. Salehi, R. Farahbakhsh, and N. Crespi, "NetSpam: A network-based spam detection framework for reviews in online social media," IEEE Transactions on Information Forensics and Security, vol. 12, no. 7, pp. 1585-1595, 2017.
[20]   A. Mukherjee et al., "Spotting opinion spammers using behavioral footprints," in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 632-640.
[21]   A. Mukherjee, V. Venkataraman, B. Liu, and N. Glance, "What yelp fake review filter might be doing?," in Seventh International AAAI Conference on Weblogs and Social Media, 2013.
[22]   A. Barushka and P. Hajek, "Review spam detection using word embeddings and deep neuralnetworks," in IFIP International Conference on Artificial Intelligence Applications and Innovations, 2019: Springer, pp. 340-350.
[23]   H. Li et al., "Bimodal distribution and co-bursting in review spam detection," in Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1063-1072.
[24]   B. Wang, J. Huang, H. Zheng, and H. Wu, "Semi-supervised recursive autoencoders for social review spam detection," in 2016 12th International Conference on Computational Intelligence and Security (CIS), 2016: IEEE, pp. 116-119.
[25]   M. Z. Asghar, A. Ullah, S. Ahmad, and A. Khan, "Opinion spam detection framework using hybrid classification scheme," Soft computing, vol. 24, no. 5, pp. 3475-3498, 2020.
[26]   X. Jia, Z. Deng, F. Min, and D. Liu, "Three-way decisions based feature fusion for Chinese irony detection," International Journal of Approximate Reasoning, vol. 113, pp. 324-335, 2019.
[27]   H.-R. Zhang and F. Min, "Three-way recommender systems based on random forests," Knowledge-Based Systems, vol. 91, pp. 275-286, 2016.
[28]   Y. Zhang, D. Miao, J. Wang, and Z. Zhang, "A cost-sensitive three-way combination technique for ensemble learning in sentiment classification," International Journal of Approximate Reasoning, vol. 105, pp. 85-97, 2019.
[29]   A. Mukherjee, B. Liu, and N. Glance, "Spotting fake reviewer groups in consumer reviews," in Proceedings of the 21st International Conference on World Wide Web, 2012, pp. 191-200.
[30]   U. Fayyad and K. Irani, "Multi-interval discretization of continuous-valued attributes for classification learning," 1993.
[31]   "Logistic polynomial regression in R", Educational Research Techniques, Dec. 29, 2017. Accessed on: Aug. 1, 2019. [Online]. Available: https://educationalresearchtechniques.com/2017/12/29/logistic-polynomial-regression-in-r/.
[32]   J. Prusa, T. M. Khoshgoftaar, and D. J. Dittman, "Using ensemble learners to improve classifier performance on tweet sentiment data," in 2015 IEEE International Conference on Information Reuse and Integration, 2015: IEEE, pp. 252-257.
[33]   Y. Freund and R. E. Schapire, "Schapire R: Experiments with a new boosting algorithm," in in: Thirteenth International Conference on ML, 1996: Citeseer.
[34]   N. Bhandari, "How does extratreesclassifier reduce the risk of overfitting?", Medium, Oct. 22, 2018. Accessed on: Sep. 10, 2019. [Online]. Available: https://medium.com/@namanbhandari/extratreesclassifier-8e7fc0502c7.