Proposing a process to integrate and identify repetitions to improve the quality of data

Document Type : Persian Original Article

Authors

1 Faculty of Computer and Information Technology Engineering, Qazvin Branch,Islamic Azad University, Qazvin, Iran.

2 Faculty of Computer Engineering, Shahid Rajaee Teacher Training University, Tehran, Iran.

Abstract

Recently, information in the workplace and decision has a major role. Due to the importance of deciding, it is also necessary to ensure data quality. Data quality can be improved by data cleaning methods. In this research, we propose a process for discovering duplications and contradictory types of records, integrating and identifying duplications to improve the quality of data. Our proposed process consists of different activities. These activities are coding records, clustering by expectation maximization algorithm, making token for records, integrate coding records methods and making token for records methods, and extracting association rules by Fp-growth Algorithm. The results of the tests show that the proposed process has averaged 96% recall, 99% precision, 95% accuracy and 95% f-score. The proposed method is compared with a duplication and error detection method. The results indicate an increase of 13% for recall, 1% for accuracy and 6% for f-score in the proposed process.

Keywords


[1] K. Ali and M. Warraich," A framework to implement Data Cleaning in Enterprise Data Warehouse for Robust Data Quality," in  Information and Emerging Technologies ., Karachi., ICIET.2010.
 
[2] D.Luebbers, U. Grimmer, and M.Jarke," Systematic Development of Data Mining-Based Data Quality Tools," in 29th international conference on Very large data bases., Berlin., VLDB,pp.548-559,2003.
 
[3] J.Han and M.Kamber,"Data Mining: Concepts and Techniques," 3th ed. Morgan kaufman publisher is an imprint of Elsevier,pp.1-673,2006.
 
[4] K.Sarpong, K. Adu-Manu, and J. Kingsley Arthur,"A Review of Data Cleansing Concepts – Achievable Goals and Limitations," International Journal of Computer Science and Information Technologies.,vol.3,pp. 5212 - 5214,2013.
 
[5] B.Pinar," A Comparison of Data Warehouse Design Models," M.S. thesis, Dept. Computer.Eng., Atilim University.,2005.
 
[6] E. Rahm and H. Hai Do," Data Cleaning: Problems and Current Approaches," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2000.
 
[7] J.Broeck, S.A. Cunningham,  R.Eeckels and K.Herbst,"Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities,"vol. 2, pp. 966-970,2005.
 
[8] L.Ettinger,"Improving the DataWarehouse with Selected Data Quality Techniques: Metadata Management, Data Cleansing and Information Stewardship," University of Oregon Applied Information Management Program,2008.
 
[9] H.Muller and J.Christoph Freytag," Problems, Methods, and Challenges in Comprehensive Data Cleansing," Humboldt-Universität zu Berlin, Berlin,2005.
 
[10] R.Arora, P.Pahwa, and S.Bansal," Alliance Rules for Data Warehouse Cleansing",in International Conference on Signal Processing Systems., PP743-747,2009.
 
[11] M.Hamad and A.Jihad," An Enhanced Technique to Clean Data in the Data Warehouse," E-systems Engineering, pp.301-311,2011.
 
[12] Y.Hao,D.Xing-chun, and L.Kai-qi,"Research on Information Quality Driven Data Cleaning Framework,"in International Seminar on Future Information Technology and Management Engineering., PP537-539,2008.
 
[13] M.Rehman and V.Esichaikul,"Duplicate Record Detection For Database Cleansing.in Machine Vision,"in Second International Conference., Dubai, pp. 333 – 338,2009.
 
[14] R.Kavitha Kumar and RM.CHADRASEKARAN,"Attribute Correction-Data Cleaning Using Association Rule and Clustering Methods," International Journal of Data Mining & Knowledge Management Process.,vol.1,pp.22-32,2011.
 
[15] O.Abbas ," Comparisons between data clustering algorithm", Computer science Department, Yarmouk University, Jordan, pp320-325,2007.
 
 
[16] C.Mayfield, J.Neville, and S. Prabhakar,"A Statistical Method for Integrated Data Cleaning and Imputation", Purdue University,Computer Science Technical Reports,2009.
 
[17] A. Hoshino,H.Nakayama,K. Kanno, and K.Nishimura ,"Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing," Springer International Publishing Switzerland,pp 164-176,2015.
 
[18] J.Gu," Random Forest Based Imbalanced Data Cleaning and Classification," 2007.
 
[19] M.Bergman, T.Milo, S.Novgorodov and W.Tan,"QOCO: A Query Oriented Data Cleaning System with Oracles,"in International Conference on Very Large Data Bases., Hawaii., VLDB.pp.1900-1903,2015.
 
[20] M.Bergman, T.Milo, S.Novgorodov and W.Tan,"Query-oriented data cleaning with oracles," in 41 st International Conference on Management of Data., Melbourne., pp. 1199-1214,2015.
 
[21]  V.Raman and M.Hellerstein," Potter’sWheel: An Interactive Data Cleaning System,"in 27th Conference on Very Large Data Bases.,  Roma., VLDB.2001.
 
[22] M.Rehman andV. Esichaikul ,"Duplicate Record Detection For Database Cleansing".in Machine Vision, Second International Conference on , Dubai, pp. 333 – 338,2009.
 
 
[23] K.Ripon and A.Rahman,"A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates," Journal of Computers,vol.5,pp.1800-180,2012.
 
[24] J.Tamilselvi and V.Saravanan,"A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse",  IJCSNS International Journal of Computer Science and Network Security, PP117-121,2008.
 
[25] E.Ohanekwu and C.I. Ezeife," A Token-Based Data Cleaning Technique for DataWarehouse Systems," Natural Science and Engineering Research Council (NSERC).,2013.
 
[26] R.KavithaKumar and RM.Chadrasekaran, "Attribute Correction-Data Cleaning Using Association Rule and Clustering Methods,"International Journal of Data Mining & Knowledge Management Process (IJDKP),vol.1,2011.
 
[27] H.Yu, Z.Xiao-yi, and Y.Zhen, "A Universal Data Cleaning Framework Based on User Model,"  ISECS International Colloquium on Computing, Communication, Control, and Management,2009.
 
[28] W.Wei and M.Zhang, and B, Zhang and X.Tang,"A Data Cleaning Method Based on Association Rules,"in24 th International Conference on Intelligent System and Khnowledge Engineering., 2007.
 
 
[30] J.Han and H. Pei H, and Y. Yin," Mining Frequent Patterns without Candidate Generation," in ACM SIGMOD international conference on Management of data, pp.1-12,2000.
 
[31] A.Paul, V.Ganesan, JS. Challa and Y.Sharma,"HADCLEAN: A Hybrid Approach to Data Cleaning in DataWarehouses,"IEEE.pp.136-142,2012.
 
[32] B.Khan, A.Rauf, H.Javed and S.Khusro,"Removing Fully and Partially Duplicated Records throughK-Means Clustering," IACSIT International Journal of Engineering and Technology,PP.750-754,2012.
 
[33] I.Kononenko and S. Hong," Attribute Selection for Modeling", Future Generation Computer Systems, pp.181 – 195,1997.
 
 
[34] J.Wang,"Data warehousing and mining: concept, methodologies, tools, and appplications," in Information ScienceRefrence,2008.
 
[35]  G.Venkatesh and A. Sarma," Data Cleaning A Practical Perspective", Morgan & Claypool, pp.1-85,2013.
 
[36] C. Borgelt, "An Implementation of the FP-growth Algorithm," in 1st international workshop on open source data, pp.1-5,2005.