ارائه فرایندی جهت یکپارچه‌سازی و تشخیص تکرار برای بهبود کیفیت داده‌ها

نوع مقاله : مقاله پژوهشی فارسی

نویسندگان

1 دانشکده مهندسی برق و فناوری اطلاعات، واحد قزوین، دانشگاه آزاد اسلامی،قزوین، ایران.

2 دانشکده مهندسی کامپیوتر، دانشگاه تربیت دبیر شهید رجایی، تهران، ایران.

چکیده

اطلاعات در محیط‌ های کاری امروزی و تصمیم گیری‌ ها نقشی اساسی دارند. با توجه به اهمیت تصمیم گیری، اطمینان از کیفیت داده‌ های موجود ضروری است. با استفاده از روش‌ های پاک‌سازی داده می‌توان کیفیت داده‌ ها را بهبود بخشید. در این مقاله فرایندی در جهت کشف انواع رکورد های تکراری و متناقض، یکپارچه‌سازی و تشخیص تکرار برای بهبود کیفیت داده‌ها ارائه می‌شود. فرایند پیشنهادی شامل بخش‌هایی ازجمله کد کردن داده‌ها و خوشه‌بندی با استفاده از الگوریتم امید ریاضی- بیشینه‌سازی، ساخت نشانه برای رکوردها، ادغام روش‌های کدکردن داده‌ها و ساخت نشانه و ایجاد قوانین انجمنی با استفاده از الگوریتم Fp-growth است. نتایج آزمایش‌ها نشان می دهد در فرایند پیشنهادی به‌طور متوسط معیار فراخوانی 96%، صحت 99%، دقت 95% و امتیاز- اف 95% شده است. روش پیشنهادی با یک روش شناسایی تکرار و خطا، مقایسه شده است که نتایج حاصل نشان‌دهنده‌ی افزایش 13% فراخوانی، 1% صحت و 6% امتیاز- اف است.

کلیدواژه‌ها


[1] K. Ali and M. Warraich," A framework to implement Data Cleaning in Enterprise Data Warehouse for Robust Data Quality," in  Information and Emerging Technologies ., Karachi., ICIET.2010.
 
[2] D.Luebbers, U. Grimmer, and M.Jarke," Systematic Development of Data Mining-Based Data Quality Tools," in 29th international conference on Very large data bases., Berlin., VLDB,pp.548-559,2003.
 
[3] J.Han and M.Kamber,"Data Mining: Concepts and Techniques," 3th ed. Morgan kaufman publisher is an imprint of Elsevier,pp.1-673,2006.
 
[4] K.Sarpong, K. Adu-Manu, and J. Kingsley Arthur,"A Review of Data Cleansing Concepts – Achievable Goals and Limitations," International Journal of Computer Science and Information Technologies.,vol.3,pp. 5212 - 5214,2013.
 
[5] B.Pinar," A Comparison of Data Warehouse Design Models," M.S. thesis, Dept. Computer.Eng., Atilim University.,2005.
 
[6] E. Rahm and H. Hai Do," Data Cleaning: Problems and Current Approaches," Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,2000.
 
[7] J.Broeck, S.A. Cunningham,  R.Eeckels and K.Herbst,"Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities,"vol. 2, pp. 966-970,2005.
 
[8] L.Ettinger,"Improving the DataWarehouse with Selected Data Quality Techniques: Metadata Management, Data Cleansing and Information Stewardship," University of Oregon Applied Information Management Program,2008.
 
[9] H.Muller and J.Christoph Freytag," Problems, Methods, and Challenges in Comprehensive Data Cleansing," Humboldt-Universität zu Berlin, Berlin,2005.
 
[10] R.Arora, P.Pahwa, and S.Bansal," Alliance Rules for Data Warehouse Cleansing",in International Conference on Signal Processing Systems., PP743-747,2009.
 
[11] M.Hamad and A.Jihad," An Enhanced Technique to Clean Data in the Data Warehouse," E-systems Engineering, pp.301-311,2011.
 
[12] Y.Hao,D.Xing-chun, and L.Kai-qi,"Research on Information Quality Driven Data Cleaning Framework,"in International Seminar on Future Information Technology and Management Engineering., PP537-539,2008.
 
[13] M.Rehman and V.Esichaikul,"Duplicate Record Detection For Database Cleansing.in Machine Vision,"in Second International Conference., Dubai, pp. 333 – 338,2009.
 
[14] R.Kavitha Kumar and RM.CHADRASEKARAN,"Attribute Correction-Data Cleaning Using Association Rule and Clustering Methods," International Journal of Data Mining & Knowledge Management Process.,vol.1,pp.22-32,2011.
 
[15] O.Abbas ," Comparisons between data clustering algorithm", Computer science Department, Yarmouk University, Jordan, pp320-325,2007.
 
 
[16] C.Mayfield, J.Neville, and S. Prabhakar,"A Statistical Method for Integrated Data Cleaning and Imputation", Purdue University,Computer Science Technical Reports,2009.
 
[17] A. Hoshino,H.Nakayama,K. Kanno, and K.Nishimura ,"Leveraging the Common Cause of Errors for Constraint-Based Data Cleansing," Springer International Publishing Switzerland,pp 164-176,2015.
 
[18] J.Gu," Random Forest Based Imbalanced Data Cleaning and Classification," 2007.
 
[19] M.Bergman, T.Milo, S.Novgorodov and W.Tan,"QOCO: A Query Oriented Data Cleaning System with Oracles,"in International Conference on Very Large Data Bases., Hawaii., VLDB.pp.1900-1903,2015.
 
[20] M.Bergman, T.Milo, S.Novgorodov and W.Tan,"Query-oriented data cleaning with oracles," in 41 st International Conference on Management of Data., Melbourne., pp. 1199-1214,2015.
 
[21]  V.Raman and M.Hellerstein," Potter’sWheel: An Interactive Data Cleaning System,"in 27th Conference on Very Large Data Bases.,  Roma., VLDB.2001.
 
[22] M.Rehman andV. Esichaikul ,"Duplicate Record Detection For Database Cleansing".in Machine Vision, Second International Conference on , Dubai, pp. 333 – 338,2009.
 
 
[23] K.Ripon and A.Rahman,"A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates," Journal of Computers,vol.5,pp.1800-180,2012.
 
[24] J.Tamilselvi and V.Saravanan,"A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse",  IJCSNS International Journal of Computer Science and Network Security, PP117-121,2008.
 
[25] E.Ohanekwu and C.I. Ezeife," A Token-Based Data Cleaning Technique for DataWarehouse Systems," Natural Science and Engineering Research Council (NSERC).,2013.
 
[26] R.KavithaKumar and RM.Chadrasekaran, "Attribute Correction-Data Cleaning Using Association Rule and Clustering Methods,"International Journal of Data Mining & Knowledge Management Process (IJDKP),vol.1,2011.
 
[27] H.Yu, Z.Xiao-yi, and Y.Zhen, "A Universal Data Cleaning Framework Based on User Model,"  ISECS International Colloquium on Computing, Communication, Control, and Management,2009.
 
[28] W.Wei and M.Zhang, and B, Zhang and X.Tang,"A Data Cleaning Method Based on Association Rules,"in24 th International Conference on Intelligent System and Khnowledge Engineering., 2007.
 
 
[30] J.Han and H. Pei H, and Y. Yin," Mining Frequent Patterns without Candidate Generation," in ACM SIGMOD international conference on Management of data, pp.1-12,2000.
 
[31] A.Paul, V.Ganesan, JS. Challa and Y.Sharma,"HADCLEAN: A Hybrid Approach to Data Cleaning in DataWarehouses,"IEEE.pp.136-142,2012.
 
[32] B.Khan, A.Rauf, H.Javed and S.Khusro,"Removing Fully and Partially Duplicated Records throughK-Means Clustering," IACSIT International Journal of Engineering and Technology,PP.750-754,2012.
 
[33] I.Kononenko and S. Hong," Attribute Selection for Modeling", Future Generation Computer Systems, pp.181 – 195,1997.
 
 
[34] J.Wang,"Data warehousing and mining: concept, methodologies, tools, and appplications," in Information ScienceRefrence,2008.
 
[35]  G.Venkatesh and A. Sarma," Data Cleaning A Practical Perspective", Morgan & Claypool, pp.1-85,2013.
 
[36] C. Borgelt, "An Implementation of the FP-growth Algorithm," in 1st international workshop on open source data, pp.1-5,2005.