Abstract
This article emphasizes the importance of data quality and advocates that a generalized state of data clean-up tools should emerge and introduce sample models. Data quality is very critical for all areas of information management systems. Various data consistency challenges, though, from spelling errors to deep logical contradictions, reach into the applications all around. Data cleaning is aimed at detecting and eliminating dirty data, and improving data quality. It outlines the data quality issues and the latest data cleaning study findings. A standardized data cleaning structure focused on the consumer paradigm is introduced with regard to the existing management information systems. In a word, most companies waste their source to clean their data. Generalized data tools for every data type, will be much useful to companies that are concerned about data quality.
Introduction
In a word, quality of the data determines quality of the system. Without insight analysis of data problems, the data warehouse must fail and cause a great economic loss and decisionfault. However, this view does not take into consideration that data cleaning is always an iterative method customized to the needs and semics of a particular function of study.The data cleaning principle is to find and correct the mistakes and inconsistencies. Data cleaning is the necessary condition for discovering knowledge and constructing data warehouses. Traditionally , data cleaning procedures, sequences of transformations such as deduplication or outlier elimination that translate raw data into a format suitable for study, have been regarded as static components that fit into data incorporation or ExtractTransform-Load (ETL) pipelines and are performed based on new data entering the device [15] (Sanjay Krishnan et al., 2016). Recently many tools to enable the iterative design and optimization of data cleaning workflows were developed as a reallocation. Our claim is find touniversal data cleaning tool for every data types. In this paper, we discuss current data cleaning tools and their advantages to merge for best practice.[2
Background and Related Work
Database cleaning has been a primary field of data base work throughout the past two decades.[1] There are currently several commercial data-cleaning tools available. The academic community also developed some tools with the same purpose as the research works. Most market applications for data cleaning are developed to promote data identification and data cleaning and conversion. They are classified into two types: irregular tools for data discovery and tools for data conversion (for anomalous data cleaning)[5]. The cleaning of data is profoundly domain specific. Problems with data quality are rather trivial, complex and incoherent. There is no international common reference standard. Cleaning algorithms can only be designed according to the specific requirement of each domain. Yet some criteria still exist to judge the algorithm (Huang Yu Nanjing et al., 2015). After Kandel etal. there have been many major developments in data cleanings such as the increasing acceptance of Machine Learning in enterprise and the emergence of in-memory, low-latency data processing systems such as Apache Spark. In this respect, we will examine an example generalized system model below.
The system is highly flexible and satisfactory for the needs of different company users asshown in Fig.1.
In Fig.1, the system accesses the objects database to be cleaned by means of universalaccess controls. So users are transparent about the details of data accessing different DBMS.
The user model definition module defines the universal model and structure that our system can easily interpret and store in the XML files of the user model. The Quality Model specification module accesses the XML files of the data model and gets the user model, andspecifies the user model's quality standard, including interpretation of the data quality law, concept of restricting language. Data cleaning tools include data cleaning engine, auto-run module and the module for human intervention.[3]
Confirming Kandel et al's results, we noticed that a lot of participants described data cleaningas an engaging, iterative and self-learning method. In other words, users will load the sampledata set first so that users who do data cleaning can be more comfortable. The learning methods in the background will simply determine the interface according to the type of data and automatically present this interface to the user.
As shown in Figure 1, an interface shaped according to the data types entered into the system will provide convenience to employees in companies that have difficulties in datacleaning.
First of all, data is entered into the system. The data entered in the machine learning module is analyzed and the data types are determined. The specified data types are sent to the next module to construct the interface. As a result of the latest interface, the user eliminates the parts that are unnecessary and a customized user interface is created. This customized interface provides convenience to Business Intelligence employees who cannot write code orscripts
Conclusion
High data quality is a general prerequisite of current information system design. This paper suggests the data cleaning architecture focused on the consumer paradigm according to thespecifications of existing information systems. The framework suitable for different management information systems provides uniform interfaces of the universal user model and the data quality model. Human participation is included in the data cleaning process to deliver versatile functionality and to satisfy the information technology specifications. After the customized interface, system employees who are not sufficient to write code will be able to clear data without any extra effort. This customizable tool will play an important role in improving data quality. However, a customized framework / interface should not bedetermined without user consent to the last step precisely before the interface is determined.
References
[1] HILDA’16, June 26 2016, San Francisco, CA, USA c 2016 ACM. ISBN 978-1-4503-4207-0/16/06DOI: http://dx.doi.org/10.1145/2939502.2939511
[2] [25] S. Kandel, A. Paepcke, J. Hellerstein, and H. Jeffrey. Enterprise data analysis and visualization: An interview study. VAST, 2016.
[3] Richard Y Wang. A Product Perspective on Total Data QualityManagement. Communications of the ACM, 41(2): 58-65.
[4] Huang Yu Nanjing. A Universal Data Cleaning Framework Based on User Model. ISECS International Colloquium on Computing, Communication, Control, and Management
[5] A. Ebaid, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, J.-A. Quiane-Ruiz, N. Tang, and S. Yin. Nadeef: Ageneralized data cleaning system. VLDB, 2013