Data Cleaning Framework for NoSQL Document Databases

Abstract

Civilization is ever tightening its relationship with data and information. Today it has evolved from the basic human needs to various other fields. Due to the high importance of data many organization store data electronically for future use, especially for decision making. The higher frequency of data initiation and heterogeneous data sources has influenced data purity. As a solution for this, a new database type, called NoSQL has been proposed, but still it has not proven itself to be a proper solution for maintaining data purity. The decisions based on impure data has led to questionable or inefficient decisions which might waste time and money. Most of the industry experts have claimed that data scientists have to spend 80% to 90% of their working time just to clean up the data, with the remaining 10% to 20% time spent on rest of the activities of a data science project [1]. Therefore, this has pointed out the need for data cleaning and a data cleansing mechanism for NoSQL. Based on an available framework for RDBMS (Relational Database Management System), this paper presents a data cleaning framework for NoSQL document databases and the proposed framework is evaluated using MongoDB for its capabilities and limitations.

2016
H.P.U Wijayantha

Sri Lanka Institute of Information Technology New Kandy Rd Malabe 10115, Sri Lanka upulwijayantha@gmail.com Sri Lanka Institute of Information Technology New Kandy Rd Malabe 10115, Sri Lanka prasanna@sliit.lk

Prasanna S. Haddela

Sri Lanka Institute of Information Technology New Kandy Rd Malabe 10115, Sri Lanka prasanna@sliit.lk