Implementing rules for detecting and cleansing incorrect data is almost always done by BI specialists and not by business users themselves. Is that going to change with the introduction of self-service data preparation tools?
Most tools for data profiling and data cleansing are designed for BI specialists. They can be quite technical in spite of the fact that business users are involved in the process of preparing data. Their input is needed to determine what’s correct and what’s not.
But, as we all know, the world of BI is changing. Business users want to do more themselves. Just consider the ever-increasing popularity of self-service data visualization tools, such as Qlikview, Spotfire, and Tableau Today, even self-service data integration tools exist.
In addition, business users are becoming more and more interested in data sources available on the internet, such as social media data and open data. If they want to integrate this external data with internal data stored in the data warehouse environment, they have to ask the BI specialists to help them out. The specialists will be responsible for defining rules and developing ETL logic to clean that external data and to integrate it with the internal data. This all takes quite some time. It may take so much time, that analyzing that data becomes irrelevant because the business opportunity has passed.
Recently, products such as Paxata, Informatica’s Springbok, and Trifacta have been introduced that allow business users to prepare the data themselves. All three have specifically been developed for business users. This is not limited to having an easy-to-use interface, the tools include functionality that guides the business users in the data preparation process. For example, if users are looking at a column with city names, Paxata automatically suggests which names look identical and should probably be seen as identical. Next, it proposes a solution that automatically changes the names that are incorrectly spelled to the correct one. Various machine learning algorithms are deployed. The product is overloaded with such features. Note that Paxata doesn’t correct the data. The preparation rules defined by the business users are implemented as filters. Users can let the tool create an extract with the corrected data. The extract can then be used for analytics and reporting.
Making data preparation available to the business users speeds up the use of external data sources and improves the time to market for certain reports and analytical exercises. The challenge remains how this user-oriented data preparation can coincide with the more traditional data warehouse environment.