Home > Ask the Data Management Experts > Data integration Questions & Answers > The ETL process and MySQL
Ask The Data Management Expert: Questions & Answers
EMAIL THIS

The ETL process and MySQL

Mark Whitehorn EXPERT RESPONSE FROM: Mark Whitehorn

Pose a Question
Other Data Management Categories
Meet all Data Management Experts
Become an Expert for this site


Tips, expert advice and sample chapters
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


>
QUESTION POSED ON: 27 November 2007
I am looking for help regarding the ETL process. I have raw data and I want to transform it in MySQL to remove data errors. What is the process for doing this?

>
EXPERT RESPONSE
If you Google for extract, transform and load (ETL) you'll find a huge number of references to ETL tools. The reason that all of these tools have been developed is simple -- the ETL process is so complex that a tool is usually the best choice. And even though they are usually expensive, they are cost-effective. So, my default answer would be to first advise you to take a serious look at using an ETL tool. I hold zero shares in any ETL company (actually, I hold zero in any IT-related company), so I don't have any ax to grind here.

However, you may well have considered and rejected a tool so, where do you start with ETL using MySQL?

Extraction

MySQL is essentially a RDBMS engine, not an ETL tool. It doesn't have specific extraction tools, so you'll probably have to push the data from the source systems out as XML or CSV files. These will have to be imported into MySQL tables.

Transformation/Cleansing

Cleansing the data is often more complex. I can't give you a series of MySQL statements to run on your data, but I can, hopefully, give you some pointers.

Dirty data comes in many flavors, but it often consists of the outliers in the data. For example, consider a column that stores exam marks allocated as percentages. It should contain values between 0 and 100 (to put that more formally, the domain of acceptable values in the set 0 - 100). So, run a Min and Max query against the column and that will immediately identify whether there are unacceptable values.

Next, we know that marks are not evenly distributed across the domain, so run a query that groups the values into, say, 0-10, 11-20, 21-30 and so on. Eye-balling the figures will give an idea of whether they are the expected skewed distribution.

Now consider a column called Gender. You expect it to contain two values: Male and Female. If you run your eye down the column they may look alright, but there may be 60 million rows. However, a simple GROUPBY query will show you each discrete value in the column once. It helps to perform a count on the primary key as well and if, for example, you see:

COUNT OF Primary KeyGender
31,234,542Male
28,438,765Female
42Mail
563F
766M

Then you know you have a problem. You can then construct an UPDATE query to fix it.

Of course, once you have worked out how to do the cleansing you still have to automate it in a script.

Loading

The final step is to load the data: move it from the transformation area to the core data warehouse. This operation is often functionally an INSERT operation but, as a general rule rather than a MySQL specific one, we usually try to avoid the SQL INSERT operation here because it is way too slow if there is a significant data volume. Instead we tend to use whatever bulk inset option the RDBMS offers.

Overall

As I write this answer I do keep on coming back to the fact that all of this is easier with an ETL tool. You can still use MySQL to write and run the SQL but use the ETL tool to schedule the SQL and control the process overall.

More information on ETL tools:

  • ETL tools: What you do and don't want
  • Extract, transform and load tutorial: Five-minute audio podcast
  • ETL tools and EDR tools: What's the difference?


  • Sound Off! -   


    Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


    RELATED CONTENT
    DBMS and data warehousing
    Software developer career growth
    How to evaluate data warehouse software in five steps
    Data warehousing, data mining and data querying: Terms and definitions
    The difference between data definition language (DDL) and data manipulation language (DML)
    What is an operational data store vs. a data warehouse?
    Can a dimension table be a fact table for another data mart?
    Top three database management system (DBMS) trends
    Can I have two data warehouses?
    Database administrator job roles: Organizing the DBAs
    Data modeling for data warehouse projects

    Data integration
    SOA governance best practices
    Data integration certifications: Finding the value
    Customer data integration and data warehouses for the healthcare sector
    ERP reporting tools' advantages and disadvantages
    ETL tools and EDR tools: What's the difference?
    Data-as-a-service, explained and defined
    ETL tools defined
    Building a customer data integration business case: Project specifications
    ETL tools: What you do and don't want
    SOA starts with the data

    Data warehousing/ETL
    ETL tools and EDR tools: What's the difference?
    ETL and CDI, the fundamental differences
    Extract, transform and load (ETL) tools transforming, study finds
    The data warehouse ETL toolkit
    Can ETL tools handle complicated business logic?

    RELATED RESOURCES
    2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
    Search Bitpipe.com for the latest white papers and business webcasts
    Whatis.com, the online computer dictionary



    Search and Browse the Expert Answer Center
    Search and browse more than 25,000 question and answer pairs from more than 250 TechTarget industry experts.
    Browse our Expert Advice

    About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
    SEARCH 
    TechTarget provides enterprise IT professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective IT purchase decisions and managing their organizations' IT projects - with its network of technology-specific Web sites, events and magazines.

    TechTarget Corporate Web Site  |  Media Kits  |  Reprints  |  Site Map




    All Rights Reserved, Copyright 2005 - 2008, TechTarget | Read our Privacy Policy
      TechTarget - The IT Media ROI Experts