Home > Ask the Data management / BI Experts > Integration and SOA data services Questions & Answers > The ETL process and MySQL
Ask The Data Management Expert: Questions & Answers
EMAIL THIS

The ETL process and MySQL

Mark Whitehorn EXPERT RESPONSE FROM: Mark Whitehorn

Pose a Question
Other Data Management Categories
Meet all Data Management Experts
Become an Expert for this site


Tips, expert advice and sample chapters
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


>
QUESTION POSED ON: 27 November 2007
I am looking for help regarding the ETL process. I have raw data and I want to transform it in MySQL to remove data errors. What is the process for doing this?


Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
DBMS and data warehousing
Definition of primary, super, foreign and candidate key in the DBMS
What is the difference between a logical and physical warehouse design?
What are some emerging data warehouse and DBMS trends?
How to get data/database independence with a three-tier architecture
How to select an MPP database: DB2 vs. Teradata
What comes first — the data mart or the data warehouse?
What are the top database management systems (DBMS)?
What is the role of DBMS in RDBMS?
Is an Inmon-modeled BI system, like Madison, the future of data warehousing?
What are the benefits and disadvantages of a RDBMS?

Integration and SOA data services
What are some SOA risks and challenges and how can we mitigate risks?
How to estimate SOA costs and design an SOA strategy for integration projects
What's the difference between SOA and Web services?
What are the components of service-oriented architecture (SOA)?
SOA governance best practices
Data integration certifications: Finding the value
ERP reporting tools' advantages and disadvantages
ETL tools and EDR tools: What's the difference?
Data-as-a-service, explained and defined
ETL tools defined

Data warehousing/ETL
ETL tools and EDR tools: What's the difference?
ETL and CDI, the fundamental differences
Extract, transform and load (ETL) tools transforming, study finds
The data warehouse ETL toolkit
Can ETL tools handle complicated business logic?

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


If you Google for extract, transform and load (ETL) you'll find a huge number of references to ETL tools. The reason that all of these tools have been developed is simple -- the ETL process is so complex that a tool is usually the best choice. And even though they are usually expensive, they are cost-effective. So, my default answer would be to first advise you to take a serious look at using an ETL tool. I hold zero shares in any ETL company (actually, I hold zero in any IT-related company), so I don't have any ax to grind here.

However, you may well have considered and rejected a tool so, where do you start with ETL using MySQL?

Extraction

MySQL is essentially a RDBMS engine, not an ETL tool. It doesn't have specific extraction tools, so you'll probably have to push the data from the source systems out as XML or CSV files. These will have to be imported into MySQL tables.

Transformation/Cleansing

Cleansing the data is often more complex. I can't give you a series of MySQL statements to run on your data, but I can, hopefully, give you some pointers.

Dirty data comes in many flavors, but it often consists of the outliers in the data. For example, consider a column that stores exam marks allocated as percentages. It should contain values between 0 and 100 (to put that more formally, the domain of acceptable values in the set 0 - 100). So, run a Min and Max query against the column and that will immediately identify whether there are unacceptable values.

Next, we know that marks are not evenly distributed across the domain, so run a query that groups the values into, say, 0-10, 11-20, 21-30 and so on. Eye-balling the figures will give an idea of whether they are the expected skewed distribution.

Now consider a column called Gender. You expect it to contain two values: Male and Female. If you run your eye down the column they may look alright, but there may be 60 million rows. However, a simple GROUPBY query will show you each discrete value in the column once. It helps to perform a count on the primary key as well and if, for example, you see:

COUNT OF Primary KeyGender
31,234,542Male
28,438,765Female
42Mail
563F
766M

Then you know you have a problem. You can then construct an UPDATE query to fix it.

Of course, once you have worked out how to do the cleansing you still have to automate it in a script.

Loading

The final step is to load the data: move it from the transformation area to the core data warehouse. This operation is often functionally an INSERT operation but, as a general rule rather than a MySQL specific one, we usually try to avoid the SQL INSERT operation here because it is way too slow if there is a significant data volume. Instead we tend to use whatever bulk inset option the RDBMS offers.

Overall

As I write this answer I do keep on coming back to the fact that all of this is easier with an ETL tool. You can still use MySQL to write and run the SQL but use the ETL tool to schedule the SQL and control the process overall.

More information on ETL tools:

  • ETL tools: What you do and don't want
  • Extract, transform and load tutorial: Five-minute audio podcast
  • ETL tools and EDR tools: What's the difference?




  • Search and Browse the Expert Answer Center
    Search and browse more than 25,000 question and answer pairs from more than 250 TechTarget industry experts.
    Browse our Expert Advice

    About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
    SEARCH 
    TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

    TechTarget Corporate Web Site  |  Media Kits  |  Site Map




    All Rights Reserved, Copyright 2005 - 2009, TechTarget | Read our Privacy Policy
      TechTarget - The IT Media ROI Experts