Home > Data Management All-in-One Guides > Business intelligence > Business intelligence and related technologies > Business intelligence and text mining > Simple data mining examples and datasets
All-in-One Guides: Business intelligence:
EMAIL THIS
 START   GETTING STARTED   BUILDING A BI STRATEGY   BI TOOLS   BI AND RELATED TECHNOLOGIES   BI CASE STUDIES   
Business intelligence and related technologies


Business intelligence and text mining
<< PREVIOUS | NEXT >>: The difference between machine learning and...

Simple data mining examples and datasets

12 May 2009 | Written by: Chakrabarti et all

Tips, expert advice and sample chapters
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google

This excerpt from Data Mining: Know it All includes examples that show how algorithms, datasets and data mining works. Learn how companies can make data-related decisions based on set rules.

Table of contents:

An introduction to data mining
Simple data mining examples and datasets
Fielded applications of data mining and machine learning
The difference between machine learning and statistics in data mining
Information and examples on data mining and ethics
Data acquisition and integration techniques
What is a data rollup?
Calculating mode in data mining projects
Using data merging and concatenation techniques to integrate data

1.2 Simple Examples: The Weather Problem and Others

We use a lot of examples in this book, which seems particularly appropriate considering that the book is all about learning from examples! There are several standard datasets that we will come back to repeatedly. Different datasets tend to expose new issues and challenges, and it is interesting and instructive to have in mind a variety of problems when considering learning methods. In fact, the need to work with different datasets is so important that a corpus containing around 100 example problems has been gathered together so that different algorithms can be tested and compared on the same set of problems.



Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google


<< PREVIOUS | NEXT >>: The difference between machine learning and...
VIEW ALL IN THIS CATEGORY


RELATED CONTENT
Business intelligence and text mining
Fielded applications of data mining and machine learning
An introduction to data mining
Information and examples on data mining and ethics
The difference between machine learning and statistics in data mining
Data acquisition and integration techniques
Calculating mode in data mining projects
Using data merging and concatenation techniques to integrate data
What is a data rollup?
IBM says that companies need to mine blogs, wikis for vital business data
Application vendors to dig into data mining

Data mining and business intelligence
Birst takes SaaS BI out of the cloud, battles data security fears
Hurdles for SaaS BI vendors include data integration, low recognition
IBM launches private analytics cloud
How to expand enterprise reporting and capitalize on benefits of BI
In-database analytics pulls together SAS, data warehouse vendors
IBM expands business analytics software line, adds Cognos applications
The importance and benefits of operational decision making
How to make operational decisions and data corporate assets
Do we need business intelligence (BI) tools to be successful?
Data analytics software helps transit authorities meet rider demand

Business intelligence best practices
BI project management tips for implementation success
When profit margins are thin, business intelligence can make the difference for retail organizations
Do you need enterprise information management software to conduct EIM?
How to create an enterprise information management (EIM) strategy
Understanding five major enterprise information management benefits
Seven secrets to business intelligence (BI) success
How to expand enterprise reporting and capitalize on benefits of BI
Atlanta YMCA turns to SaaS BI software over 'complicated' Cognos
In-database analytics demystified
Choosing BI software: Use your ERP vendor or go with third-party BI?

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
data dredging  (SearchDataManagement.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary


The illustrations used here are all unrealistically simple. Serious application of data mining involves thousands, hundreds of thousands, or even millions of individual cases. But when explaining what algorithms do and how they work, we need simple examples that capture the essence of the problem but are small enough to be comprehensible in every detail. The illustrations we will be working with are intended to be "academic" in the sense that they will help us to understand what is going on. Some actual fielded applications of learning techniques are discussed in Section 1.3, and many more are covered in the books mentioned in the Further Reading section at the end of the chapter.

Another problem with actual real-life datasets is that they are often proprietary. No corporation is going to share its customer and product choice database with you so that you can understand the details of its data mining application and how it works. Corporate data is a valuable asset, one whose value has increased enormously with the development of data mining techniques such as those described in this book. Yet we are concerned here with understanding how the methods used for data mining work and understanding the details of these methods so that we can trace their operation on actual data. That is why our illustrations are simple ones. But they are not simplistic: they exhibit the features of real datasets.

1.2.1 The Weather Problem
The weather problem is a tiny dataset that we will use repeatedly to illustrate machine learning methods. Entirely fictitious, it supposedly concerns the conditions that are suitable for playing some unspecified game. In general, instances in a dataset are characterized by the values of features, or attributes, that measure different aspects of the instance. In this case there are four attributes: outlook, temperature, humidity, and windy. The outcome is whether or not to play.

In its simplest form, shown in Table 1.2 , all four attributes have values that are symbolic categories rather than numbers. Outlook can be sunny, overcast, or rainy; temperature can be hot, mild, or cool; humidity can be high or normal; and windy can be true or false. This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of input examples.

A set of rules learned from this information — not necessarily a very good one — might look as follows:

If outlook = sunny and humidity = highthen play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normalthen play = yes
If none of the above then play = yes

These rules are meant to be interpreted in order: the first one; then, if it doesn't apply, the second; and so on.

A set of rules intended to be interpreted in sequence is called a decision list. Interpreted as a decision list, the rules correctly classify all of the examples in the table, whereas taken individually, out of context, some of the rules are incorrect. For example, the rule if humidity = normal, then play = yes gets one of the examples wrong (check which one). The meaning of a set of rules depends on how it is interpreted — not surprisingly!

In the slightly more complex form shown in Table 1.3, two of the attributes — temperature and humidity — have numeric values. This means that any learning method must create inequalities involving these attributes rather than simple equality tests, as in the former case. This is called a numeric-attribute problem — in this case, a mixed-attribute problem because not all attributes are numeric.

Now the first rule given earlier might take the following form:

If outlook = sunny and humidity > 83 then play = no

A slightly more complex process is required to come up with rules that involve numeric tests.

Table 1.2 The Weather Data
OutlookTemperatureHumidityWindyPlay
SunnyHotHighFalseNo
SunnyHotHighTrueNo
OvercastHotHighFalseYes
RainyMildHighFalseYes
RainyCoolNormalFalseYes
RainyCoolNormalTrueNo
OvercastCoolNormalTrueYes
SunnyMildHighFalseNo
SunnyCoolNormalFalseYes
RainyMildNormalFalseYes
SunnyMildNormalTrueYes
OvercastMildHighTrueYes
OvercastHotNormalFalseYes
RainyMildHighTrueNo

The rules we have seen so far are classification rules: they predict the classification of the example in terms of whether or not to play. It is equally possible to disregard the classification and just look for any rules that strongly associate different attribute values. These are called association rules. Many association rules can be derived from the weather data in Table 1.2. Some good ones are as follows:

If temperature = coolthen humidity = normal
If humidity = normal and windy = false then play = yes
If outlook = sunny and play = no then humidity = high
If windy = false and play = nothen outlook = sunny
and humidity = high.

Table 1.3 Weather Data with Some Numeric Attribute
OutlookTemperatureHumidityWindyPlay
Sunny8585FalseNo
Sunny8090TrueNo
Overcast8386FalseYes
Rainy7096FalseYes
Rainy6880FalseYes
Rainy6570TrueNo
Overcast6465TrueYes
Sunny7295FalseNo
Sunny6970FalseYes
Rainy7580FalseYes
Sunny7570TrueYes
Overcast7290TrueYes
Overcast8175FalseYes
Rainy7191TrueNo

All these rules are 100 percent correct on the given data; they make no false predictions. The first two apply to four examples in the dataset, the third to three examples, and the fourth to two examples. There are many other rules: in fact, nearly 60 association rules can be found that apply to two or more examples of the weather data and are completely correct on this data. If you look for rules that are less than 100 percent correct, then you will find many more. There are so many because unlike classification rules, association rules can "predict" any of the attributes, not just a specified class, and can even predict more than one thing. For example, the fourth rule predicts both that outlook will be sunny and that humidity will be high.

1.2.2 Contact Lenses: An Idealized Problem
The contact lens data introduced earlier tells you the kind of contact lens to prescribe, given certain information about a patient. Note that this example is intended for illustration only: it grossly oversimplifies the problem and should certainly not be used for diagnostic purposes!

The first column of Table 1.1 gives the age of the patient. In case you're wondering, presbyopia is a form of longsightedness that accompanies the onset of middle age. The second gives the spectacle prescription: myope means shortsighted and hypermetrope means longsighted. The third shows whether the patient is astigmatic, and the fourth relates to the rate of tear production, which is important in this context because tears lubricate contact lenses. The final column shows which kind of lenses to prescribe: hard, soft, or none. All possible combinations of the attribute values are represented in the table.

A sample set of rules learned from this information is shown in Figure 1.1 . This is a large set of rules, but they do correctly classify all the examples. These rules are complete and deterministic: they give a unique prescription for every conceivable example. Generally, this is not the case. Sometimes there are situations in which no rule applies; other times more than one rule may apply, resulting in conflicting recommendations. Sometimes probabilities or weights may be associated with the rules themselves to indicate that some are more important, or more reliable, than others.

You might be wondering whether there is a smaller rule set that performs as well. If so, would you be better off using the smaller rule set and, if so, why? These are exactly the kinds of questions that will occupy us in this book. Because the examples form a complete set for the problem space, the rules do no more than summarize all the information that is given, expressing it in a different and more concise way. Even though it involves no generalization, this is often a useful thing to do! People frequently use machine learning techniques to gain insight into the structure of their data rather than to make predictions for new cases. In fact, a prominent and successful line of research in machine learning began as an attempt to compress a huge database of possible chess endgames and their outcomes into a data structure of reasonable size. The data structure chosen for this enterprise was not a set of rules, but a decision tree.

Figure 1.1. Rules for the contact lenses data
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no and
   tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no and
   tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
   astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
   tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
   tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes and
   tear production rate = normal then recommendation = hard
If age = pre-presbyopic and
   spectacle prescription = hypermetrope and astigmatic = yes
   then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
   and astigmatic = yes then recommendation = none

Figure 1.2. Decision tree for the contact lenses data
Decision tree for the contact lenses data

Figure 1.2 presents a structural description for the contact lens data in the form of a decision tree, which for many purposes is a more concise and perspicuous representation of the rules and has the advantage that it can be visualized more easily. (However, this decision tree — in contrast to the rule set given in Figure 1.1 — classifies two examples incorrectly.) The tree calls first for a test on tear production rate, and the first two branches correspond to the two possible outcomes. If tear production rate is reduced (the left branch), the outcome is none. If it is normal (the right branch), a second test is made, this time on astigmatism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached that dictates the contact lens recommendation for that case.

1.2.3 Irises: A Classic Numeric Dataset
The iris dataset, which dates back to seminal work by the eminent statistician R. A. Fisher in the mid-1930s and is arguably the most famous dataset used in data mining, contains 50 examples each of three types of plant: Iris setosa, Iris versicolor, and Iris virginica. It is excerpted in Table 1.4. There are four attributes: sepal length, sepal width, petal length, and petal width (all measured in centimeters). Unlike previous datasets, all attributes have numeric values.

Table 1.4 The Iris Data
Sepal Length (cm)Sepal Width (cm)Petal Length (cm)Petal Width (cm)Type
15.13.51.40.2Iris setosa
24.93.01.40.2Iris setosa
34.73.21.30.2Iris setosa
44.63.11.50.2Iris setosa
55.03.61.40.2Iris setosa
517.03.24.71.4 Iris versicolor
526.43.24.51.5 Iris versicolor
536.93.14.91.5 Iris versicolor
545.52.34.01.3 Iris versicolor
556.52.84.61.5 Iris versicolor
1016.33.36.02.5Iris virginica
1025.82.75.11.9Iris virginica
1037.13.05.92.1Iris virginica
1046.32.95.61.8Iris virginica
1056.53.05.82.2Iris virginica

The following set of rules might be learned from this dataset:

If petal length < 2.45 then Iris setosa
If sepal width < 2.10 then Iris versicolor
If sepal width < 2.45 and petal length < 4.55 then Iris versicolor
If sepal width < 2.95 and petal width < 1.35 then Iris versicolor
If petal length ≥ 2.45 and petal length < 4.45 then Iris versicolor
If sepal length ≥ 5.85 and petal length < 4.75 then Iris versicolor
If sepal width < 2.55 and petal length < 4.95 and
   petal width < 1.55 then Iris versicolor
If petal length ≥2.45 and petal length < 4.95 and
petal width < 1.55 then Iris versicolor
If sepal length ≥ 6.55 and petal length < 5.05 then Iris versicolor
If sepal width < 2.75 and petal width < 1.65 and
   sepal length < 6.05 then Iris versicolor
If sepal length ≥5.85 and sepal length < 5.95 and
   petal length < 4.85 then Iris versicolor
If petal length ≥ 5.15 then Iris virginica
If petal width ≥1.85 then Iris virginica
If petal width ≥ 1.75 and sepal width < 3.05 then Iris virginica
If petal length ≥ 4.95 and petal width < 1.55 then Iris virginica

These rules are very cumbersome; more compact rules can be expressed that convey the same information.

1.2.4 CPU Performance: Introducing Numeric Prediction
Although the iris dataset involves numeric attributes, the outcome — the type of iris — is a category, not a numeric value. Table 1.5 shows some data for which the outcome and the attributes are numeric. It concerns the relative performance of computer processing power on the basis of a number of relevant attributes; each row represents 1 of 209 different computer configurations.

The classic way of dealing with continuous prediction is to write the outcome as a linear sum of the attribute values with appropriate weights, for example:

PRP -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX + 0.6410 CACH
    - 0.2700 CHMIN + 1.480 CHMAX

Table 1.5 The CPU Performance Data
Cycle Time (ns) MYCT Main Memory (KB)Cache (KB) CACH Channels Performance PRP
Minimum MMNMaximum MMAXMinimum CHMINMaximum CHMAX
1125256600025616128198
22980003200032832269
32980003200032832220
42980003200032832172
52980001600032816132
20712620008000021452
2084805128000320067
2094801000400000045

(The abbreviated variable names are given in the second row of the table.) This is called a regression equation, and the process of determining the weights is called regression, a well-known procedure in statistics. However, the basic regression method is incapable of discovering nonlinear relationships (although variants do exist).

In the iris and central processing unit (CPU) performance data, all the attributes have numeric values. Practical situations frequently present a mixture of numeric and nonnumeric attributes.

1.2.5 Labor Negotiations: A More Realistic Example
The labor negotiations dataset in Table 1.6 summarizes the outcome of Canadian contract negotiations in 1987 and 1988. It includes all collective agreements reached in the business and personal services sector for organizations with at least 500 members (teachers, nurses, university staff, police, etc.). Each case concerns one contract, and the outcome is whether the contract is deemed acceptable or unacceptable. The acceptable contracts are ones in which agreements were accepted by both labor and management. The unacceptable ones are either known offers that fell through because one party would not accept them or acceptable contracts that had been significantly perturbed to the extent that, in the view of experts, they would not have been accepted.

Table 1.6 The Labor Negotiations Data
AttributeType12340
DurationYears1232
Wage increase first yearPercentage2%4%4.3%4.5
Wage increase second yearPercentage?5%4.4%4.0
Wage increase third yearPercentage????
Cost of living adjustment[none, tcf, tc] NoneTCF?None
Working hours per weekHours28353840
Pension[none, ret-allw, empl-cntr]None???
Standby payPercentage?13%??
Shift-work supplementPercentage?5%4%4%
Education allowance[yes, no] Yes???
Statutory holidaysDays11151212
Vacation[below-avg, avg, gen] AvgGenGenAvg
Long-term disability insurance[yes, no] No??Yes
Dental plan contribution[none, half, full] None?FullFull
Bereavement assistance[yes, no] No??Yes
Health plan contribution[none, half, full] None?FullHalf
Acceptability of contract[good, bad] BadGoodGoodGood

There are 40 examples in the dataset (plus another 17 that are normally reserved for test purposes). Unlike the other tables here, Table 1.6 presents the examples as columns rather than as rows; otherwise, it would have to be stretched over several pages. Many of the values are unknown or missing, as indicated by question marks.

This is a much more realistic dataset than the others we have seen. It contains many missing values, and it seems unlikely that an exact classification can be obtained.

Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a) is simple and approximate: it doesn't represent the data exactly. For example, it will predict bad for some contracts that are actually marked good. But it does make intuitive sense: a contract is bad (for the employee!) if the wage increase in the first year is too small (less than 2.5 percent). If the first-year wage increase is larger than this, it is good if there are lots of statutory holidays (more than 10 days). Even if there are fewer statutory holidays, it is good if the first-year wage increase is large enough (more than 4 percent).

Figure 1.3(b) is a more complex decision tree that represents the same dataset. In fact, this is a more accurate representation of the actual dataset that was used to create the tree. But it is not necessarily a more accurate representation of the underlying concept of good versus bad contracts. Look down the left branch. It doesn't seem to make sense intuitively that, if the working hours exceed 36, a contract is bad if there is no health-plan contribution or a full health-plan contribution but is good if there is a half health-plan contribution. It is certainly reasonable that the health-plan contribution plays a role in the decision but not if half is good and both full and none are bad. It seems likely that this is an artifact of the particular values used to create the decision tree rather than a genuine feature of the good versus bad distinction.

The tree in Figure 1.3(b) is more accurate on the data that was used to train the classifier but will probably perform less well on an independent set of test data. It is "overfitted" to the training data — it follows it too slavishly. The tree in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of pruning.

1.2.6 Soybean Classification: A Classic Machine Learning Success
An often-quoted early success story in the application of machine learning to practical problems is the identification of rules for diagnosing soybean diseases. Soybean Classification: A Classic Machine Learning Success
The data is taken from questionnaires describing plant diseases. There are about 680 examples, each representing a diseased plant. Plants were measured on 35 attributes, each one having a small set of possible values. Examples are labeled with the diagnosis of an expert in plant biology: there are 19 disease categories altogether — horrible-sounding diseases, such as diaporthe stem canker, rhizoctonia root rot, and bacterial blight, to mention just a few.

Table 1.7 gives the attributes, the number of different values that each can have, and a sample record for one particular plant. The attributes are placed into different categories just to make them easier to read.

Here are two example rules, learned from this data:

If         [leaf condition is normal and
        stem condition is abnormal and
        stem cankers is below soil line and
        canker lesion color is brown]
then
        diagnosis is rhizoctonia root rot
If         [leaf malformation is absent and
        stem condition is abnormal and
        stem cankers is below soil line and
        canker lesion color is brown]
then
        diagnosis is rhizoctonia root rot

These rules nicely illustrate the potential role of prior knowledge — often called domain knowledge — in machine learning, because the only difference between the two descriptions is leaf condition is normal versus leaf malformation is absent. In this domain, if the leaf condition is normal, then leaf malformation is necessarily absent, so one of these conditions happens to be a special case of the other. Thus, if the first rule is true, the second is necessarily true as well. The only time the second rule comes into play is when leaf malformation is absent but leaf condition is not normal — that is, when something other than malformation is wrong with the leaf. This is certainly not apparent from a casual reading of the rules.

Table 1.7 The Soybean Data
Attribute
Number of ValuesSample Value
EnvironmentTime of occurrence7July
Precipitation3Above normal
Temperature3Normal
Cropping history4Same as last year
Hail damage2Yes
Damaged area4Scattered
Severity3Severe
Plant height2Normal
Plant growth2Abnormal
Seed treatment3Fungicide
Germination3Less than 80%
SeedCondition2Normal
Mold growth2Absent
Discoloration2Absent
Size2Normal
Shriveling2Absent
FruitCondition of fruit pods3Normal
Fruit spots5---
LeafCondition2Abnormal
Leaf spot size3---
Yellow leaf spot halo3Absent
Leaf spot margins3---
Shredding2Absent
Leaf malformation2Absent
Leaf mildew growth3Absent
StemCondition2Abnormal
Stem lodging2Yes
Stem cankers4Above soil line
Canker lesion color3---
Fruiting bodies on stems2Present
External decay of stem3Firm and dry
Mycelium on stem2Absent
Internal discoloration3None
Sclerotia2Absent
RootCondition3Normal
DiagnosisDiaporthe stem
19Canker

Research on this problem in the late 1970s found that these diagnostic rules could be generated by a machine learning algorithm, along with rules for every other disease category, from about 300 training examples. The examples were carefully selected from the corpus of cases as being quite different from one another — "far apart" in the example space. At the same time, the plant pathologist who had produced the diagnoses was interviewed, and his expertise was translated into diagnostic rules. Surprisingly, the computer-generated rules outperformed the expert's rules on the remaining test examples. They gave the correct disease top ranking 97.5 percent of the time compared with only 72 percent for the expert-derived rules. Furthermore, not only did the learning algorithm find rules that outperformed those of the expert collaborator, but the same expert was so impressed that he allegedly adopted the discovered rules in place of his own!

More on data mining:

  • Continue to the next section: Fielded applications of data mining and machine learning
  • Download a PDF of this chapter for free: "What's it All About?"
  • Read other excerpts from data management books in the Chapter Download Library.





  • Data Compliance Articles and Research: Data Privacy, Financial Data Management, Healthcare Data
    About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
    SEARCH 
    TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

    TechTarget Corporate Web Site  |  Media Kits  |  Site Map




    All Rights Reserved, Copyright 2005 - 2009, TechTarget | Read our Privacy Policy
      TechTarget - The IT Media ROI Experts