|
|
||||||||||||||||||||
| Home > Data Management All-in-One Guides > Business intelligence > Business intelligence and related technologies > Business intelligence and text mining > Simple data mining examples and datasets | |
| All-in-One Guides: Business intelligence: |
|
|||||||
|
||||||||
Business intelligence and related technologies
![]() Business intelligence and text mining
|
||
Table of contents:
1.2 Simple Examples: The Weather Problem and Others We use a lot of examples in this book, which seems particularly appropriate considering that the book is all about learning from examples! There are several standard datasets that we will come back to repeatedly. Different datasets tend to expose new issues and challenges, and it is interesting and instructive to have in mind a variety of problems when considering learning methods. In fact, the need to work with different datasets is so important that a corpus containing around 100 example problems has been gathered together so that different algorithms can be tested and compared on the same set of problems.
The illustrations used here are all unrealistically simple. Serious application of data mining involves thousands, hundreds of thousands, or even millions of individual cases. But when explaining what algorithms do and how they work, we need simple examples that capture the essence of the problem but are small enough to be comprehensible in every detail. The illustrations we will be working with are intended to be "academic" in the sense that they will help us to understand what is going on. Some actual fielded applications of learning techniques are discussed in Section 1.3, and many more are covered in the books mentioned in the Further Reading section at the end of the chapter. Another problem with actual real-life datasets is that they are often proprietary. No corporation is going to share its customer and product choice database with you so that you can understand the details of its data mining application and how it works. Corporate data is a valuable asset, one whose value has increased enormously with the development of data mining techniques such as those described in this book. Yet we are concerned here with understanding how the methods used for data mining work and understanding the details of these methods so that we can trace their operation on actual data. That is why our illustrations are simple ones. But they are not simplistic: they exhibit the features of real datasets.
1.2.1 The Weather Problem In its simplest form, shown in Table 1.2 , all four attributes have values that are symbolic categories rather than numbers. Outlook can be sunny, overcast, or rainy; temperature can be hot, mild, or cool; humidity can be high or normal; and windy can be true or false. This creates 36 possible combinations (3 X 3 X 2 X 2 = 36), of which 14 are present in the set of input examples. A set of rules learned from this information — not necessarily a very good one — might look as follows:
These rules are meant to be interpreted in order: the first one; then, if it doesn't apply, the second; and so on. A set of rules intended to be interpreted in sequence is called a decision list. Interpreted as a decision list, the rules correctly classify all of the examples in the table, whereas taken individually, out of context, some of the rules are incorrect. For example, the rule if humidity = normal, then play = yes gets one of the examples wrong (check which one). The meaning of a set of rules depends on how it is interpreted — not surprisingly! In the slightly more complex form shown in Table 1.3, two of the attributes — temperature and humidity — have numeric values. This means that any learning method must create inequalities involving these attributes rather than simple equality tests, as in the former case. This is called a numeric-attribute problem — in this case, a mixed-attribute problem because not all attributes are numeric. Now the first rule given earlier might take the following form: If outlook = sunny and humidity > 83 then play = no A slightly more complex process is required to come up with rules that involve numeric tests.
Table 1.2 The Weather Data
The rules we have seen so far are classification rules: they predict the classification of the example in terms of whether or not to play. It is equally possible to disregard the classification and just look for any rules that strongly associate different attribute values. These are called association rules. Many association rules can be derived from the weather data in Table 1.2. Some good ones are as follows:
Table 1.3 Weather Data with Some Numeric Attribute
All these rules are 100 percent correct on the given data; they make no false predictions. The first two apply to four examples in the dataset, the third to three examples, and the fourth to two examples. There are many other rules: in fact, nearly 60 association rules can be found that apply to two or more examples of the weather data and are completely correct on this data. If you look for rules that are less than 100 percent correct, then you will find many more. There are so many because unlike classification rules, association rules can "predict" any of the attributes, not just a specified class, and can even predict more than one thing. For example, the fourth rule predicts both that outlook will be sunny and that humidity will be high.
1.2.2 Contact Lenses: An Idealized Problem The first column of Table 1.1 gives the age of the patient. In case you're wondering, presbyopia is a form of longsightedness that accompanies the onset of middle age. The second gives the spectacle prescription: myope means shortsighted and hypermetrope means longsighted. The third shows whether the patient is astigmatic, and the fourth relates to the rate of tear production, which is important in this context because tears lubricate contact lenses. The final column shows which kind of lenses to prescribe: hard, soft, or none. All possible combinations of the attribute values are represented in the table. A sample set of rules learned from this information is shown in Figure 1.1 . This is a large set of rules, but they do correctly classify all the examples. These rules are complete and deterministic: they give a unique prescription for every conceivable example. Generally, this is not the case. Sometimes there are situations in which no rule applies; other times more than one rule may apply, resulting in conflicting recommendations. Sometimes probabilities or weights may be associated with the rules themselves to indicate that some are more important, or more reliable, than others. You might be wondering whether there is a smaller rule set that performs as well. If so, would you be better off using the smaller rule set and, if so, why? These are exactly the kinds of questions that will occupy us in this book. Because the examples form a complete set for the problem space, the rules do no more than summarize all the information that is given, expressing it in a different and more concise way. Even though it involves no generalization, this is often a useful thing to do! People frequently use machine learning techniques to gain insight into the structure of their data rather than to make predictions for new cases. In fact, a prominent and successful line of research in machine learning began as an attempt to compress a huge database of possible chess endgames and their outcomes into a data structure of reasonable size. The data structure chosen for this enterprise was not a set of rules, but a decision tree.
Figure 1.1. Rules for the contact lenses data
Figure 1.2. Decision tree for the contact lenses data Figure 1.2 presents a structural description for the contact lens data in the form of a decision tree, which for many purposes is a more concise and perspicuous representation of the rules and has the advantage that it can be visualized more easily. (However, this decision tree — in contrast to the rule set given in Figure 1.1 — classifies two examples incorrectly.) The tree calls first for a test on tear production rate, and the first two branches correspond to the two possible outcomes. If tear production rate is reduced (the left branch), the outcome is none. If it is normal (the right branch), a second test is made, this time on astigmatism. Eventually, whatever the outcome of the tests, a leaf of the tree is reached that dictates the contact lens recommendation for that case.
1.2.3 Irises: A Classic Numeric Dataset
Table 1.4 The Iris Data
The following set of rules might be learned from this dataset:
If petal length < 2.45 then Iris setosa These rules are very cumbersome; more compact rules can be expressed that convey the same information.
1.2.4 CPU Performance: Introducing Numeric Prediction The classic way of dealing with continuous prediction is to write the outcome as a linear sum of the attribute values with appropriate weights, for example:
Table 1.5 The CPU Performance Data
(The abbreviated variable names are given in the second row of the table.) This is called a regression equation, and the process of determining the weights is called regression, a well-known procedure in statistics. However, the basic regression method is incapable of discovering nonlinear relationships (although variants do exist).
In the iris and central processing unit (CPU) performance data, all the attributes have numeric values. Practical situations frequently present a mixture of numeric and nonnumeric attributes.
1.2.5 Labor Negotiations: A More Realistic Example
Table 1.6 The Labor Negotiations Data
There are 40 examples in the dataset (plus another 17 that are normally reserved for test purposes). Unlike the other tables here, Table 1.6 presents the examples as columns rather than as rows; otherwise, it would have to be stretched over several pages. Many of the values are unknown or missing, as indicated by question marks.
This is a much more realistic dataset than the others we have seen. It contains many missing values, and it seems unlikely that an exact classification can be obtained.
Figure 1.3 shows two decision trees that represent the dataset. Figure 1.3(a) is simple and approximate: it doesn't represent the data exactly. For example, it will predict bad for some contracts that are actually marked good. But it does make intuitive sense: a contract is bad (for the employee!) if the wage increase in the first year is too small (less than 2.5 percent). If the first-year wage increase is larger than this, it is good if there are lots of statutory holidays (more than 10 days). Even if there are fewer statutory holidays, it is good if the first-year wage increase is large enough (more than 4 percent).
Figure 1.3(b) is a more complex decision tree that represents the same dataset. In fact, this is a more accurate representation of the actual dataset that was used to create the tree. But it is not necessarily a more accurate representation of the underlying concept of good versus bad contracts. Look down the left branch. It doesn't seem to make sense intuitively that, if the working hours exceed 36, a contract is bad if there is no health-plan contribution or a full health-plan contribution but is good if there is a half health-plan contribution. It is certainly reasonable that the health-plan contribution plays a role in the decision but not if half is good and both full and none are bad. It seems likely that this is an artifact of the particular values used to create the decision tree rather than a genuine feature of the good versus bad distinction.
The tree in Figure 1.3(b) is more accurate on the data that was used to train the classifier but will probably perform less well on an independent set of test data. It is "overfitted" to the training data — it follows it too slavishly. The tree in Figure 1.3(a) is obtained from the one in Figure 1.3(b) by a process of pruning.
1.2.6 Soybean Classification: A Classic Machine Learning Success
Table 1.7 gives the attributes, the number of different values that each can have, and a sample record for one particular plant. The attributes are placed into different categories just to make them easier to read.
Here are two example rules, learned from this data:
If [leaf condition is normal and
These rules nicely illustrate the potential role of prior knowledge — often called domain knowledge — in machine learning, because the only difference between the two descriptions is leaf condition is normal versus leaf malformation is absent. In this domain, if the leaf condition is normal, then leaf malformation is necessarily absent, so one of these conditions happens to be a special case of the other. Thus, if the first rule is true, the second is necessarily true as well. The only time the second rule comes into play is when leaf malformation is absent but leaf condition is not normal — that is, when something other than malformation
is wrong with the leaf. This is certainly not apparent from a casual reading
of the rules.
Table 1.7 The Soybean Data
Research on this problem in the late 1970s found that these diagnostic rules could be generated by a machine learning algorithm, along with rules for every other disease category, from about 300 training examples. The examples were carefully selected from the corpus of cases as being quite different from one another — "far apart" in the example space. At the same time, the plant pathologist who had produced the diagnoses was interviewed, and his expertise was translated
into diagnostic rules. Surprisingly, the computer-generated rules outperformed the expert's rules on the remaining test examples. They gave the correct disease top ranking 97.5 percent of the time compared with only 72 percent for the expert-derived rules. Furthermore, not only did the learning algorithm find rules that outperformed those of the expert collaborator, but the same expert was so impressed that he allegedly adopted the discovered rules in place of his own!
More on data mining:
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| About Us | Contact Us | For Advertisers | For Business Partners | Site Index | RSS |
|
|
|
|||||||