|
|
||||||||||||||||||||
| Home > Data Management All-in-One Guides > Business intelligence > Business intelligence and related technologies > Business intelligence and text mining > Using data merging and concatenation techniques to integrate data | |
| All-in-One Guides: Business intelligence: |
|
|||||||
|
||||||||
Business intelligence and related technologies
![]() Business intelligence and text mining
|
||
Table of contents:
2.7 Data Integration The data necessary to compile the mining view usually comes from many different tables. The rollup and summarization operations described in the last two sections can be performed on the data coming from each of these data sources independently. Finally, we would be required to assemble all these segments in one mining view. The most used assembly operations are merging and concatenation . Merging is used to collect data for the same key variable (e.g., customer ID) from different sources. Concatenation is used to assemble different portions of the same data fields for different segments of the key variable. It is most useful when preparing the scoring view with a very large number of observations (many millions). In this case, it is more efficient to partition the data into smaller segments, prepare each segment, and finally concatenate them together.
2.7.1 Merging Because of the requirement that the mining view have a unique record per category of key variable, most merging operations required to integrate different pieces of the mining view are of the type called match-merge with nonmatched observations. We demonstrate this type of merging with a simple example.
We start with two datasets, Left and Right, as shown in Table 2.6 . The two tables can be joined using the MERGE – BY commands within a DATA step operation as follows:
DATA Left;
DATA Right;
INPUT ID Balance Status $;
Table 2.6 Two Sample Tables: Left and Right
Table 2.7 Result of Merging: Dataset Both
DATA Both; MERGE Left Right; BY ID; RUN; PROC PRINT DATA = Both; RUN; The result of the merging is the dataset Both given in Table 2.7, which shows that the MERGE-BY commands did merge the two datasets as needed using ID as the key variable. We also notice that the common file Status was overwritten by values from the Right dataset. Therefore, we have to be careful about this possible side effect. In most practical cases, common fields should have identical values. In our case, where the variable represented some customer designation status (Gold or Silver), the customer should have had the same status in different datasets. Therefore, checking these status values should be one of the data integrity tests to be performed before performing the merging. Merging datasets using this technique is very efficient. It can be used with more than two datasets as long as all the datasets in the MERGE statement have the common variable used in the BY statement. The only possible difficulty is that SAS requires that all the datasets be sorted by the BY variable. Sorting very large datasets can sometimes be slow. You have probably realized by now that writing a general macro to merge a list of datasets using an ID variable is a simple task. Assuming that all the datasets have been sorted using ID before attempting to merge them, the macro would simply be given as follows:
%macro MergeDS(List, IDVar, ALL); Finally, calling this macro to merge the two datasets in Table 2.6 would simply be as follows:
%let List = Left Right;
2.7.2 Concatenation Performing concatenation in SAS is straightforward. We list the datasets to be concatenated in a SET statement within the destination dataset. This is illustrated in the following example.
Example 2.2 We then use the following code to implement the concatenation of the two datasets into a new dataset:
DATA TOP;
Table 2.8 Table: Top
Table 2.9 Table: Bottom
DATA BOTTOM; The resulting dataset is shown in Table 2.10 . As in the case of merging datasets, we may include a list of several datasets in the SET statement to concatenate. The resulting dataset will contain all the records of the contributing datasets in the same order in which they appear in the SET statement.
Table 2.10 Table: Both
The preceding process can be packed into the following macro:
%macro ConcatDS(List, ALL); To use this macro to achieve the same result as in the previous example, we use the following calling code:
%let List = TOP BOTTOM; More on data mining:
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| About Us | Contact Us | For Advertisers | For Business Partners | Site Index | RSS |
|
|
|
|||||||