Unknown data is data that is buried in all of the operational data available to organizational decision-makers.
Significant data is sometimes undetected because most data is captured and maintained by a particular department. Data which may seem irrelevant or uninteresting at the department level may yield insights and indicate patterns important at the organizational level, such as: customer behavior and purchasing patterns, the effectiveness of sales promotions, detecting fraud, evaluating risk and assessing the quality of service or products.
Consolidating and Cleansing Data
Data is often located on several different systems, in different formats and structures, and may even redundant. This data provides no real value to business users without some method of accessing it. The data warehouse is a resource for consolidating and cleansing data, facilitating analysis much more effectively than regular flat files or operational databases.
Three steps are needed to identify and use hidden information:
- Captured data must be incorporated into other data
- Information must be specially organized to simplify decision-making
- Data must be analyzed or mined for valuable information.
Data Mining Techniques
Several analysis methodologies are used in data mining operations, including:
- Classification
- Association
- Sequence-based
- Clustering
- Estimation.
Classification is perhaps the most often employed data mining technique and involves a set of instances or predefined examples to develop a model that can classify the population of records at large.The use of classification algorithms begins with a sample set of pre-classified example transactions to determine the set of parameters required for proper identification. Once an effective classifier is developed, it is used in a predictive model to classify new records automatically into these same predefined classes.
Association is an operation performed against a set of records -- a collection of items and a set of transactions, each of which contains some number of items from a given collection. The operation returns ‘affinities’ that exist among the collection of items.‘Market basket’ analysis is a common application used by retailers to determine affinities among shoppers. Association tools discover rules based on items that occur together in a given event or transaction.
Sequence-based analysis is often used as a variation of the association technique, when there is additional information to tie together a sequence of purchases, an account number, a credit card, or a frequent shopper number, for example, can all be used to track multiple purchases in a time series.Sequence-based mining can be used to detect the set of customers associated with frequent buying patterns.
Clustering segments a database into different groups. The goal is to find groups that differ from one another as well as similarities among members. The clustering approach assigns records with a large number of attributes into a relatively small set of groups, or “segments”. This assignment process is performed automatically by clustering algorithms that identify the distinguishing characteristics of the data set and then partition the space defined by the data set attributes along natural “boundaries.” There is no need to identify the groupings desired for the attributes that should be used to segment the data set.
Estimation is a variation on the classification technique. It involves the generation of scores along various dimensions in the data. Rather than employing a binary classifier to determine whether a loan applicant, for instance, is approved or classified as a risk, the estimation approach generates a credit-worthiness ‘score’ based on a pre-scored sample set of transactions. That is, sample data (complete records of approved and risk applicants) are used as samples in determining the worthiness of all records in a data set.
These data mining techniques may be used together or individually and often two or three methods are used to refine and analyze data more completely.