Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Minining - Data reduction, Study notes of Data Mining

In this document topics covered which are Cluster or Stratified Sampling, Sampling: with or without Replacement, Data Reduction Method (4): Sampling, Data Reduction Method (3): Clustering, Regress Analysis and Log-Linear Models, Multiple regression.

Typology: Study notes

2010/2011

Uploaded on 09/04/2011

amit-mohta
amit-mohta 🇮🇳

4.2

(152)

89 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
November 27, 2014 Data Mining: Concepts and
Techniques 1
CHAPTER 2: Data
Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Data Minining - Data reduction and more Study notes Data Mining in PDF only on Docsity!

November 27, 2014 Data Mining: Concepts and 1

CHAPTER 2: Data

Preprocessing

• Why preprocess the data?

• Data cleaning

• Data integration and transformation

• Data reduction

• Discretization and concept hierarchy

generation

• Summary

November 27, 2014 Data Mining: Concepts and 2

Data Reduction Strategies

  • (^) Why data reduction?
    • (^) A database/data warehouse may store terabytes of data
    • (^) Complex data analysis/mining may take a very long time to run on the complete data set
  • (^) Data reduction
    • (^) Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results
  • (^) Data reduction strategies
    • (^) Data cube aggregation:
    • (^) Dimensionality reduction — e.g., remove unimportant attributes
    • (^) Data Compression
    • (^) Numerosity reduction — e.g., fit data into models
    • (^) Discretization and concept hierarchy generation

November 27, 2014 Data Mining: Concepts and 4 Attribute Subset Selection

  • (^) Feature selection (i.e., attribute subset selection):
    • (^) Select a minimum set of features such that the

probability distribution of different classes given the

values for those features is as close as possible to

the original distribution given the values of all

features

  • (^) reduce # of patterns in the patterns, easier to

understand

  • (^) Heuristic methods (due to exponential # of choices):
    • (^) Step-wise forward selection : The procedure

starts with an empty set of attributes as the reduced

set. The best of the original attributes is determined

and added to the reduced set, which continue till the

best of remaining original attributes is added to the

set.

November 27, 2014 Data Mining: Concepts and 5

  • (^) Step-wise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attributes remaining in the set.
  • (^) Combining forward selection and backward elimination: It is the combination of above two approaches so that, at each step, the procedures selects the best attribute and removes the worst from among the remaining attributes.
  • (^) Decision-tree induction: It construct a FC like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction. When DTI is used for attribute subset selection, a tree is constructed from the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set of attributes appearing in the tree from the reduced subset of

attributes.

November 27, 2014 Data Mining: Concepts and 7 Heuristic Feature Selection Methods

  • (^) There are 2 d^ possible sub-features of d features
  • (^) Several heuristic feature selection methods:
    • (^) Best single features under the feature independence assumption: choose by significance tests
    • (^) Best step-wise feature selection:
      • (^) The best single-feature is picked first
      • (^) Then next best feature condition to the first, ...
    • (^) Step-wise feature elimination:
      • (^) Repeatedly eliminate the worst feature
    • (^) Best combined feature selection and elimination
    • (^) Optimal branch and bound:
      • (^) Use feature elimination and backtracking

November 27, 2014 Data Mining: Concepts and 8

Data Compression

  • (^) String compression
    • (^) There are extensive theories and well-tuned

algorithms

  • (^) Typically lossless
  • (^) But only limited manipulation is possible without

expansion

  • (^) Audio/video compression
    • (^) Typically lossy compression, with progressive

refinement

  • (^) Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

  • (^) Time sequence is not audio
    • (^) Typically short and vary slowly with time

November 27, 2014 Data Mining: Concepts and 10 Dimensionality Reduction: Wavelet Transformation In Dimensionality reduction , data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless. If instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. Two effective methods of lossy dimensionality reduction are :

  • (^) Principal Components Analysis
  • (^) Discrete wavelet transform (DWT): linear signal processing, multi-resolutional analysis
  • (^) Compressed approximation: store only a small fraction of the strongest of the wavelet coefficients
  • (^) Similar to discrete Fourier transform (DFT), but better lossy compression, localized in space Haar2 (^) Daubechie

November 27, 2014 Data Mining: Concepts and 11

  • Method : The general procedure for applying a discrete wavelet transform uses hierarchical pyramid algorithm that halves the data at each iteration, resulting in fast computational speed. The method is as follows :- - (^) Length, L of input data vector, must be an integer power of 2 (padding with 0’s, when necessary) - (^) Each transform has 2 functions: smoothing, difference - (^) Applies to pairs of data, resulting in two set of data of length L/ - (^) Applies two functions recursively, until reaches the desired length Application : Compression of finger print images, computer vision, analysis of time-series data.

Dimensionality Reduction:

Wavelet Transformation

November 27, 2014 Data Mining: Concepts and 13

  • (^) Given N data vectors from n -dimensions, find kn orthogonal vectors ( principal components ) that can be best used to represent data
  • (^) Steps
    • (^) Normalize input data: Each attribute falls within the same range
    • (^) Compute k orthonormal (unit) vectors, i.e., principal components
    • (^) Each input data (vector) is a linear combination of the k principal component vectors
    • (^) The principal components are sorted in order of decreasing “significance” or strength
    • (^) Since the components are sorted, the size of the data can be reduced by eliminating the weak components, i.e., those with low variance. (i.e., using the strongest principal components, it is possible to reconstruct a good approximation of the original data
  • (^) Works for numeric data only
  • (^) Used when the number of dimensions is large Dimensionality Reduction: Principal Component Analysis (PCA)

November 27, 2014 Data Mining: Concepts and 14 X X Y Y

Principal Component

Analysis

November 27, 2014 Data Mining: Concepts and 16 Data Reduction Method (1): Regression and Log-Linear Models

  • (^) Linear regression: Data are modeled to fit a straight

line

  • (^) Often uses the least-square method to fit the line
  • (^) Multiple regression: allows a response variable Y to

be modeled as a linear function of multidimensional

feature vector

  • (^) Log-linear model: approximates discrete

multidimensional probability distributions

  • (^) Linear regression : Y = w X + b
    • (^) Two regression coefficients, w and b, specify the line and are to be estimated by using the data at hand
    • (^) Using the least squares criterion to the known values of Y 1 , Y 2 , …, X 1 , X 2 , ….
  • (^) Multiple regression : Y = b0 + b1 X1 + b2 X2.
    • (^) Many nonlinear functions can be transformed into the above
  • (^) Log-linear models:
    • (^) The multi-way table of joint probabilities is approximated by a product of lower-order tables
    • (^) Probability: p(a, b, c, d) =abacadbcd Regress Analysis and Log-Linear Models

November 27, 2014 Data Mining: Concepts and 19 Data Reduction Method (3): Clustering

  • (^) Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
  • (^) Can be very effective if data is clustered but not if data is “smeared”
  • (^) Can have hierarchical clustering and be stored in multi- dimensional index tree structures
  • (^) There are many choices of clustering definitions and clustering algorithms.

November 27, 2014 Data Mining: Concepts and 20

Data Reduction Method (4):

Sampling

  • (^) Sampling: obtaining a small sample s to represent the

whole data set N

  • (^) Allow a mining algorithm to run in complexity that is

potentially sub-linear to the size of the data

  • (^) Choose a representative subset of the data
    • (^) Simple random sampling may have very poor

performance in the presence of skew

  • (^) Develop adaptive sampling methods
    • (^) Stratified sampling:
      • (^) Approximate the percentage of each class (or

subpopulation of interest) in the overall database

  • (^) Used in conjunction with skewed data
  • (^) Note: Sampling may not reduce database I/Os (page at

a time)