Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Preprocessing Techniques, Lecture notes of Data Mining

Jaypee University of Engineering & Technology Data Mining

Lecture Notes with Special Points on Data Preprocessing

Typology: Lecture notes

2018/2019

Uploaded on 03/05/2019

gajalu 🇮🇳

2 documents

1 / 75

This page cannot be seen from the preview

Don't miss anything!

January 18, 2019 Lecture-3 1

Data Preprocessing

Why preprocess the data?

Descriptive data summarization

Data cleaning

Data integration and transformation

Data reduction

Discretization and concept hierarchy generation

Summary

Partial preview of the text

Download Data Preprocessing Techniques and more Lecture notes Data Mining in PDF only on Docsity!

Data Preprocessing

 Why preprocess the data?

 Descriptive data summarization

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Why Data Preprocessing?

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking

certain attributes of interest, or containing

only aggregate data

 e.g., occupation=“ ”

 noisy: containing errors or outliers

 e.g., Salary=“-10”

 inconsistent: containing discrepancies in codes

or names

 e.g., Age=“42” Birthday=“03/07/1997”  e.g., Was rating “1,2,3”, now rating “A, B, C”  e.g., discrepancy between duplicate records

Why Is Data Preprocessing Important?

 No quality data, no quality mining results!  Quality decisions must be based on quality data  e.g., duplicate or missing data may cause incorrect or even misleading statistics.  Data warehouse needs consistent integration of quality data  Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality

 A well-accepted multidimensional view:  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility  Broad categories:  Intrinsic, contextual, representational, and accessibility

Forms of Data Preprocessing

Chapter 2: Data Preprocessing

 Why preprocess the data?

 Descriptive data summarization

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Measuring the Central Tendency

 Mean (algebraic measure) (sample vs. population):  Weighted arithmetic mean:  Trimmed mean: chopping extreme values  Median: A holistic measure  Middle value if odd number of values, or average of the middle two values otherwise  Estimated by interpolation (for grouped data ):  Mode  Value that occurs most frequently in the data  Unimodal, bimodal, trimodal  Empirical formula:

  n i x i n x 1 1 

   (^) n i i n i i i w w x x 1 1 c f n f l median L median ) / 2 ( ) ( 1    

mean  mode  3 ( mean  median )

^ x

 

Symmetric vs. Skewed Data

 Median, mean and mode of symmetric, positively and negatively skewed data The image part with relationship ID rId4 was not found in the file.

Properties of Normal Distribution Curve

 The normal (distribution) curve  From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation)  From μ– 2 σ to μ+2σ: contains about 95% of it  From μ– 3 σ to μ+3σ: contains about 99.7% of it The image part with relationship ID rId3 was not found in the file. The image part with relationship ID rId4 was not found in the file.

Boxplot Analysis

 Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum  Boxplot  Data is represented with a box  The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ  The median is marked by a line within the box  Whiskers: two lines outside the box extend to Minimum and Maximum

Histogram Analysis

 Graph displays of basic statistical class descriptions  Frequency histograms  A univariate graphical method  Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data

Quantile Plot

 Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences)  Plots quantile information  For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value xi

Scatter plot

 Provides a first look at bivariate data to see clusters of points, outliers, etc  Each pair of values is treated as a pair of coordinates and plotted as points in the plane

Loess Curve

 Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence  Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression

Data Preprocessing Techniques, Lecture notes of Data Mining

Related documents

Partial preview of the text

Download Data Preprocessing Techniques and more Lecture notes Data Mining in PDF only on Docsity!

Data Preprocessing

 Why preprocess the data?

 Descriptive data summarization

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Why Data Preprocessing?

 Data in the real world is dirty

 incomplete: lacking attribute values, lacking

certain attributes of interest, or containing

only aggregate data

 noisy: containing errors or outliers

 inconsistent: containing discrepancies in codes

or names

Why Is Data Preprocessing Important?

Multi-Dimensional Measure of Data Quality

Forms of Data Preprocessing

Chapter 2: Data Preprocessing

 Why preprocess the data?

 Descriptive data summarization

 Data cleaning

 Data integration and transformation

 Data reduction

 Discretization and concept hierarchy generation

 Summary

Measuring the Central Tendency

mean  mode  3 ( mean  median )

^ x

Symmetric vs. Skewed Data

Properties of Normal Distribution Curve

Boxplot Analysis

Histogram Analysis

Quantile Plot

Scatter plot

Loess Curve