



































































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Lecture Notes with Special Points on Data Preprocessing
Typology: Lecture notes
1 / 75
This page cannot be seen from the preview
Don't miss anything!
e.g., occupation=“ ”
e.g., Salary=“-10”
e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records
No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse
A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: Intrinsic, contextual, representational, and accessibility
Mean (algebraic measure) (sample vs. population): Weighted arithmetic mean: Trimmed mean: chopping extreme values Median: A holistic measure Middle value if odd number of values, or average of the middle two values otherwise Estimated by interpolation (for grouped data ): Mode Value that occurs most frequently in the data Unimodal, bimodal, trimodal Empirical formula:
n i x i n x 1 1
(^) n i i n i i i w w x x 1 1 c f n f l median L median ) / 2 ( ) ( 1
N
Median, mean and mode of symmetric, positively and negatively skewed data The image part with relationship ID rId4 was not found in the file.
The normal (distribution) curve From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) From μ– 2 σ to μ+2σ: contains about 95% of it From μ– 3 σ to μ+3σ: contains about 99.7% of it The image part with relationship ID rId3 was not found in the file. The image part with relationship ID rId4 was not found in the file.
Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Boxplot Data is represented with a box The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ The median is marked by a line within the box Whiskers: two lines outside the box extend to Minimum and Maximum
Graph displays of basic statistical class descriptions Frequency histograms A univariate graphical method Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data
Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) Plots quantile information For a data x i data sorted in increasing order, f i indicates that approximately 100 f i % of the data are below or equal to the value xi
Provides a first look at bivariate data to see clusters of points, outliers, etc Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression