






























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to data mining, covering fundamental concepts and techniques. It explores data objects and attributes, delving into various types of data sets, including records, relational records, data matrices, document data, transaction data, and more. The document also discusses important characteristics of structured data, such as dimensionality, sparsity, resolution, and distribution. It further examines different types of attributes, including nominal, binary, ordinal, interval-scaled, and ratio-scaled attributes. The document concludes with a discussion of basic statistical descriptions of data, including measures of central tendency and dispersion.
Typology: Cheat Sheet
1 / 70
This page cannot be seen from the preview
Don't miss anything!
Data Objects and Attributes
Important Characteristics of Structured Data
Resolution
Data Objects Data sets are made up of data objects. A data object represents an entity. Examples: sales database: customers, store items, sales medical database: patients, treatments university database: students, professors, courses Also called samples , examples, instances, data points, objects, tuples. Data objects are described by attributes. (^) Database rows -> data objects; columns ->attributes.
Attributes (^) Attribute ( or dimensions (DWH), features (ML), variables(Statistics): a data field, representing a characteristic or feature of a data object. Ex.: _customer ID, name, address of Customer Object (^) Attribute Vector or Feature Vector : Set of attributes used to describe an object (^) Observed value for a given attribute : Observations (^) Distribution of data involving one attribute : Univariate (^) Bivariate : Involves two attributes (^) Types of attributes: (^) Nominal Binary (^) Numeric: quantitative (^) Interval-scaled (^) Ratio-scaled
Numeric Attribute (^) Interval-Scaled (^) Measured on a scale of equal-sized units and have order (^) Only differences (addition or subtraction) makes meaningful (^) Yesterday temperature is 10˚ less than today’s temperature (^) Ex.: Temperature in C˚or F˚, Calendar Dates (^) No true zero-point (^) Ratio-Scaled (^) Inherent zero-point (^) can compute both differences as well as ratios between values (^) 10 K˚ is twice as high as 5 K˚ or Price of item A & B is in the ratio 2 : 5 (^) Ex.: Temperature in Kelvin, Length, Counts, Monetary quantities
Numeric Attribute (In Classification Algorithms) (^) Quantitative (integer or real-valued)
Mining Descriptive Characteristics (^) To better understand the data: central tendency, variation and spread (^) Descriptive Data Summarisation (^) For data preprocessing it is essential to have overall picture of data (^) DDS - used to identify the properties of data and which data to be treated as noise or outliers (^) 3 Statistical Descriptions of Data: (^) central tendency - locates the center or middle of the data distribution (^) Dispersion – how the data is spread out? (^) Visualization – visual inspection of data (^) Central Tendency Measures : mean, median, mode, max, min. (^) Data Dispersion Measures : range, quartiles, five-point summary, boxplot, variance, standard deviation (^) Data Visualization : Quantile Plots, Q-Q Plots, Histograms, scatter plots
Measures
A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure of each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set.
Measuring the Central Tendency (^) Mean (algebraic measure):
n i x i n x 1 1 (^) n i i n i i i w w x x 1 1
Weighted Arithmetic Mean X 82 76 73 76 65 60 W 2 3 6 7 3 7 Compute the weighted arithmetic mean of the following data:
Symmetric vs. Skewed Data In unimodal frequency curve, mean, median, and mode of symmetric , positively and negatively skewed data positively skewed negatively skewed symmetric
Median Problems Find the median of the following set of points in a game: 15, 14, 10, 8, 12, 8, 16 Find the median of the following set of points: 23, 29, 20, 32, 23, 21, 33, 25