Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining: Concepts and Techniques - Introduction to Data Objects and Attributes, Cheat Sheet of Mathematics

An introduction to data mining, covering fundamental concepts and techniques. It explores data objects and attributes, delving into various types of data sets, including records, relational records, data matrices, document data, transaction data, and more. The document also discusses important characteristics of structured data, such as dimensionality, sparsity, resolution, and distribution. It further examines different types of attributes, including nominal, binary, ordinal, interval-scaled, and ratio-scaled attributes. The document concludes with a discussion of basic statistical descriptions of data, including measures of central tendency and dispersion.

Typology: Cheat Sheet

2024/2025

Uploaded on 11/11/2024

the-super-world
the-super-world 🇮🇳

2 documents

1 / 70

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Data Mining:
Concepts and
Techniques
Knowing Data
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46

Partial preview of the text

Download Data Mining: Concepts and Techniques - Introduction to Data Objects and Attributes and more Cheat Sheet Mathematics in PDF only on Docsity!

Data Mining:

Concepts and

Techniques

Knowing Data

Data Objects and Attributes

Important Characteristics of Structured Data

Dimensionality

Curse of dimensionality

Sparsity

Only presence counts-lot of 0s

Resolution

Patterns depend on the scale

Distribution

Centrality and dispersion

Resolution

Pattern emergence depends on scale and

resolution.

Temporal variability: frequency pattern

emerges at a certain temporal scale, below

which it is noise and above which it is

background (e.g. climate change)

Data Objects  Data sets are made up of data objects.  A data object represents an entity.  Examples:  sales database: customers, store items, sales  medical database: patients, treatments  university database: students, professors, courses  Also called samples , examples, instances, data points, objects, tuples.  Data objects are described by attributes.  (^) Database rows -> data objects; columns ->attributes.

Attributes  (^) Attribute ( or dimensions (DWH), features (ML), variables(Statistics): a data field, representing a characteristic or feature of a data object. Ex.: _customer ID, name, address of Customer Object  (^) Attribute Vector or Feature Vector : Set of attributes used to describe an object  (^) Observed value for a given attribute : Observations  (^) Distribution of data involving one attribute : Univariate  (^) Bivariate : Involves two attributes  (^) Types of attributes:  (^) Nominal  Binary  (^) Numeric: quantitative  (^) Interval-scaled  (^) Ratio-scaled

Numeric Attribute  (^) Interval-Scaled  (^) Measured on a scale of equal-sized units and have order  (^) Only differences (addition or subtraction) makes meaningful  (^) Yesterday temperature is 10˚ less than today’s temperature  (^) Ex.: Temperature in C˚or F˚, Calendar Dates  (^) No true zero-point  (^) Ratio-Scaled  (^) Inherent zero-point  (^) can compute both differences as well as ratios between values  (^) 10 K˚ is twice as high as 5 K˚ or Price of item A & B is in the ratio 2 : 5  (^) Ex.: Temperature in Kelvin, Length, Counts, Monetary quantities

Numeric Attribute (In Classification Algorithms)  (^) Quantitative (integer or real-valued)

 Discrete Attribute

 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents

 Sometimes, represented as integer variables

 Binary attributes are a special case of discrete attributes

and can have values {0.1}

 Continuous Attribute

 Has real numbers as attribute values

 E.g., temperature, height, or weight

Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables

Mining Descriptive Characteristics  (^) To better understand the data: central tendency, variation and spread  (^) Descriptive Data Summarisation  (^) For data preprocessing it is essential to have overall picture of data  (^) DDS - used to identify the properties of data and which data to be treated as noise or outliers  (^) 3 Statistical Descriptions of Data:  (^) central tendency - locates the center or middle of the data distribution  (^) Dispersion – how the data is spread out?  (^) Visualization – visual inspection of data  (^) Central Tendency Measures : mean, median, mode, max, min.  (^) Data Dispersion Measures : range, quartiles, five-point summary, boxplot, variance, standard deviation  (^) Data Visualization : Quantile Plots, Q-Q Plots, Histograms, scatter plots

Measures

Distributive measure :

A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure of each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set.

Ex. : sum () & count ()

Measuring the Central Tendency  (^) Mean (algebraic measure):

 x

1 , x 2 , …xn - set of ‘ n ’ values or observations;^ For ex. :^ age

 Arithmetic mean

 Find the average of 56, 41, 59, 52, 42 and 44.

 Weighted Arithmetic mean (weight reflect significance or

occurrence frequency)

 Trimmed mean: chopping extreme values (to offset the

effect caused by extreme values)

  n i x i n x 1 1      (^) n i i n i i i w w x x 1 1

Weighted Arithmetic Mean X 82 76 73 76 65 60 W 2 3 6 7 3 7 Compute the weighted arithmetic mean of the following data:

Symmetric vs. Skewed Data  In unimodal frequency curve, mean, median, and mode of symmetric , positively and negatively skewed data positively skewed negatively skewed symmetric

Median Problems Find the median of the following set of points in a game: 15, 14, 10, 8, 12, 8, 16 Find the median of the following set of points: 23, 29, 20, 32, 23, 21, 33, 25