Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Data Mining: Concepts and Techniques - Introduction to Data Objects and Attributes, Cheat Sheet of Mathematics

MIT - World Peace University Mathematics

An introduction to data mining, covering fundamental concepts and techniques. It explores data objects and attributes, delving into various types of data sets, including records, relational records, data matrices, document data, transaction data, and more. The document also discusses important characteristics of structured data, such as dimensionality, sparsity, resolution, and distribution. It further examines different types of attributes, including nominal, binary, ordinal, interval-scaled, and ratio-scaled attributes. The document concludes with a discussion of basic statistical descriptions of data, including measures of central tendency and dispersion.

Typology: Cheat Sheet

2024/2025

Uploaded on 11/11/2024

the-super-world 🇮🇳

2 documents

1 / 70

This page cannot be seen from the preview

Don't miss anything!

Data Mining:

Concepts and

Techniques

Knowing Data

Partial preview of the text

Download Data Mining: Concepts and Techniques - Introduction to Data Objects and Attributes and more Cheat Sheet Mathematics in PDF only on Docsity!

Data Mining:

Concepts and

Techniques

Knowing Data

Data Objects and Attributes

Important Characteristics of Structured Data 

Dimensionality



Curse of dimensionality



Sparsity



Only presence counts-lot of 0s



Resolution



Patterns depend on the scale



Distribution



Centrality and dispersion

Resolution 

Pattern emergence depends on scale and

resolution.



Temporal variability: frequency pattern

emerges at a certain temporal scale, below

which it is noise and above which it is

background (e.g. climate change)

Data Objects  Data sets are made up of data objects.  A data object represents an entity.  Examples:  sales database: customers, store items, sales  medical database: patients, treatments  university database: students, professors, courses  Also called samples , examples, instances, data points, objects, tuples.  Data objects are described by attributes.  (^) Database rows -> data objects; columns ->attributes.

Attributes  (^) Attribute ( or dimensions (DWH), features (ML), variables(Statistics): a data field, representing a characteristic or feature of a data object. Ex.: _customer ID, name, address of Customer Object  (^) Attribute Vector or Feature Vector : Set of attributes used to describe an object  (^) Observed value for a given attribute : Observations  (^) Distribution of data involving one attribute : Univariate  (^) Bivariate : Involves two attributes  (^) Types of attributes:  (^) Nominal  Binary  (^) Numeric: quantitative  (^) Interval-scaled  (^) Ratio-scaled

Numeric Attribute  (^) Interval-Scaled  (^) Measured on a scale of equal-sized units and have order  (^) Only differences (addition or subtraction) makes meaningful  (^) Yesterday temperature is 10˚ less than today’s temperature  (^) Ex.: Temperature in C˚or F˚, Calendar Dates  (^) No true zero-point  (^) Ratio-Scaled  (^) Inherent zero-point  (^) can compute both differences as well as ratios between values  (^) 10 K˚ is twice as high as 5 K˚ or Price of item A & B is in the ratio 2 : 5  (^) Ex.: Temperature in Kelvin, Length, Counts, Monetary quantities

Numeric Attribute (In Classification Algorithms)  (^) Quantitative (integer or real-valued)

 Discrete Attribute

 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents

 Sometimes, represented as integer variables

 Binary attributes are a special case of discrete attributes

and can have values {0.1}

 Continuous Attribute

 Has real numbers as attribute values

 E.g., temperature, height, or weight



Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables

Mining Descriptive Characteristics  (^) To better understand the data: central tendency, variation and spread  (^) Descriptive Data Summarisation  (^) For data preprocessing it is essential to have overall picture of data  (^) DDS - used to identify the properties of data and which data to be treated as noise or outliers  (^) 3 Statistical Descriptions of Data:  (^) central tendency - locates the center or middle of the data distribution  (^) Dispersion – how the data is spread out?  (^) Visualization – visual inspection of data  (^) Central Tendency Measures : mean, median, mode, max, min.  (^) Data Dispersion Measures : range, quartiles, five-point summary, boxplot, variance, standard deviation  (^) Data Visualization : Quantile Plots, Q-Q Plots, Histograms, scatter plots

Measures 

Distributive measure :

A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure of each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set.

Ex. : sum () & count ()

Measuring the Central Tendency  (^) Mean (algebraic measure):

 x

1 , x 2 , …xn - set of ‘ n ’ values or observations;^ For ex. :^ age

 Arithmetic mean

 Find the average of 56, 41, 59, 52, 42 and 44.

 Weighted Arithmetic mean (weight reflect significance or

occurrence frequency)

 Trimmed mean: chopping extreme values (to offset the

effect caused by extreme values)

  n i x i n x 1 1      (^) n i i n i i i w w x x 1 1

Weighted Arithmetic Mean X 82 76 73 76 65 60 W 2 3 6 7 3 7 Compute the weighted arithmetic mean of the following data:

Symmetric vs. Skewed Data  In unimodal frequency curve, mean, median, and mode of symmetric , positively and negatively skewed data positively skewed negatively skewed symmetric

Median Problems Find the median of the following set of points in a game: 15, 14, 10, 8, 12, 8, 16 Find the median of the following set of points: 23, 29, 20, 32, 23, 21, 33, 25

Data Mining: Concepts and Techniques - Introduction to Data Objects and Attributes, Cheat Sheet of Mathematics

Related documents

Partial preview of the text

Download Data Mining: Concepts and Techniques - Introduction to Data Objects and Attributes and more Cheat Sheet Mathematics in PDF only on Docsity!

Data Mining:

Concepts and

Techniques

Knowing Data

Dimensionality

Curse of dimensionality

Sparsity

Only presence counts-lot of 0s

Resolution

Patterns depend on the scale

Distribution

Centrality and dispersion

Pattern emergence depends on scale and

resolution.

Temporal variability: frequency pattern

emerges at a certain temporal scale, below

which it is noise and above which it is

background (e.g. climate change)

 Discrete Attribute

 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents

 Sometimes, represented as integer variables

 Binary attributes are a special case of discrete attributes

and can have values {0.1}

 Continuous Attribute

 Has real numbers as attribute values

 E.g., temperature, height, or weight

Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables

Distributive measure :

Ex. : sum () & count ()

 x

1 , x 2 , …xn - set of ‘ n ’ values or observations;^ For ex. :^ age

 Arithmetic mean

 Find the average of 56, 41, 59, 52, 42 and 44.

 Weighted Arithmetic mean (weight reflect significance or

occurrence frequency)

 Trimmed mean: chopping extreme values (to offset the

effect caused by extreme values)