Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Descriptive and Inferential Statistics: A Comprehensive Guide, Study notes of Software Development

A thorough overview of descriptive and inferential statistics, covering key concepts such as mean, median, mode, variance, standard deviation, and hypothesis testing. it explores various probability distributions, including normal, binomial, and poisson distributions, and delves into hypothesis testing methods like t-tests, chi-square tests, and anova. The guide also includes practical examples and visualizations to aid understanding.

Typology: Study notes

2024/2025

Available from 04/18/2025

sarofan
sarofan 🇮🇳

17 documents

1 / 36

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
S
O
b
statistics
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24

Partial preview of the text

Download Descriptive and Inferential Statistics: A Comprehensive Guide and more Study notes Software Development in PDF only on Docsity!

S

O

b

statistics

Sett

e

Chemnitionalistics

is a science

of

collecting

,

organizing

and

analyzing

darte

Data

=

'facts

or

pieces

of information

Eg

:

·

Heights

of

students in classroom

·

IQ

of

students

·

weight

of people

Entire

statistics is

divided

into I

parts

:

Types

of

statistics

Descriptive

Stats

Inferential

Stats

defn

:

defn

:

It

consists

of

organizing

It

consists

of

data

you

have

and

summarizing

data

measured

to

form

conclusion

Measure

of

central

sample

population

Tendency

[mean

,

median

,

mode

3

Conclusions

Hypotheis

testing

Measure

of

Dispersion

of

I les

variance

,

standard

t-test

Conclusions

of

deviation 3

Chi-square

test

sample

on

Diffrent types of

distribution

ANOVA

Population

F-

Test

CI

,

pvalue

, X

of

data

Eg

:

Histogram

, pdf

, prut , calf

CLT

(Central

Limit

Theorem)

Eg

: "Ave

the

height

of

students

in

the

class similar

to

what

Eg

:

"What is the

overage

height

of

you

expect

in

a

college"

students

in

the

closs

Scales

of

measurement

of

dati

Nominal scale volator

a

Ordinal

scale

dator

3

Internal

scale

data

Ratio Scale voata

① Nominal

Scale voator

there

we

mainly

measure

Qualitative/lategorical

data

Eg

:

Colours

, Gender ,

Labels

.

Order does

not

matter

Eg

:

Favourite color

.

Red

5-50 %

5

Blue-3-

%

32

Orange-2-

%

--

10

Ordinal Scale

dator Race

Eg

:

Best-

1st

3 min

Categorical

dator

Good

  • 2 2nd-

min

Bad

  • 3

3rd- min

Ranking

and order matters

we

cannot measure based on

only

this data

Difference

cannot

be

measured

Incase

of

race

example

only

based

on

the

rank

i.e

I

,

2

,

3 ch

we

cannot measure

it

Only

when estra

information

is

provided

(time

taken)

we

can

measure

it

Internal

scale

data

Eg

:

Temperature

Order matters 30 F

60 : 30 =

Difference

can

be

measured 60 F

Ratio

cannot

be

measured GOF-

Room 1

  • 30F

No

zno

starting

point

120F -

Room 2

60'F

Here

we can't

say

that

Rooms

No

temperature

is

ice

of

Rooms

Farcht

com

also

ham-we values

Hence has no

zeo

starting

point

Ratio scale volator

Order

moths we can sort this

numbers]

Differences

are

measurable

including

ratio

Contains or

zo

starting

point

Students

marks in class

30

,

60

,

90 ,

95

,

99

Example

:

Maritol

Status

[Nominal]

we

[Nominal]

Favourite

food

based on

gene

IQ

measurement /Ratio scale data]

can convert

Ordinal

Discriptive

statistics

Measure

of

central

tendency

mean

median mode

Mean

:

Population

(N)

Sample

(n)

where

=

32

,

1

,

2 ,

2

,

3 ,

3

,

4 ,

5

,

5

, 6}

population

size

sample

size :

Population

=

Hi

Sample

My

:

i

,N

mean

1

2

2

3

3

4

5 + 5

6

  1. 2

=

10

Median :

Note:

X

=

[

,

5

,

2

,

3

,

2

, B Median is used

to

find

Steps

:

central

tendency

when

Sort the random

variable Outlier

is

present

.

Number

of

elements

3

if

count

  • even

of

center

I

elements

and take the

average

if

count-- odd

The middle

element

will

de

the median

Why

medion

? Note:

It

is

because

of

outliers

Mean

is

Exe : 12 =

31

,

2

,

3

,

3 ,

53

I

=

E

,

2

,

3

,

4 ,

5

,

1003

affected

by

I

meanx=

3

mean (2)

=

19 Outters

median

= 3 .

5

medicen= 3

Ex

: 9 ,

2

,

3

,

For

example

(Here

,

y

are

amon

e

x

=

33

y

=

<

3

2

92

=

. 5

92 = 7

. 5

52

= (xi

  • x

i

1x

  • 1

The

respective graph

for

these variances

Ji I

(xi

C

I 3 4 <

23 I ...

1

=

  1. 5

A

it

33 0

3 I When the variance is

less

then the

5

3

spread

is less

xi-c) = 10

Standard deviation

:

It

is

nothing

but root

of

variance

1 I

How

far

the

value is

away

from the

mean

"

Ex:

C

=

[

,

2

,

3

,

4 , 53

5

=

3

Population

standard

deviation

S/x

=

  1. 581
  • =

variance

=

=

How

to show

this

Spread

sample

standard

deviation

S =- 52

=

S

7 >

.

162 ing'

bix2' . 743

<

We can

get

how much

Therefore

variance

is standard

driation

far

it is

away

from

square

mean

we

use standard

In

short we

are

talking

about how

deviation

well the

volator

is

spread

ex:

162 is 2

Sta

roleviation

away

from

mean

Note

:

1

.

is one sta deviation

variance error w. r.

to

one distribution to

another

away

from

mean

M

L

A

variance error

  • N

A

A

S

E

x

Random variables

Linahebra

Seeterraciables

11

Random variable

is a

process

of mapping

the

output of

a random

process

or

experiment

to

a

number

Eg

:

Tossing

a

coin

[Head , Tail]

process

C

.

2

=

Set d

=>

"Outcome

of

a

random

process

is converted

to

number"

"In

every

tos the value can

change

,

i

. e it is

not

fixed"

Ex

:

Rolling

a

dice

[ ,

2 ,

3

,

5

,

63

y

=

(sum

of rolling

of

dice

I

times]

What

can

we

do

from

this

?

-> We can

find probility

of (y15) , Pr(y<10)

x

Histogram

and Skewness

.

Whenever we

talk about

(Frequency]

the best

visualization

diagram

we can

use is

Histogram

Ex:

Ages

=

[

,

12

,

18

, 24 ,

26

,

30

,

35

,

36

,

37

,

,

41 , 42 , 43 ,

50 , 513

If

we

wont

to

map

the

frequency

of

the elements between

the

ranges

and we

want

to

visualize

the

diagrams

.

Then

we

com

specifically

use

Histograms

mach

lange

=

5- bin

size

No

of

bins

= 10

  • Buckets

Buckets--

A

I 5

probability

8 4-

density

***(probability density functions

valus

63-

Count

Kernel

density

estimator -

is

responsible

42-

2

1

for

smoothening

5

is is

do is

so

is

no is 5

50

skewness

:

It

is a

measure

of

the

distortion

of

symmetrical

distribution or

asymmetry

in or

dator set .

① a

=>

Normal distribution

(Zero

Skewed)

n

Symmetre

cal distribution

=>

In this the center

element is

specifying

median : mean

: mode

In

Symmetrical

distribution there is

no skewness

Right

Skewed

Left

Skewed :

in

skewed

(my

RHS

Negale

Positive

skewed

is

elongated)

this distribution

is

called as

M

7

i

·

"Log

Normal

Distribution"

>S

mode

medians

mean

mean? median?

mode

I

uartiles :

·

G

  • 25th

percentile

Q

->

Median

percentile

Q

175th

percentile

Number

summary

I

Minimum

First

Quartile

(

percentile)

(a)

Media (Q2)

Third

Quartile

(

percentile)

(Q3)

&

Maximum (

percentile)

Let's see

how

we can remove

outliers

using

this

technique

3

=

[

,

2

,

2

,

2

,

3

,

3

,

4 ,

5 ,

5

,

5

,

6 ,

6

,

6

,

6

,

7

,

8

,

8

,

9

,

29}

Using

5 numbers

summary

we calculate

2

important

things

Lower

Higher

fence fence

This

is or

range

(Border)

·

Below this

Lower

fence

everything

is

an

Outli

·

Above

this

Higher fence

everything

is

an Outli

Formula

to

calculate

Lower

fence

:

Lower

fence

= Q-

. 5

(IQR)

where

IQR

-Inter

Quartile

Range

Formula

to

calculate

Higher fence

:

I

23

2

Higher

fence

=

Q

  1. 5

(IQR)

=

[

,

2 ,

2

,

2

,

,

3

,

,

5

,

5

,

5

,

6

,

6

,

,

,

7

,

,

,

,

29}

Q

=

25

percentile

=

15x

(

Q

:

75

percentile

=

x(

  1. =

, 4

05

100

=

telement

element

= I

IQR

=

&

Q

=I

  • 3

=

7

1

. 5(

.

Lower

fence

=

Q-

. 5

(IQR)

=>

.

5(4)

Higher funce=

&st

  1. 5

(IQR)

=

3

Lower

fence

and

Higher

fence

ove

in the

range

between

3

,

29

is

the

Outlier

Box

PLOT

:

Box

plot

in searborn is a

diagram

which

will

help

us

to

visualize

outties

Minimum value = 1

min

P, G

Q

Mak

BOX

PLOT

2Q

= 3

x xx x

x x

s

Medion

Q

=

5 Outlier

-ener

4

&

=

I

5

Maximum

: 9

I

I I

I

(donot

-zoc's i i is

in

is is do

......

conside outlier)

Interview

Questions

:

Draw a

box

plot for

right

skwad

e

min

① 92 Q

more

mean)

median)

mode

@s-Q

: P2-Q

Internal

Assignment

y

=

<- 13

,

,

,

,

3

,

4 ,

5

,

6

,

7

,

7

,

8

,

10

,

10

,

11

, 24 , 553

Q

= 25

percentile=

5x (17)

&

percentile

:

Ex

① 1

=

.

25

element

(Hence

take

average

=

=

12 .

7 elem

of 4th

$ 5in

element)

O

IQR

=

Q

&

11

Lower

fence

=

&

,

.

5

(IQR)

Higher

fence =

Q

1

.

5 (IQR)

=-

Di

=

5(11)

=

.

. 5

,

. 5

Hence" 55"

is an Outlier

Cor(x,

y)

=

my

,

-y) while

using

this

formula

if

X

/

Y

/

I've

X y '-'ve

X

Y

covariance

X Y covariance

Ex

:

x

y

Cov(x,

y)

=

(y-y)

4)(

4)(

(

  1. (

=

-=

5 in

The

conciones

is

I've

y

=

4

0

t

I

2 and it is

following

property

I and

y

are

having

a

positive

covariance

A

drantages

Disadvantages

·

Relationship

blu

x &

Y

·

Covariance

does not have

a

specific

limit value

(It

can

range anywhere

from

as

to + o

S

Ex

,

It also cannot

tell

us

D

y

which is

highly

covariant

100

between x

,

y

and

x

,

z

So in

Covariance these are the

disadvantages,

so

to overcome

this we ruse Pearson

correlation

co-efficient

PEARSON

CORRELATION

CO-EFFICIENT

S

x,

y

= Cow(x

,

y)

The outcome

of

Pearson correlation

coefficient

E

G

will

the

ranging

between

-1 to +

The

more

the value towards

the more

positive

(t)

correlated it

is

the more the value

towards-1 the more

negative(-)

correlated

it is

  • 1

to

i. 2 x

y

z

En

:x

y

  1. 6

x 2

0

. 7

I to +

Now we can

say

that x and I ore

highly

co-related

than a

and

y

By

using

Pearson

correlation

confficient

Note :

Only

when

we restrict

it we

will be able

to

compare

it

9

=

kg<

0x

<+

=

1

I

⑧ ⑧

&

&

Examples

of

scatter

diagrams

with

different

volves

of

correlation

co-efficient

wantages

With

respect

to

pearson

correlation it can

only

capture

the

linear

properties

3

SPEARMAN'S

RANK

CORRELATION COEFFICIENT

x

y

R(x)

R(y)

A

Now

linear

relationship

com

be

captured by

Spearman's

5

6 3

I

I 2 2

rs

Cor(R(x) ,

Ry)

I 3

L L

8

,

I 5 5

RR(x)

R(y)

Lits

see

where we

specifically

use this

Feature

Selection

Ex

:

With

respect

to

house

price

O

Size of

No .

of

Location

. ofppl

Haunter Price"

House

rooms

staying

+we

+we

+ yel

He

not

a

very

big

role

(hence drop

,

the

feature

Why

study

these Distribution

?

Different types

of

distribution

:

Note: The

datasets

we will be

working

on ,

will

be

following

Normal/Gaussian

distribution -

pdf

one

or

the other distribution

2

Standard Normal distribution

pdf

3

Log

Normal

distribution >

palf

Power Law

distribution

pdf

5

Bernoulli

distribution

prf

6

Binomial

distribution >

pruf

7

Poisson distribution

put

>pdf

8

Uniform

distribution

pmf

a

Exponential

distribution

polf

10 CHL SQUARE

distribution

pdf

11 I

distribution

polf

Normal/Gaussian

distribution

NOTE

: Most

of

the colotasets

that

ore available in

I

:

Continuous Random

Variable

the uninesse

follow

mean-mode-a

Balcon

Gaunian

distribution

Empirical

Rule

. 7

% Rule

r-s-ri- M itr

inter

Eg

:

Height

,

Weight

,

Age

,

IRIS distribution

variance

·

Be

ene

Normal

distribution

I=

N(M

Probability

density function support paramities

M=

mean

%

4

& 9

PDF

=

Whenever

you

find

a

redistribution

W

the red curve

is standard

Normal Distribution

that

followe

is

gamian

distribution

Conulative distribution

function

CDF=

I

1+

if

x

M

this rate

is med 100 %

Standard

Normal

distribution

Lets

consider

a

random variable()

Standard

Let

M=

3 and

& = 1

Normal

M

=

3

Transformation

&

Distribution

O

=

1

  • ....

M

= 0

,

  • = 1

I 1

2

3

Y

I I

is

is

Is

0 I

5

6

This

cam

the

converted

into

standard

normal distribution

by

:

score

I

scor

tells

us

about a

value

,

Z-score

= (i-M how

many

standard deviations

it is

I

away

from

the mean

=

3 =

  • 2

T

Similarly

0

,

1

,

2

,

3

= 12

= -

3

Log

Normal Distribution

(continous

random variable

3

Log

  • Normal

PROBABILITY

1

ensity

Function

In

probability

theory

,

or

log

normal distribution is a

continuous

probability

distribution

of

a

random variable

whose

logarithm

is

normally

distributed .

Thus

if

the random variable

X is

log normally

distributed

,

then

y

=

In(x)

has

a normal distribution

.

Equivalently if y

has ar normal distribution

,

then

the

Cumulative

density function

exponential

function

of

y

,

x=

exp(Y) ,

hos a

log

normal

distribution

32

=

Log

normal

(M

,

~2)

y

=

In(x)

Normal

Distribution

(M

,

In

=

Natural

log

(log ?

&

logarithm

why

do

we

merch

such

transformations

?

Because whenever our

data

follows

right

skwed (tr)

Transformation

THIS

distribution

,

my

model will

get

trained

efficiently

In(x)

Log

distribution Normal

distribution NOTE : In Linear

regression

it

says

that

your

independent

feature

NOTE

: For vice usa

i .e

Normal -

Log

Normal

should

follow gausian/normal

x

=

exp(y) distribution .

If

X is a

random

variable with a Parto

(Type!)

distribution

,

then

the

probability

that X

is

greater

than some numbers

,

i. e .

The survival

function

(also

called

tail

function)

,

is

given by

F()

=

PrXSx

=

x1am ,

E

x<2m ,

Q.

Can we

convert pareto

distribution to

Normal distribution?

somation

·

2

Exponential

distribution

[continous

random variable 3

80-

%

rule

Height

is varied

by

X(lambda

PDF =

X

ex

  • Ax

CDF

=

1

  • 2

PDF

=

f(x:x)

=

Ge

**

no

0

< <

CDF

= F(xix)

=

(t-

exc

o

x <

5 Bernoulli

distribution

(discrete

Random variable 3

Note :

Tossing

a coin

Outcome

of

the

process

is

binary

[1 , 03

,

success

,

failures

discrete

random variable

Su :

Tossing

a

coin

Pr(H = 0

. 5 = P %

p

q

= 1

Dr(T)

=

0

. 5

=

P

=

q

PMF

=

q=

1-p

if

K

= 0

E

CDF =

Gipigt

is

P if

K

= 1

K&

1

It can also be

derived

by

pmf

=

pP)c-p)

" "where K =

91

,

03

6 Binomial distribution

Binomial

distribution is n-times

of

Bernoullis

distribution

(Combination

of

multiple

Bernoulli distribution

  • Poisson distribution

discrete

random variable

(prf)

describes

the

number

of

events

occuring

in a

fixed

interval

of

time

Here we

take a

parameter

(lambda)

N ->

x= 3 =)

Expected

number

of

prophe

to come

at

Exe

:

Number

of

people

visiting that

specific

time

interval

bank

every

hour

prof

I

111

I

prf

Pr x

= 5

=

1 2 3 4

5

Ex

x= 3

Prx

=

5

=

e

=

e Note :

If

Pr x

=

502 x

= 6

=

Pr(x

=

Pr(x

=

=0. 101=)

F =

.... 1

123456

Uniform

Distribution

Continuous

Uniform

Distribution

(pdf)

a

Discrete

Uniform

Distribution

(prf)

① Continous

Uniform

Distribution

(continuous

random

variable

I

Eg

:

of

candies

sold

daily

at a

shop

is

uniformly

distributed

Probability

density function

there we

will be able

to

find

a

range

i. 2

15-20]

So

whenever we have there kind

of

datasets

,

which

will have some

specific

range

,

that will

be called as

Continuous

Uniform

Distribution

there we

will have a min and more

value

(min

,

mon]->

Intuval

Cumulative

distribution

function

Notation : 1 a ,

b

b

a

Parameters

: -

<a

Daf=

Goa

for

xe(a

, b)

otherwise

Eg

:

The number

of

candies sold

doily

not a

shop

is

uniformly

distributed with a maximum

of 40

and minimum

of

10

i)

Probability of daily

sales

to

fall

below 15 and 30 ·

A

To

find

x

,

=

15

area

under

this

Pr(15x30)

L

xz

=

30

  • We

should

find

arra under

square/ rectangle

ex

b)

1 1 1 1 7

10 15

20 2530354045

=

(30-

x

a

=

5 x 52

=

=

=5-

ii)

Probability

of

sales

above 20

N

x

,

= 20

Pr(202x = 40)

x

=

40 =

40

x

b-a

:

in

Firin

= 0

:

66

56

=

20 X

1