




























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A thorough overview of descriptive and inferential statistics, covering key concepts such as mean, median, mode, variance, standard deviation, and hypothesis testing. it explores various probability distributions, including normal, binomial, and poisson distributions, and delves into hypothesis testing methods like t-tests, chi-square tests, and anova. The guide also includes practical examples and visualizations to aid understanding.
Typology: Study notes
1 / 36
This page cannot be seen from the preview
Don't miss anything!
Sett
e
Chemnitionalistics
is a science
of
collecting
,
organizing
and
analyzing
Data
=
'facts
or
pieces
of information
Eg
:
·
Heights
of
students in classroom
·
IQ
of
students
·
weight
of people
Entire
statistics is
divided
parts
:
Types
of
statistics
Descriptive
Stats
Inferential
Stats
defn
:
defn
:
It
consists
of
organizing
It
consists
of
data
you
have
and
summarizing
data
measured
to
form
Measure
of
central
sample
population
Tendency
↓
[mean
,
median
,
mode
3
Conclusions
↓
Hypotheis
testing
②
Measure
of
Dispersion
of
①
I les
variance
,
standard
②
t-test
of
deviation 3
③
Chi-square
sample
on
Diffrent types of
distribution
Population
Test
CI
,
of
Eg
:
Histogram
, prut , calf
(Central
Theorem)
Eg
the
height
of
students
in
to
what
Eg
:
"What is the
overage
height
of
you
expect
in
a
college"
students
in
closs
Scales
of
measurement
of
dati
Nominal scale volator
a
Ordinal
scale
3
scale
↑
① Nominal
Scale voator
there
we
mainly
measure
Qualitative/lategorical
data
Eg
:
, Gender ,
Labels
.
Order does
not
matter
Eg
:
Favourite color
.
Red
5-50 %
5
%
32
Orange-2-
%
--
10
②
Ordinal Scale
dator Race
Eg
:
Best-
3 min
Categorical
dator
Good
min
Bad
3rd- min
Ranking
and order matters
we
cannot measure based on
only
this data
Difference
cannot
be
measured
of
race
only
based
on
the
i.e
,
2
,
3 ch
we
cannot measure
it
Only
when estra
information
is
provided
(time
taken)
we
can
measure
it
③
scale
data
Eg
:
Temperature
Order matters 30 F
60 : 30 =
Difference
can
measured 60 F
Ratio
measured GOF-
Room 1
No
starting
point
120F -
60'F
say
that
Rooms
temperature
is
ice
of
Rooms
Farcht
com
also
Hence has no
zeo
starting
Ratio scale volator
Order
numbers]
Differences
are
including
Contains or
zo
starting
Students
marks in class
30
,
60
,
90 ,
95
,
99
Example
:
[Nominal]
we
food
gene
IQ
measurement /Ratio scale data]
↓
can convert
Discriptive
statistics
①
Measure
of
tendency
mean
①
Mean
:
Population
(n)
=
32
,
1
,
2 ,
2
,
3 ,
3
,
4 ,
5
,
5
, 6}
size
sample
size :
Population
=
Hi
Sample
My
:
,N
mean
1
2
2
3
3
4
5 + 5
6
=
10
⑧
Median :
Note:
X
=
[
,
5
,
2
,
3
,
2
, B Median is used
to
find
Steps
:
central
tendency
when
Sort the random
variable Outlier
present
.
Number
of
elements
3
if
count
of
center
I
and take the
average
if
count-- odd
element
will
Why
? Note:
is
because
of
outliers
Mean
is
Exe : 12 =
31
,
2
,
3
,
3 ,
53
=
E
,
2
,
3
,
4 ,
5
,
1003
affected
by
I
meanx=
3
mean (2)
=
19 Outters
median
= 3 .
5
medicen= 3
Ex
: 9 ,
2
,
3
,
For
example
(Here
,
are
amon
e
=
33
y
=
<
3
2
92
=
. 5
92 = 7
. 5
52
1x
The
respective graph
for
these variances
Ji I
(xi
C
I 3 4 <
23 I ...
1
=
A
it
33 0
less
then the
5
3
spread
is less
xi-c) = 10
②
:
It
is
nothing
but root
of
variance
1 I
How
far
the
away
mean
"
Ex:
=
[
,
2
,
3
,
4 , 53
5
=
3
Population
S/x
=
variance
=
=
How
to show
this
Spread
sample
standard
=
S
7 >
.
bix2' . 743
<
We can
get
Therefore
variance
is standard
driation
far
it is
away
from
square
mean
we
In
are
talking
deviation
well the
is
spread
ex:
162 is 2
Sta
away
from
mean
Note
:
1
.
is one sta deviation
variance error w. r.
to
away
from
mean
M
L
A
variance error
A
A
S
E
x
Linahebra
Seeterraciables
11
Random variable
is a
of mapping
the
output of
a random
or
a
number
Eg
:
Tossing
a
coin
[Head , Tail]
.
2
=
Set d
=>
a
random
process
to
"In
every
tos the value can
change
,
i
. e it is
not
fixed"
Ex
:
Rolling
a
dice
[ ,
2 ,
3
,
5
,
63
y
=
(sum
of rolling
of
dice
I
times]
can
we
from
?
-> We can
find probility
of (y15) , Pr(y<10)
Histogram
and Skewness
.
Whenever we
talk about
(Frequency]
the best
diagram
we can
use is
Histogram
Ex:
Ages
=
[
,
12
,
18
, 24 ,
26
,
30
,
35
,
36
,
37
,
,
41 , 42 , 43 ,
50 , 513
If
we
wont
to
the
frequency
the elements between
the
ranges
and we
want
to
visualize
the
diagrams
.
Then
we
com
specifically
use
Histograms
mach
=
5- bin
No
of
bins
= 10
Buckets--
A
I 5
probability
8 4-
density
↑
***(probability density functions
valus
63-
Count
Kernel
density
estimator -
is
responsible
42-
2
1
for
smoothening
5
is is
so
is
50
skewness
:
is a
measure
of
the
distortion
distribution or
asymmetry
in or
dator set .
① a
=>
Normal distribution
(Zero
Skewed)
n
cal distribution
=>
In this the center
element is
specifying
median : mean
: mode
Symmetrical
distribution there is
no skewness
②
Right
③
Left
Skewed :
in
skewed
(my
RHS
Negale
skewed
is
elongated)
this distribution
is
called as
M
7
i
·
"Log
Normal
Distribution"
mode
medians
mean
mode
·
G
percentile
Q
->
Median
percentile
Q
Number
First
Quartile
(
percentile)
(a)
⑰
Third
(
percentile)
(Q3)
&
Maximum (
percentile)
Let's see
we can remove
outliers
using
this
technique
3
=
[
,
2
,
2
,
2
,
3
,
3
,
4 ,
5 ,
5
,
5
,
6 ,
6
,
6
,
6
,
7
,
8
,
8
,
9
,
29}
Using
5 numbers
summary
2
things
Lower
Higher
fence fence
This
is or
·
Lower
fence
everything
is
an
·
Above
Higher fence
everything
is
Formula
to
calculate
Lower
fence
:
Lower
fence
= Q-
. 5
(IQR)
Quartile
Range
Formula
to
calculate
Higher fence
:
I
23
2
Higher
fence
=
Q
(IQR)
=
[
,
2 ,
2
,
2
,
,
3
,
,
5
,
5
,
5
,
6
,
6
,
,
,
7
,
,
,
,
29}
Q
=
25
=
(
Q
:
75
=
x(
, 4
05
100
=
telement
element
IQR
=
&
Q
=
7
1
. 5(
.
Lower
fence
=
. 5
=>
.
5(4)
Higher funce=
&st
=
3
fence
Higher
fence
ove
in the
range
between
3
,
29
is
Box
PLOT
:
Box
plot
diagram
help
us
to
visualize
outties
Minimum value = 1
min
P, G
Mak
BOX
PLOT
2Q
= 3
x xx x
x x
s
Medion
Q
=
5 Outlier
-ener
4
&
=
I
5
Maximum
: 9
I
I I
I
(donot
-zoc's i i is
in
is is do
......
:
Draw a
plot for
right
skwad
e
min
more
mean)
median)
mode
Internal
Assignment
y
=
<- 13
,
,
,
,
3
,
4 ,
5
,
6
,
7
,
7
,
8
,
10
,
10
,
11
, 24 , 553
Q
= 25
percentile=
5x (17)
&
percentile
:
Ex
① 1
=
.
25
element
(Hence
average
=
=
12 .
7 elem
of 4th
$ 5in
element)
O
IQR
=
&
11
Lower
fence
=
&
,
.
5
(IQR)
Higher
fence =
Q
1
.
5 (IQR)
=-
=
5(11)
=
.
. 5
,
. 5
is an Outlier
Cor(x,
y)
=
my
,
-y) while
using
formula
if
X
/
/
I've
X y '-'ve
X
X Y covariance
Ex
:
x
y
Cov(x,
y)
=
(y-y)
4)(
4)(
(
=
-=
The
conciones
is
I've
y
=
4
0
I
following
property
y
are
having
a
positive
covariance
drantages
Disadvantages
·
blu
x &
·
does not have
a
limit value
(It
can
range anywhere
as
to + o
Ex
,
It also cannot
tell
us
D
which is
highly
covariant
100
,
and
x
,
z
So in
disadvantages,
so
to overcome
this we ruse Pearson
correlation
PEARSON
CORRELATION
CO-EFFICIENT
S
x,
y
= Cow(x
,
y)
The outcome
of
Pearson correlation
E
ranging
between
-1 to +
①
more
the more
positive
is
the more the value
towards-1 the more
it is
to
i. 2 x
y
:x
y
x 2
0
. 7
I to +
Now we can
that x and I ore
highly
co-related
than a
and
By
using
Pearson
confficient
Note :
it we
to
compare
it
9
=
kg<
④
⑧
⑧
⑧
⑧
⑧
⑧
0x
<+
=
1
⑧
I
⑧
③
⑧
⑧
⑧
⑧ ⑧
⑧
⑥
&
⑧
⑧
④
⑧
⑧
&
⑧
Examples
of
diagrams
of
co-efficient
wantages
⑳
With
correlation it can
only
linear
properties
3
SPEARMAN'S
x
y
R(y)
A
linear
relationship
com
be
captured by
5
6 3
I
I 2 2
Cor(R(x) ,
Ry)
I 3
L L
8
,
I 5 5
RR(x)
R(y)
Lits
see
specifically
use this
:
With
respect
to
price
O
Size of
No .
of
Location
. ofppl
rooms
staying
He
a
very
big
role
(hence drop
,
feature
Why
?
Different types
of
:
Note: The
we will be
working
on ,
will
be
following
one
or
2
Standard Normal distribution
3
Log
Normal
palf
Power Law
5
distribution
6
pruf
7
Poisson distribution
put
8
Uniform
pmf
a
Exponential
polf
10 CHL SQUARE
polf
Normal/Gaussian
NOTE
of
the colotasets
that
ore available in
I
:
Continuous Random
Variable
the uninesse
mean-mode-a
Balcon
Gaunian
distribution
Empirical
. 7
% Rule
r-s-ri- M itr
inter
Eg
:
Height
,
,
,
IRIS distribution
variance
·
Be
ene
N(M
Probability
density function support paramities
mean
%
4
& 9
=
Whenever
find
a
W
the red curve
is standard
Normal Distribution
followe
is
gamian
Conulative distribution
function
CDF=
I
1+
if
x
M
Lets
a
random variable()
Standard
Let
M=
& = 1
Normal
=
3
&
O
=
1
M
= 0
,
I 1
2
3
I I
is
Is
0 I
5
6
cam
into
by
:
score
scor
us
about a
value
,
Z-score
= (i-M how
standard deviations
I
away
from
the mean
=
3 =
Similarly
0
,
1
,
2
,
3
= 12
= -
3
Log
Normal Distribution
(continous
3
PROBABILITY
1
ensity
Function
In
probability
theory
,
or
log
normal distribution is a
continuous
probability
distribution
a
random variable
whose
logarithm
is
normally
distributed .
Thus
X is
log normally
,
y
=
has
.
Equivalently if y
,
then
the
Cumulative
density function
function
,
x=
exp(Y) ,
hos a
log
normal
distribution
32
=
Log
normal
(M
,
~2)
y
=
In(x)
(M
,
In
=
Natural
(log ?
&
logarithm
↑
why
we
such
transformations
?
Because whenever our
data
follows
right
skwed (tr)
Transformation
THIS
,
my
model will
get
trained
efficiently
Log
distribution Normal
regression
says
that
your
independent
feature
NOTE
: For vice usa
i .e
Normal -
Log
Normal
follow gausian/normal
x
=
exp(y) distribution .
If
X is a
random
variable with a Parto
(Type!)
,
then
the
probability
that X
is
greater
than some numbers
,
i. e .
The survival
called
tail
function)
,
is
F()
=
PrXSx
=
x1am ,
E
x<2m ,
Q.
Can we
convert pareto
distribution to
Normal distribution?
somation
·
2
Exponential
[continous
random variable 3
80-
%
rule
Height
by
X(lambda
X
ex
CDF
=
1
=
f(x:x)
=
Ge
**
no
0
< <
CDF
=
(t-
exc
o
x <
5 Bernoulli
distribution
(discrete
Random variable 3
Note :
Tossing
a coin
of
the
is
,
success
,
failures
discrete
random variable
Su :
Tossing
a
coin
Pr(H = 0
. 5 = P %
p
q
= 1
Dr(T)
=
0
. 5
=
P
=
q
PMF
=
q=
1-p
if
K
= 0
E
CDF =
Gipigt
is
K
= 1
K&
1
It can also be
derived
by
pmf
=
pP)c-p)
91
,
03
6 Binomial distribution
Binomial
distribution is n-times
of
distribution
(Combination
multiple
discrete
random variable
(prf)
describes
the
number
of
events
occuring
in a
fixed
interval
of
time
Here we
take a
N ->
x= 3 =)
Expected
number
prophe
to come
at
Exe
:
Number
of
visiting that
time
interval
bank
hour
prof
I
111
I
Pr x
= 5
=
1 2 3 4
5
x= 3
=
5
=
e
=
e Note :
If
=
502 x
= 6
=
Pr(x
=
=
=0. 101=)
F =
.... 1
123456
Uniform
Distribution
Continuous
a
Discrete
(prf)
① Continous
(continuous
random
variable
I
Eg
:
of
candies
sold
daily
at a
shop
is
uniformly
Probability
there we
will be able
to
a
range
i. 2
So
of
,
which
will have some
range
,
that will
be called as
Continuous
Distribution
there we
will have a min and more
value
,
mon]->
Intuval
Cumulative
distribution
function
Notation : 1 a ,
b
b
a
: -
<a
Daf=
Goa
for
xe(a
, b)
Eg
:
The number
candies sold
doily
not a
shop
is
uniformly
of 40
and minimum
of
10
Probability of daily
sales
to
below 15 and 30 ·
A
To
find
x
,
=
15
area
under
this
Pr(15x30)
L
xz
=
30
should
find
arra under
square/ rectangle
ex
b)
1 1 1 1 7
10 15
20 2530354045
=
(30-
x
a
=
5 x 52
=
=
=5-
ii)
Probability
of
above 20
N
x
,
= 20
Pr(202x = 40)
x
=
40 =
40
x
b-a
:
in
Firin
= 0
:
56
=
20 X
1