






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
In this useful cheat sheet you have the main concepts and definitions of Data Science
Typology: Cheat Sheet
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Last Updated August 13, 2018
Multi-disciplinary field that brings together concepts from computer science, statistics/machine learning, and data analysis to understand and extract insights from the ever-increasing amounts of data.
Two paradigms of data research.
The heart of data science is to always ask questions. Al- ways be curious about the world.
What is Data Science?
Structured: Data that has predefined structures. e.g. tables, spreadsheets, or relational databases. Unstructured Data: Data with no predefined struc- ture, comes in any size or form, cannot be easily stored in tables. e.g. blobs of text, images, audio Quantitative Data: Numerical. e.g. height, weight Categorical Data: Data that can be labeled or divided into groups. e.g. race, sex, hair color. Big Data: Massive datasets, or data that contains greater variety arriving in increasing volumes and with ever-higher velocity (3 Vs). Cannot fit in the memory of a single machine.
Data Sources/Fomats Most Common Data Formats CSV, XML, SQL, JSON, Protocol Buffers Data Sources Companies/Proprietary Data, APIs, Gov- ernment, Academic, Web Scraping/Crawling
Types of Data
Two problems arise repeatedly in data science. Classification: Assigning something to a discrete set of possibilities. e.g. spam or non-spam, Democrat or Repub- lican, blood type (A, B, AB, O) Regression: Predicting a numerical value. e.g. some- one’s income, next year GDP, stock price
Main Types of Problems
Probability theory provides a framework for reasoning about likelihood of events.
Terminology Experiment: procedure that yields one of a possible set of outcomes e.g. repeatedly tossing a die or coin Sample Space S: set of possible outcomes of an experi- ment e.g. if tossing a die, S = (1,2,3,4,5, Event E: set of outcomes of an experiment e.g. event that a roll is 5, or the event that sum of 2 rolls is 7 Probability of an Outcome s or P(s): number that satisfies 2 properties
p(s) = 1
Probability of Event E: sum of the probabilities of the outcomes of the experiment: p(E) =
s⊂E p(s) Random Variable V: numerical function on the out- comes of a probability space Expected∑ Value of Random Variable V: E(V) = s⊂S p(s) * V(s)
Independence, Conditional, Compound Independent Events: A and B are independent iff: P(A ∩ B) = P(A)P(B) P(A|B) = P(A) P(B|A) = P(B) Conditional Probability: P(A|B) = P(A,B)/P(B) Bayes Theorem: P(A|B) = P(B|A)P(A)/P(B) Joint Probability: P(A,B) = P(B|A)P(A) Marginal Probability: P(A)
Probability Distributions Probability Density Function (PDF) Gives the prob- ability that a rv takes on the value x: pX (x) = P (X = x) Cumulative Density Function (CDF) Gives the prob- ability that a random variable is less than or equal to x: FX (x) = P (X ≤ x) Note: The PDF and the CDF of a given random variable contain exactly the same information.
Probability Overview Provides a way of capturing a given data set or sample. There are two main types: centrality and variability measures.
Centrality Arithmetic Mean Useful to characterize symmetric distributions without outliers μX = (^1) n
x Geometric Mean Useful for averaging ratios. Always less than arithmetic mean = n
a 1 a 2 ...a 3 Median Exact middle value among a dataset. Useful for skewed distribution or data with outliers. Mode Most frequent element in a dataset.
Variability Standard Deviation Measures the squares differences between the individual elements and the mean σ =
i=1(xi−x)^2 N − 1 Variance V = σ^2
Interpreting Variance Variance is an inherent part of the universe. It is impossi- ble to obtain the same results after repeated observations of the same event due to random noise/error. Variance can be explained away by attributing to sampling or measurement errors. Other times, the variance is due to the random fluctuations of the universe.
Correlation Analysis Correlation coefficients r(X,Y) is a statistic that measures the degree that Y is a function of X and vice versa. Correlation values range from -1 to 1, where 1 means fully correlated, -1 means negatively-correlated, and 0 means no correlation. Pearson Coefficient Measures the degree of the rela- tionship between linearly related variables r = Cov σ(X()X,Yσ(Y^ )) Spearman Rank Coefficient Computed on ranks and depicts monotonic relationships
Note: Correlation does not imply causation!
Descriptive Statistics
Data Cleaning is the process of turning raw data into a clean and analyzable data set. ”Garbage in, garbage out.” Make sure garbage doesn’t get put in.
Errors vs. Artifacts
Data Compatibility Data compatibility problems arise when merging datasets. Make sure you are comparing ”apples to apples” and not ”apples to oranges”. Main types of conver- sions/unifications:
Data Imputation Process of dealing with missing values. The proper meth- ods depend on the type of data we are working with. Gen- eral methods include:
Outlier Detection Outliers can interfere with analysis and often arise from mistakes during data collection. It makes sense to run a ”sanity check”.
Miscellaneous Lowercasing, removing non-alphanumeric, repairing, unidecode, removing unknown characters
Note: When cleaning data, always maintain both the raw data and the cleaned version(s). The raw data should be kept intact and preserved for future use. Any type of data cleaning/analysis should be done on a copy of the raw data.
Data Cleaning Feature engineering is the process of using domain knowl- edge to create features or input variables that help ma- chine learning algorithms perform better. Done correctly, it can help increase the predictive power of your models. Feature engineering is more of an art than science. FE is one of the most important steps in creating a good model. As Andrew Ng puts it:
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
Continuous Data Raw Measures: data that hasn’t been transformed yet Rounding: sometimes precision is noise; round to nearest integer, decimal etc.. Scaling: log, z-score, minmax scale Imputation: fill in missing values using mean, median, model output, etc.. Binning: transforming numeric features into categorical ones (or binned) e.g. values between 1-10 belong to A, between 10-20 belong to B, etc. Interactions: interactions between features: e.g. sub- traction, addition, multiplication, statistical test Statistical: log/power transform (helps turn skewed distributions more normal), Box-Cox Row Statistics: number of NaN’s, 0’s, negative values, max, min, etc Dimensionality Reduction: using PCA, clustering, factor analysis etc
Discrete Data Encoding: since some ML algorithms cannot work on categorical data, we need to turn categorical data into nu- merical data or vectors Ordinal Values: convert each distinct feature into a ran- dom number (e.g. [r,g,b] becomes [1,2,3]) One-Hot Encoding: each of the m features becomes a vector of length m with containing only one 1 (e.g. [r, g, b] becomes [[1,0,0],[0,1,0],[0,0,1]]) Feature Hashing Scheme: turns arbitrary features into indices in a vector or matrix Embeddings: if using words, convert words to vectors (word embeddings)
Feature Engineering Process of statistical reasoning: there is an underlying population of possible things we can potentially observe and only a small subset of them are actually sampled (ide- ally at random). Probability theory describes what prop- erties our sample should have given the properties of the population, but statistical inference allows us to deduce what the full population is like after analyzing the sample.
Sampling From Distributions Inverse Transform Sampling Sampling points from a given probability distribution is sometimes necessary to run simulations or whether your data fits a particular distribution. The general technique is called inverse transform sampling or Smirnov transform. First draw a random number p between [0,1]. Compute value x such that the CDF equals p: FX (x) = p. Use x as the value to be the random value drawn from the distribution described by FX (x).
Monte Carlo Sampling In higher dimensions, correctly sampling from a given distribution becomes more tricky. Generally want to use Monte Carlo methods, which typically follow these rules: define a domain of possible inputs, generate random inputs from a probability distribution over the domain, perform a deterministic calculation, and analyze the results.
Statistical Analysis
There are many different types of models. It is important to understand the trade-offs and when to use a certain type of model.
Parametric vs. Nonparametric
Modeling- Taxonomy Need to determine how good our model is. Best way to assess models is out-of-sample predictions (data points your model has never seen).
Classification Predicted Yes Predicted No Actual Yes True Positives (TP) False Negatives (FN) Actual No False Positives (FP) True Negatives (TN)
Accuracy: ratio of correct predictions over total pre- dictions. Misleading when class sizes are substantially different. accuracy = (^) T P +T PT N^ + +T NF N +F P Precision: how often the classifier is correct when it predicts positive: precision = (^) T PT P +F P Recall: how often the classifier is correct for all positive instances: recall = (^) T PT P +F N F-Score: single measurement to describe performance: F = 2 · (^) precision + recallprecision·recall ROC Curves: plots true positive rates and false pos- itive rates for various thresholds, or where the model determines if a data point is positive or negative (e.g. if
0.8, classify as positive). Best possible area under the ROC curve (AUC) is 1, while random is 0.5, or the main diagonal line.
Regression Errors are defined as the difference between a prediction y′ and the actual result y. Absolute Error: ∆ = y′ − y Squared Error: ∆^2 = (y′ − y)^2 Mean-Squared Error: M SE = (^) n^1
∑n i=1(y′i^ −^ yi)
2 Root Mean-Squared Error: RMSD =
Absolute Error Distribution: Plot absolute error dis- tribution: should be symmetric, centered around 0, bell- shaped, and contain rare extreme outliers.
Modeling- Evaluation Metrics Evaluation metrics provides use with the tools to estimate errors, but what should be the process to obtain the best estimate? Resampling involves repeatedly drawing samples from a training set and refitting a model to each sample, which provides us with additional information compared to fitting the model once, such as obtaining a better estimate for the test error.
Key Concepts Training Data: data used to fit your models or the set used for learning Validation Data: data used to tune the parameters of a model Test Data: data used to evaluate how good your model is. Ideally your model should never touch this data until final testing/evaluation
Cross Validation Class of methods that estimate test error by holding out a subset of training data from the fitting process. Validation Set: split data into training set and valida- tion set. Train model on training and estimate test error using validation. e.g. 80-20 split Leave-One-Out CV (LOOCV): split data into training set and validation set, but the validation set consists of 1 observation. Then repeat n-1 times until all observations have been used as validation. Test erro is the average of these n test error estimates. k-Fold CV: randomly divide data into k groups (folds) of approximately equal size. First fold is used as validation and the rest as training. Then repeat k times and find average of the k estimates.
Bootstrapping Methods that rely on random sampling with replacement. Bootstrapping helps with quantifying uncertainty associ- ated with a given estimate or model.
Amplifying Small Data Sets What can we do it we don’t have enough data?
Modeling- Evaluation Environment
Linear regression is a simple and useful tool for predicting a quantitative response. The relationship between input variables X = (X 1 , X 2 , ...Xp) and output variable Y takes the form:
Y ≈ β 0 + β 1 X 1 + ... + βpXp +
β 0 ...βp are the unknown coefficients (parameters) which we are trying to determine. The best coefficients will lead us to the best ”fit”, which can be found by minimizing the residual sum squares (RSS), or the sum of the differences between the actual ith value and the predicted ith value. RSS =
∑n i=1 ei, where^ ei^ =^ yi^ −^ yˆi
How to find best fit? Matrix Form: We can solve the closed-form equation for coefficient vector w: w = (XT^ X)−^1 XT^ Y. X represents the input data and Y represents the output data. This method is used for smaller matrices, since inverting a matrix is computationally expensive. Gradient Descent: First-order optimization algorithm. We can find the minimum of a convex function by starting at an arbitrary point and repeatedly take steps in the downward direction, which can be found by taking the negative direction of the gradient. After several iterations, we will eventually converge to the minimum. In our case, the minimum corresponds to the coefficients with the minimum error, or the best line of fit. The learning rate α determines the size of the steps we take in the downward direction.
Gradient descent algorithm in two dimensions. Repeat until convergence.
For non-convex functions, gradient descent no longer guar- antees an optimal solutions since there may be local min- imas. Instead, we should run the algorithm from different starting points and use the best local minima we find for the solution. Stochastic Gradient Descent: instead of taking a step after sampling the entire training set, we take a small batch of training data at random to determine our next step. Computationally more efficient and may lead to faster convergence.
Linear Regression Improving Linear Regression Subset/Feature Selection: approach involves identify- ing a subset of the p predictors that we believe to be best related to the response. Then we fit model using the re- duced set of variables.
∑p j=1 |βj^ |
∑p j=1 β
2 j Dimension Reduction: projecting p predictors into a M-dimensional subspace, where M < p. This is achieved by computing M different linear combinations of the variables. Can use PCA. Miscellaneous: Removing outliers, feature scaling, removing multicollinearity (correlated variables)
Evaluating Model Accuracy Residual Standard Error (RSE): RSE =
1 n− 2 RSS. Generally, the smaller the better. R^2 : Measure of fit that represents the proportion of variance explained, or the variability in Y that can be explained using X. It takes on a value between 0 and 1. Generally the higher the better. R^2 = 1 − RSST SS , where Total Sum of Squares (TSS) =
(yi − ¯y)^2
Evaluating Coefficient Estimates Standard Error (SE) of the coefficients can be used to per- form hypothesis tests on the coefficients: H 0 : No relationship between X and Y, Ha: Some rela- tionship exists. A p-value can be obtained and can be interpreted as follows: a small p-value indicates that a re- lationship between the predictor (X) and the response (Y) exists. Typical p-value cutoffs are around 5 or 1 %.
Linear Regression II Logistic regression is used for classification, where the response variable is categorical rather than numerical.
The model works by predicting the probability that Y be- longs to a particular category by first fitting the data to a linear regression model, which is then passed to the logis- tic function (below). The logistic function will always pro- duce a S-shaped curve, so regardless of X, we can always obtain a sensible answer (between 0 and 1). If the prob- ability is above a certain predetermined threshold (e.g. P(Yes) > 0.5), then the model will predict Yes.
p(X) = e
β 0 +β 1 X 1 +...+βpXp 1+eβ^0 +β^1 X^1 +...+βpXp How to find best coefficients? Maximum Likelihood: The coefficients β 0 ...βp are un- known and must be estimated from the training data. We seek estimates for β 0 ...βp such that the predicted proba- bility ˆp(xi) of each observation is a number close to one if its observed in a certain class and close to zero otherwise. This is done by maximizing the likelihood function:
l(β 0 , β 1 ) =
i:yi=
p(xi)
i′:yi′=
(1 − p(xi))
Potential Issues Imbalanced Classes: imbalance in classes in training data lead to poor classifiers. It can result in a lot of false positives and also lead to few training data. Solutions in- clude forcing balanced data by removing observations from the larger class, replicate data from the smaller class, or heavily weigh the training examples toward instances of the larger class. Multi-Class Classification: the more classes you try to predict, the harder it will be for the the classifier to be ef- fective. It is possible with logistic regression, but another approach, such as Linear Discriminant Analysis (LDA), may prove better.
Logistic Regression
Comparing ML Algorithms Power and Expressibility: ML methods differ in terms of complexity. Linear regression fits linear functions while NN define piecewise-linear separation boundaries. More complex models can provide more accurate models, but at the risk of overfitting. Interpretability: some models are more transparent and understandable than others (white box vs. black box models) Ease of Use: some models feature few parame- ters/decisions (linear regression/NN), while others require more decision making to optimize (SVMs) Training Speed: models differ in how fast they fit the necessary parameters Prediction Speed: models differ in how fast they make predictions given a query
Naive Bayes Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the ”naive” assumption of independence between every pair of features.
Problem: Suppose we need to classify vector X = x 1 ...xn into m classes, C 1 ...Cm. We need to compute the proba- bility of each possible class given X, so we can assign X the label of the class with highest probability. We can calculate a probability using the Bayes’ Theorem:
P (Ci|X) =
P (X|Ci)P (Ci) P (X)
Where:
The prediction model will formally look like:
C(X) = argmaxi∈classes(t) P^ (X| PC (iX)P)^ (Ci)
where C(X) is the prediction returned for input X.
Machine Learning Part I Decision Trees Binary branching structure used to classify an arbitrary input vector X. Each node in the tree contains a sim- ple feature comparison against some field (xi > 42?). Result of each comparison is either true or false, which determines if we should proceed along to the left or right child of the given node. Also known as some- times called classification and regression trees (CART).
Advantages: Non-linearity, support for categorical variables, easy to interpret, application to regression. Disadvantages: Prone to overfitting, instable (not robust to noise), high variance, low bias
Note: rarely do models just use one decision tree. Instead, we aggregate many decision trees using methods like ensembling, bagging, and boosting.
Ensembles, Bagging, Random Forests, Boosting Ensemble learning is the strategy of combining many different classifiers/models into one predictive model. It revolves around the idea of voting: a so-called ”wisdom of crowds” approach. The most predicted class will be the final prediction. Bagging: ensemble method that works by taking B boot- strapped subsamples of the training data and constructing B trees, each tree training on a distinct subsample as Random Forests: builds on bagging by decorrelating the trees. We do everything the same like in bagging, but when we build the trees, everytime we consider a split, a random sample of the p predictors is chosen as split can- didates, not the full set (typically m ≈ √p). When m = p, then we are just doing bagging. Boosting: the main idea is to improve our model where it is not performing well by using information from previ- ously constructed classifiers. Slow learner. Has 3 tuning parameters: number of classifiers B, learning parameter λ, interaction depth d (controls interaction order of model).
Machine Learning Part II Support Vector Machines Work by constructing a hyperplane that separates points between two classes. The hyperplane is de- termined using the maximal margin hyperplane, which is the hyperplane that is the maximum distance from the training observations. This distance is called the margin. Points that fall on one side of the hyperplane are classified as -1 and the other +1.
Principal Component Analysis (PCA) Principal components allow us to summarize a set of correlated variables with a smaller set of variables that collectively explain most of the variability in the original set. Essentially, we are ”dropping” the least important feature variables.
Principal Component Analysis is the process by which principal components are calculated and the use of them to analyzing and understanding the data. PCA is an unsupervised approach and is used for dimensional- ity reduction, feature extraction, and data visualization. Variables after performing PCA are independent. Scal- ing variables is also important while performing PCA.
Machine Learning Part III
ML Terminology and Concepts
Features: input data/variables used by the ML model Feature Engineering: transforming input features to be more useful for the models. e.g. mapping categories to buckets, normalizing between -1 and 1, removing null Train/Eval/Test: training is data used to optimize the model, evaluation is used to asses the model on new data during training, test is used to provide the final result Classification/Regression: regression is prediction a number (e.g. housing price), classification is prediction from a set of categories(e.g. predicting red/blue/green) Linear Regression: predicts an output by multiplying and summing input features with weights and biases Logistic Regression: similar to linear regression but predicts a probability Overfitting: model performs great on the input data but poorly on the test data (combat by dropout, early stop- ping, or reduce # of nodes or layers) Bias/Variance: how much output is determined by the features. more variance often can mean overfitting, more bias can mean a bad model Regularization: variety of approaches to reduce over- fitting, including adding the weights to the loss function, randomly dropping layers (dropout) Ensemble Learning: training multiple models with dif- ferent parameters to solve the same problem A/B testing: statistical way of comparing 2+ techniques to determine which technique performs better and also if difference is statistically significant Baseline Model: simple model/heuristic used as refer- ence point for comparing how well a model is performing Bias: prejudice or favoritism towards some things, people, or groups over others that can affect collection/sampling and interpretation of data, the design of a system, and how users interact with a system Dynamic Model: model that is trained online in a con- tinuously updating fashion Static Model: model that is trained offline Normalization: process of converting an actual range of values into a standard range of values, typically -1 to + Independently and Identically Distributed (i.i.d): data drawn from a distribution that doesn’t change, and where each value drawn doesn’t depend on previously drawn values; ideal but rarely found in real life Hyperparameters: the ”knobs” that you tweak during successive runs of training a model Generalization: refers to a model’s ability to make cor- rect predictions on new, previously unseen data as op- posed to the data used to train the model Cross-Entropy: quantifies the difference between two probability distributions
Machine Learning Part IV What is Deep Learning? Deep learning is a subset of machine learning. One popu- lar DL technique is based on Neural Networks (NN), which loosely mimic the human brain and the code structures are arranged in layers. Each layer’s input is the previous layer’s output, which yields progressively higher-level fea- tures and defines a hierarchy. A Deep Neural Network is just a NN that has more than 1 hidden layer.
Recall that statistical learning is all about approximating f (X). Neural networks are known as universal approx- imators, meaning no matter how complex a function is, there exists a NN that can (approximately) do the job. We can increase the approximation (or complexity) by adding more hidden layers and neurons.
Popular Architectures There are different kinds of NNs that are suitable for certain problems, which depend on the NN’s architecture.
Linear Classifier: takes input features and combines them with weights and biases to predict output value DNN: deep neural net, contains intermediate layers of nodes that represent “hidden features” and activation functions to represent non-linearity CNN: convolutional NN, has a combination of convolu- tional, pooling, dense layers. popular for image classifica- tion. Transfer Learning: use existing trained models as start- ing points and add additional layers for the specific use case. idea is that highly trained existing models know general features that serve as a good starting point for training a small network on specific examples RNN: recurrent NN, designed for handling a sequence of inputs that have ”memory” of the sequence. LSTMs are a fancy version of RNNs, popular for NLP GAN: general adversarial NN, one model creates fake ex- amples, and another model is served both fake example and real examples and is asked to distinguish Wide and Deep: combines linear classifiers with deep neural net classifiers, ”wide” linear parts represent mem- orizing specific examples and “deep” parts represent un- derstanding high level features
Deep Learning Part I Tensorflow Tensorflow is an open source software library for numeri- cal computation using data flow graphs. Everything in TF is a graph, where nodes represent operations on data and edges represent the data. Phase 1 of TF is building up a computation graph and phase 2 is executing it. It is also distributed, meaning it can run on either a cluster of machines or just a single machine. TF is extremely popular/suitable for working with Neural Networks, since the way TF sets up the computational graph pretty much resembles a NN.
Tensors In a graph, tensors are the edges and are multidimensional data arrays that flow through the graph. Central unit of data in TF and consists of a set of primitive values shaped into an array of any number of dimensions. A tensor is characterized by its rank (# dimensions in tensor), shape (# of dimensions and size of each di- mension), data type (data type of each element in tensor).
Placeholders and Variables Variables: best way to represent shared, persistent state manipulated by your program. These are the parameters of the ML model are altered/trained during the training process. Training variables. Placeholders: way to specify inputs into a graph that hold the place for a Tensor that will be fed at runtime. They are assigned once, do not change after. Input nodes
Deep Learning Part II
Structured Query Language (SQL) is a declarative language used to access & manipulate data in databases. Usually the database is a Relational Database Man- agement System (RDBMS), which stores data arranged in relational database tables. A table is arranged in columns and rows, where columns represent character- istics of stored data and rows represent actual data entries.
Basic Queries
Useful Keywords for SELECT DISTINCT- return unique results BETWEEN a AND b- limit the range, the values can be numbers, text, or dates LIKE- pattern search within the column text IN (a, b, c) - check if the value is contained among given
Data Modification
Joins The JOIN clause is used to combine rows from two or more tables, based on a related column between them.
SQL Part I Data structures are a way of storing and manipulating data and each data structure has its own strengths and weaknesses. Combined with algorithms, data structures allow us to efficiently solve problems. It is important to know the main types of data structures that you will need to efficiently solve problems.
Lists: or arrays, ordered sequences of objects, mutable
l = [42, 3.14, "hello","world"]
Tuples: like lists, but immutable
t = (42, 3.14, "hello","world")
Dictionaries: hash tables, key-value pairs, unsorted
d = {"life": 42, "pi": 3.14}
Sets: mutable, unordered sequence of unique elements. frozensets are just immutable sets
s = set([42, 3.14, "hello","world"])
Collections Module deque: double-ended queue, generalization of stacks and queues; supports append, appendLeft, pop, rotate, etc
s = deque([42, 3.14, "hello","world"])
Counter: dict subclass, unordered collection where ele- ments are stored as keys and counts stored as values
c = Counter('apple') print(c) Counter({'p': 2, 'a': 1, 'l': 1, 'e': 1})
heqpq Module Heap Queue: priority queue, heaps are binary trees for which every parent node has a value greater than or equal to any of its children (max-heap), order is important; sup- ports push, pop, pushpop, heapify, replace functionality
heap = [] for n in data: ... heappush(heap, n) heap [0, 1, 3, 6, 2, 8, 4, 7, 9, 5]
Python- Data Structures
Recommended Resources