



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Machine learning, data processes , concepts and the math
Typology: Schemes and Mind Maps
Limited-time offer
Uploaded on 06/11/2021
4.4
(26)312 documents
1 / 5
This page cannot be seen from the preview
Don't miss anything!
On special offer
Types Regression A supervised problem, the outputs are continuous rather than discrete. Classification^ Inputs are divided into two or more classes, and the learner must produce a model that assigns unseen inputs to one or more (multi-label classification) of these classes. This is typically tackled in a supervised way. Clustering A set of inputs is to be divided into groups. Unlike in classification, the groups are not known beforehand, making this typically an unsupervised task. Density Estimation Finds the distribution of inputs in some space. Dimensionality Reduction Simplifies inputs by mapping them into a lower-dimensional space. Kind Parametric Step 1: Making an assumption about the functional form or shape of our function (f), i.e.: f is linear, thus we will select a linear model. Step 2: Selecting a procedure to fit or train our model. This means estimating the Beta parameters in the linear function. A common approach is the (ordinary) least squares, amongst others. Non-Parametric When we do not make assumptions about the form of our function (f). However, since these methods do not reduce the problem of estimating f to a small number of parameters, a large number of observations is required in order to obtain an accurate estimate for f. An example would be the thin-plate spline model. Categories Supervised^ The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. Unsupervised No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning). Reinforcement Learning A computer program interacts with a dynamic environment in which it must perform a certain goal (such as driving a vehicle or playing a game against an opponent). The program is provided feedback in terms of rewards and punishments as it navigates its problem space. Approaches Decision tree learning Association rule learning Artificial neural networks Deep learning Inductive logic programming Support vector machines Clustering Bayesian networks Reinforcement learning Representation learning Similarity and metric learning Sparse dictionary learning Genetic algorithms Rule-based machine learning Learning classifier systems Taxonomy Generative Methods Model class-conditional pdfs and prior probabilities. “Generative” since sampling can generate synthetic data points. Popular models Gaussians, Naïve Bayes, Mixtures of multinomials Mixtures of Gaussians, Mixtures of experts, Hidden Markov Models (HMM) Sigmoidal belief networks, Bayesian networks, Markov random fields Discriminative Methods Directly estimate posterior probabilities. No attempt to model underlying probability distributions. Focus computational resources on given task– better performance Popular Models Logistic regression, SVMs Traditional neural networks, Nearest neighbor Conditional Random Fields (CRF) Selection Criteria^ Prediction Accuracy vs Model Interpretability There is an inherent tradeo that is to say that as the model get more flexible in the way the function (f) is selected,ff between Prediction Accuracy and Model Interpretability, they get obscured, and are hard to interpret. Flexible methods are better for inference, and inflexible methods are preferable for prediction. Libraries Python Numpy^ Adds support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays Pandas Offers data structures and operations for manipulating numerical tables and time series Scikit-Learn It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k- means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Tensorflow Components Does lazy evaluation. Need to build the graph, and then run it in a session. MXNet Is an modern open-source deep learning framework used to train, and deploy deep neural networks. MXNet library is portable and can scale to multiple GPUs and multiple machines. MXNet is supported by major Public Cloud providers including AWS and Azure. Amazon has chosen MXNet as its deep learning framework of choice at AWS. Keras^ Is an open source neural network library written in Python. It is capable of running on top of MXNet, Deeplearning4j, Tensorflow, CNTK or Theano. Designed to enable fast experimentation with deep neural networks, it focuses on being minimal, modular and extensible. Torch^ Torch is an open source machine learning library, a scientific computing framework, and a script language based on the Lua programming language. It provides a wide range of algorithms for deep machine learning, and uses the scripting language LuaJIT, and an underlying C implementation. Microsoft Cognitive Toolkit Previously known as CNTK and sometimes styled as The Microsoft Cognitive Toolkit, is a deep learning framework developed by Microsoft Research. Microsoft Cognitive Toolkit describes neural networks as a series of computational steps via a directed graph. Tuning Cross-validation One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds. Methods Leave-p-out cross-validation Leave-one-out cross-validation k-fold cross-validation Holdout method Repeated random sub-sampling validation Hyperparameters Grid Search The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. Random Search Since grid searching is an exhaustive and therefore potentially expensive method, several alternatives have been proposed. In particular, a randomized search that simply samples parameter settings a fixed number of times has been found to be more effective in high- dimensional spaces than exhaustive search. Gradient-based optimization For specific learning algorithms, it is possible to compute the gradient with respect to hyperparameters and then optimize the hyperparameters using gradient descent. The first usage of these techniques was focused on neural networks. Since then, these methods have been extended to other models such as support vector machines or logistic regression. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit, and stop the algorithm then. Early Stopping (Regularization) Overfitting When a given method yields a small training MSE (or cost), but a large test MSE (or cost), we are said to be overfitting the data. This happens because our statistical learning procedure is trying too hard to find pattens in the data, that might be due to random chance, rather than a property of our function. In other words, the algorithms may be learning the training data too well. If model overfits, try removing some features, decreasing degrees of freedom, or adding more data. Underfitting Opposite of Overfitting. Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. It occurs when the model or algorithm does not fit the data enough. Underfitting occurs if the model or algorithm shows low variance but high bias (to contrast the opposite, overfitting from high variance and low bias). It is often a result of an excessively simple model. Test that applies Random Sampling with Replacement of the available data, and assigns measures of accuracy (bias, variance, etc.) to sample estimates. Bootstrap Bagging An approach to ensemble learning that is based on bootstrapping. Shortly, given a training set, we produce multiple different training sets (called bootstrap samples), by sampling with replacement from the original dataset. Then, for each bootstrap sample, we build a model. The results in an ensemble of models, where each model votes with the equal weight. Typically, the goal of this procedure is to reduce the variance of the model of interest (e.g. decision trees). Performance Analysis Confusion Matrix Fraction of correct predictions, not reliable as skewed when the data set is unbalanced (that is, when the number of samples in different classes vary greatly) Accuracy f1 score Precision Out of all the examples the classifier labeled as positive, what fraction were correct? Recall Out of all the positive examples there were, what fraction did the classifier pick up? Harmonic Mean of Precision and Recall: (2 * p * r / (p + r)) ROC Curve - Receiver Operating Characteristics True Positive Rate (Recall / Sensitivity) vs False Positive Rate (1-Specificity) Bias-Variance Tradeoff Bias refers to the amount of error that is introduced by approximating a real-life problem, which may be extremely complicated, by a simple model. If Bias is high, and/or if the algorithm performs poorly even on your training data, try adding more features, or a more flexible model. Variance is the amount our model’s prediction would change when using a different training data set. High: Remove features, or obtain more data. 1.0 - sum_of_squared_errors / total_sum_of_squares(y) Goodness of Fit = R^ Mean Squared Error (MSE) The mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors or deviations—that is, the difference between the estimator and what is estimated. The proportion of mistakes made if we apply out estimate model function the the training observations in a classification setting. Error Rate Motivation Prediction When we are interested mainly in the predicted variable as a result of the inputs, but not on the each way of the inputs affect the prediction. In a real estate example, Prediction would answer the question of: Is my house over or under valued? Non-linear models are very good at these sort of predictions, but not great for inference because the models are much less interpretable. Inference When we are interested in the way each one of the inputs a estate example, Inference would answer the question of: How much would my houseffect the prediction. In a real cost if it had a view of the sea? Linear models are more suited for inference because the models themselves are easier to understand than their non-linear counterparts.
Machine Learning Mathematics Cost/Loss(Min) Objective(Max) Functions Maximum Likelihood Estimation (MLE) Many cost functions are the result of applying Maximum Likelihood. For instance, the Least Squares cost function can be obtained via Maximum Likelihood. Cross-Entropy is another example. The likelihood of a parameter value (or vector of parameter values), given outcomes x, is equal to the probability (density) assumed for those observed outcomes given those parameter values, that is θ, The natural logarithm of the likelihood function, called the log-likelihood, is more convenient to work with. Because the logarithm is a monotonically increasing function, the logarithm of a function achieves its maximum value at the same points as the function itself, and hence the log-likelihood can be used in place of the likelihood in maximum likelihood estimation and related techniques. In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the maximizes the "agreement" of the selected model with the observed data, and for discrete random variables it likelihood function. Intuitively, this indeed maximizes the probability of the observed data under the resulting distribution. Maximum-likelihood estimation gives a unified approach to estimation, which is distribution well-defined and many other problems. in the case of the normal Cross-Entropy^ Cross entropy can be used to define the loss^ function in machine learning and optimization. The true probability pi is the true label, and the given distribution qi is the predicted value of the current model. Cross-entropy error function and logistic regression Logistic The logistic loss function is defined as: Quadratic^ The use of a quadratic loss function is common, for example when^ using least squares techniques. It is often more mathematically^ tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. 0-1 Loss If the target is t, then a quadratic loss function is:In^ statistics^ and^ decision theory, a frequently used loss function is the 0-1 loss function Hinge Loss^ The hinge loss is a loss function used for^ training classifiers. For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as: Exponential Hellinger Distance^ It is used to quantify the similarity between two probability distributions. It is a type of f- divergence. continuous^ To define the Hellinger distance in terms of measure theory probability measures with respect to a third probability, let P and Q denote two that are absolutely distance between P and Q is defined as the^ measure^ λ. The square of the Hellinger quantity Kullback-Leibler Divengence Is a measure of how one probability distribution diverges from a second expected probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference.^ Discrete^ Continuous Itakura–Saito distance^ is a measure of the di^ original spectrum P( P^( perceptual measure, it is intended to reflect perceptual (dis)similarity.ω) of that spectrum. Although it is not aωff) and an approximationerence between an https://stats.stackexchange.com/questions/154879/a-list-of-cost-functions-used-in-neural-networks-alongside-applications https://en.wikipedia.org/wiki/Loss_functions_for_classification Probability Concepts Frequentist vs Bayesian Probability^ Frequentist Bayesian http://www.behind-the-enemy-lines.com/2008/01/are-you-bayesian-or-frequentist-or.html The probability is not a number, but a distribution itself.Basic notion of probability: # Results / # Attempts Random Variable^ In^ a random variable can take on a set of possible di each with an associated probability, in contrast to other mathematical variables.^ variableprobability and statistics^ whose value is subject to variations due to chance (i.e., a random variable, random quantity, aleatory variable or stochastic variable isfferent values (similarly to other mathematical variables),^ randomness, in a mathematical sense). A (^) Expectation (Expected Value) of a Random Variable Same, for continuous variables Independence^ Two or stochastically independent if the occurrence of one does not a^ events^ ffare independent, statistically independent,ect the probability of the other. Conditionality Bayes Theorem (rule, law) Simple Form With Law of Total probability Marginalisation^ The marginal distribution of a^ of^ the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables.^ random variables is the^ probability distribution^ subset^ of a collection^ of Continuous Discrete Law of Total Probability^ Is a fundamental rule relating^ to probability of an outcome which can be realized via several distinct events - hence the name.^ conditional probabilities. It expresses the total^ marginal probabilities Chain Rule Permits the calculation of any member of the joint distribution of a set of random variables using only conditional probabilities. Bayesian Inference Bayesian inference derives the antecedents model for the observed data. Bayesian inference computes the posterior probability, a prior probability posterior probability and a "likelihood function as a consequence" derived from a of two statistical according to out hypothesis. Bayes' theorem. It can be applied iteratively so to update the confidence on Distributions Definition^ Is a table or an equation that links each outcome of a statistical experiment with the probability of occurence. When Continuous, is is described by the Probability Density Function Types (Density Function)^ Normal (Gaussian)^ Poisson^ Uniform Bernoulli Gamma Binomial Cumulative Distribution Function (CDF) Information Theory Entropy Entropy is a measure of unpredictability of information content. To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B, the surprise given by the 1/3-2/3 model will be: [-500log(1/3) - 500log(2/3)]/1000 = 1/2 * Log(9/2) While the correct 1/2-1/2 model will give: [-500log(1/2) - 500log(1/2)]/1000 = 1/2 * Log(8/2) So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model. Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/ distribution. If the sequence is short, it won't give a convincing result. Cross Entropy Cross entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set, if a coding scheme is used that is optimized for an "unnatural" probability distribution q, rather than the "true" distribution p. Joint Entropy Conditional Entropy Mutual Information Kullback-Leibler Divergence Density Estimation Mostly Non-Parametric. Parametric makes assumptions on my data/random-variables, for instance, that they are normally distributed. Non-parametric does not. The methods are generally intended for description rather than formal inference Methods Kernel Density Estimation non-negative it’s a type of PDF that it is symmetric real-valued symmetric integral over function is equal to 1 non-parametric calculates kernel distributions for every sample point, and then adds all the distributions Uniform, Triangle, Quartic, Triweight, Gaussian, Cosine, others... Cubic Spline^ A cubic spline is a function created from cubic polynomials on each between-knot interval by pasting them together twice continuously differentiable at the knots. Regularization L1-norm is also known as least absolute deviations (LAD), least absolute errors (LAE). It is basically minimizing the sum of the absolute di value and the estimated values.fferences (S) between the target Manhattan Distance L1 norm L2-norm is also known as least squares. It is basically minimizing the sum of the square of the di and the estimated values:fferences (S) between the target value Euclidean Distance L2 norm Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit, and stop the algorithm then. Early Stopping Is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very e averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network fficient way of performing model Dropout This regularizer defines an L2 norm on each column and an L1 norm over all columns. It can be solved by proximal methods. Sparse regularizer on columns Nuclear norm regularization This regularizer constrains the functions learned for each task to be similar to the overall average of the functions across all tasks. This is useful for expressing prior information that each task is expected to share similarities with each other task. An example is predicting blood iron levels measured at different times of the day, where each task represents a different person. Mean-constrained regularization This regularizer is similar to the mean- constrained regularizer, but instead enforces similarity between tasks within the same cluster. This can capture more complex prior information. This technique has been used to predict Netflix recommendations. Clustered mean-constrained regularization More general than above, similarity between tasks can be defined by a function. The regularizer encourages the model to learn similar functions for similar tasks. Graph-based similarity Optimization Is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient Descent Gradient descent uses total gradient over all examples per update, SGD updates after only 1 or few examples: Stochastic Gradient Descent (SGD) Gradient descent uses total gradient over all examples per update, SGD updates after only 1 example Mini-batch Stochastic Gradient Descent (SGD) Idea: Add a fraction v of previous update to current one. When the gradient keeps pointing in the same direction, this will increase the size of the steps taken towards the minimum. Momentum Adaptive learning rates for each parameter Adagrad Statistics Measures of Central Tendency Value in the middle or an ordered Median^ Mean list, or average of two in middle.Most Frequent Value Mode Division of probability distributions based on contiguous intervals with equal probabilities. In short: Dividing observations numbers in a sample list equally. Quantile Dispersion The average of the absolute value of the deviation of each value from the mean Medium Absolute Deviation (MAD)^ Range Three quartiles divide the data in approximately four equally divided parts Inter-quartile Range (IQR) Variance The average of the squared di the squared deviation of a random variable from its mean, and it informally measures how far a set of (random) numbers are spread out from their mean.fferences from the Mean. Formally, is the expectation of Definition Continuous^ Types Discrete The signed number of deviations is above the an observation or mean. standard datum z-score/value/factor sqrt(variance) Standard Deviation Relationship A measure of how much two random variables change together. http:// stats.stackexchange.com/questions/18058/how-would-you-explain- covariance-to-someone-who-understands-only-the-mean dot(de_mean(x), de_mean(y)) / (n - 1) Covariance Correlation Benchmarks linear relationship, most appropriate for measurements taken from an interval scale, is a measure of the linear dependence between two variables^ Pearson Benchmarks monotonic relationship (whether linear or not), Spearman's coe appropriate for both continuous and discrete variables, including ordinal variables.fficient is Spearman Contrary to the each other ranks are but only by whether the ranks between observations are equal or not, and is thus only appropriate for^ Is a^ statistic Spearman correlation^ used to measure the discrete variables, the Kendall correlation is not a^ ordinal association but not defined for^ between two measured quantities. ffcontinuous variablesected by how far from. Kendall The results are presented in a matrix format, where the cross tabulation of two fields is a cell value. The cell value represents the percentage of times that the two fields exist in the same events. Co-occurrence Techniques Is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups. The null hypothesis is generally assumed to be true until evidence indicates otherwise. Null Hypothesis p-value In this method, as part of experimental design, before performing the experiment, one first chooses a model (the null hypothesis) and a threshold value for p, called the significance level of the test, traditionally 5% or 1% and denoted as α. If the p-value is less than the chosen significance level ( with the null hypothesis that the null hypothesis may be rejected. However, that does not prove that the tested hypothesis is true. For typical analysis, using the standardα), that suggests that the observed data is sufficiently inconsistent α = 0. cuto value does not, in itself, support reasoning about the probabilities of hypotheses but is only a tool for deciding whether to reject the null hypothesis.ff, the null hypothesis is rejected when p < .05 and not rejected when p > .05. The p- Five heads in a row Example Suppose a researcher flips a coin five times in a row and assumes a null hypothesis that the coin is fair. The test statistic of "total number of heads" can be one-tailed or two-tailed: a one-tailed test corresponds to seeing if the coin is biased towards heads, but a two-tailed test corresponds to seeing if the coin is biased either way. The researcher flips the coin five times and observes heads each time (HHHHH), yielding a test statistic of 5. In a one-tailed test, this is the upper extreme of all possible outcomes, and yields a p-value of (1/2)5 = 1/32 level of 0.05, this result would be deemed significant and the hypothesis that the coin is fair would be rejected. In a two- ≈ 0.03. If the researcher assumed a significance tailed test, a test statistic of zero heads (TTTTT) is just as extreme and thus the data of HHHHH would yield a p-value of 2 ×(1/2)5 = 1/16 ≈ 0.06, which is not significant at the 0.05 level. This demonstrates that specifying a direction (on a symmetric test statistic) halves the p-value (increases the significance) and can mean the difference between data being considered significant or not. The process of data mining involves automatically testing huge numbers of hypotheses about a single combinations of variables that might show a correlation. Conventional tests of arose by chance, and necessarily accept some risk of mistaken test results, called the statistical significance significance. are based on the probability that an observation data set by exhaustively searching for p-hacking http://blog.vctr.me/posts/central-limit-theorem.html^ States that a random variable defined as the average of a large number of independent and identically distributed random variables is itself approximately normally distributed. Central Limit Theorem Linear Algebra Basic Operations: Addition, Multiplication, Transposition Trace, Rank, Determinante, Inverse Transformations Almost all Machine Learning algorithms use Matrix algebra in one way or another. This is a broad subject, too large to be included here in it’s full length. Here’s a start: https://en.wikipedia.org/wiki/Matrix_(mathematics) Matrices http://setosa.io/ev/eigenvectors-and- eigenvalues/^ In^ from a^ change its direction when that linear transformation is applied to it.^ linear algebra^ vector space, an eigenvector or characteristic vector of a^ V over a^ field^ F into itself is a non-zero^ vectorlinear transformation^ that does not^ T^ Eigenvectors and Eigenvalues Leibniz Notation^ Rule^ Derivatives Chain Rule The When the matrix is a are referred to as the Jacobian in literature matrix of all first-order square matrix partial derivatives, both the matrix and its of a vector-valued function determinant. Jacobian Matrix The gradient is a multi-variable generalization of the derivative. The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued. Gradient For Machine Learning purposes, a Tensor can be described as a Multidimentional Matrix Matrix. Depending on the dimensions, the Tensor can be a Scalar, a Vector, a Matrix, or a Multidimentional Matrix. Tensors When measuring the forces applied to an infinitesimal cube, one can store the force values in a multidimensional matrix. When the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Curse of Dimensionality