o you want to get started with Machine Learning and not entirely sure where to turn? At Google, we run everyone through the Machine Learning Crash Course — which can kick start your learning and knowledge with a range of tutorials and content. The course we have released is a great resource for anyone who wants to get started in the field.

I’m sharing my notes on the course below —** you should not in anyway think these are an official representation of the course** — they are just my notes which can help you get started.

I’m a huge fan of open sourcing notes and learning through everything I’ve done (see here) — so when I’m doing a course and discover it looks long and daunting — I take the liberty of taking notes, summarizing it and allowing others to improve on it. If you are keen, you can checkout a Google Doc version of this blog post here.

Medium has a funky lack of styling — so the blog post has some quirks in that regard but hopefully this can serve as a starting point for many, or as a companion doc to the great content that Google has open-sourced in the crash course.

Without further ado, here are my Google Machine Learning Crash Course notes summarized.

### Table of Contents

Terminology

Regression

Training and Loss

Typical Flow

Synthetic Features

Generalization

Training, Validation and Test Sets: Splitting Data

Representation & Feature Engineering

Correlation

Feature Crosses

Regulationization

Logistic Regression

Classification

ROC & AUC

Prediction Bias

Regularization

Neural Networks

Risk of Neural Networks

Optimizers

Multiclass Classifiers

One vs. All (OvA) / Logit Layer

Softmax (Multi-class classification)

Embeddings

Convolution Neural Networks

Great Links

### Terminology

**Epoch**— When the entire data set is passed forward and backward through a model once.**Gradient Clipping**— Capping gradients to ensure that they dont vanish or explode when back-propagating.

*— In Deep networks or RNN — error gradients can become very large. Some common ways to resolve this are:*

*— — Rectified Linear Unit (reLU)*

*— — Gradient Clipping*

*— — Use LSTM*

— A measure of how wrong the model is in terms of its ability to estimate the relationship between X and y. Usually the difference between the predicted value and the actual value.*Cost Function (aka Loss or Error)***Saddled Points**— Stationary points (i.e. where the gradient is zero) but are neither a local minimum nor a local maximum point.**Batch Size**— Total number of training examples {features, label} in a single batch**Steps**— Total number of training iterations. One step calculates the loss from one batch and uses that value to modify the models weights once.**Iteration / Number of batches**— Total number of batches needed to complete one epoch.**Total Trained Examples**— batch size * steps**Periods**— Batch size * steps / periods**Labels**— Thing we are predicting — the Y in a simple linear regression**Information theory —**The more one knows about a topic, the less new information one is apt to get about it.**Cross-Entropy**— Log loss, measures the performance of a**classification model**whose output is a probability value between 0 and 1.

*— So a predicted value of 0.012 when the actual observation is 1 would result in a high log loss. A perfect model would have 0 log loss.*

**Entropy**— refers to disorder or uncertainty**Learning Rate**— The size of the update steps in Gradient Descent. With a higher learning rate, we can cover more ground in each step but risk overshooting the lowest point on the slope.

*— Lower learning rate *— Much more accurate but also many more steps and is expensive.

*— Higher learning rate *— Not as expensive, but higher probability of overfitting

**Features**— An input variables — the X variable in a simple linear regression.

*— Words in email text*

*— Senders address*

*— Time of day the email was sent*

*— Email contains the phrase “on weird trick”*

**Examples**— Labeled vs unlabelled data

*— Labelled* — Includes both features and the label

*— Unlabelled* — Includes only the features and NOT the label

**Models**— Defines the relationship between features and label.

*— Training* — means creating or learning the model

*— Inference* — applying the trained model to unlabeled examples.

**Regression**— Model predicts continuous values

*— What is the value of a house in CA?*

*— What is the probability that user will click an ad?*

**Classification**— Predicts discrete values

*— Is a given email message spam or not spam?*

*— Is this an image of a dog, a cat or a hamster?*

**Goldilocks Principle**— The “just right” model fit to the data

*— Hessian Matrix : Matrix of second order partial derivatives and allows us to determine whether a point is at minimum, maximum or saddle point.*

### Regression

**L2 Loss** — Square of the difference between prediction and label. Minimize loss across the entire data set

*— (observation — prediction)²*

*— (y = y’)²*

**Example**:

Almost any example for regression involves

*Y’*is the predicted label (desired output)- B is the bias (y-intercept) or w0
- W1 is the weight of feature 1 or the “slope”
- X1 is a feature (input)

### Training and Loss

**Training**— Training a model simply means learning (determining) good values for all the weights and the bias from**labeled examples**.**Empirical Risk Minimization**— Finding a model that minimizes loss.**Penalty —**Loss is the penalty for a bad prediction**Squared Loss**— Mean square error

*— Average squared loss per example over the whole dataset.*

*— Sum up all the squared losses for individual examples and then divide by the number of examples:*

*— Details:*

*— — X — set of features that the model uses to make predictions*

*— — Y — examples label*

*— Prediction(x) — function of the weights and bias in combination with set of features x*

*—D is a data set containing many labeled examples (x,y) pairs.*

*— N is the number of examples in D*

**Reducing loss**:

**— Hyper-parameters — **The configuration settings used to tune how the model is trained.

**— Derivative of Squared Loss (L2 Loss) **— Tell us how the loss changes for a given example

*— — Simple to compute and convex*

*— — Repeatedly take these steps in the direction that minimizes loss*

**— — Gradient Descent **— A negative gradient tells us the direction in which to head and we can tune the hyperparameters.

**— — — Learning Rate **— Once we get a positive gradient — we know that we are close to the local minima.

*— — — If its small — Take a very small number of steps to compute the gradient*

*— — — If its large — Take a large number of steps, and potentially overshoot the gradient.*

**Gradient Descent**

*— Batch Gradient Descent (BGD) — **A batch is the total number of examples you use to calculate the gradient. In BGD, you typically process the entire data set at a time and this can take a very long time.*

*— Stochastic Gradient Descent (SGD) — **Rather than calculating the gradients of the entire data set, SGD — randomly chooses INDIVIDUAL data and attempts to draw a gradient direction from it to the local minima.*

*— — Mini Batch Gradient Descent (MBGD)** — Typically between 10 and 1,000 examples chosen at random. Mini-batch SGD reduces the amount of noise in the SGD but is more efficient than full batch.*

*— Redundancy — **Redundancy becomes more likely as the batch size grows. Some redundancy can be useful to smooth out noisy gradients, but huge batches don’t provide much more utility than normal ones.*

### Typical Flow

**Load the data using**

*— var = pandas.read_csv(“data”)*

**Randomize the data**

*— var = var.reindex(np.random.permutation(var.index))*

**Establish what you want to be the label (also called a target)**

*— What are the features ?*

*— — Establish what the features are*

**Define the Features and Configure the Columns**

*— I.e. my_feature = california_housing_dataframe[[“total_rooms”]]*

*— I.e. feature_columns = [tf.feature_column.numeric_column(“total_rooms”)]*

**Define the targets (i.e. label)**

*— targets = california_housing_dataframe[“median_house_value”]*

**Training the model:**

*— my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)*

*— Use gradient clipping for SGD:*

— — my_optimizer= tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.o)

**Configure the regressor:**

*— Linear_regressor = tf.estimator.linearRegressor(feature_columns=feature_columns, optimizer=my_optimizer)*

**Define an input function:**

*— This instructs Tensorflow on how to preprocess data*

- Train the model inside a loop so that we can assess.

### Synthetic Features

- Synthetic feature is just a feature that is added to the overall data that is a natural transformation of manipulation of two or more other features.

*— I.e. *california_housing_dataframe[“rooms_per_person”] = (california_housing_dataframe[“total_rooms”] / california_housing_dataframe[“population”])

### Generalization

**Definition —**Refers to your models ability to make correct predictions on new, previously unseen data as opposed to the data used to train the model.**Risk of Overfitting —**Overfit models gets a low loss during training but does a terrible job of predicting new data.

*— Cause — making a model more complex than necessary.*

*— Occams Razor — “one should select the answer that makes the fewest assumptions”*

**Critical Assumptions:**

*— IID — Independently and identically at random from the distribution (that is they are not dependant on prior info)*

*— Stationary — The distribution does not change over time (I.e. retail data in November compared to June)*

*— Same distribution — Including training, validation and test sets*

### Training, Validation and Test Sets: Splitting Data

**Training Set**— A subset to train a model**Validation Set**— This subset of data allows your to adjust your hyper-parameters (learning rate, batch size etc).**Test Set**— A completely impartial dataset that is used to validate training**Primary aspects:**

*— Its large enough to yield statistically meaningful results*

*— Its representative of the data set as a whole.*

*— Goal is to generalize the data well enough against both training and test (i.e. new data)*

**Batch Size Influence**— In most runs, increasing Batch size does NOT influence Training or test loss significantly.

### Representation & Feature Engineering

**See how feature engineering helped**— Medium Post

*— See Also — **This medium post*

**Apache Spark**— https://spark.apache.org/docs/latest/ml-features.html

*— Feature Engineering — **https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/****

**High Cardinality**— There are is a significant degree of variation in the data that is being presented.**Common methodologies in feature engineering**

**— Non-zero **— Columns should have mostly non-zero values to be useful

**— “One hot encoding”** for strings

*— — Represents each categorical feature as a specific “m” binary value.*

*— — Each is transformed into a sparse vector*

*— — Units should reflect a reasonable source (i.e. age in years not in seconds)*

*— — Shouldn’t have crazy outlier data*

**— Constant semantics** — “stationary” aspect of the data

**— Feature hashing** — Hashes categorical values in vectors with a fixed-length

*— — Typically has lower sparsity and higher compression compared to OHE*

*— — — “sparse_column_with_hash_bucket”*

**— Bucketization or “Binning”**

*— — In order to reduce computational load, we can put data into large buckets and then represent them as integers*

*— —Fixed-width binning** — We have a specific fixed width for each of the bins which are usually pre-defined against a range of values.*

*— — Q-Binning — **q-Quantiles provides more flexibility than fixed-width as with fixed-width you can end up with irregular bin sizes. This will provide equal distributions*

*— — Time Binning **— Apply binning on time data to make it categorical and more general*

*— — Utilize** tf.feature_column.bucketized_column **to get features into bucketized columns*

*def get_quantile_based_boundaries(feature_values, num_buckets):*

*boundaries = np.arange(1.0, num_buckets) / num_buckets*

*quantiles = feature_values.quantile(boundaries)*

*return [quantiles[q] for q in quantiles.keys()]*

*# Divide households into 7 buckets.*

*households = tf.feature_column.numeric_column(“households”)*

*bucketized_households = tf.feature_column.bucketized_column(*

*households, boundaries=get_quantile_based_boundaries(*

*california_housing_dataframe[“households”], 7))*

**Scaling Feature Values**— Converting float-point features into their natural range.

*— Reasons this is critical:*

*— — Helps gradient descent converge more quickly*

*— — Helps avoid the NaN trap which becomes an issue during backpropagation where all values can become NaN*

*— — The model will pay MORE attention to features with a wider range*

**— Z-Score — **The Z score relates the number of standard deviations away from the mean. In other words:

*— — ScaledValue = (value — mean) / stddev*

*— — — I.e. if mean = 100, stdev = 20 and org value = 130*

*— — — scaled_value = (130–100)/20*

*— — — scaled_value = 1.5*

**—Log Transformation**

*— — Compresses the large numbers and expands the range of the smaller numbers BUT leaves a long tail due to the nature of potential outliers.*

**— Clipping **— Clipping allows us to cap or “clip” the overall dataset at some maximum value.

**Rules of thumb:**

*— Avoid discrete or high cardinality feature values** — Good feature values should appear more than 5 or so times in the data set. This enables any model to learn how this feature value relates to the label.*

*— Prefer clear and obvious meanings, particular in data types** — Each feature should have a clear and obvious meaning to anyone.*

*— — I.e. years should just be listed as years not as seconds*

*— **Clean inconsistent values and outliers values** — If there is data that is out of bounds, it should either be transformed and fixed and/or there should be floor and ceiling restrictions.*

*— Omitted values**. For instance, a person forgot to enter a value for a house’s age.*

*— — Duplicate examples - For example, a server mistakenly uploaded the same logs twice.*

— — *Bad labels - FFor instance, a person mislabeled a picture of an oak tree as a maple.*

*— — Bad feature values - For example, someone typed in an extra digit, or a thermometer was left out in the sun.*

### Correlation

**Pearson correlation co-efficient —**measure of the linear correlation between two variables.

*— dataframe.corr()*

**Primary looking for:**

*— Features that are NOT correlated strongly with each other — that way they can add independent information*

*— Features that are strongly correlated to the target — We ideally only want to use features that are strongly correlated to the target.*

### Feature Crosses

**Synthetic feature**— Formed by multiplying two or more features. Crossing combinations of features can provide predictive abilities beyond what features can provide individually.

*— Using feature crosses + massive data is an extremely efficient way for learning highly complex models.*

*— Encodes nonlinearity in the feature space by multiplying two or more input features together.*

**Cross One-Hot Vectors**— These are essentially logical conjunctions

*— A one-hot encoding of each generates vectors with binary features that can be interpreted as AND statements which can be INCREDIBLY useful when bucketing these:*

*— — For example:*

*— — — Single Features*

*— — — Cross One-Hot Vectors*

*— — — Consider also a satisfied dog owner*

*— Behaviour type (barking, crying, snuggling etc)*

*— Time of day*

*— Feature cross from both of these:*

*[behaviour type x time of day]*

*— This provides a much better predictor of owner satisfaction:*

*— — Dog cries at 5pm when the owner returns from work*

*— — Dog cries at 3am when owner is sleeping*

**Can get higher order crosses**

— X1, X2, X1X2, X1², x2³ ….. Etc

— But remember Occams Razor!!!!

### Regulationization

**Regulationization is a strategy to avoid overfitting by penalizing complexity and adds an additional regularization term to avoid overfitting.**

*— Structural Risk Minimization *— Minimize loss + complexity

**L1 Regularization**

*Penalizes weights in proportion to the sum of absolute values of the weights.*

*Drives the weights of irrelevant features to 0 particularly in sparse vectors.*

**L2 Regularization**

*Penalizes weights in proportion to the sum of the squares of the weights.*

*Helps drive **outlier weights closer** to 0 but not quite 0.*

**Generalize**— we need to generalize to ensure that our training and test data also work on the validation data.

*— Penalizing model complexity:*

*— — Model complexity as a function of the weights of all the features in the model*

*— — Model complexity as a function of the total number of features with nonzero weights*

**L2 Regularization (Ridge Regularization)**

*— Penalizes weights in proportion to the sum of the squares of the weights.*

*— It helps drive outlier weights (high positive or low negative values) closer to 0 but not exactly 0. This improves generalization in linear models.*

*— Consider the following example:*

— Weights close to zero have little effect on model complexity while outlier weights can have a huge impact per the above example.

**Lambda values**

*— Strike a balance between simplicity and training-data fit*

*— — Too high** — If you lambda value is too high, your model will be simple but you risk underfitting the data.*

*I.e. Your model won’t learn enough about the training data to make useful predictions*

*— — Too Low** — Your model will be more complex and you run the risk of overfitting your data.*

*Too specific to training data, you won’t be able to generalize to validation data.*

**Dropout Regularization [for NN only]**

*— Works by randomly “dropping out” units in a network for a single gradient step*

*The more you drop out, the stronger the regularization:*

*— — 0 = no dropout regularization*

*— — 1 — drop out everything, learns nothing*

### Logistic Regression

**Summary**— Utilizes a sigmoid and is bounded between 0 and 1.

*— Unlike linear regression — this infers that the predictions are mapped between 0 and 1 OR can be converted to a binary value (if above 0.6 vs. if below)*

*Two important strategies for logistic regression:*

*— L2 regularization** — penalizes huge weights*

*— Early stopping **— limiting the training steps or learning rates*

*— Reasoning**:*

*The model tries to drive loss to zero on all examples and never get there, driving the weights to +infinity or -infinity.*

### Classification

**Classification —**Using a probabilistic outcome, logistic regression can return against a threshold some classification outcome.

*— I.e. if probability > 0.7 — spam etc*

**Critical metrics for classification tasks:**

*— Accuracy** — Fraction of predictions that our model got correct (i.e. correct predictions / total number of predictions)*

*— Precision** — Precision identifies the frequency with which a model was correct when predicting the** positive class** (True Positives / True Positives + False Positives)*

*— Recall** — Out of all the possible positive labels, how many did the model correctly identify (True Positives / True Positives + False Negatives)*

**Classification/Decision Threshold**— For what score above this threshold are we predicting that the a logistic regression value is mapped to a binary category.**Positives & Negatives:**

*— True positive** — is an outcome where the model correctly predicts the positive class.*

*— True negative** — is an outcome where the model correctly predicts the negative class.*

*— False positive** — is an outcome where the model incorrectly predicts the positive class.*

*— False negative** — is an outcome where the model incorrectly predicts the negative class.*

**Confusion matrix**— A NxN table that summarises how well a classification task has performed using True Positive, False Positive, False Negative, True Negative.

*Row — Actual label*

*Column — The label that the model predicted*

**Accuracy on this model is 91% — this looks good right ?**

*— NO! **Its 8 of out 9 malignant tumors are incorrectly diagnosed which is terrible.*

*— A classifier that ONLY classifies tumors are benign would have the same results (FP + TN)*

**Instead use***Precision*and*Recall*:

*Precision** — What proportion of positive identifications were actually correct?*

*— TP/TP+FP — its 1/2 = 0.5*

*— When it predicts a tumor is malignant, it is correct 50% of the time.*

*Recall** — What proportion of ACTUAL positives were correct?*

*— TP/TP+FN — 1/9 = 0.11*

*It only predicts 11% of malignant tumors*

**Tug of War between Precision and Recall**

**— INCREASING the classification threshold — less total positives ***(Increases the classification threshold classifies less items as positive, this increases both the false negative rate and true negative rate)*

False positives decrease, but false negatives increase.

Precision increases, recall decreases

**— DECREASING the classification threshold** — **more total positives** *(Lowering the classification threshold classifies more items as positive, this increases both the false positive rate and true positive rate)*

False positives increase, false negatives decrease.

Precision decreases, recall increases.

### ROC & AUC

**ROC Curve**— shows the performance of a classification model at all classifications thresholds.

*The curve plots:*

True positive rate (same as recall) — TP / TP + FN

False positive rate — FP / FP + TN

**AUC**— Area under the curve and measures the entire two-dimensional area under the curve.

*— Definition* — the probability that the model ranks a random positive example more highly than a random negative example.

**Positives of AUC**:

*Scale-invariant* — means how well predictions are ranked, rather than their values

*I.e. you can double the values of all the predictions and it wont impact the AUC*

*Classification-threshold-invariant* — Measures the quality of the models predictions irrespective of what classification threshold is chosen.

### Prediction Bias

**Logistic Regression SHOULD be unbiased:**

*“Average of predictions” *= *“average of observations”*

*A good model will have near-zero bias — that is, the “average prediction” should compare to the “average observation”*

**Prediction Bias**— a quantity that measures how far apart those two averages are.

*Prediction bias = average of predictions — average of labels in data set*

**Root causes:**

*Incomplete feature set**Noisy data set**Buggy pipeline**Biased training sample**Overly strong regularization*

**Bucketing**

*You cannot accurately determine the prediction bias based on only one example — you must examine the prediction bias on a bucket of examples.*

### Regularization

**When to regularize ?**When we are overfitting.**Sparse Feature Crosses**:

*— If we are crossing large feature spaces — we will potentially have large and “noisy” data*

*— Particularly if those features are sparse*

**Comparison of L1 vs L2:**

**— L1 Regularization**

*Penalizes weights in proportion to the sum of absolute values of the weights.*

*Drives the weights of irrelevant features to 0 particularly** in sparse vectors.*

*Reduces the total number of features by moving them to 0.*

**— L2 Regularization**

*Penalizes weights in proportion to the sum of the squares of the weights.*

*Helps drive **outlier weights closer** to 0 but not quiet 0.*

*Does not reduce the total number of features, just reduces their impact.*

**L1 vs L2 Regularization**

*L2 and L1 penalize weights differently:*

*— L2 penalizes weight²*

*— — Derivative: 2*weight*

*— L1 penalizes |weight| (absolute)*

*— — Derivative: K (constant, value is independent of weight)*

**Risks**:

**— Can also influence informed features to have a weight of 0:**

*Weakly informative features.*

*Strongly informative features on different scales.*

*Informative features strongly correlated with other similarly informative features.*

**— How does increasing the L1 regularization rate (lambda) influence the learned weights?**

*Increasing the L1 regularization rate generally dampens the learned weights; however, if the regularization rate goes too high, the model can’t converge and losses are very high.*

### Neural Networks

**Why?**Some datasets cannot be solved with a linear model. To improve nonlinearity, we can add a hidden layer**Hidden Layers**— A hidden layer takes the inputs and applies a nonlinear transformation function to it before being passed on to the weighted sums of the next layer.**Activation Function**— The relevant nonlinear function is the activation function.

*— Sigmoid activation function* — converts the weighted sum to a value between 0 and 1.

*— reLU *— Rectified linear unit often works a little better than a smooth function like a sigmoid.

**What is a neural network?**

*— A set of nodes analogous to neurons*

*— A set of weights representing connection between each neural layer below it.*

*— A set of biases, one for each node.*

*— An activation function that transforms the output of each node in a layer. Different layers can have different activation functions.*

**Backpropagation**

*— Determines how much to update each weight of the network after comparing the predicted output with the desired output.*

** — Error down** — If the error goes down when weight increases, increase the weight

** — Error increases** — If the error does up when weight increases, then decrease the weight.

**How backpropagation works:****Checkout this AMAZING FLOW DIAGRAM**

*Utilizes the chain rule to decide how the error changes with **respect to each weight.*

*Also computes:*

*— How the error changes with respect to:*

*— — The total input of the node*

*— — The total output of the node.*

### Risk of Neural Networks

**Vanishing Gradients**— Gradients for lower layers (closer to the input) can become very small or and can take a long time.

*— ReLU helps here*

**Exploding Gradients**— If the weights in a network are very large, then the gradients for the lower layers involve products of many large terms. Gradients that get too large to converge EXPLODE (i.e. lots of NaN results)

*To reduce this prob:*

*— Batch normalization can help prevent exploding gradients*

*— Lowering learning rate can also help*

**Dead ReLU Units**

*— Once the weighted sum of ReLU unit falls below 0, the relu can get stuck.*

*It outputs 0 activation and gradients can no longer flow backwards.*

*— Lower the learning rate*

*This can help resolve the “dead ReLU” problem.*

### Optimizers

**Overview of Gradient Descent Optimizers**— See here

**Adagrad optimizer**— This optimizer adapts the learning rate to the parameters and uses a different parameter at every step t.

*— Smaller updates (i.e. lower learning rates)* — For parameters associated with frequently occurring features

*— Larger updates (i.e. higher learning rates)* — For parameters associated with infrequent features.

*— Benefits*:

*— — Well suited when dealing with sparse data.*

*— — Eliminates the need to manually tune the learning rate.*

*— — Weakness:*

*— — — Diminishing learning rates — Accumulation of the square gradients in the denominator. Every added term is positive, the accumulated sum keeps growing during training which causes the learning rate to shrink and eventually become too small — meaning it can’t learn anything new.*

**Adaptive Moment Estimation (ADAM) —**Computes adaptive learnings rates for each parameter.

*— How does it work? Stores an exponentially decaying average of past squared gradients.*

*— Adam behaves like “a heavy ball with friction” and prefers flat minima in the error surfacing.*

**Best algorithm? Go with Adam for now.**

*— Right now, probably ADAM — achieves the best result with the default value particularly if the data is sparse.*

— Many others agree with this conclusion.

### Multiclass Classifiers

#### One vs. All (OvA) / Logit Layer

**One vs All**— Leverages a binary classification to come up with N possible solutions, a one-vs-all solution consists of N separate binary classifiers.

*— Given a classification problem with N solutions, a one-vs-all solution consists of N separate binary classifiers one for each outcome.*

**— When to use?**

*When the total number of classes is small.*

#### Softmax (Multi-class classification)

**Category Distribution**— Assigns probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0

This additional constraint helps training coverage more quickly than it otherwise would.

**Options**:

**— Full Softmax** — Calculates a probability for every possible class.

**— Candidate Sampling** — Softmax calculates a probability for all the positive labels but only for a random sample of negative labels.* (I.e. if we are interested in determining whether an input image is a beagle or a bloodhound, we don’t have to provide probabilities for every non-doggy example.)*

**Single vs Many Labels**

*— Softmax assumes that each example is a member of exactly one class. Some examples can be members of multiple classes.*

*— In these scenarios:*

*You should not use Softmax*

*You must rely on multiple logistic regression.*

### Embeddings

**Embeddings**— A mapping from discrete objects (words etc) to vectors of real numbers. They translate large sparse vectors into a lower-dimensional space that preserves semantic relationships.

*— Assumes interests can be roughly explained by d aspects*

*— Each embedding is a d-dimensional point where the value in the dimension d represents how much the movie fits that aspect.*

*— Embedding Layer — Is just a hidden layer*

**Translating to a Lower-Dimensional Space**

— You can solve the core problems of spare input data by mapping your high-dimensional data into lower-dimensional data.

—* Lookup tables:*

*Matrix** — An embedding is a matrix in which each column is the vector that corresponds to an item in your vocabulary.*

*If the sparse vector contains count of the vocabulary items, you could multiply each embedding by the count of its corresponding item before adding it to the sum.*

**PCA (Principal Component Analysis)**

— Combines predictors that are important and drops eigenvectors that are not important *(dimensionality reduction — i.e. dropping features)*

*Brings together:*

*Covariance Matrix** — A measure of how each variable is associated with one another.*

*Eigenvectors** — The directions in which our data are dispersed.*

*Eigenvalues** — The relative importance of these different directions.*

**Word2Vec**

— Distributional Hypothesis — Words that have the same neighboring words tend to be semantically similar.

— I.e. both “dog” and “cat” appear close the word “vet”

### Convolution Neural Networks (CNN)

- A type of neural networks that are extremely effective for image recognition, classification and segmentation.

**Channels** — All images are made up of pixels which can be interpreted to a matrix as an integer range between 0 and 255. The number of channels an image has infers the amount of color in it — a greyscale image only has a single channel and therefore has a single 2d matrix. A color image in the RBG space as 3 channels which would infer three 3d-matrices stacked up.

**Filter or Kernel **— Is a matrix that that slides over the interpreted image to compute the dot product and ultimately identify “Feature Maps” or “Convolved Features” within the image. Different values will produce different feature maps across the image.

— There are four main operations that typically form the major blocks of every CNN.

- Convolution
- Non linearity
- Pooling sub-sampling
- Classification

— The best place to dive into CNN’s a lot more is going to be UJJ Walkarn blog post here.

— Another awesome resource is the CS231N notes below:

[embed]http://cs231n.github.io/convolutional-networks/[/embed]

### Overall Great Links

**Overall learning —**http://deeplearning.stanford.edu/tutorial/**Python Tricks —**https://chrisalbon.com/**Deep Learning —**http://www.deeplearningbook.org/**Feature Engineering —**https://www.amazon.com/Feature-Engineering-Machine-Learning-Principles/dp/1491953241/**How feature engineering helps —**https://medium.com/unstructured/how-feature-engineering-can-help-you-do-well-in-a-kaggle-competition-part-i-9cc9a883514d**Calculus Brush Up —**https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/the-gradient