Gradient Descent#

Orientation#

Where is Gradient Descent used in the Data Science Lifecycle?

06 PREDICTIVE MODELING:

  • select a ML algorithm

  • train the ML model

  • evaluate the performance

  • make predictions

Recap of Optimization of Linear Regression with OLS#




Model: \(y=b_{0}+b_{1}\cdot x+e\)

Residuals: \(e_{i}=y_{i}-\hat{y}_{i}\)

OLS:
Minimize the sum of squared residuals

\(\sum_{i=1}^{n}e_{i}^{2}\)

results in Normal Equation:

\(b=\left(X^{T}X\right)^{-1}X^{T}y\)

→ closed form solution

Using OLS (normal equation), the best parameters were found at: 
            b₀ = 0.237 
            b₁ = 0.049

Definition#

Gradient Descent is an iterative optimization algorithm to find the minimum of a function.

→ e.g. of a cost function

Note: Unlike the normal equation, Gradient Descent has no closed solution.

Find the parameters of a model by gradually minimizing your cost#

Go downhill in the direction of the steepest slope.

→ endpoint is dependent on the initial starting point

Gradient Descent in a nutshell#

  • Have some function \(J(b)\)

  • Want min \(J(b)\):

    • start with some random parameter \(b\)

    • keep changing \(b\) to reduce \(J(b)\) until we hopefully get to a minimum

When to use Gradient Descent?#

  • GD is a simple optimization procedure that can be used for many machine learning algorithms (eg. linear regression, logistic regression, SVM)

  • used for “Backpropagation” in Neural Nets (variations of GD)

  • can be used for online learning (update as soon as new data comes in)

  • gives faster results for problems with many features (computationally simpler)

First Step: Define Model#


Example:
least squares with one feature


\[\hat{y}=b_0+b_1\cdot x\]

Second Step: Define Cost Function#

Mean Squared Error with a little bit of cosmetics (2 in denominator)

\[ J(b_{0},b_{1})\;=\;\frac{1}{2n}\sum_{i=1}^{n}{(\hat{y}_{i}-y_{i})}^{2}\,\]

\(\hspace{0.5cm}\)

\[min(J(b_{0},b_{1}))\]

Third Step: Initialize Gradient Descent#

Deliberately set random starting values for \(b\)

../_images/2459b7e89e9f6d08330ec3677bac30880af7bd3ca0ee121bb816cafc82d6d95d.png

Forth Step: Derivatives with respect to the parameters#



\[ J(b_{0},b_{1})\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(\hat{y}_{i}-y_{i}\right)^{2}\,=\,\frac{1}{2n}\sum_{i=1}^{n}\left(b_{0}+b_{1}x_{i}-y_{i}\right)^{2}\]

The chain rule gives us:

\[\begin{split}\begin{align} \frac{\partial J}{\partial b_{0}}&=\frac{1}{n}\sum_{i=1}^{n}(b_{0}+b_{1}x_{i}-y_{i})\cdot 1 \\\\ \frac{\partial J}{\partial b_{1}}&=\frac{1}{n}\sum_{i=1}^{n}(b_{0}+b_{1}x_{i}-y_{i})\cdot x_{i} \end{align}\end{split}\]

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\text{slope of }b_1\text{ at start point }=-6.8\]
../_images/d316686a766c727fb236a63be4b225c454c6f0d6a14c4a57862335283a60c51b.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\begin{split}\begin{align} \alpha&=0.15 \\ slope&=-6.8 \end{align}\end{split}\]
../_images/d316686a766c727fb236a63be4b225c454c6f0d6a14c4a57862335283a60c51b.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \\ b_{1}^{new}&=0.6-(0.15*(-6.8)) \\ b_{1}^{new}&=1.62 \\ \end{align}\end{split}\]
../_images/a3c2d33714986ec6c6c9699752c471d26ef9f80708b4c470e26b6061f5cfab70.png

Forth Step: Gradient Descent#

Start descent:

  • Take derivatives with respect to your parameters \(b\)

  • Set your learning rate (step-size)

  • Adjust your parameters (step)

Repeat till there is no further improvement

\[\begin{split}\begin{align} step&=\alpha \cdot slope \\ b_{1}^{new}&=b_{1}^{old} - step \end{align}\end{split}\]
../_images/3fde81eac693586223d578b48836579be3f1c3a100bead81e560c988af5b4da2.png

Gradient Descent summed up#

\(\hspace{0.5cm}\)

  1. Define model

  2. Define the cost function

  3. Deliberately set some starting values

  4. Start descent:

    • Take derivatives with respect to parameters

    • Set your learning rate (step-size)

    • Adjust your parameters (step)

  5. Repeat 4. till there is no further improvement

Learning Rate#

It is important to find a good value for the learning rate!



../_images/8f06c54a4a79ae64c4ee8f528bbc61bf715ec8a311d1c27477b613ad0f2ffc6b.png

Learning Rate#

It is important to find a good value for the learning rate!

Learning Rate too small → will take long to find optimum

Learning Rate too large → not able to find optimum

../_images/d8418eefa76e6538e77f648afc4fd3a41add12ec1a53f2354254ff22f976324b.png

Two main challenges of Gradient Descent#

The MSE cost function for linear regression is a convex function, it just has one global minimum. This is not always the case. Problems are:

  • local minima

  • plateaus

Convex: If you pick two random points the line that is connecting them will not cross the curve.

../_images/742f7d68f139207643552fba70dbc98963e6c8d7ef947f28464d8dee48b31637.png

Unscaled Data#

Without scaling the algorithm Gradient Descent needs more time to find the minimum

../_images/f4dabb61cacedc8d9a9c991ebac8892025541eecaddb061e8496f3cf13384ddd.png

Variants of Gradient Descent#

Which route to take downhill?

Batch gradient descent#

All instances are used to calculate the gradient (step).

The algorithm is faster than the Normal Equation but still slow.

Stochastic gradient descent#

Only 1 random instances is used to calculate the gradient (step).

Much faster as there is little data to process at each update.

Can be used for very large data sets → online training possible

Cost decreases only on average and bounces back and forth. → can help not getting stuck in local minima

Optimum will be close but never reached.

Mini-Batch gradient descent#

Trains on small random subset instead of all data or individual ones.

Performs better than SGD as it exploits hardware optimization of matrix operations

Drawback: It is harder not to end up in local minima

Gradient Descent Comparison#

../_images/01d041d920082c0dd3f3972cb9749777a0ebd2252d00cfb8b4c4b9e8125d6e87.png
../_images/8cf91484bf58ef07a6025f692ca4e8cfc199cfcf0bbbce2e5a02210af88f76c7.png

References#

Gradient Descent Step by Step