{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "ul-YGtvEgwd1" }, "source": [ "# What is Machine Learning?\n", " \n", "According to Tom Mitchell in his seminal book:\n", "\n", "> \"A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$, if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.\"\n", "\n", "Example: playing checkers.\n", "\n", " * $E$ = the experience of playing many games of checkers\n", " * $T$ = the task of playing checkers.\n", " * $P$ = the probability that the program will win the next game.\n", "\n", "In general, any machine learning problem can be assigned to one of two broad classifications:\n", "\n", " * Supervised learning (this module) \n", " * Unsupervised learning (next module)\n", "\n", "## Supervised Learning\n", "\n", "In supervised learning, we are given a data set and, for **this** data set (named training data set), we **know** what our correct output should look like. The objective is to learn the relationship between the input and the output.\n", "\n", "Supervised learning problems are categorized into \"regression\" and \"classification\" problems: \n", " * In a **regression** problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. \n", " * In a **classification** problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.\n", "\n", "### Example 1:\n", "\n", " * Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a **regression** problem.\n", "\n", " * We could turn this example into a classification problem by instead making our output about whether the house \"sells for more or less than the asking price.\" Here we are **classifying** the houses based on price into two discrete categories.\n", "\n", "### Example 2:\n", "\n", " * **Regression** - Given a picture of a person, we have to predict their age on the basis of the given picture\n", " * **Classification** - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign based on an x-ray image.\n", "\n", "## Linear regression\n", "\n", "The goal of linear regression is to find a line that fits our data as best as possible. Unfornutaley, we don't have time to look at the math details, but you can find them in [this book](https://web.stanford.edu/~hastie/ElemStatLearn/). We'll try to pass the intuituin behind this method.\n", "\n", "Let's assume that we have a set of pairs $(x_i,y_i)$, where $x$ is the independent variable and $y$ the independent variable. Normaly, $x$ is a $m$-dimenstional vector, where we have $m$ variables describing the instance (e.g., the features computed from the molecules by some feature descriptor). However, for simplicity, we assume for now that $x$ is unidimensional. The independent variable is the one we want to predict (e.g., the energy of the molecule). The picture bellow illustrates this: \n", "\n", "\n", "\n", "\n", "The yellow line in the figure corresponds to a model that can be used to predict the value of $y$ from $x$. How do we measure the quality of this model? We need a way to measure the **difference** between real and predicted values (an **error measure**, also known as cost or loss function). Then, we can transform the problem of fiting the best line to the data in a problem of minimizing the error between real values and predictions. \n", "\n", "A common error measure is the **squared loss**. Let $pred_i$ be the value predicted for instance $i$. The squared loss can be computed as:\n", "\n", "$$ E = \\sum_{i=1}^n (pred_i - y_i)^2 $$\n", "\n", "This function basically compute the difference between the real value and the prediction for all the instances:\n", "\n", "![](https://media.giphy.com/media/VbnQM59vjG6fYPmBIl/giphy.gif)\n", "\n", "And then square these values and sum over all the training data:\n", "\n", "![](https://media.giphy.com/media/YlHI3bh3u6hf8oADOq/giphy.gif)\n", "\n", "The resons for squaring is three fold:\n", "\n", " 1. It eliminates the negative signs. Both positive and negative error should count for the loss, and squaring turn a negative error in a positive error\n", " 1. It amplifies the errors with greater magnitude, thus larger errors counts more than small ones.\n", " 1. The function is continue and differentiable (we will see the importance of this when studying the gradient descent method). \n", "\n", "## Least Squares \n", "\n", "Given an error (loss) function, how do we find the line that minimizes it? Recall that we can represent a line by $\\theta_0 + \\theta_1 x$ (alternative notatios are $b + mx$ or $b+ ax$), where $\\theta_0$ is the intercep and $\\theta_1$ the slope. Thus, our objective is to find the line \n", "\n", "$$ pred_i = \\theta_0 + \\theta_1 x_i$$\n", "\n", "that minimizes $E$.\n", "\n", "Let's introduce an algebra trick to facilitate the math. Let $\\mathbf{x_i} = [1, x_i]$, where we added a dummy variable which which value is allways 1, and $\\mathbf{\\theta} = [\\theta_0, \\theta_1]$. Thus, we can write the prediction line as:\n", "\n", "$$ pred_i = \\mathbf{\\theta} ^\\intercal \\mathbf{x_i} = \\begin{bmatrix}\n", " \\theta_0 \\\\\n", " \\theta_1\n", " \\end{bmatrix} [1, x_i] = \\theta_0\\cdot 1 + \\theta_1 x_i = \\theta_0 + \\theta_1 x_i$$\n", "\n", "Stacking a set of intances, we can form a matrix $X$ where each row $i$ contains a vector $\\mathbf{x_i}$. Then, we can compute all predictions at once by computing $X\\mathbf{\\theta}$. The error function can then be computed as:\n", "\n", "$$ E(\\mathbf{\\theta}) = (X\\mathbf{\\theta} -y)^\\intercal(X\\mathbf{\\theta} -y)$$\n", "\n", "To minimize the error we can compute its derivate:\n", "\n", "$$ \\frac{\\partial E(\\mathbf{\\theta})}{ \\partial \\mathbf{\\theta}} = 2X^\\intercal X \\mathbf{\\theta} - 2X^\\intercal y$$\n", "\n", "By taking $\\frac{\\partial E}{ \\partial \\mathbf{\\theta}} = 0$ and solving for $\\mathbf{\\theta}$, we get\n", "\n", "$$ \\mathbf{\\theta} = (X^\\intercal X)^{-1} \\cdot X^\\intercal y$$\n", "\n", "This is known as the normal equation of the linear regression. Although we have described it in terms of a single variable, it can be easily extended to multidimensional data.\n", "\n", "The code sequence below illustrates the use of python package sklearn to compute a linear regression for the [diabetes data set](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). \n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": {}, "colab_type": "code", "id": "yXrXbSLGCwua" }, "outputs": [], "source": [ "## First, we load the packages \n", "\n", "# plotting\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# data manipulation\n", "import numpy as np\n", "\n", "# dataset and algorithm \n", "from sklearn import datasets, linear_model\n", "\n", "# quality measures\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": {}, "colab_type": "code", "id": "_OfeNTZ9EaDZ" }, "outputs": [], "source": [ "# The, we load the diabetes dataset\n", "# This is a sample data set provided by sklearn\n", "# You can use our own, as long as X is a matrix and y a vector with the same number of rows as X\n", "diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)\n", "\n", "# Use only one feature (this is not necessary, just for plotting)\n", "diabetes_X = diabetes_X[:, np.newaxis, 2]\n", "\n", "# Split the data into training/testing sets (we will discuss this latter on)\n", "diabetes_X_train = diabetes_X[:-20]\n", "diabetes_X_test = diabetes_X[-20:]\n", "\n", "# Split the targets into training/testing sets (we will discuss this latter on)\n", "diabetes_y_train = diabetes_y[:-20]\n", "diabetes_y_test = diabetes_y[-20:]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": {}, "colab_type": "code", "id": "0lphca4FElXL" }, "outputs": [], "source": [ "# Create linear regression object\n", "regr = linear_model.LinearRegression()\n", "\n", "# Train the model using the training sets\n", "regr.fit(diabetes_X_train, diabetes_y_train)\n", "\n", "# Make predictions using the testing set\n", "diabetes_y_pred = regr.predict(diabetes_X_test)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "InpZNvsXejqs" }, "source": [ "One common measure of predictive performance of the algorithm is to compare the predictions, $pred_i$, to the true values $y_i$. A commonly used measure for this is the sum of the mean square-error (MSE) on the test set:\n", " \n", "$$ MSE= \\frac{1}{N_\\mathrm{test}} \\sum_{i=1}^{N_\\mathrm{test}}(y_i - pred_i)^2 $$\n", "\n", "Another common measure is the mean absolute-error (MAE)\n", "\n", "$$ MAE= \\frac{1}{N_\\mathrm{test}} \\sum_{i=1}^{N_\\mathrm{test}}|y_i - pred_i| $$\n", "\n", "A measure that is independent of the scale of the [coefficient of determination](https://pt.wikipedia.org/wiki/Coeficiente_de_determina%C3%A7%C3%A3o). The best possible possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y (the mean of y), disregarding the input features, would get a $R^2$ score of 0.0.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 104 }, "colab_type": "code", "id": "zX2MyDz9EuqI", "outputId": "3ba050c9-66d8-4383-ec36-76d12d3f787f" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept: 152.92 \n", " Coefficients: 938.24 \n", "\n", "Mean squared error: 2548.07\n", "Coefficient of determination: 0.47\n" ] } ], "source": [ "# The intercept and coefficients\n", "print('Intercept: {0:.2f} \\n Coefficients: {1:.2f} \\n'.format(regr.intercept_, regr.coef_[0]))\n", "\n", "# The mean squared error\n", "print('Mean squared error: %.2f'\n", " % mean_squared_error(diabetes_y_test, diabetes_y_pred))\n", "\n", "# The coefficient of determination: 1 is perfect prediction\n", "print('Coefficient of determination: %.2f'\n", " % r2_score(diabetes_y_test, diabetes_y_pred))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 252 }, "colab_type": "code", "id": "Cg54EcTKE5te", "outputId": "b94ffb91-4242-4a0f-8be0-426d444ed7f0" }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAADrCAYAAABXYUzjAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAQOElEQVR4nO3dfawcVf3H8c9sH7QLtBYKaiw7g8RKLYJArcZfVHzC538MauJaY3zYGAIhklAjm2g0WWL1LyD406XGGO/8oxJNxJiUWokx0WgrJBahhMjuLRpMW0HabC992PGP4969vffuzky7s2fmzPuV9A+G0+bbXPjkm+85c8aLokgAgOmr2C4AAMqKAAYASwhgALCEAAYASwhgALCEAAYAS1amWbxhw4YoCIKMSgEAN+3fv/9IFEWXLn6eKoCDINC+ffsmVxUAlIDned3lnjOCAABLCGAAsIQABgBLCGAAsIQABgBLCGAATgvDUEEQqFKpKAgChWFou6R5qY6hAUCRhGGoRqOhXq8nSep2u2o0GpKker1uszRJdMAAHNZsNufDd6DX66nZbFqq6GwEMABnzc7Opno+bQQwAGfVarVUz6eNAAbgrFarpWq1etazarWqVqtlqaKzEcAAnFWv19Vut+X7vjzPk+/7arfbudiAkyQvzUc5t27dGnEZDwCk43ne/iiKti5+TgcMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwACc9dxz0nXXSZ5nfn3/+7YrOhsBDCCXwjBUEASqVCoKgkBhGCb+vb/8pQncV79aeuyx4fMvfSmDQs/DStsFAMBiYRiq0Wio1+tJkrrdrhqNhiSpXq8v+3tOnpRuuUX6wQ9G/7n33DPxUs+LF0VR4sVbt26N9u3bl2E5ACAFQaBut7vkue/76nQ6Zz178knpbW+Tnn9+9J935ZXS3r1SrTbhQhPyPG9/FEVbFz9nBAEgd2ZnZ2Off+97ZsywefPo8L3zTun0aenpp+2F7ziMIADkTq1WW7YD3rhxi266SXr44fG//5FHpHe+M5vaJokOGEDutFotVavVBU/+T1KkQ4f+OjJ83/1u0wlHUTHCV6IDBpBD9Xpd/b6nL3xhi06evHbs2vvuk269dUqFTRgBDCBXnnhCesMbJOlTI9esXSv94Q+DdcXFCAJALnzjG2ZTbVyofvaz0tyc9J//FD98JTpgABYdPy5t2CC99NL4dd/6lvSVr0ynpmmiAwZy6HzeAiuC3/zGdLsXXTQ+fA8eNJtqLoavRAADuTN4C6zb7SqKovm3wIoewlEkfeITJnjf+97R697xDunMGbN+06bp1WcDb8IBOZPmLbAi+Mc/pI0b49f99KfSzTdnX48NvAkHFESSt8CKYNcu0+3Ghe+RI6bbdTV8xyGAgZypjXhndtTzPDl1SrrqKhO8X/zi6HW33GJCN4qkSy6ZXn15QwADObP0LTCpWq2q1WpZqijeo4+a0F292mycjfLHP5rQvf/+6dWWZwQwkDP1el3tdlu+78vzPPm+r3a7PfIaRpvuvNME7/XXj15Tq5mzu1EkveUt06utCNiEA5DKCy9I69fHr7v3Xum227KvpwhGbcLxIgaARB56SProR+PXPfOMFASZl+MERhAARooi6YMfNGOGceH74Q9L/b5ZT/gmRwcMYIlOR7riivh1Dz1kwhfnhg4YwLx77zXdblz4vvCC6XYJ3/NDAAMld/z48LPtt98+et2OHcOzu+vWTa8+lxHAQEn9+MfDC3HGefRRE7o7d06nrjJhBgyUzKpV5kOV42zZYoJ31arp1FRWdMBACTzzzHDMMC58d+0y3e6BA4TvNBDAgMPuuMOE7mtfO37dgQMmeD//+enUBYMRBOCY06eTd6/9vglo2EEHDDjikUdMmMaF7z33DE8zEL520QEDBbdtm/TnP8evO3Kk3Fc/5hEBDBTQ889LF18cv+7aa6XHHsu+HpwbRhBAgXz3u2ZsEBe+e/aYEQPhm290wEDORZFUSdgqnTolreT/6sKgAwZy6oknTLcbF7633TbcVCN8i4UfF5AzV1xhbiOL8/TT0pVXZl4OMkQAAzlw4oS06DNwI6X4iA1yjhEEYNFgUy0ufH/4w+GYAe6gAwYsSPoCxNGjyY6boZjogBcJw1BBEKhSqSgIAoVhaLskOKLTGV6IE2fQ7RK+biOAFwjDUI1GQ91uV1EUqdvtqtFoEMI4L5/8ZLKvTPziF4wZyobP0i8QBIG63e6S577vq5NkWxr4nzRnd0+fllasyLYe2DXqs/R0wAvMzs6meg43nc8YavfuZGd3P/CBYbdL+JYXm3AL1Gq1ZTvgWq1moRrYMBhD9Xo9SZofQ0lSvV4f+fvWrJHm5uL//IMHpU2bJlIqHEAHvECr1VJ10XmgarWqVqtlqSJMW7PZnA/fgV6vp2azuWTtiy8ON9XiwnfQ7RK+WIgAXqBer6vdbsv3fXmeJ9/31W63x3Y+cEuSMdTdd5vQjfsy8M6dbKphPAJ4kXq9rk6no36/r06nQ/iWzKhxU61Wm+92l2mGz3LsmAndHTsyKDADHL20hwAGFlg6hrpKUqRutzP2973iFcNu98ILs6xwsjh6aRfH0IBFwjDU5z63WSdPXh+7du9e6V3vmkJRGeHo5XSMOobGKQjgf4Yfs4wfO7nyMUuOXtrFCAKld//9yT5muX27ex+zHDfzRvbogFFaSUN0dla6/PJsa7Gl1Wqdde5Z4ujlNNEBo1T++c/0F+K4Gr4SRy9tI4BRCh/5iAnd17xm/Lqvfa18Z3c5emkPIwg4LemYodczrxMD00QHDOf8/OfpxwyEL2ygA4Yzkna7u3dL73tftrUASRDAKLReT7rggmRryzTXRTEwgkAhNRqm440LX98v36YaioMOGIWSdMzw97/HfwIIsI0OGLn3+OPpN9UIXxQBAYzcGoTu1VePX/fVrzJmQDERwBZxD+tSg3sWknS7L71k1t99d/Z1AVkggC3hHtazffvbyT5mKQ273dWrs68LyBL3AVvCPaxG0k21PXuk97wn21qArHAfcM6U+R7Ww4elyy5Ltpa5LlzGCMKSMt7D+sY3mo43Lnxf+Uo21VAOBLAlS7895u49rINNtQMHxq979lkTus89N526ANsIYEtcv4d1z570Z3fjrooEXMMmHCYq6abaXXdJDjb7wLLYhENmhh+zTLZ2xYps6wGKghEEztkddyT7mKU0HDMQvsAQHTBSSzpm+N3vpLe/PdtagCIjgJFIp5P8ghuOjwHJMILAWNddZzreuPDdto2zu0BadMBYVtIxw7//La1fn20tgKvogDHv179Of3aX8AXOHQGM+dD90IfiVm6X7weamSnnjW3ApDGCKKm5ueSfYl+z5gKdONGTJHW7UqPRkCRn3toDbKEDLpkvf9l0u3Hhu369GTH4fjAfvgO9Xk/NZjPDKoFyoAMuiaSbagcPSps2Df+5zNdmAlmjA3bYU0+l31RbGL5SOa/NBKaFAHbQJZeY0H3968evu/32+LO7Zbo2E5g2RhCOiKJk31OTpBMnpJe/PNnawUZbs9nU7OysarWaWq0WG3DABHAdZcHNzEjbtydby1tqgB1cR+mYpJtqv/pVkvO9AGxgBlwQYRiqVrs69aYa4QvkFwFcAG99a1ef/nRdhw6N/6jaNddwIQ5QJIwgcmzY6fpj1x06JG3cmHk5ACaMDjhn9u9PfnbX8yqKIsIXKCoCOCcGobt1yT7pYndJ8iR5vAwBFBwjCIv6/eTfSFuzZq1OnDg2/8+8DAEUHx2wBbt3m243SfgONtUeeOD/5fu+PM+T7/tqt9u8DAEUHAE8RS97mQne979//Lrf/37paYZ6va5Op6N+v69Op0P4xgjDUEEQqFKpKAgChSF3GCN/GEFk7MUXpXXrkq3l+NhkhGGoRqOhXm9wh3GXO4yRS3TAGWm1TLcbF77f+Q5ndyet2WzOh+8Adxgjj+iAJyzpK8LHjkkXXphtLWXFHcYoCjrgCfjb35Kd3b344mG3S/hmhzuMURQE8Hm48UYTulu2jF+3d68J3aNHp1LWxBVtQ4s7jFEUjCBSOn1aWrUq2dp+P/lIIq+KuKHFHcYoCu4DTuhnP5M+/vH4dZ/5jPSjH2Vfz7QEQaBut7vkue/76nQ60y8IKCDuAz5HSTtYVy/EYUMLyA4z4GUcPpz+Y5Yuhq/EhhaQJQJ4gQceMKF72WXj1+3aVZ6zu2xoAdlhBKHkY4a5OfM6cZmwoQVkp7SbcP/6l/SqV8Wv27zZnPMFgHM1ahOudCOImRnT8caF78GDZsSQt/At2plcAKOVYgRx5oy0bZv0l7/Er83zXLeIZ3IBjOZ0B/z446bbXblyfPjOzNjdVEva1XLJDOAWJzvgr39d+uY3x6/ZsEGanZXWrJlOTaOk6Wo5kwu4xZkO+PhxafVq0/GOC9+dO02ne/iw/fCV0nW1nMkF3FL4AH74YRO6F10knTo1et1TT5ng3bFjerUlkaar5Uwu4JZCBnAUSTffbIL3pptGr7vxRrMBF0XS6143tfJSSdPV1ut1tdttvg0HOKJQAfzssyZ0KxXpwQdHr3vwQRO6v/2tWZtnabtavg0HuCPn8WS02yZ4L798/LqjR03wfuxj06lrEuhqgfLK9Ztwc3PxG2W33irdd9906gGAc1HI6yh/8pPR/+5Pf5Le/Obp1QIAk5brAH7Tm6S1a82n3SUpCKQnnyzfhTgA3JTrAL7mGvOyxMmT0qWX2q4GACYr1wEsSevW2a4AALJRiFMQAOAiAhgALCl1AHO3LgCbcj8Dzgp36wKwrbQdMHfrArCttAHM3boAbCttAHO3bnExu4crShvArtytW7YwGszuu92uoiian927/veGo6IoSvzrhhtuiFwyMzMT+b4feZ4X+b4fzczM2C4plZmZmaharUaS5n9Vq9Wxf4+i/5193z/r7zv45fu+7dKAkSTti5bJ1FzfhobxgiBQt9td8tz3fXU6nSXPF5/8kEzXX6TrLyuVipb7b9bzPPX7fQsVAfFG3YZW2hGEC9JuJLpw8oPZPVxCABdY2jBy4eSHK7N7QCKACy1tGLnQPfIFEbiEAC6wtGHkSvfId/HgikIEcNmOWqWRJozoHoF8yf0pCBd27gGUW2FPQbiwcw8Ay8l9ALuwcw8Ay8l9ALuwcw8Ay8l9ALuycw8Ai+U6gMMwnJ8Br1ixQpLYuS8JTr6gDHL7RYzFpx/OnDkz3/kSvm7jayUoi9weQ0t70Qzcwc8erincMTROP5QXP3uURW4DmNMP5cXPHmWR2wDm9EN58bNHWeQ2gLm3oLz42aMscrsJBwCuKNwmHAC4jgAGAEsIYACwhAAGAEsIYACwJNUpCM/zDkta+o4oAGAcP4qiSxc/TBXAAIDJYQQBAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJYQwABgCQEMAJb8F4FKY8Ec3TGwAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot outputs\n", "plt.scatter(diabetes_X_test, diabetes_y_test, color='black')\n", "plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)\n", "\n", "plt.xticks(())\n", "plt.yticks(())\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "QkT3wj6vTM2T" }, "source": [ "## Exercise\n", "\n", "In this exercise, we will apply a linear regression to a materials science problem. \n", "\n", "\n", "\n", "The first part (creating a representation) can be performed in different ways, as we studied in the previous module. Today we will see the second part\n", "\n", "Let's open a representation of a data set that we generated like yesterday and build a linear model for predicting some numerical property. The representation is the eigen values of the Coulomb matrix." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_0feature_1feature_2feature_3feature_4feature_5feature_6feature_7feature_8feature_9...homolumogapr2zpveU0UHGCv
O3NC4H5134.38304973.27743165.56522041.20533937.97136028.56217022.14847519.5414560.3461830.246778...-0.2528-0.00770.24511012.18640.095653-435.725372-435.718189-435.717245-435.75726626.019
FONC6H4135.76212285.19293461.37565438.78915036.30910926.51595423.00382220.93152518.6641590.345873...-0.2604-0.07890.18151188.64410.089534-460.774434-460.767413-460.766468-460.80637725.848
O2N2C5H4139.21411374.41726949.82891645.67417037.00023826.23330924.53920423.07953619.0392770.343824...-0.2324-0.04680.1856945.79880.092456-452.744414-452.738132-452.737188-452.77571623.374
O3C6H4137.46688773.56536461.51838346.47332532.47677128.00542623.11813820.29464519.7725880.349248...-0.2670-0.08230.1848944.04260.090391-456.613010-456.606412-456.605468-456.64439224.902
FON3C4H4146.43197883.54877159.49413250.05566143.92324231.13762023.79579122.15140019.0936720.331518...-0.2192-0.03090.18821074.30780.090835-494.100913-494.093709-494.092765-494.13247127.708
\n", "

5 rows × 29 columns

\n", "
" ], "text/plain": [ " feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 \\\n", "O3NC4H5 134.383049 73.277431 65.565220 41.205339 37.971360 28.562170 \n", "FONC6H4 135.762122 85.192934 61.375654 38.789150 36.309109 26.515954 \n", "O2N2C5H4 139.214113 74.417269 49.828916 45.674170 37.000238 26.233309 \n", "O3C6H4 137.466887 73.565364 61.518383 46.473325 32.476771 28.005426 \n", "FON3C4H4 146.431978 83.548771 59.494132 50.055661 43.923242 31.137620 \n", "\n", " feature_6 feature_7 feature_8 feature_9 ... homo lumo \\\n", "O3NC4H5 22.148475 19.541456 0.346183 0.246778 ... -0.2528 -0.0077 \n", "FONC6H4 23.003822 20.931525 18.664159 0.345873 ... -0.2604 -0.0789 \n", "O2N2C5H4 24.539204 23.079536 19.039277 0.343824 ... -0.2324 -0.0468 \n", "O3C6H4 23.118138 20.294645 19.772588 0.349248 ... -0.2670 -0.0823 \n", "FON3C4H4 23.795791 22.151400 19.093672 0.331518 ... -0.2192 -0.0309 \n", "\n", " gap r2 zpve U0 U H \\\n", "O3NC4H5 0.2451 1012.1864 0.095653 -435.725372 -435.718189 -435.717245 \n", "FONC6H4 0.1815 1188.6441 0.089534 -460.774434 -460.767413 -460.766468 \n", "O2N2C5H4 0.1856 945.7988 0.092456 -452.744414 -452.738132 -452.737188 \n", "O3C6H4 0.1848 944.0426 0.090391 -456.613010 -456.606412 -456.605468 \n", "FON3C4H4 0.1882 1074.3078 0.090835 -494.100913 -494.093709 -494.092765 \n", "\n", " G Cv \n", "O3NC4H5 -435.757266 26.019 \n", "FONC6H4 -460.806377 25.848 \n", "O2N2C5H4 -452.775716 23.374 \n", "O3C6H4 -456.644392 24.902 \n", "FON3C4H4 -494.132471 27.708 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## The features in this data set are the eingen values of the Coulomb matr\n", "\n", "import pandas as pd\n", "data = pd.read_csv(\"data.csv\",index_col=0)\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Index(['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4',\n", " 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9',\n", " 'feature_10', 'feature_11', 'feature_12'],\n", " dtype='object')\n" ] } ], "source": [ "feature_names = data.columns[0:13]\n", "print(feature_names)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "molecules_X = data[feature_names]\n", "molecules_y = data['gap']\n", "\n", "\n", "# Split the data into training/testing sets (we will discuss this latter on)\n", "molecules_X_train = molecules_X[:-2000]\n", "molecules_X_test = molecules_X[-2000:]\n", "\n", "# Split the targets into training/testing sets (we will discuss this latter on)\n", "molecules_y_train = molecules_y[:-2000]\n", "molecules_y_test = molecules_y[-2000:]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# dataset and algorithm \n", "from sklearn import datasets, linear_model\n", "\n", "# quality measures\n", "from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score\n", "\n", "# Create linear regression object\n", "regr = linear_model.LinearRegression()\n", "\n", "# Train the model using the training sets\n", "regr.fit(molecules_X_train, molecules_y_train)\n", "\n", "# Make predictions using the testing set\n", "molecules_y_pred = regr.predict(molecules_X_test)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## Lets make a scatter plot of the true versus predicted values\n", "\n", "plt.scatter(molecules_y_test,molecules_y_pred)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([508., 484., 364., 285., 175., 108., 48., 19., 5., 4.]),\n", " array([6.57534459e-05, 1.15205127e-02, 2.29752720e-02, 3.44300312e-02,\n", " 4.58847905e-02, 5.73395498e-02, 6.87943090e-02, 8.02490683e-02,\n", " 9.17038276e-02, 1.03158587e-01, 1.14613346e-01]),\n", " )" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAD4CAYAAAAKA1qZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAP6klEQVR4nO3de4yc1XnH8e9TlktKGsxlQcR2u0RxpEDEJdoQJKqqxU0COIqRAilNWxxqyX+USkFpGkwSqaHqHyaqSlOporJKGhOlxZQUYcUoKTHQmwphzS0xlLA4Lt4a4QUMCUEhMXn6xxyTYb32zO47szPL+X6k0Zz3vGfmfY5n1r99LzMbmYkkqU6/NOgCJEmDYwhIUsUMAUmqmCEgSRUzBCSpYiODLgDgpJNOyrGxsUGXIUmLyvbt25/LzNEmzzEUITA2NsbExMSgy5CkRSUi/rfpc3g4SJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKtbVJ4YjYhfwI+A1YH9mjkfECcBmYAzYBXwsM/dFRABfAi4GXgE+kZkP9r70lrH1W/v11B3t2rBqYNuWpF6Yy57Ab2Xm2Zk5XpbXA9sycwWwrSwDXASsKLd1wI29KlaS1FtNDgetBjaV9ibgkrb+m7PlPmBJRJzaYDuSpD7pNgQS+NeI2B4R60rfKZn5DEC5P7n0LwV2tz12qvS9QUSsi4iJiJiYnp6eX/WSpEa6/RbR8zNzT0ScDNwVEf9zmLExS99Bf80+MzcCGwHGx8f9a/eSNABd7Qlk5p5yvxe4HTgXePbAYZ5yv7cMnwKWtz18GbCnVwVLknqnYwhExLER8SsH2sAHge8BW4A1Zdga4I7S3gJcES3nAS8dOGwkSRou3RwOOgW4vXXlJyPAP2bmNyPiAeDWiFgLPA1cVsbfSevy0Elal4he2fOqJUk90TEEMnMncNYs/c8DK2fpT+CqnlQnSeqrofjzkovVoD6o5ofUJPWKXxshSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKmYISFLFDAFJqpghIEkVMwQkqWKGgCRVzBCQpIoZApJUMUNAkipmCEhSxQwBSaqYISBJFTMEJKlihoAkVcwQkKSKGQKSVDFDQJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKmYISFLFug6BiDgiIh6KiG+U5dMi4v6IeDIiNkfEUaX/6LI8WdaP9ad0SVJTc9kT+CTweNvy9cANmbkC2AesLf1rgX2Z+U7ghjJOkjSEugqBiFgGrAL+viwHcAFwWxmyCbiktFeXZcr6lWW8JGnIdLsn8NfAZ4Cfl+UTgRczc39ZngKWlvZSYDdAWf9SGf8GEbEuIiYiYmJ6enqe5UuSmugYAhHxYWBvZm5v755laHax7hcdmRszczwzx0dHR7sqVpLUWyNdjDkf+EhEXAwcA7yN1p7BkogYKb/tLwP2lPFTwHJgKiJGgOOAF3peuSSpsY57Apl5bWYuy8wx4HLg7sz8PeAe4NIybA1wR2lvKcuU9Xdn5kF7ApKkwWvyOYFrgE9FxCStY/43lf6bgBNL/6eA9c1KlCT1SzeHg16XmfcC95b2TuDcWcb8BLisB7VJkvrMTwxLUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKjanD4tpOIyt3zqwbe/asGpg25bUe+4JSFLFDAFJqpghIEkVMwQkqWKGgCRVzBCQpIoZApJUMUNAkipmCEhSxQwBSaqYISBJFTMEJKlihoAkVcwQkKSKGQKSVDFDQJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkinUMgYg4JiK+ExGPRMSOiLiu9J8WEfdHxJMRsTkijir9R5flybJ+rL9TkCTNVzd7Aq8CF2TmWcDZwIURcR5wPXBDZq4A9gFry/i1wL7MfCdwQxknSRpCHUMgW14ui0eWWwIXALeV/k3AJaW9uixT1q+MiOhZxZKknunqnEBEHBERDwN7gbuAp4AXM3N/GTIFLC3tpcBugLL+JeDEXhYtSeqNrkIgM1/LzLOBZcC5wLtnG1buZ/utP2d2RMS6iJiIiInp6elu65Uk9dCcrg7KzBeBe4HzgCURMVJWLQP2lPYUsBygrD8OeGGW59qYmeOZOT46Ojq/6iVJjXRzddBoRCwp7bcAvw08DtwDXFqGrQHuKO0tZZmy/u7MPGhPQJI0eCOdh3AqsCkijqAVGrdm5jci4jHgloj4C+Ah4KYy/ibgqxExSWsP4PI+1C1J6oGOIZCZjwLnzNK/k9b5gZn9PwEu60l1kqS+8hPDklSxbg4HSa8bW791INvdtWHVQLYrvdm5JyBJFTMEJKlihoAkVcwQkKSKGQKSVDFDQJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKmYISFLFDAFJqpghIEkVMwQkqWKGgCRVzBCQpIoZApJUMUNAkipmCEhSxQwBSaqYISBJFTMEJKlihoAkVcwQkKSKGQKSVDFDQJIq1jEEImJ5RNwTEY9HxI6I+GTpPyEi7oqIJ8v98aU/IuJvImIyIh6NiPf2exKSpPnpZk9gP/Anmflu4Dzgqog4HVgPbMvMFcC2sgxwEbCi3NYBN/a8aklST3QMgcx8JjMfLO0fAY8DS4HVwKYybBNwSWmvBm7OlvuAJRFxas8rlyQ1NqdzAhExBpwD3A+ckpnPQCsogJPLsKXA7raHTZW+mc+1LiImImJienp67pVLkhrrOgQi4q3A14GrM/OHhxs6S18e1JG5MTPHM3N8dHS02zIkST3UVQhExJG0AuBrmfkvpfvZA4d5yv3e0j8FLG97+DJgT2/KlST1UjdXBwVwE/B4Zv5V26otwJrSXgPc0dZ/RblK6DzgpQOHjSRJw2WkizHnA38AfDciHi59nwU2ALdGxFrgaeCysu5O4GJgEngFuLKnFUuSeqZjCGTmfzL7cX6AlbOMT+CqhnVJkhaAnxiWpIoZApJUMUNAkipmCEhSxbq5OkgauLH1Wwey3V0bVg1ku9JCcU9AkipmCEhSxQwBSaqYISBJFTMEJKlihoAkVcwQkKSKGQKSVDFDQJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKmYISFLFDAFJqpghIEkVMwQkqWKGgCRVzBCQpIoZApJUMUNAkipmCEhSxQwBSarYSKcBEfFl4MPA3sx8T+k7AdgMjAG7gI9l5r6ICOBLwMXAK8AnMvPB/pQu9d/Y+q0D2/auDasGtm3Vo5s9ga8AF87oWw9sy8wVwLayDHARsKLc1gE39qZMSVI/dAyBzPx34IUZ3auBTaW9Cbikrf/mbLkPWBIRp/aqWElSb833nMApmfkMQLk/ufQvBXa3jZsqfQeJiHURMRERE9PT0/MsQ5LURK9PDMcsfTnbwMzcmJnjmTk+Ojra4zIkSd2Ybwg8e+AwT7nfW/qngOVt45YBe+ZfniSpn+YbAluANaW9Brijrf+KaDkPeOnAYSNJ0vDp5hLRfwJ+EzgpIqaAPwM2ALdGxFrgaeCyMvxOWpeHTtK6RPTKPtQsSeqRjiGQmb97iFUrZxmbwFVNi5IkLQw/MSxJFTMEJKlihoAkVcwQkKSKGQKSVDFDQJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxQ0CSKmYISFLFOv49AUmDMbZ+60C2u2vDqoFsV4PhnoAkVcwQkKSKGQKSVDFDQJIqZghIUsUMAUmqmCEgSRUzBCSpYoaAJFXMEJCkihkCklQxvztI0hsM6juLwO8tGgT3BCSpYoaAJFXMEJCkihkCklSxvoRARFwYEU9ExGRErO/HNiRJzfX86qCIOAL4W+ADwBTwQERsyczHer0tSW8ug7wyaVAGfUVUP/YEzgUmM3NnZv4UuAVY3YftSJIa6sfnBJYCu9uWp4D3zxwUEeuAdWXx5Yh4Yp7bOwl4bp6PHVbOafF4M87LOS2guH7eDz0J+LWm2+9HCMQsfXlQR+ZGYGPjjUVMZOZ40+cZJs5p8Xgzzss5LQ5lTmNNn6cfh4OmgOVty8uAPX3YjiSpoX6EwAPAiog4LSKOAi4HtvRhO5Kkhnp+OCgz90fEHwPfAo4AvpyZO3q9nTaNDykNIee0eLwZ5+WcFoeezCkyDzpcL0mqhJ8YlqSKGQKSVLGhDoFOXz8REUdHxOay/v6IGGtbd23pfyIiPrSQdR/OfOcUER+IiO0R8d1yf8FC134oTV6nsv5XI+LliPj0QtXcScP33pkR8d8RsaO8XscsZO2H0uC9d2REbCpzeTwirl3o2g+lizn9RkQ8GBH7I+LSGevWRMST5bZm4ao+vPnOKSLObnvfPRoRv9PVBjNzKG+0Tio/BbwDOAp4BDh9xpg/Av6utC8HNpf26WX80cBp5XmOWORzOgd4e2m/B/i/Qc+n6Zza1n8d+Gfg04OeTw9epxHgUeCssnzim+C993HgltL+ZWAXMLZI5jQGnAncDFza1n8CsLPcH1/axy/yOb0LWFHabweeAZZ02uYw7wl08/UTq4FNpX0bsDIiovTfkpmvZuYPgMnyfIM27zll5kOZeeDzFjuAYyLi6AWp+vCavE5ExCW0fgD7eQXZXDWZ0weBRzPzEYDMfD4zX1ugug+nyZwSODYiRoC3AD8FfrgwZR9Wxzll5q7MfBT4+YzHfgi4KzNfyMx9wF3AhQtRdAfznlNmfj8znyztPcBeYLTTBoc5BGb7+omlhxqTmfuBl2j95tXNYwehyZzafRR4KDNf7VOdczHvOUXEscA1wHULUOdcNHmd3gVkRHyr7LJ/ZgHq7UaTOd0G/JjWb5ZPA3+ZmS/0u+AuNPk5X8z/R3QUEefS2pN4qtPYYf4bw918/cShxnT11RUD0GROrZURZwDX0/qNcxg0mdN1wA2Z+XLZMRgWTeY0Avw68D7gFWBbRGzPzG29LXHOmszpXOA1WocYjgf+IyK+nZk7e1vinDX5OV/M/0cc/gkiTgW+CqzJzJl7QAcZ5j2Bbr5+4vUxZVf1OOCFLh87CE3mREQsA24HrsjMjgm/QJrM6f3AFyNiF3A18NnyQcNBa/re+7fMfC4zXwHuBN7b94o7azKnjwPfzMyfZeZe4L+AYfgeniY/54v5/4hDioi3AVuBz2fmfV09aNAnQg5zgmSE1rHi0/jFCZIzZoy5ijeeyLq1tM/gjSeGdzIcJ+eazGlJGf/RQc+jV3OaMeYLDM+J4Sav0/HAg7ROoI4A3wZWLfI5XQP8A63fUo8FHgPOXAxzahv7FQ4+MfyD8nodX9onLPI5HQVsA66e0zYHPekO/yAXA9+ndVzrc6Xvz4GPlPYxtK4qmQS+A7yj7bGfK497Arho0HNpOifg87SOyz7cdjt50PNp+jq1PccXGJIQ6MF77/dpnej+HvDFQc+lB++9t5b+HbQC4E8HPZc5zOl9tH67/jHwPLCj7bF/WOY6CVw56Lk0nVN53/1sxv8RZ3fanl8bIUkVG+ZzApKkPjMEJKlihoAkVcwQkKSKGQKSVDFDQJIqZghIUsX+H22SkG/1SI3eAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## and a histogram of the magnitude of the error\n", "plt.hist(abs(molecules_y_test- molecules_y_pred))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept: 0.25 \n", " Coefficients: [ 8.12849803e-04 -1.03028728e-03 -9.85332103e-05 1.19988936e-03\n", " -4.89037868e-04 -4.57045254e-04 -2.04281125e-03 -1.43264786e-03\n", " -7.22503668e-04 -4.02626465e-02 -1.11831216e-01 2.96729799e-02\n", " 1.21237161e-01] \n", "\n", "Mean squared error: 0.00115\n", "Coefficient of determination: 0.2940\n" ] } ], "source": [ "# The intercept and coefficients\n", "print('Intercept: {0:.2f} \\n Coefficients: {1} \\n'.format(regr.intercept_, regr.coef_))\n", "\n", "# The mean squared error\n", "print('Mean squared error: %.5f'\n", " % mean_squared_error(molecules_y_test, molecules_y_pred))\n", "\n", "# The coefficient of determination: 1 is perfect prediction\n", "print('Coefficient of determination: %.4f'\n", " % r2_score(molecules_y_test, molecules_y_pred))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "oU8GH-H0CyNF" }, "source": [ "## Gradient Descent \n", "\n", "Although we can compute the coefficients $\\mathbf{\\theta}$ for a given data set, this approach has some restrictions:\n", "\n", " * To compute $(X^\\intercal X)^{-1}$, the matrix $(X^\\intercal X)$ has to be non-singular. \n", " * Even if we can compute $(X^\\intercal X)^{-1}$, this is computationally cost for large data sets. If we have $10.000$ data points for instance, $(X^\\intercal X)^{-1}$ will have a dimension of $10.000 \\times 10.000$, requiring a large ammount of memory to store it, and a very large computational time for its inversion. \n", "\n", "The gradient descent is a computational technique that can be used to minimize functions interativelly. The idea is as follow:\n", "\n", " 1. Guess an initial value for $\\mathbf{\\theta}$.\n", " 1. Compute the gradient of the error function at $\\mathbf{\\theta}$.\n", " 1. Adjust the the value of $\\mathbf{\\theta}$ doing a \"small step\" in the oposite direction of the gradient.\n", " 1. Repeat 2 and 3 untill convergence.\n", "\n", "This process is illustrated in the picture bellow\n", "\n", "![](https://miro.medium.com/proxy/1*wsBakfF2Geh1zgY4HJbwFQ.gif)\n", "\n", "The size of the step in 3 is a parameter of the algorithm, and is known as **learning rate**. More formally, the update rule is:\n", "\n", "$$ \\mathbf{\\theta}_{n+1} = \\mathbf{\\theta}_{n} - \\alpha \\nabla_\\theta E(\\mathbf{\\theta})$$\n", "\n", "The learning rate $\\alpha$ shoulde be choose wisely:\n", " * A very small learning rate may delay de convergence, requiring more iterations to converge.\n", " * A very large learning rate could diverge instead of converging, and we the minimum will not be reach.\n", "\n", "The picture bellow illustrated the influence of the learning rate minimizing a quadratic function. The first two converge to the minimum, although the case in the middle converges more quickly. On the other hand, the third graph has a too high learning rate, and the gradient descent diverges from the minimum. \n", "\n", "![](https://miro.medium.com/proxy/1*Q-2Wh0Xcy6fsGkbPFJvMhQ.gif)\n", "\n", "In general, the setting of the learning rate requires some experimentation. There are some approaches that uses a variable learning rate, aiming to speed up learning while avoiding divergence and local minimas (you can find more details [here](https://physics.bu.edu/~pankajm/ML-Notebooks/HTML/NB2_CIV-gradient_descent.html))\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": {}, "colab_type": "code", "id": "UTZ3i9C-ypKm" }, "outputs": [], "source": [ "# Create a gradient descent linear regression model\n", "# note that the learning rate is name eta0 in sklearn\n", "# You can also choose the type of the learning rate\n", "regr = linear_model.SGDRegressor(learning_rate='constant', eta0 = 0.001, max_iter=2000)\n", "\n", "# Train the model using the training sets\n", "regr.fit(diabetes_X_train, diabetes_y_train)\n", "\n", "# Make predictions using the testing set\n", "diabetes_y_pred = regr.predict(diabetes_X_test)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 104 }, "colab_type": "code", "id": "0X_pBSodzRu7", "outputId": "79105116-a571-4fc6-a7f5-165307c60785" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Intercept: 152.81 \n", " Coefficients: 660.64 \n", "\n", "Mean squared error: 3024.76\n", "Coefficient of determination: 0.37\n" ] } ], "source": [ "# The intercept and coefficients\n", "# They are a bit different from linear regression due to som numerical issues and regularization (discussed later)\n", "print('Intercept: {0:.2f} \\n Coefficients: {1:.2f} \\n'.format(regr.intercept_[0], regr.coef_[0]))\n", "\n", "# The mean squared error\n", "print('Mean squared error: %.2f'\n", " % mean_squared_error(diabetes_y_test, diabetes_y_pred))\n", "\n", "# The coefficient of determination: 1 is perfect prediction\n", "print('Coefficient of determination: %.2f'\n", " % r2_score(diabetes_y_test, diabetes_y_pred))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 252 }, "colab_type": "code", "id": "F8aHgWT46Nte", "outputId": "8d6fb844-f25d-4025-9349-677656785e23" }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAWAAAADrCAYAAABXYUzjAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAPL0lEQVR4nO3dXYhc5R3H8d+ZxGiOWoumJV645wgiLUWK7Cq+kor2qha13jlGhdKxvVLx7eKoNzK0FxoVWqpzIagzikUxQa9bakoaYSJSLCZV0plVqMFI1JCJMe48vRhnz87uJjlnds485zzn+4FFPMyT/NfVXx7/z8vxjDECAExfxXYBAFBWBDAAWEIAA4AlBDAAWEIAA4AlBDAAWLI+zYc3bdpkwjDMqBQAcNOePXsOGmN+sPx5qgAOw1DtdntyVQFACXie113tOS0IALCEAAYASwhgALCEAAYASwhgALCEAAbgtFarpTAMValUFIahWq2W7ZIWpdqGBgBF0mq1VKvV1Ov1JEndble1Wk2SVK1WbZYmiRkwAIdFUbQYvkO9Xk9RFFmqaBQBDMBZ8/PzqZ5PGwEMwFkzMzOpnk8bAQzAWfV6Xb7vjzzzfV/1et1SRaMIYADOqlarajQaCoJAnucpCAI1Go1cLMBJkpfmpZxzc3OGy3gAIB3P8/YYY+aWP2cGDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDACWEMAAYAkBDCCXWq2WwjBUpVJRGIZqtVq2S5q49bYLAIDlWq2WarWaer2eJKnb7apWq0mSqtWqzdImihkwgNyJomgxfId6vZ6iKLJUUTYIYAC5Mz8/n+p5URHAAHJnZmYm1fOiIoAB5E69Xpfv+yPPfN9XvV63VFE2CGAAuVOtVtVoNBQEgTzPUxAEajQaTi3ASZJnjEn84bm5OdNutzMsBwDc43neHmPM3PLnzIABwBICGAAsIYCBHCrDKTBwEg7InbKcAgMzYCB3ynIKDAQwkDtlOQUGAhjInbKcAgMBDOROWU6BgQAGcqcsp8DASTgAjltYkHbskMJQuvRSyfOmXwMn4QCUxoED0l13DcJ2/Xrp1lul2Vmp0bBd2Sj2AQNwwl//Kv3ud9J//nPiz+zfP716kmAGDKCQvv5aqtcHs1zPk66//uThe/rp0iOPTK++JAhgAIXx0UfSL34xCNyNG5MF6hNPSMeODQL77LOzrzENWhAAcssY6bHHpCeflI4eTTbmpz+V/vQn6eqrs61tEghgALly4IC0ZYu0b1/yMb/9rfT449KmTdnVlQUCGIB1O3ZIN9+c/PO+L/35z9Ltt0uVAjdSC1w6gKI6fly64454AS1J+N5wg/Tvfw/aEkeODMYXOXwlZsAApmTv3kF/9ptvko+55hrprbekc87Jri6bCv7nB4A8++Mf41nuj3+cLHyfeWYwyzVG2rnT3fCVmAEDmKAvv5RuvFH6xz+Sj9mwQXrvvUFAlw0zYABr8vbb8Sz3+99PFr5btw725hoz+GsZw1cigAGk1O9LDzwQh+6WLcnGvfFG3Fp48cXBzLfsaEEAOKVPPpGuukr6+OPkYy6+WPr736XNm7Orq+iYAQNY1auvxrPcCy5IFr6PPjqYIRszOEhB+J4cM2AAkgZ3JWzdKr32Wrpx//yndMUV2dTkOmbAQIm9+WY8y924MVn4/vzn0uHDcT+X8B0fM2CgRIwZHG7YtSvduEZD+s1vsqmpzAhgwHHd7uB1PGmcc47UbksXXZRJSfgOLQjAQU88EbcWkoZvEAzuaDBG+uILwncaCOBlWq2WwjBUpVJRGIZqtVq2SwJO6fhx6bTT4tB98MFk437/+7iX2+kM3p+G6eEf9xKtVku1Wk29Xk+S1O12VavVJIlXgiN3du+Wrrwy/bj9+6ULL5x8PUiPGfASURQthu9Qr9dTFEWWKgJGDd/063nJw/fyy+O9ucYQvnlCAC8xPz+f6jnclKc21KFDceB6nvTCC8nGvfZaHLjvvDMYi/whgJeYmZlJ9RzuGbahut2ujDGLbahphvDSN/2ee27ycV98EYfurbdmVx8mhwBeol6vy/f9kWe+76ter1uqCNNmow1lzOgsN+mr02u1OHCNcfveXFcRwEtUq1U1Gg0FQSDP8xQEgRqNBgtwJTKtNlS7HQdumtfqtNtx4D733ERLggUE8DLValWdTkf9fl+dTofwLZks21C//GUcupddlnzccG+uMdLs7JrLWCFPPe+yIYCBJSbZhjpyZLS18NZbycbdffdoayHLvbl56HmXGQEMLLHWNtTLL8eBe9ZZyX/fffviwH322TGLHwNbL+3yjDGJPzw3N2fa7XaG5QDFc+aZ0rIMSyTFf3qZqVQqWi0DPM9Tv9+3UJGbPM/bY4yZW/6cGTCQUrc72lpIGr6NxmhrIQ/YemkXAQwk8NBD6S+3kQYHKYaBm8frHNl6aRd3QQCrWFgYb/HrmmuknTsnX09Whr3tKIo0Pz+vmZkZ1et1dv9MCT1g4Dvbt0u33JJ+3NtvS9deO/l64I4T9YCZAaPUxr0j4dtvpXXrJlsLyoceMErls89GF9CSevjh0QU0wheTQADDeVEUB+4Pf5h83N69ceD+4Q/Z1YfyogUB5xiT7n6F5WOBaWEGDCfs3Dne5TbbtuVvby7KgxkwCut735MOH04/7quvpLPPnnw9QFrMgFEYR4+OLqAlDd/Nm0dnuYQv8oIARq5t2xYH7rIDWye1fXscuP/7X3b1AWtBAFvEPayrWzrLvf/+5OMWFuLQvemm7OoDJoUAtoR7WGN79463N/e660ZbC+PufABs4SiyJWEYqtvtrngeBIE6nc70C5qyiy+WPvww/bgPP5Quumjy9QBZ4ihyzkzr3WN5wd5cYCX+p82SMtzD+sor4+3Nvf9+9uaiHJgBW1Kv11Wr1UZeB+PCPazjXm5z+HC6V/gALmAGbMla3z2WF4cOjbeAJo3OcglflBGLcEjt17+Wnn8+/bg335RuvHHy9QB5xyIc1mTc1kK/P/5YwHW0ILCq998fr7Vw1VWjrQXCFzgxZsBYNDsrvftu+nH//W+6F1UCGCCAS2zcF09KbA8DJoEWRMm8/HLcVkgTvo0Ge3OBSWMGXALj9mGPHZM2bJhsLQBizIAdNO6LJy+8cHSWS/gC2SKAHXH33eO9eLLdjgN3//7VP8O1mUA2aEEU2LithTQ93OG1mcMj08NrMyUV7tQekDfMgAtk9+7xWgv33DP+AloURSP3VUhSr9dTFEXpfiEAKzADzrnzz5c+/TT9uIMHpfPOW/vvX7ZrM4FpYgacM8eOjc5y04Tv0lnuJMJXKse1mYAtBHAOvPpqHLhnnJF83F/+kv3e3Hq9Ln/Z2zBduDYTyANaEJZceqn03nvpxy0sTPfdZ8OFtiiKND8/r5mZGdXrdRbggAngOsop+fxzadOm9OOuvFLatWvy9QCYnhNdR0kLIkPPPBO3FtKE7wcfxG0FwhdwFwE8QcPrF4df996bbNy55472cn/0o5Wf4TAE4B4CeI2W3pubpjf7xhtx4H7++ck/OzwM0e12ZYxZPAxBCAPFRg94DLffLo2TfUePptvlMBSGobrd7ornQRCo0+mk/wUBTBWvJFqDr7+WNm5MP+6228YL6uU4DAG4iRbECezYEbcW0oTvv/4VtxYm1SHgMATgJgJ4ia1b49C9+ebk4/r9OHQvuWTydXEYAnBTqQP4yJHRXQvNZrJxTz013RdPVqtVNRoNBUEgz/MUBIEajQaHIYCCK10Av/NOHLhnnZV83MGDceAm3V42SdVqVZ1OR/1+X51Oh/A9BbbtoQicD2BjpMcei0P3iiuSjbvllmwut0H22LaHonByG9qBA9KWLdK+fenGvfvu4I4GFBvb9pA3zh9FXrprYfPmZOF7332Dy22Gs1zC1w1s20NRFDaAjx+X7rgj/a6Fv/0tDtxt26Z7sximg217KIpCxc/evdLppw8Cd8MG6aWXTj3m6qulQ4fi0P3ZzzIv0zlFW9Bi2x4KwxiT+Gt2dtZM265dS5fCkn09/fTUy3RWs9k0vu8bSYtfvu+bZrNpu7STajabJggC43meCYIg9/XCbZLaZpVMzfUi3IsvSnfeeerPrV8/uNz8Jz/JvqayYUELWLtCLsKdrMVw222D96cZM+gHE77ZYEELyE6uA3j79tG/f/310XsWNmywU1eZsKAFZCfXAXzmmaPd3V/9ynZF5cOCFpCdXAcw7OMeCiA7uV6EAwAXFHIRDisVbU8ugBPjjRgFMrxkptfrSdLiJTOSaAkABcQMOAeSzmqjKFoM36Fer6coiqZRJoAJYwZsWZpZLXtyAbcwA7YszayWPbmAWwhgy9LMatmTC7iFALYszayWPbmAWwhgy9LOank3HOAOAtgyZrVAeXESDgAyxkk4AMgZAhgALCGAAcASAhgALCGAAcASAhgALCl1AHO3LgCbSnsbGnfrArCttDNg7tYFYFtpA5i7dQHYVtoA5m7d4qJ3D1eUNoBduVu3bGE07N13u10ZYxZ7965/33CUMSbx1+zsrHFJs9k0QRAYz/NMEASm2WzaLimVZrNpfN83kha/fN8/6fdR9O85CIKR73f4FQSB7dKAE5LUNqtkKrehFVgYhup2uyueB0GgTqez4vnynR/SYNZfpOsvK5WKVvt31vM89ft9CxUBp8ZtaA5Ku5Dows4PevdwCQFcYGnDyIWdH6707gGJAC60tGHkwuyRN4jAJQRwgaUNI1dmj7wXD64oRACXbatVGmnCiNkjkC+53wXhwso9gHIr7C4IF1buAWA1uQ9gF1buAWA1uQ9gF1buAWA1uQ9gV1buAWC5XAdwq9Va7AGvW7dOkli5Lwl2vqAMcvtGjOW7HxYWFhZnvoSv23hbCcoit9vQ0l40A3fws4drCrcNjd0P5cXPHmWR2wBm90N58bNHWeQ2gNn9UF787FEWuQ1g7i0oL372KIvcLsIBgCsKtwgHAK4jgAHAEgIYACwhgAHAEgIYACxJtQvC87zPJK08IwoAOJnAGPOD5Q9TBTAAYHJoQQCAJQQwAFhCAAOAJQQwAFhCAAOAJQQwAFhCAAOAJQQwAFhCAAOAJf8HalCvD7PAM9YAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot outputs\n", "# The graph is similar to linear regression using ordinal least squares\n", "plt.scatter(diabetes_X_test, diabetes_y_test, color='black')\n", "plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)\n", "\n", "plt.xticks(())\n", "plt.yticks(())\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "oTMndk2HDXJD" }, "source": [ "## Polinomial Regression\n", "\n", "Our model function need not be linear (a straight line) if that does not fit the data well.\n", "\n", "We can change the behavior or curve of our model function by making it a quadratic, cubic or square root function (or other polinomial forms).\n", "\n", "Recall that a linear model is $pred_\\theta(x) = \\theta_0 + \\theta_1 x$. We can create additional features based on $x$ to add higher order polimonial terms. For instance, if we want to include a quadratic term, then we can add an extra feature $x_i^2$ an run the least squares algorithm in the augmented data set to compute the quadratic model function $pred_\\theta(x_i) = \\theta_0 + \\theta_1 x_i + \\theta_2 x_i^2$. Similarly, if we want to also add a cubic function, we can add another extra feature $x_i^3$ so that the model function is now $pred_\\theta(x) = \\theta_0 + \\theta_1 x_i + \\theta_2 x_i^2 + \\theta_3 x_i^3$.\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 265 }, "colab_type": "code", "id": "SXaXlEA2bNPI", "outputId": "b03779d3-1cac-4a3a-98c0-209f2d4937bf" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "from sklearn.pipeline import make_pipeline\n", "\n", "\n", "# The, we load the linnedrud dataset\n", "# This is a sample data set provided by sklearn\n", "# You can use our own, as long as X is a matrix and y a vector with the same number of rows as X\n", "linnerud_X, linnerud_y = datasets.load_linnerud(return_X_y=True)\n", "\n", "# Use only one feature (this is not necessary, just for plotting)\n", "linnerud_X = linnerud_X[:, np.newaxis, 0]\n", "# Use only one feature (this is not necessary, just for plotting)\n", "linnerud_y = linnerud_y[:, np.newaxis, 0]\n", "\n", "\n", "# Split the data into training/testing sets (we will discuss this latter on)\n", "linnerud_X_train = linnerud_X[-10:]\n", "linnerud_X_test = linnerud_X[:-10]\n", "\n", "# Split the targets into training/testing sets (we will discuss this latter on)\n", "linnerud_y_train = linnerud_y[-10:]\n", "linnerud_y_test = linnerud_y[:-10]\n", "\n", "\n", "regr = linear_model.LinearRegression()\n", "\n", "x_plot = np.array([np.arange(1,20,0.1)]).transpose()\n", "\n", "plt.scatter(linnerud_X_test, linnerud_y_test, color='black')\n", "\n", "\n", "for count, degree in enumerate([1, 2, 3, 4, 5]):\n", " model = make_pipeline(PolynomialFeatures(degree), regr)\n", " model.fit(linnerud_X_train, linnerud_y_train)\n", " y_plot = model.predict(x_plot)\n", " plt.plot(x_plot, y_plot, linewidth=2,\n", " label=\"degree %d\" % degree)\n", " \n", "\n", "plt.legend(loc='lower center')\n", "\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean squared error: 0.00110\n", "Coefficient of determination: 0.33\n" ] } ], "source": [ "regr = linear_model.LinearRegression()\n", "\n", "model = make_pipeline(PolynomialFeatures(2), regr)\n", "model.fit(molecules_X_train, molecules_y_train)\n", "\n", "molecules_y_pred = model.predict(molecules_X_test)\n", "\n", "# The mean squared error\n", "print('Mean squared error: %.5f'\n", " % mean_squared_error(molecules_y_test, molecules_y_pred))\n", "\n", "# The coefficient of determination: 1 is perfect prediction\n", "print('Coefficient of determination: %.2f'\n", " % r2_score(molecules_y_test, molecules_y_pred))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## Lets make a scatter plot of the true versus predicted values\n", "\n", "plt.scatter(molecules_y_test,molecules_y_pred)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([926., 613., 288., 132., 32., 5., 3., 0., 0., 1.]),\n", " array([3.04527283e-06, 1.93773367e-02, 3.87516281e-02, 5.81259195e-02,\n", " 7.75002109e-02, 9.68745023e-02, 1.16248794e-01, 1.35623085e-01,\n", " 1.54997377e-01, 1.74371668e-01, 1.93745959e-01]),\n", " )" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAD4CAYAAAAEhuazAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAQZ0lEQVR4nO3de4xc5XnH8e8Djk0hAhtYIrCdrGmcthAqoBsgiRIVnCgB2hgJaEhJsKgltwltk1KpmJAqUqVKRq0KQa1AVlwwbS5QmgirJK0ol0qVYpo190uJF0NhwYUlXJpAQ3B5+se8C8N67Z2dO7zfjzSac97znnOePXP4zfF7ZobITCRJddln0AVIkvrP8JekChn+klQhw1+SKmT4S1KFFgy6AIBDDz00R0dHB12GJL2lbNu27dnMHGln3aEI/9HRUcbHxwddhiS9pUTEf7W7rsM+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUoaH4hm8nRtffNLB9P7bh9IHtW5I64ZW/JFXI8JekChn+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUIcNfkipk+EtShQx/SaqQ4S9JFTL8JalChr8kVcjwl6QKGf6SVCHDX5IqZPhLUoUMf0mqkOEvSRUy/CWpQi2Ff0T8UUQ8EBH3R8S3ImK/iFgREXdExPaIuC4iFpa+i8r8RFk+2ss/QJI0f3OGf0QsBf4QGMvM9wP7AucAlwKXZeZK4HlgbVllLfB8Zr4XuKz0kyQNkVaHfRYAvxARC4D9gZ3AKcANZflm4IwyvbrMU5aviojoTrmSpG6YM/wz80ngL4HHaYT+i8A24IXM3FW6TQJLy/RS4Imy7q7S/5CZ242IdRExHhHjU1NTnf4dkqR5aGXYZwmNq/kVwBHAAcCps3TN6VX2suyNhsyNmTmWmWMjIyOtVyxJ6lgrwz4fAx7NzKnMfBX4DvAhYHEZBgJYBjxVpieB5QBl+UHAc12tWpLUkVbC/3HgpIjYv4zdrwIeBG4Dzip91gA3luktZZ6y/NbM3O3KX5I0OK2M+d9B48btncB9ZZ2NwEXAhRExQWNMf1NZZRNwSGm/EFjfg7olSR1YMHcXyMyvAl+d0bwDOGGWvj8Dzu68NElSr/gNX0mqkOEvSRUy/CWpQoa/JFXI8JekChn+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUIcNfkipk+EtShQx/SaqQ4S9JFTL8JalChr8kVcjwl6QKGf6SVCHDX5IqZPhLUoUMf0mqkOEvSRUy/CWpQoa/JFXI8JekChn+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUIthX9ELI6IGyLiPyPioYj4YEQcHBE3R8T28ryk9I2IuCIiJiLi3og4vrd/giRpvha02O9rwD9n5lkRsRDYH/gycEtmboiI9cB64CLgVGBleZwIXFme33ZG1980kP0+tuH0gexX0tvHnFf+EXEg8FFgE0Bm/jwzXwBWA5tLt83AGWV6NXBtNmwFFkfE4V2vXJLUtlaGfY4EpoCrI+KuiPh6RBwAvCszdwKU58NK/6XAE03rT5Y2SdKQaCX8FwDHA1dm5nHASzSGePYkZmnL3TpFrIuI8YgYn5qaaqlYSVJ3tBL+k8BkZt5R5m+g8Wbw9PRwTnl+pqn/8qb1lwFPzdxoZm7MzLHMHBsZGWm3fklSG+YM/8z8b+CJiPil0rQKeBDYAqwpbWuAG8v0FuC88qmfk4AXp4eHJEnDodVP+/wB8I3ySZ8dwPk03jiuj4i1wOPA2aXv94DTgAng5dJXkjREWgr/zLwbGJtl0apZ+iZwQYd1SZJ6yG/4SlKFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUIcNfkipk+EtShQx/SaqQ4S9JFTL8JalChr8kVcjwl6QKGf6SVCHDX5IqZPhLUoUMf0mqkOEvSRUy/CWpQoa/JFXI8JekChn+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUIcNfkipk+EtShQx/SaqQ4S9JFTL8JalCLYd/ROwbEXdFxD+V+RURcUdEbI+I6yJiYWlfVOYnyvLR3pQuSWrXfK78vwg81DR/KXBZZq4EngfWlva1wPOZ+V7gstJPkjREWgr/iFgGnA58vcwHcApwQ+myGTijTK8u85Tlq0p/SdKQaPXK/3LgT4DXyvwhwAuZuavMTwJLy/RS4AmAsvzF0v9NImJdRIxHxPjU1FSb5UuS2jFn+EfEbwDPZOa25uZZumYLy95oyNyYmWOZOTYyMtJSsZKk7ljQQp8PA5+KiNOA/YADafxLYHFELChX98uAp0r/SWA5MBkRC4CDgOe6XrkkqW1zXvln5sWZuSwzR4FzgFsz81zgNuCs0m0NcGOZ3lLmKctvzczdrvwlSYPTyef8LwIujIgJGmP6m0r7JuCQ0n4hsL6zEiVJ3dbKsM/rMvN24PYyvQM4YZY+PwPO7kJtkqQe8Ru+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUIcNfkipk+EtShQx/SaqQ4S9JFTL8JalChr8kVcjwl6QKGf6SVCHDX5IqZPhLUoXm9f/w1XAYXX/TwPb92IbTB7ZvSd3jlb8kVcjwl6QKGf6SVCHDX5IqZPhLUoUMf0mqkOEvSRUy/CWpQoa/JFXI8JekChn+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUJzhn9ELI+I2yLioYh4ICK+WNoPjoibI2J7eV5S2iMiroiIiYi4NyKO7/UfIUman1au/HcBf5yZvwKcBFwQEUcB64FbMnMlcEuZBzgVWFke64Aru161JKkjc4Z/Zu7MzDvL9E+Ah4ClwGpgc+m2GTijTK8Grs2GrcDiiDi865VLkto2rzH/iBgFjgPuAN6VmTuh8QYBHFa6LQWeaFptsrTN3Na6iBiPiPGpqan5Vy5JalvL4R8R7wT+EfhSZv7P3rrO0pa7NWRuzMyxzBwbGRlptQxJUhe0FP4R8Q4awf+NzPxOaX56ejinPD9T2ieB5U2rLwOe6k65kqRuaOXTPgFsAh7KzL9qWrQFWFOm1wA3NrWfVz71cxLw4vTwkCRpOCxooc+Hgc8B90XE3aXty8AG4PqIWAs8Dpxdln0POA2YAF4Gzu9qxZKkjs0Z/pn578w+jg+wapb+CVzQYV2SpB7yG76SVCHDX5IqZPhLUoUMf0mqkOEvSRUy/CWpQoa/JFXI8JekChn+klShVn7eQXrd6PqbBrLfxzacPpD9Sm9XXvlLUoUMf0mqkOEvSRUy/CWpQoa/JFXI8JekChn+klQhw1+SKmT4S1KFDH9JqpDhL0kVMvwlqUKGvyRVyPCXpAoZ/pJUIcNfkipk+EtShQx/SaqQ4S9JFTL8JalChr8kVcjwl6QKLRh0AVIrRtffNJD9Prbh9IHsV+o1r/wlqUKGvyRVyPCXpAr1JPwj4pMR8XBETETE+l7sQ5LUvq6Hf0TsC/wNcCpwFPCZiDiq2/uRJLWvF5/2OQGYyMwdABHxbWA18GAP9iX11KA+ZTRIfsKpDr0I/6XAE03zk8CJMztFxDpgXZn9aUQ83Ob+DgWebXPdfhjm+qytPcNcG3RYX1zaxUp2N8zH7q1Y23va3WAvwj9macvdGjI3Ahs73lnEeGaOdbqdXhnm+qytPcNcGwx3fdbWnl7U1osbvpPA8qb5ZcBTPdiPJKlNvQj/HwIrI2JFRCwEzgG29GA/kqQ2dX3YJzN3RcTvA/8C7Av8bWY+0O39NOl46KjHhrk+a2vPMNcGw12ftbWn67VF5m7D8ZKktzm/4StJFTL8JalCQxf+c/00REQsiojryvI7ImK0adnFpf3hiPhEq9vsdW0R8fGI2BYR95XnU5rWub1s8+7yOKzPtY1GxP827f+qpnV+rdQ8ERFXRMRsH+PtZW3nNtV1d0S8FhHHlmVdOW4t1vfRiLgzInZFxFkzlq2JiO3lsaapvV/HbtbaIuLYiPhBRDwQEfdGxKebll0TEY82Hbtj+1lbWfZ/Tfvf0tS+opwD28s5sbCd2jqpLyJOnnHe/SwizijL+nXsLoyIB8trd0tEvKdpWXfOucwcmgeNG8SPAEcCC4F7gKNm9PkCcFWZPge4rkwfVfovAlaU7ezbyjb7UNtxwBFl+v3Ak03r3A6MDfC4jQL372G7/wF8kMZ3N74PnNrP2mb0OQbY0c3jNo/6RoFfBa4FzmpqPxjYUZ6XlOklfT52e6rtfcDKMn0EsBNYXOavae7b7+NWlv10D9u9HjinTF8FfH4Q9c14jZ8D9u/zsTu5aZ+f543/Xrt2zg3blf/rPw2RmT8Hpn8aotlqYHOZvgFYVd7hVgPfzsxXMvNRYKJsr5Vt9rS2zLwrM6e/6/AAsF9ELGqjhq7XtqcNRsThwIGZ+YNsnFnXAmcMsLbPAN9qY/8d15eZj2XmvcBrM9b9BHBzZj6Xmc8DNwOf7Oex21NtmfmjzNxepp8CngFG2qih67XtSXnNT6FxDkDjnGjnuHWzvrOA72fmy23W0W5ttzXtcyuN70tBF8+5YQv/2X4aYume+mTmLuBF4JC9rNvKNntdW7Mzgbsy85WmtqvLPyH/tM3hgU5rWxERd0XEv0XER5r6T86xzX7UNu3T7B7+nR63Vuub77r9PHZziogTaFxhPtLU/OdlSOGyNi9EOq1tv4gYj4it00MqNF7zF8o50M42u1nftHPY/bzr97FbS+NKfm/rzvucG7bwb+WnIfbUZ77t89VJbY2FEUcDlwK/27T83Mw8BvhIeXyuz7XtBN6dmccBFwLfjIgDW9xmr2trLIw4EXg5M+9vWt6N49ZqffNdt5/Hbu8baFwR/h1wfmZOX+FeDPwy8AEawwcXDaC2d2fj5wp+G7g8In6xC9ts1q1jdwyN7yxN6+uxi4jPAmPAX8yx7rz/3mEL/1Z+GuL1PhGxADiIxpjcntbt1s9NdFIbEbEM+C5wXma+fgWWmU+W558A36TxT8K+1VaGyX5cathG4+rwfaX/sqb1B3Lcit2uvrp03Fqtb77r9vPY7VF5E78J+Epmbp1uz8yd2fAKcDW9O+f2aHoYNBu//ns7jftizwKLyzkw7212s77it4DvZuar0w39PHYR8THgEuBTTSMF3TvnOrlx0e0HjW8c76Bxw3b6RsjRM/pcwJtvDl5fpo/mzTd8d9C4sTLnNvtQ2+LS/8xZtnlomX4HjbHO3+tzbSPAvmX6SOBJ4OAy/0PgJN64gXRaP2sr8/vQOLGP7PZxa7W+pr7XsPsN30dp3HhbUqb7euz2UttC4BbgS7P0Pbw8B3A5sKHPtS0BFpXpQ4HtlBuewD/w5hu+X+j369rUvhU4eRDHjsab4SOUm/a9OOfmfVB7/QBOA35U/vBLStuf0Xj3A9ivnCATNO5uN4fCJWW9h2m60z3bNvtZG/AV4CXg7qbHYcABwDbgXho3gr9GCeI+1nZm2fc9wJ3AbzZtcwy4v2zzrynfCO/za/rrwNYZ2+vacWuxvg/QeAN6Cfgx8EDTur9T6p6gMbTS72M3a23AZ4FXZ5xzx5ZltwL3lfr+Hnhnn2v7UNn/PeV5bdM2jyznwEQ5JxYN6HUdpXEhtM+Mbfbr2P0r8HTTa7el2+ecP+8gSRUatjF/SVIfGP6SVCHDX5IqZPhLUoUMf0mqkOEvSRUy/CWpQv8PCLP/PonQwgkAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "## and a histogram of the magnitude of the error\n", "plt.hist(abs(molecules_y_test- molecules_y_pred))" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "rc3nkPWlDQSf" }, "source": [ "## Regularization\n", "\n", "The more features we have, the more freedom the algorithm has to fit the data. For instance, when doing polinomial regression, the higher the degree of polinomious we add, the more closely the algorithm can build a curve the pass through the data. \n", "\n", "Unfortonatelly, this flexibility cames with a price: the model can learn very well the training set, but may fail in predicting correctly out of the sample data points. This process is known as **overfitting**. \n", "\n", "However, reducing the number of features (e.g., reducing the degree of the polonima in the polinomial regression), we may have very simple models, which are not able to fit the data. This process is known as **underfitting**.\n", "\n", "\n", "There is a trade-off between making the model more specific (capturing more information from the training set) or more geneal (aiming to generalize to out-of-the sample data). \n", "\n", "\n", "![](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/04/Screen-Shot-2018-04-03-at-7.52.01-PM-e1522832332857.png)\n", "\n", "Regularization is a way to try to control this trade-off, adding a penalty to the parameters of the model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "Tb79pVNGZR__" }, "source": [ "## Ridge Regression\n", "\n", "In Ridge-Regression, the regularization penalty is taken to be the L2-norm of the coefficients. The idea is not to remove features beforehand, but adding a penalty to the magnitude of coefficients of the regression. This make the coefficients \"compete\" to each other: if a coefficient increase, other have to decrease so that their sum does not increase to much. \n", "\n", "$$ \\mathbf{\\theta} = (X^\\intercal X - \\alpha I)^{-1} \\cdot X^\\intercal y$$\n", "\n", "Where $\\alpha$ is a parameter of the algorithm that controls the strenght of the regularization.\n", "\n", "The example below experiments with different values of $\\alpha$ for Rigid Regression\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean squared error for alpha 0.001 : 0.000924 \n", "Mean squared error for alpha 0.010 : 0.000940 \n", "Mean squared error for alpha 0.100 : 0.000945 \n", "Mean squared error for alpha 1.000 : 0.000952 \n", "Mean squared error for alpha 10.000 : 0.000969 \n", "Mean squared error for alpha 100.000 : 0.000986 \n" ] } ], "source": [ "from sklearn.linear_model import Ridge\n", "\n", "for alpha in [0.001, 0.01, 0.1, 1, 10, 100]:\n", " \n", " regr = Ridge(alpha)\n", " model = make_pipeline(PolynomialFeatures(2), regr)\n", " \n", " model.fit(molecules_X_train, molecules_y_train)\n", "\n", " molecules_y_pred = model.predict(molecules_X_test)\n", "\n", " print('Mean squared error for alpha %.3f : %.6f '\n", " % (alpha, mean_squared_error(molecules_y_test, molecules_y_pred)))\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "5B_KPSPMZgF0" }, "source": [ "## Kernels\n", "\n", "The models we have seen so far are linear or ponomial. This restriction limits their usage for data sets which do not fit in this family of functions. \n", "\n", "In machine learning, a \"kernel\" usually refers to the kernel trick, a method of using a linear classifier to solve a non-linear problem. It entails transforming linearly inseparable data like to (hopefully) linearly separable ones. \n", "\n", "\n", "\n", "\n", "The RBF kernel is one exemple of a kernel used in machine learning. The ideia is to associate a Gaussian centered in each point of the training set, and compute the inverse of this Gaussian function w.r.t the other points:\n", "\n", " \n", "\n", "Examples which are closer to the reference example are mapped to a higher value, whereas examples which are far are mapped to a low value. \n", "\n", "\n", "\n", "The parameter $\\sigma$ control the neighborhood ratio around the reference example.\n", "\n", "Larger $\\sigma$ takes into account distant examples\n", "\n", "\n", "\n", "While small $\\sigma$ takes into account close examples\n", "\n", "\n", "\n", "\n", "The example next shows the use of Kernel Ridge Regression " ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": {}, "colab_type": "code", "id": "RyNK9VqMlGwM" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.kernel_ridge import KernelRidge\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.gaussian_process import GaussianProcessRegressor\n", "from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared\n", "\n", "\n", "###############################################################################\n", "# Generate sample data\n", "X = np.sort(5 * np.random.rand(40, 1), axis=0)\n", "y = np.sin(X).ravel()\n", "\n", "###############################################################################\n", "# Add noise to targets\n", "y[::5] += 3 * (0.5 - np.random.rand(8))\n", "\n", "\n", "plt.scatter(X, y, c='k', label='data')\n", "\n", "for gamma in [0.01,0.1,1,10]:\n", " kr = KernelRidge(kernel='rbf', gamma=gamma)\n", " kr.fit(X, y)\n", " kr_pred = kr.predict(X)\n", " plt.plot(X, kr_pred, label='RBF model - gamma '+str(gamma))\n", " \n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you have notice, the performance is very different depending on the value of $\\gamma$. Tools like the [TPOP](https://github.com/EpistasisLab/tpot) may be used to select the right parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Neural Networks\n", "\n", "Neural networks are modular models which pieces are connected according to some criteria. Pieces are called neurons and connections synapsis, due to the inspiration to the brain.\n", "\n", "The simplest neural network consists of only one neuron and is called a perceptron, as shown in the figure below:\n", "\n", "\n", "\n", "A perceptron has one input layer and one neuron. Input layer is responsible for receiving the inputs. The number of nodes in the input layer is equal to the number of features in the input dataset. Each input is multiplied with a weight (which is typically initialized with some random value) and the results are added together. \n", "\n", "The sum is then passed through an activation function. The activation function of a perceptron resembles the nucleus of human nervous system neuron. It processes the information and yields an output. In the case of a perceptron, this output is the final outcome. \n", "\n", "\n", "### Multilayer Perceptron\n", "\n", "In the case of multilayer perceptrons, the output from the neurons in the previous layer serves as the input to the neurons of the proceeding layer. Therefore, multilayer perceptrons, or more commonly referred to as MLPs, are a combination of multiple neurons connected in the form a network. An artificial neural network has an input layer, one or more hidden layers, and an output layer. This is shown in the image below:\n", "\n", "\n", "\n", "A neural network executes in two phases: Feed-Forward and Back Propagation. In the feed-forward phase, the input is presented to the input layer. The input signal is then propagated till the output layer. During the training phase, the output is compared with the correct value, and the weights of different neurons are updated in a way that the difference between the desired and predicted output is as small as possible.\n", "\n", "Let's try a neural net for predicting some properties of the molecules. MLP networks allow us to predict more than one output, so let's try." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_0feature_1feature_2feature_3feature_4feature_5feature_6feature_7feature_8feature_9...homolumogapr2zpveU0UHGCv
O3NC4H5134.38304973.27743165.56522041.20533937.97136028.56217022.14847519.5414560.3461830.246778...-0.2528-0.00770.24511012.18640.095653-435.725372-435.718189-435.717245-435.75726626.019
FONC6H4135.76212285.19293461.37565438.78915036.30910926.51595423.00382220.93152518.6641590.345873...-0.2604-0.07890.18151188.64410.089534-460.774434-460.767413-460.766468-460.80637725.848
O2N2C5H4139.21411374.41726949.82891645.67417037.00023826.23330924.53920423.07953619.0392770.343824...-0.2324-0.04680.1856945.79880.092456-452.744414-452.738132-452.737188-452.77571623.374
O3C6H4137.46688773.56536461.51838346.47332532.47677128.00542623.11813820.29464519.7725880.349248...-0.2670-0.08230.1848944.04260.090391-456.613010-456.606412-456.605468-456.64439224.902
FON3C4H4146.43197883.54877159.49413250.05566143.92324231.13762023.79579122.15140019.0936720.331518...-0.2192-0.03090.18821074.30780.090835-494.100913-494.093709-494.092765-494.13247127.708
\n", "

5 rows × 29 columns

\n", "
" ], "text/plain": [ " feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 \\\n", "O3NC4H5 134.383049 73.277431 65.565220 41.205339 37.971360 28.562170 \n", "FONC6H4 135.762122 85.192934 61.375654 38.789150 36.309109 26.515954 \n", "O2N2C5H4 139.214113 74.417269 49.828916 45.674170 37.000238 26.233309 \n", "O3C6H4 137.466887 73.565364 61.518383 46.473325 32.476771 28.005426 \n", "FON3C4H4 146.431978 83.548771 59.494132 50.055661 43.923242 31.137620 \n", "\n", " feature_6 feature_7 feature_8 feature_9 ... homo lumo \\\n", "O3NC4H5 22.148475 19.541456 0.346183 0.246778 ... -0.2528 -0.0077 \n", "FONC6H4 23.003822 20.931525 18.664159 0.345873 ... -0.2604 -0.0789 \n", "O2N2C5H4 24.539204 23.079536 19.039277 0.343824 ... -0.2324 -0.0468 \n", "O3C6H4 23.118138 20.294645 19.772588 0.349248 ... -0.2670 -0.0823 \n", "FON3C4H4 23.795791 22.151400 19.093672 0.331518 ... -0.2192 -0.0309 \n", "\n", " gap r2 zpve U0 U H \\\n", "O3NC4H5 0.2451 1012.1864 0.095653 -435.725372 -435.718189 -435.717245 \n", "FONC6H4 0.1815 1188.6441 0.089534 -460.774434 -460.767413 -460.766468 \n", "O2N2C5H4 0.1856 945.7988 0.092456 -452.744414 -452.738132 -452.737188 \n", "O3C6H4 0.1848 944.0426 0.090391 -456.613010 -456.606412 -456.605468 \n", "FON3C4H4 0.1882 1074.3078 0.090835 -494.100913 -494.093709 -494.092765 \n", "\n", " G Cv \n", "O3NC4H5 -435.757266 26.019 \n", "FONC6H4 -460.806377 25.848 \n", "O2N2C5H4 -452.775716 23.374 \n", "O3C6H4 -456.644392 24.902 \n", "FON3C4H4 -494.132471 27.708 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "## variables to be predicted\n", "outputs = ['homo','lumo','gap']\n", "\n", "molecules_X = data[feature_names]\n", "molecules_y = np.array(data[outputs])\n", "\n", "\n", "# Split the data into training/testing sets (we will discuss this latter on)\n", "molecules_X_train = molecules_X[:-2000]\n", "molecules_X_test = molecules_X[-2000:]\n", "\n", "# Split the targets into training/testing sets (we will discuss this latter on)\n", "molecules_y_train = molecules_y[:-2000]\n", "molecules_y_test = molecules_y[-2000:]" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.neural_network import MLPRegressor\n", "\n", "## two hidden layers, with 20 neuros each\n", "mlp = MLPRegressor(hidden_layer_sizes=(20,20),max_iter=1000)\n", "\n", "mlp.fit(molecules_X_train,molecules_y_train)\n", "\n", "molecules_y_pred = mlp.predict(molecules_X_test)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(2,3,figsize=(15,10))\n", "for i in range(3):\n", " ax[0,i].scatter(molecules_y_test[:,i],molecules_y_pred[:,i])\n", " ax[1,i].hist(abs(molecules_y_test[:,i]-molecules_y_pred[:,i]))\n", " ax[0,i].set_title(outputs[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification\n", "\n", "In classification, the target output is categorical instead of numerical. For example, we may predict a tumor as benign or malignant, or a piece of news in sports, economy, politics. In this part, we will use a neural network to predict the class of a flower named iris.\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "\n", "X_iris,y_iris = datasets.load_iris(return_X_y=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we split the data directly selecting a contiguous subset, we will likely choose a bised sample. Some classes may not be present in the train or test sets. So we will use s different spliting schema. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(X_iris, y_iris, test_size = 0.20)\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "from sklearn.neural_network import MLPClassifier\n", "mlp = MLPClassifier(hidden_layer_sizes=(10, 10, 10), max_iter=1000)\n", "mlp.fit(X_iris_train, y_iris_train)\n", "\n", "y_iris_pred = mlp.predict(X_iris_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to evaluated the prediction, we could tabulate the correct and incorrect classification in a confusion matrix:\n", "\n", "\n", "\n", "where\n", "\n", "- True Positive:\n", "\n", " Interpretation: You predicted positive and it’s true.\n", " You predicted that a woman is pregnant and she actually is.\n", "\n", "- True Negative:\n", "\n", " Interpretation: You predicted negative and it’s true.\n", " You predicted that a man is not pregnant and he actually is not.\n", "\n", "- False Positive: (Type I Error)\n", "\n", " Interpretation: You predicted positive and it’s false.\n", " You predicted that a man is pregnant but he actually is not.\n", "\n", "- False Negative: (Type II Error)\n", "\n", " Interpretation: You predicted negative and it’s false.\n", " You predicted that a woman is not pregnant but she actually is." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[13 0 0]\n", " [ 0 5 1]\n", " [ 0 0 11]]\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import confusion_matrix, plot_confusion_matrix\n", "cm = confusion_matrix(y_iris_test,y_iris_pred)\n", "print(cm)\n", "plot_confusion_matrix(mlp, X_iris_test, y_iris_test,\n", " display_labels=['setosa','versicolor','virginica'],\n", " cmap=plt.cm.Blues)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the matrix, we can compute several performance measures, such as:\n", "\n", "- Precision: Out of all the positive classes we have predicted correctly, how many are actually positive.\n", "- Recall: Out of all the positive classes, how much we predicted correctly. It should be high as possible\n", "- Accuracy: Out of all the classes, how much we predicted correctly\n", "- F1: Harmonic mean of the Precision and Recall" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 1.00 1.00 1.00 13\n", " 1 1.00 0.83 0.91 6\n", " 2 0.92 1.00 0.96 11\n", "\n", " accuracy 0.97 30\n", " macro avg 0.97 0.94 0.96 30\n", "weighted avg 0.97 0.97 0.97 30\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "print(classification_report(y_iris_test,y_iris_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cross validation\n", "\n", "Using a single train/test split may have some problems dangers — what if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? (imagine a file ordered by one of these). This will result in overfitting, even though we’re trying to avoid it! This is where cross validation comes in.\n", "\n", "The idea is, instead of using a single train/test split, create a series of splits in the following way: in K-Folds Cross Validation we split our data into k different subsets (or folds). We use k-1 subsets to train our data and leave the last subset (or the last fold) as test data. We then average the model against each of the folds and then finalize our model.\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_validate\n", "\n", "results = cross_validate(mlp,X_iris,y_iris,cv=5,scoring=['f1_macro','accuracy'])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean accuracy: 0.967, Std accuracy: 0.030\n", "Mean F1: 0.967, Std F1: 0.030\n" ] } ], "source": [ "print(\"Mean accuracy: {0:.3f}, Std accuracy: {1:.3f}\".format(np.mean(results['test_accuracy']),\n", " np.std(results['test_accuracy'])))\n", "\n", "print(\"Mean F1: {0:.3f}, Std F1: {1:.3f}\".format(np.mean(results['test_f1_macro']),\n", " np.std(results['test_f1_macro'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learn more\n", "\n", "- [Machine learning for phisicists (review)](https://arxiv.org/abs/1803.08823)\n", "- [Machine learning for phisicists (review)](https://machine-learning-for-physicists.org/)\n", "- [Sklearn](https://scikit-learn.org/)\n", "- [TensorFlow Playground](https://playground.tensorflow.org/)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "5. Supervised.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }