{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple Regression example\n", "Let's explore implementation of Multiple regression!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "\"\"\"\n", "Multiple regression model: yi = α + β1xi1 + … + βkxik + εi\n", "\n", "our independent variable x will be a list of vectors, each of which looks like this:\n", "[1, # constant term\n", " 49, # number of friends\n", " 4, # work hours per day\n", " 0] # doesn't have PhD\n", "\"\"\"\n", "from scratch.linear_algebra import dot, Vector\n", "\n", "def predict(x: Vector, beta: Vector) -> float:\n", " \"\"\"assumes that the first element of x is 1\"\"\"\n", " return dot(x, beta)\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Fitting the model\n", "We first implement error and squared error and then compute the betta using gradient decent algorithm." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from typing import List\n", "\n", "def error(x: Vector, y: float, beta: Vector) -> float:\n", " return predict(x, beta) - y\n", "\n", "def squared_error(x: Vector, y: float, beta: Vector) -> float:\n", " return error(x, y, beta) ** 2\n", "\n", "x = [1, 2, 3]\n", "y = 30\n", "beta = [4, 4, 4] # so prediction = 4 + 8 + 12 = 24\n", "\n", "assert error(x, y, beta) == -6\n", "assert squared_error(x, y, beta) == 36" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def sqerror_gradient(x: Vector, y: float, beta: Vector) -> Vector:\n", " err = error(x, y, beta)\n", " return [2 * err * x_i for x_i in x]\n", "\n", "assert sqerror_gradient(x, y, beta) == [-12, -24, -36]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The general least_squares_fit function\n", "At this point, we’re ready to find the optimal beta using gradient descent. Let’s first write out a least_squares_fit function that can work with any dataset" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import random\n", "import tqdm\n", "from scratch.linear_algebra import vector_mean\n", "from scratch.gradient_descent import gradient_step\n", "\n", "\n", "def least_squares_fit(xs: List[Vector],\n", " ys: List[float],\n", " learning_rate: float = 0.001,\n", " num_steps: int = 1000,\n", " batch_size: int = 1) -> Vector:\n", " \"\"\"\n", " Find the beta that minimizes the sum of squared errors\n", " assuming the model y = dot(x, beta).\n", " \"\"\"\n", " # Start with a random guess\n", " guess = [random.random() for _ in xs[0]]\n", "\n", " for _ in tqdm.trange(num_steps, desc=\"least squares fit\"):\n", " for start in range(0, len(xs), batch_size):\n", " batch_xs = xs[start:start+batch_size]\n", " batch_ys = ys[start:start+batch_size]\n", "\n", " gradient = vector_mean([sqerror_gradient(x, y, guess)\n", " for x, y in zip(batch_xs, batch_ys)])\n", " guess = gradient_step(guess, gradient, -learning_rate)\n", "\n", " return guess" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apply to data\n", "Now let's apply that model to our data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "least squares fit: 100%|██████████| 5000/5000 [00:04<00:00, 1019.63it/s]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEWCAYAAACJ0YulAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAYxklEQVR4nO3deZgldX3v8fcngBGBiMCILMKIQQnyCOhIAJdgcEFFwYQgBAy4XDRBwSUqcl1AozG5XgzE6CMIgoiIAgoIUXPRBLkiOuxrLoZFwAGGTRaNbN/7R1UXh6aX0zNzzmm636/n6adPLafqe2pq+nPqV1W/SlUhSRLA7426AEnS7GEoSJI6hoIkqWMoSJI6hoIkqWMoSJI6hoKmlOSKJDuMuo5RSvLGJDcmuS/J1su5rJcm+c8VVdu4ZR+b5O8GsWzNH4bCPJbk+iSvGDdu3yTnjg1X1fOq6t+nWc7CJJVk5QGVOmqfBd5VVatX1UXjJ7af/f42NO5LcvdkC6qqH1fVcwda7SSSrJfk6CRLktyb5OokhyZZbcDrPSTJ1wa5Dq04hoJmvVkQNhsDV0wzz5ZtaKxeVWtONMMoP0eStYDzgFWB7apqDeCVwJrAs0dVl2YfQ0FT6j2aSLJNksVJ7klya5LD2tnOaX/f3X5T3i7J7yX5SJIbktyW5KtJntqz3L9qp92R5KPj1nNIkpOTfC3JPcC+7brPS3J3+03380me1LO8SvI3Sa5pvwV/Msmzk/ykrfebvfOP+4wT1prk95PcB6wEXJLkv2a47XZIclOSDyW5BfjK2LieedZPckqSpUmuS3JAz7RD2rq/2n6mK5Is6pm+dZIL22knAU+eopz3AfcCe1fV9QBVdWNVHVhVl7bL2z7Jz5P8uv29fc+6HnNU2fvtv+dIcZ8kv0xye5L/2U7bCTgYeFO7b1zSjt83ybVt7dcl2Wsm21aDYyhoJg4HDq+qP6D5dvnNdvzL2t9rtt+UzwP2bX9eDmwCrA58HiDJ5sAXgL2A9YCnAhuMW9cuwMk032RPAB4G3gusA2wH7Aj8zbj3vBp4IbAt8EHgSGBv4JnAFsCek3yuCWutqt9V1ertPFtW1bJ8o34GsBbN0cZ+vROS/B5wBnAJzeffEXhPklf3zPYG4Bs02+F0Ht2GTwK+AxzfLv9bwJ9PUccrgFOr6pGJJrZHEmcCRwBrA4cBZyZZewaf9SXAc9vP8bEkf1RV3wM+DZzU7htbts1VRwCvaY9YtgcunsF6NECGgr7Tfvu+u20L/8IU8z4I/GGSdarqvqr66RTz7gUcVlXXVtV9wIeBPdomlN2AM6rq3Kp6APgYML4TrvOq6jtV9UhV/baqLqiqn1bVQ+033S8BfzLuPf9YVfdU1RXA5cAP2vX/GvhXYLKTxFPV2q8Le7bjET3jHwE+3gbMb8e950XAgqr6RFU9UFXXAkcBe/TMc25VnVVVD9MEwJbt+G2BVYB/qqoHq+pk4OdT1Lc2sGSK6a8Drqmq49ttfCJwNfD6aT53r0Pbf6tLaIJuyynmfQTYIsmqVbWk/TfTLGAoaNeqWnPsh8d/++71NuA5wNVt88LOU8y7PnBDz/ANwMrAuu20G8cmVNVvgDvGvf/G3oEkz0ny3SS3tE1Kn6Y5auh1a8/r304wvDoTm6rWfr2gZzse0DN+aVX99yTv2RhYf1woHzxuvbf0vP4N8OQ2rNYHbq7H9mjZ+xnGu4PmqGwy47fB2PLGH8FNZXytE27vqrofeBPwTmBJkjOTbDaD9WiADAX1raquqao9gacD/wCc3DYFTNTV7q9o/uiN2Qh4iOYP9RJgw7EJSVal+Sb7mNWNG/4izTfXTdvmq4OBLPun6bvW5TVVN8Q3Atf1hnJVrVFVr+1juUuADZL0boONppj//wBvbJusJjJ+G4wt7+b29f3AU3qmPaOPGsc8bhtU1fer6pU0QXU1zRGSZgFDQX1LsneSBW279Nhll48AS9vfm/TMfiLw3iTPSrI6j7YrP0RzruD17YnNJwGHMP0f+DWAe4D72m+Vf72iPtc0tQ7Sz4B72xPRqyZZKckWSV7Ux3vPowmuA5KskuTPgG2mmP8w4A+A45JsDJBkgySHJXk+cBbwnCR/mWTlJG8CNge+277/YpomtVXak927zeBz3gosHAukJOsm2aX9QvE74D6a/UezgKGgmdgJuKK9IudwYI+2Dfk3wKeA/9s2g2wLHEPTBn4OcB3w38C7Adr243fTnEBdQvNH4TaaPxCT+VvgL2muoDkKOGkFfq5Jax2k9jzBzsBW7XpvB75Mc+J9uvc+APwZzQnyO2maY06dYv47aU7oPgicn+Re4Gzg18AvquqOtpb30zQ1fRDYuapubxfxUZqLC+4CDgW+PoOP+q329x1JLqT5u/M+mqOTO2nODa3IkNdyiA/Z0ai1387vpmkaum7U9UjzmUcKGokkr0/ylLYJ4bPAZcD1o61K0sBCIckzk/woyZXtTTcHtuMPSXJzkovbn35Oqmnu2YWm+eBXwKY0TVEetkojNrDmoyTrAetV1YVJ1gAuAHYFdgfuq6rPDmTFkqRlNrC+WKpqCe3NMlV1b5KrmNk1z5KkIRvKieYkC2mu7NiC5qqDfWkuL1wMvL+q7prgPfvRdguw2mqrvXCzzby3RZJm4oILLri9qhbM5D0DD4X2ypL/AD5VVacmWZfm0rsCPknTxPTWqZaxaNGiWrx48UDrlKS5JskFVbVo+jkfNdCrj5KsApwCnFBVpwJU1a1V9XB7A9RRTH3DjSRpiAZ59VGAo4GrquqwnvG9/a+8kabjMknSLDDIh368GHgzcFmSsW5xDwb2TLIVTfPR9cA7BliDJGkGBnn10blM3J/NWYNapyRp+XhHsySpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjqGgiSpYyhIkjorj7qAmVh40Jnd6+s/87rlWsayvl+S5jKPFCRJHUNBktQxFCRJHUNBktQxFCRJHUNBktQxFCRJHUNBktQxFCRJHUNBktQxFCRJHUNBktQxFCRJHUNBktQZWCgkeWaSHyW5MskVSQ5sx6+V5N+SXNP+ftqgapAkzcwgjxQeAt5fVZsD2wL7J9kcOAg4u6o2Bc5uhyVJs8DAQqGqllTVhe3re4GrgA2AXYDj2tmOA3YdVA2SpJkZypPXkiwEtgbOB9atqiXtpFuAdSd5z37AfgAbbbTRtOvofaLainhCmyTNRwM/0ZxkdeAU4D1VdU/vtKoqoCZ6X1UdWVWLqmrRggULBl2mJIkBh0KSVWgC4YSqOrUdfWuS9drp6wG3DbIGSVL/Bnn1UYCjgauq6rCeSacD+7Sv9wFOG1QNkqSZGeQ5hRcDbwYuS3JxO+5g4DPAN5O8DbgB2H2ANUiSZmBgoVBV5wKZZPKOg1qvJGnZeUezJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKkzlIfsPNH4kB5J85VHCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSeo8YW9e6/cGM29Ek6T+eaQgSeoYCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSer0FQpJNk7yivb1qknWGGxZkqRRmDYUkvwP4GTgS+2oDYHvDLIoSdJo9HOksD/wYuAegKq6Bnj6dG9KckyS25Jc3jPukCQ3J7m4/XntshYuSVrx+gmF31XVA2MDSVYGqo/3HQvsNMH4z1XVVu3PWf2VKUkahn5C4T+SHAysmuSVwLeAM6Z7U1WdA9y5nPVJkoaon1A4CFgKXAa8AzgL+MhyrPNdSS5tm5eeNtlMSfZLsjjJ4qVLly7H6ia38KAzH/MQnuWdT5Ke6KYNhap6pKqOqqq/qKrd2tf9NB9N5IvAs4GtgCXA/55ivUdW1aKqWrRgwYJlXJ0kaSYmfRxnksuY4txBVT1/piurqlt7ln8U8N2ZLkOSNDhTPaN55xW9siTrVdWSdvCNwOVTzS9JGq5JQ6Gqbhh7neQZwDY0Rw4/r6pbpltwkhOBHYB1ktwEfBzYIclW7XKupzlHIUmaJaY6UgAgyduBjwE/BAL8c5JPVNUxU72vqvacYPTRy1SlJGkopg0F4APA1lV1B0CStYGfAFOGgiTpiaefS1LvAO7tGb63HSdJmmP6OVL4BXB+ktNozgXsAlya5H0AVXXYAOuTJA1RP6HwX+3PmNPa3/aUKklzzLShUFWHAiRZvR2+b9BFSZJGo5+us7dIchFwBXBFkguSPG/wpUmShq2fE81HAu+rqo2ramPg/cBRgy1LkjQK/YTCalX1o7GBqvp3YLWBVSRJGpl+TjRfm+SjwPHt8N7AtYMrSZI0Kv0cKbwVWACcCpwCrNOOkyTNMf1cfXQXcECS1arq/iHUJEkakX6uPto+yZXAVe3wlkm+MPDKJElD1885hc8BrwZOB6iqS5K8bKBVjYBPVpOk/s4pUFU3jhv18ABqkSSNWD9HCjcm2R6oJKsAB9I2JUmS5pZ+jhTeCewPbAD8iub5yvsPsihJ0mj0c/XR7cBeQ6hFkjRi/Vx9tEmSM5IsTXJbktOSbDKM4iRJw9VP89HXgW8C6wHrA98CThxkUZKk0egnFJ5SVcdX1UPtz9eAJw+6MEnS8PVz9dG/JjkI+AbNk9feBJyVZC2AqrpzgPVJkoaon1DYvf39jnHj96AJCc8vSNIc0c/VR88aRiGSpNHr645mSdL8YChIkjqThkKSF7e/f3945UiSRmmqI4Uj2t/nDaMQSdLoTXWi+cEkRwIbJDli/MSqOmBwZUmSRmGqUNgZeAXNsxQuGE45kqRRmjQU2o7wvpHkqqq6ZIg1SZJGpJ+rj+5I8u22M7zbkpySZMOBVyZJGrp+QuErNI/iXL/9OaMdJ0maY/oJhadX1Vd6OsQ7Flgw4LokSSPQTyjcnmTvJCu1P3sDdwy6MEnS8PUTCm+l6RTvFmAJsBvwlkEWJUkajX46xLsBeMNMF5zkGJrLWm+rqi3acWsBJwELgeuB3avqrpkuW5I0GIPs++hYYKdx4w4Czq6qTYGz22FJ0iwxsFCoqnOA8Q/g2QU4rn19HLDroNYvSZq5YfeSum5VLWlf3wKsO9mMSfZLsjjJ4qVLlw6nOkma56YNhSQf6Xm9wnpMraqieXLbZNOPrKpFVbVowQKvgJWkYZiq6+wPJdmO5mqjMcvbY+qtSdZrl78ecNtyLk+StAJNdaRwNfAXwCZJfpzkKGDtJM9djvWdDuzTvt4HOG05liVJWsGmCoW7gYOBXwA7AIe34w9K8pPpFpzkRJoji+cmuSnJ24DPAK9Mcg1ND6yfWY7aJUkr2FT3Kbwa+BjwbOAw4FLg/qrq68a1qtpzkkk7zqhCSdLQTHqkUFUHV9WONDeZHQ+sBCxIcm6SM4ZUnyRpiKa9oxn4flUtBhYn+euqekmSdQZdmCRp+Ka9JLWqPtgzuG877vZBFSRJGp0Z3bzmE9gkaW4b9h3NkqRZzFCQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHUMBUlSx1CQJHVWHsVKk1wP3As8DDxUVYtGUYck6bFGEgqtl1fV7SNcvyRpHJuPJEmdUYVCAT9IckGS/SaaIcl+SRYnWbx06dIhlydJ89OoQuElVfUC4DXA/kleNn6GqjqyqhZV1aIFCxYMv0JJmodGEgpVdXP7+zbg28A2o6hDkvRYQw+FJKslWWPsNfAq4PJh1yFJerxRXH20LvDtJGPr/3pVfW8EdUiSxhl6KFTVtcCWw16vJGl6XpIqSeoYCpKkjqEgSeqMspuLOWfhQWcCcP1nXjfh+ImmSdJs4pGCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOoaCJKljKEiSOj5kZzlM9fCc3mnLujwf2iNp2DxSkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUsdQkCR1DAVJUseb12ZoWW5Km2wZw7zxbK7c8NbvDX6DXq8026yofdQjBUlSx1CQJHUMBUlSx1CQJHUMBUlSZyShkGSnJP+Z5BdJDhpFDZKkxxt6KCRZCfgX4DXA5sCeSTYfdh2SpMcbxZHCNsAvquraqnoA+AawywjqkCSNk6oa7gqT3YCdqurt7fCbgT+uqneNm28/YL92cAvg8qEWOjutA9w+6iJmCbdFw+3QcDs0xm+HjatqwUwWMGvvaK6qI4EjAZIsrqpFIy5p5NwOj3JbNNwODbdDY0Vsh1E0H90MPLNneMN2nCRpxEYRCj8HNk3yrCRPAvYATh9BHZKkcYbefFRVDyV5F/B9YCXgmKq6Ypq3HTn4yp4Q3A6Pcls03A4Nt0NjubfD0E80S5JmL+9oliR1DAVJUmdWh8J87g4jyTOT/CjJlUmuSHJgO36tJP+W5Jr299NGXeswJFkpyUVJvtsOPyvJ+e2+cVJ70cKclmTNJCcnuTrJVUm2m8f7w3vb/xeXJzkxyZPnwz6R5JgktyW5vGfchPtAGke02+PSJC/oZx2zNhTsDoOHgPdX1ebAtsD+7ec/CDi7qjYFzm6H54MDgat6hv8B+FxV/SFwF/C2kVQ1XIcD36uqzYAtabbHvNsfkmwAHAAsqqotaC5Y2YP5sU8cC+w0btxk+8BrgE3bn/2AL/azglkbCszz7jCqaklVXdi+vpfmD8AGNNvguHa244BdR1Ph8CTZEHgd8OV2OMCfAie3s8z57ZDkqcDLgKMBquqBqrqbebg/tFYGVk2yMvAUYAnzYJ+oqnOAO8eNnmwf2AX4ajV+CqyZZL3p1jGbQ2ED4Mae4ZvacfNOkoXA1sD5wLpVtaSddAuw7ojKGqZ/Aj4IPNIOrw3cXVUPtcPzYd94FrAU+ErbjPblJKsxD/eHqroZ+CzwS5ow+DVwAfNvnxgz2T6wTH9DZ3MoCEiyOnAK8J6quqd3WjXXE8/pa4qT7AzcVlUXjLqWEVsZeAHwxaraGrifcU1F82F/AGjbzHehCcr1gdV4fJPKvLQi9oHZHArzvjuMJKvQBMIJVXVqO/rWsUPA9vdto6pvSF4MvCHJ9TRNiH9K07a+Ztt0APNj37gJuKmqzm+HT6YJifm2PwC8AriuqpZW1YPAqTT7yXzbJ8ZMtg8s09/Q2RwK87o7jLbd/Gjgqqo6rGfS6cA+7et9gNOGXdswVdWHq2rDqlpIsw/8sKr2An4E7NbONh+2wy3AjUme247aEbiSebY/tH4JbJvkKe3/k7FtMa/2iR6T7QOnA3/VXoW0LfDrnmamSc3qO5qTvJamPXmsO4xPjbikoUnyEuDHwGU82pZ+MM15hW8CGwE3ALtX1fgTT3NSkh2Av62qnZNsQnPksBZwEbB3Vf1ulPUNWpKtaE62Pwm4FngLzRe7ebc/JDkUeBPNVXoXAW+naS+f0/tEkhOBHWi6yL4V+DjwHSbYB9rA/DxN09pvgLdU1eJp1zGbQ0GSNFyzuflIkjRkhoIkqWMoSJI6hoIkqWMoSJI6hoLmpCR/n+TlSXZN8uEZvndB29vmRUleOm7aS9veOS9OsuoE7/3J8tbeLmdhb0+Y0rAYCpqr/hj4KfAnwDkzfO+OwGVVtXVV/XjctL2Av6+qrarqt2Mjx+6krartl6NmaeQMBc0pSf5XkkuBFwHn0dzU9MUkH5tg3oVJftj2NX92ko3aG8T+Edhl/NFAkrcDuwOfTHJCkh2S/DjJ6TR31JLkvp75P5Dk5+3yD+1Z51VJjmqPOH4wto4kL0xySZJLgP17lvO8JD9r67k0yaYrfstJDUNBc0pVfYCmH/1jaYLh0qp6flV9YoLZ/xk4rqqeD5wAHFFVFwMfA04afzRQVV+m6TrgA21XG9D0P3RgVT2nd8FJXkXTj/02wFbAC5O8rJ28KfAvVfU84G7gz9vxXwHeXVVbjqvzncDhVbUVsIimHyRpIAwFzUUvAC4BNuOxD+YZbzvg6+3r44GXLMO6flZV100w/lXtz0XAhW0tY9/wr2vDB5ounxcmWRNYs+0vf6yeMecBByf5ELBxb1BJK9rK088iPTG0TT/H0vQGeTvNw1eS5GJguwH9Mb1/snJozj18aVyNC4He/ngeBh53wrpXVX09yfk0Dxo6K8k7quqHy1yxNAWPFDRnVNXFbRPL/6N5hOsPgVePbwbq8ROanlehOYE8/qTy8vg+8Nb2eRgk2SDJ06eo/W7g7rYjxLF6aN+7CXBtVR1B0wPm81dgndJjeKSgOSXJAuCuqnokyWZVdeUUs7+b5klmH6B5qtlbVlQdVfWDJH8EnNd0Vsl9wN40RwaTeQtwTJICftAzfnfgzUkepHmy1qdXVJ3SePaSKknq2HwkSeoYCpKkjqEgSeoYCpKkjqEgSeoYCpKkjqEgSer8f7020VZ8ODQgAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from scratch.statistics import daily_minutes_good\n", "from scratch.gradient_descent import gradient_step\n", "\n", "random.seed(0)\n", "# I used trial and error to choose num_iters and step_size.\n", "# This will run for a while.\n", "learning_rate = 0.001\n", "\n", "inputs: List[List[float]] = [[1.,49,4,0],[1,41,9,0],[1,40,8,0],[1,25,6,0],[1,21,1,0],[1,21,0,0],[1,19,3,0],[1,19,0,0],[1,18,9,0],[1,18,8,0],[1,16,4,0],[1,15,3,0],[1,15,0,0],[1,15,2,0],[1,15,7,0],[1,14,0,0],[1,14,1,0],[1,13,1,0],[1,13,7,0],[1,13,4,0],[1,13,2,0],[1,12,5,0],[1,12,0,0],[1,11,9,0],[1,10,9,0],[1,10,1,0],[1,10,1,0],[1,10,7,0],[1,10,9,0],[1,10,1,0],[1,10,6,0],[1,10,6,0],[1,10,8,0],[1,10,10,0],[1,10,6,0],[1,10,0,0],[1,10,5,0],[1,10,3,0],[1,10,4,0],[1,9,9,0],[1,9,9,0],[1,9,0,0],[1,9,0,0],[1,9,6,0],[1,9,10,0],[1,9,8,0],[1,9,5,0],[1,9,2,0],[1,9,9,0],[1,9,10,0],[1,9,7,0],[1,9,2,0],[1,9,0,0],[1,9,4,0],[1,9,6,0],[1,9,4,0],[1,9,7,0],[1,8,3,0],[1,8,2,0],[1,8,4,0],[1,8,9,0],[1,8,2,0],[1,8,3,0],[1,8,5,0],[1,8,8,0],[1,8,0,0],[1,8,9,0],[1,8,10,0],[1,8,5,0],[1,8,5,0],[1,7,5,0],[1,7,5,0],[1,7,0,0],[1,7,2,0],[1,7,8,0],[1,7,10,0],[1,7,5,0],[1,7,3,0],[1,7,3,0],[1,7,6,0],[1,7,7,0],[1,7,7,0],[1,7,9,0],[1,7,3,0],[1,7,8,0],[1,6,4,0],[1,6,6,0],[1,6,4,0],[1,6,9,0],[1,6,0,0],[1,6,1,0],[1,6,4,0],[1,6,1,0],[1,6,0,0],[1,6,7,0],[1,6,0,0],[1,6,8,0],[1,6,4,0],[1,6,2,1],[1,6,1,1],[1,6,3,1],[1,6,6,1],[1,6,4,1],[1,6,4,1],[1,6,1,1],[1,6,3,1],[1,6,4,1],[1,5,1,1],[1,5,9,1],[1,5,4,1],[1,5,6,1],[1,5,4,1],[1,5,4,1],[1,5,10,1],[1,5,5,1],[1,5,2,1],[1,5,4,1],[1,5,4,1],[1,5,9,1],[1,5,3,1],[1,5,10,1],[1,5,2,1],[1,5,2,1],[1,5,9,1],[1,4,8,1],[1,4,6,1],[1,4,0,1],[1,4,10,1],[1,4,5,1],[1,4,10,1],[1,4,9,1],[1,4,1,1],[1,4,4,1],[1,4,4,1],[1,4,0,1],[1,4,3,1],[1,4,1,1],[1,4,3,1],[1,4,2,1],[1,4,4,1],[1,4,4,1],[1,4,8,1],[1,4,2,1],[1,4,4,1],[1,3,2,1],[1,3,6,1],[1,3,4,1],[1,3,7,1],[1,3,4,1],[1,3,1,1],[1,3,10,1],[1,3,3,1],[1,3,4,1],[1,3,7,1],[1,3,5,1],[1,3,6,1],[1,3,1,1],[1,3,6,1],[1,3,10,1],[1,3,2,1],[1,3,4,1],[1,3,2,1],[1,3,1,1],[1,3,5,1],[1,2,4,1],[1,2,2,1],[1,2,8,1],[1,2,3,1],[1,2,1,1],[1,2,9,1],[1,2,10,1],[1,2,9,1],[1,2,4,1],[1,2,5,1],[1,2,0,1],[1,2,9,1],[1,2,9,1],[1,2,0,1],[1,2,1,1],[1,2,1,1],[1,2,4,1],[1,1,0,1],[1,1,2,1],[1,1,2,1],[1,1,5,1],[1,1,3,1],[1,1,10,1],[1,1,6,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,4,1],[1,1,9,1],[1,1,9,1],[1,1,4,1],[1,1,2,1],[1,1,9,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,1,1],[1,1,1,1],[1,1,5,1]]\n", "\n", "beta = least_squares_fit(inputs, daily_minutes_good, learning_rate, 5000, 25)\n", "assert 30.50 < beta[0] < 30.70 # constant\n", "assert 0.96 < beta[1] < 1.00 # num friends\n", "assert -1.89 < beta[2] < -1.85 # work hours per day\n", "assert 0.91 < beta[3] < 0.94 # has PhD" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "minutes=30.514795945185586 + 0.9748274277323267 friends -1.8506912934343662 workhours + 0.91407780744768 phd\n" ] } ], "source": [ "# Print out the formula\n", "print(\"minutes=\"+str(beta[0])+\" + \"+str(beta[1])+\" friends \"+str(beta[2])+\" workhours + \"+str(beta[3])+\" phd\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How good is your model?\n", "Again, we could use R-square to calculate the square error of your output against actual output\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scratch.simple_linear_regression import total_sum_of_squares\n", "\n", "def multiple_r_squared(xs: List[Vector], ys: Vector, beta: Vector) -> float:\n", " sum_of_squared_errors = sum(error(x, y, beta) ** 2\n", " for x, y in zip(xs, ys))\n", " return 1.0 - sum_of_squared_errors / total_sum_of_squares(ys)\n", "\n", "assert 0.67 < multiple_r_squared(inputs, daily_minutes_good, beta) < 0.68" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regularization\n", "In practice, you’d often like to apply linear regression to datasets with large numbers of variables. This creates a couple of extra wrinkles.
\n", "First, the more variables you use, the more likely you are to overfit your model to the training set.
\n", "Second, the more nonzero coefficients you have, the harder it is to make sense of them.
\n", "Regularization is an approach in which we add to the error term a penalty that gets larger as beta gets larger. We then minimize the combined error and penalty. The more importance we place on the penalty term, the more we discourage large coefficients." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# alpha is a *hyperparameter* controlling how harsh the penalty is.\n", "# Sometimes it's called \"lambda\" but that already means something in Python.\n", "def ridge_penalty(beta: Vector, alpha: float) -> float:\n", " return alpha * dot(beta[1:], beta[1:])\n", "\n", "def squared_error_ridge(x: Vector,\n", " y: float,\n", " beta: Vector,\n", " alpha: float) -> float:\n", " \"\"\"estimate error plus ridge penalty on beta\"\"\"\n", " return error(x, y, beta) ** 2 + ridge_penalty(beta, alpha)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We can then plug this into gradient descent in the usual way:\n", "from scratch.linear_algebra import add\n", "\n", "def ridge_penalty_gradient(beta: Vector, alpha: float) -> Vector:\n", " \"\"\"gradient of just the ridge penalty\"\"\"\n", " return [0.] + [2 * alpha * beta_j for beta_j in beta[1:]]\n", "\n", "def sqerror_ridge_gradient(x: Vector,\n", " y: float,\n", " beta: Vector,\n", " alpha: float) -> Vector:\n", " \"\"\"\n", " the gradient corresponding to the ith squared error term\n", " including the ridge penalty\n", " \"\"\"\n", " return add(sqerror_gradient(x, y, beta),\n", " ridge_penalty_gradient(beta, alpha))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### In-class exercise 1\n", "Now we want to modify the least_squares_fit function to use the sqerror_ridge_gradient instead of sqerror_gradient. Please implement the function least_squares_fit_ridge" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def least_squares_fit_ridge(xs: List[Vector],\n", " ys: List[float],\n", " alpha: float,\n", " learning_rate: float = 0.001,\n", " num_steps: int = 1000,\n", " batch_size: int = 1) -> Vector:\n", " \"\"\"\n", " Find the beta that minimizes the sum of squared errors\n", " assuming the model y = dot(x, beta).\n", " \"\"\"\n", " pass\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'NoneType' object is not subscriptable", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 4\u001b[0m learning_rate, 5000, 25)\n\u001b[1;32m 5\u001b[0m \u001b[0;31m# [30.51, 0.97, -1.85, 0.91]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0;32massert\u001b[0m \u001b[0;36m5\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mdot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbeta_0\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbeta_0\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m6\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 7\u001b[0m \u001b[0;32massert\u001b[0m \u001b[0;36m0.67\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mmultiple_r_squared\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minputs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdaily_minutes_good\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mbeta_0\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0;36m0.69\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mTypeError\u001b[0m: 'NoneType' object is not subscriptable" ] } ], "source": [ "# Now alpha is set to 0, there’s no penalty at all and we get the same results as before\n", "random.seed(0)\n", "beta_0 = least_squares_fit_ridge(inputs, daily_minutes_good, 0.0, # alpha\n", " learning_rate, 5000, 25)\n", "# [30.51, 0.97, -1.85, 0.91]\n", "assert 5 < dot(beta_0[1:], beta_0[1:]) < 6\n", "assert 0.67 < multiple_r_squared(inputs, daily_minutes_good, beta_0) < 0.69" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "least squares fit: 100%|██████████| 5000/5000 [00:06<00:00, 722.64it/s]\n", "least squares fit: 100%|██████████| 5000/5000 [00:06<00:00, 746.34it/s]\n", "least squares fit: 100%|██████████| 5000/5000 [00:06<00:00, 745.49it/s]\n" ] } ], "source": [ "# As we increase alpha, the goodness of fit gets worse, but the size of beta gets smaller\n", "\n", "beta_0_1 = least_squares_fit_ridge(inputs, daily_minutes_good, 0.1, # alpha\n", " learning_rate, 5000, 25)\n", "# [30.8, 0.95, -1.83, 0.54]\n", "assert 4 < dot(beta_0_1[1:], beta_0_1[1:]) < 5\n", "assert 0.67 < multiple_r_squared(inputs, daily_minutes_good, beta_0_1) < 0.69\n", "\n", "beta_1 = least_squares_fit_ridge(inputs, daily_minutes_good, 1, # alpha\n", " learning_rate, 5000, 25)\n", "# [30.6, 0.90, -1.68, 0.10]\n", "assert 3 < dot(beta_1[1:], beta_1[1:]) < 4\n", "assert 0.67 < multiple_r_squared(inputs, daily_minutes_good, beta_1) < 0.69\n", "\n", "beta_10 = least_squares_fit_ridge(inputs, daily_minutes_good,10, # alpha\n", " learning_rate, 5000, 25)\n", "# [28.3, 0.67, -0.90, -0.01]\n", "assert 1 < dot(beta_10[1:], beta_10[1:]) < 2\n", "assert 0.5 < multiple_r_squared(inputs, daily_minutes_good, beta_10) < 0.6" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### in-class exercise 2\n", "Another approach is lasso regression, which uses the penalty as follow, please try to explore this penalty and evaluate the performance.\n" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def lasso_penalty(beta, alpha):\n", " return alpha * sum(abs(beta_i) for beta_i in beta[1:])" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### in-class exercise 3\n", "scikit-learn has a linear_model module that provides a LinearRegression model similar to ours, as well as ridge regression, lasso regression, and other types of regularization. Please explore that and apply it to our data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }