{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c9c70eb0",
   "metadata": {},
   "source": [
    "# Implementing Linear Regression with `sklearn`\n",
    "\n",
    "## Objective:\n",
    "- Learn to implement simple and multiple linear regression models using `sklearn`.\n",
    "- Understand how to interpret the model coefficients and evaluate model performance.\n",
    "- Apply these concepts to the Boston Housing dataset to predict median house values.\n",
    "\n",
    "## Tools and Libraries Needed:\n",
    "- Python\n",
    "- Jupyter Notebook or any Python IDE\n",
    "- `sklearn` library\n",
    "- `pandas` for data manipulation\n",
    "- `matplotlib` and `seaborn` for data visualization\n",
    "\n",
    "## Dataset:\n",
    "- California Housing dataset (accessible directly via `sklearn.datasets`)\n",
    "\n",
    "## Duration:\n",
    "1 Hour\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ef86d3e",
   "metadata": {},
   "source": [
    "### Part 1: Setting Up Your Environment\n",
    "1. Ensure Python and the necessary libraries (`sklearn`, `pandas`, `matplotlib`, `seaborn`) are installed.\n",
    "2. Start a new Jupyter Notebook.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a95f5ec",
   "metadata": {},
   "source": [
    "### Part 2: Loading and Exploring the Dataset\n",
    "1. Import necessary libraries:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "803fec41",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.datasets import fetch_california_housing\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cbd79d48",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "cdba39d7",
   "metadata": {},
   "source": [
    "2. Load the Californica Housing dataset and convert it into a DataFrame:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5684418f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "california = fetch_california_housing()\n",
    "df = pd.DataFrame(california.data, columns=california.feature_names)\n",
    "df['MedHouseVal'] = california.target"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64e5d6fc",
   "metadata": {},
   "source": [
    "3. Explore the dataset using `df.describe()` and `df.info()`.\n",
    "\n",
    "4. Visualize the relationships between `MedInc` (median income in block) and `MedHouseVal` (median house value for California districts) using a scatter plot.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "dc3edc9e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \\\n",
      "0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   \n",
      "1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   \n",
      "2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   \n",
      "3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   \n",
      "4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   \n",
      "...       ...       ...       ...        ...         ...       ...       ...   \n",
      "20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   \n",
      "20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   \n",
      "20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   \n",
      "20638  1.8672      18.0  5.329513   1.171920       741.0  2.123209     39.43   \n",
      "20639  2.3886      16.0  5.254717   1.162264      1387.0  2.616981     39.37   \n",
      "\n",
      "       Longitude  MedHouseVal  \n",
      "0        -122.23        4.526  \n",
      "1        -122.22        3.585  \n",
      "2        -122.24        3.521  \n",
      "3        -122.25        3.413  \n",
      "4        -122.25        3.422  \n",
      "...          ...          ...  \n",
      "20635    -121.09        0.781  \n",
      "20636    -121.21        0.771  \n",
      "20637    -121.22        0.923  \n",
      "20638    -121.32        0.847  \n",
      "20639    -121.24        0.894  \n",
      "\n",
      "[20640 rows x 9 columns]\n"
     ]
    }
   ],
   "source": [
    "print(df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fc121fe",
   "metadata": {},
   "source": [
    "2. Split the data into training and test sets using `train_test_split`:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "46628117",
   "metadata": {},
   "outputs": [],
   "source": [
    "X = df[['MedInc']]  # Feature matrix\n",
    "y = df['MedHouseVal']  # Target variable"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01ae5c02",
   "metadata": {},
   "source": [
    "3. Train a simple linear regression model:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c252063f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# now get X_train, y_train? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "bc965d45",
   "metadata": {},
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'X_train' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[8], line 3\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mlinear_model\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m LinearRegression\n\u001b[1;32m      2\u001b[0m lm \u001b[38;5;241m=\u001b[39m LinearRegression()\n\u001b[0;32m----> 3\u001b[0m lm\u001b[38;5;241m.\u001b[39mfit(X_train, y_train)\n",
      "\u001b[0;31mNameError\u001b[0m: name 'X_train' is not defined"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "lm = LinearRegression()\n",
    "lm.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6047ee38",
   "metadata": {},
   "source": [
    "4. Predict and evaluate the model using the test set:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00deb356",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred = lm.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "478284ce",
   "metadata": {},
   "source": [
    "5. Plot the regression line over the test data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "952ed6bd",
   "metadata": {},
   "source": [
    "### Part 4: Multiple Linear Regression\n",
    "1. For multiple linear regression, select additional features (e.g., `CRIM`, `LSTAT`) and repeat the process of splitting, training, and evaluating the model.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81802eac",
   "metadata": {},
   "source": [
    "### Part 5: Model Evaluation\n",
    "1. Evaluate the model performance using metrics such as Mean Squared Error (MSE) and R-squared:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ea50185",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "mse = mean_squared_error(y_test, y_pred)\n",
    "r2 = r2_score(y_test, y_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21513cff",
   "metadata": {},
   "source": [
    "2. Discuss the implications of these metrics.\n",
    "\n",
    "---\n",
    "\n",
    "## Deliverable:\n",
    "- Students will submit a Jupyter notebook containing their code, visualizations, and a brief analysis of the model's performance.\n",
    "\n",
    "## Assessment:\n",
    "- Successful implementation of simple and multiple linear regression models.\n",
    "- Ability to interpret model outputs and evaluate model performance.\n",
    "- Engagement in class discussions about the challenges faced during the lab and potential solutions.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c832933a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}