{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Decision Tree example\n", "Let's explore implementation of Decision Tree!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from typing import List\n", "import math\n", "\n", "def entropy(class_probabilities: List[float]) -> float:\n", " \"\"\"Given a list of class probabilities, compute the entropy\"\"\"\n", " return sum(-p * math.log(p, 2)\n", " for p in class_probabilities\n", " if p > 0) # ignore zero probabilities\n", "\n", "assert entropy([1.0]) == 0\n", "assert entropy([0.5, 0.5]) == 1\n", "assert 0.81 < entropy([0.25, 0.75]) < 0.82\n", "\n", "from typing import Any\n", "from collections import Counter\n", "\n", "def class_probabilities(labels: List[Any]) -> List[float]:\n", " total_count = len(labels)\n", " return [count / total_count\n", " for count in Counter(labels).values()]\n", "\n", "def data_entropy(labels: List[Any]) -> float:\n", " return entropy(class_probabilities(labels))\n", "\n", "assert data_entropy(['a']) == 0\n", "assert data_entropy([True, False]) == 1\n", "assert data_entropy([3, 4, 4, 4]) == entropy([0.25, 0.75])\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Entropy of a partition\n", "One problem with this approach is that partitioning by an attribute with many different values will result in a very low entropy due to overfitting." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def partition_entropy(subsets: List[List[Any]]) -> float:\n", " \"\"\"Returns the entropy from this partition of data into subsets\"\"\"\n", " total_count = sum(len(subset) for subset in subsets)\n", "\n", " return sum(data_entropy(subset) * len(subset) / total_count\n", " for subset in subsets)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating a decision tree\n", "The interviewee data, consisting of (per your specification) a NamedTuple of the relevant attributes for each candidate—her level, her preferred language, whether she is active on Twitter, whether she has a PhD, and whether she interviewed well." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from typing import NamedTuple, Optional\n", "\n", "class Candidate(NamedTuple):\n", " level: str\n", " lang: str\n", " tweets: bool\n", " phd: bool\n", " did_well: Optional[bool] = None # allow unlabeled data\n", "\n", " # level lang tweets phd did_well\n", "inputs = [Candidate('Senior', 'Java', False, False, False),\n", " Candidate('Senior', 'Java', False, True, False),\n", " Candidate('Mid', 'Python', False, False, True),\n", " Candidate('Junior', 'Python', False, False, True),\n", " Candidate('Junior', 'R', True, False, True),\n", " Candidate('Junior', 'R', True, True, False),\n", " Candidate('Mid', 'R', True, True, True),\n", " Candidate('Senior', 'Python', False, False, False),\n", " Candidate('Senior', 'R', True, False, True),\n", " Candidate('Junior', 'Python', True, False, True),\n", " Candidate('Senior', 'Python', True, True, True),\n", " Candidate('Mid', 'Python', False, True, True),\n", " Candidate('Mid', 'Java', True, False, True),\n", " Candidate('Junior', 'Python', False, True, False)\n", " ]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "defaultdict(, {'Senior': [Candidate(level='Senior', lang='Java', tweets=False, phd=False, did_well=False), Candidate(level='Senior', lang='Java', tweets=False, phd=True, did_well=False), Candidate(level='Senior', lang='Python', tweets=False, phd=False, did_well=False), Candidate(level='Senior', lang='R', tweets=True, phd=False, did_well=True), Candidate(level='Senior', lang='Python', tweets=True, phd=True, did_well=True)], 'Mid': [Candidate(level='Mid', lang='Python', tweets=False, phd=False, did_well=True), Candidate(level='Mid', lang='R', tweets=True, phd=True, did_well=True), Candidate(level='Mid', lang='Python', tweets=False, phd=True, did_well=True), Candidate(level='Mid', lang='Java', tweets=True, phd=False, did_well=True)], 'Junior': [Candidate(level='Junior', lang='Python', tweets=False, phd=False, did_well=True), Candidate(level='Junior', lang='R', tweets=True, phd=False, did_well=True), Candidate(level='Junior', lang='R', tweets=True, phd=True, did_well=False), Candidate(level='Junior', lang='Python', tweets=True, phd=False, did_well=True), Candidate(level='Junior', lang='Python', tweets=False, phd=True, did_well=False)]})\n", "level 0.6935361388961919\n", "lang 0.8601317128547441\n", "tweets 0.7884504573082896\n", "phd 0.8921589282623617\n" ] } ], "source": [ "from typing import Dict, TypeVar\n", "from collections import defaultdict\n", "\n", "T = TypeVar('T') # generic type for inputs\n", "\n", "def partition_by(inputs: List[T], attribute: str) -> Dict[Any, List[T]]:\n", " \"\"\"Partition the inputs into lists based on the specified attribute.\"\"\"\n", " partitions: Dict[Any, List[T]] = defaultdict(list)\n", " for input in inputs:\n", " key = getattr(input, attribute) # value of the specified attribute\n", " partitions[key].append(input) # add input to the correct partition\n", " return partitions\n", "\n", "print(partition_by(inputs, 'level'))\n", "\n", "def partition_entropy_by(inputs: List[Any],\n", " attribute: str,\n", " label_attribute: str) -> float:\n", " \"\"\"Compute the entropy corresponding to the given partition\"\"\"\n", " # partitions consist of our inputs\n", " partitions = partition_by(inputs, attribute)\n", "\n", " # but partition_entropy needs just the class labels\n", " labels = [[getattr(input, label_attribute) for input in partition]\n", " for partition in partitions.values()]\n", " #print(labels)\n", "\n", " return partition_entropy(labels)\n", "\n", "for key in ['level','lang','tweets','phd']:\n", " print(key, partition_entropy_by(inputs, key, 'did_well'))\n", "\n", "assert 0.69 < partition_entropy_by(inputs, 'level', 'did_well') < 0.70\n", "assert 0.86 < partition_entropy_by(inputs, 'lang', 'did_well') < 0.87\n", "assert 0.78 < partition_entropy_by(inputs, 'tweets', 'did_well') < 0.79\n", "assert 0.89 < partition_entropy_by(inputs, 'phd', 'did_well') < 0.90\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# the lowest is level in previous step. Now we’ll need to make a subtree for each possible level value\n", "senior_inputs = [input for input in inputs if input.level == 'Senior']\n", "\n", "assert 0.4 == partition_entropy_by(senior_inputs, 'lang', 'did_well')\n", "assert 0.0 == partition_entropy_by(senior_inputs, 'tweets', 'did_well')\n", "assert 0.95 < partition_entropy_by(senior_inputs, 'phd', 'did_well') < 0.96\n", "\n", "# so which should we pick now? \n", "# Yes, it is tweets! So you will continue to split the tree" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import Image\n", "from IPython.core.display import HTML \n", "Image(url= \"https://learning.oreilly.com/library/view/data-science-from/9781492041122/assets/dsf2_1703.png\")\n", "# this is the decision tree finally you would get" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting It All Together\n", "Now that we’ve seen how the algorithm works, we would like to implement it more generally. \n", "
We define a tree to be either:\n", "
a Leaf (that predicts a single value), or\n", "
a Split (containing an attribute to split on, subtrees for specific values of that attribute, and possibly a default value to use if we see an unknown value)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from typing import NamedTuple, Union, Any\n", "\n", "class Leaf(NamedTuple):\n", " value: Any\n", "\n", "class Split(NamedTuple):\n", " attribute: str\n", " subtrees: dict\n", " default_value: Any = None\n", "\n", "DecisionTree = Union[Leaf, Split]\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# With this representation, our hiring tree would look like:\n", "\n", "hiring_tree = Split('level', { # First, consider \"level\".\n", " 'Junior': Split('phd', { # if level is \"Junior\", next look at \"phd\"\n", " False: Leaf(True), # if \"phd\" is False, predict True\n", " True: Leaf(False) # if \"phd\" is True, predict False\n", " }),\n", " 'Mid': Leaf(True), # if level is \"Mid\", just predict True\n", " 'Senior': Split('tweets', { # if level is \"Senior\", look at \"tweets\"\n", " False: Leaf(False), # if \"tweets\" is False, predict False\n", " True: Leaf(True) # if \"tweets\" is True, predict True\n", " })\n", "})\n", "\n", "#There’s still the question of what to do if we encounter an unexpected (or missing) attribute value. \n", "#What should our hiring tree do if it encounters a candidate whose level is Intern? \n", "#We’ll handle this case by populating the default_value attribute with the most common label." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classify and build tree\n", "\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def classify(tree: DecisionTree, input: Any) -> Any:\n", " \"\"\"classify the input using the given decision tree\"\"\"\n", "\n", " # If this is a leaf node, return its value\n", " if isinstance(tree, Leaf):\n", " return tree.value\n", "\n", " # Otherwise this tree consists of an attribute to split on\n", " # and a dictionary whose keys are values of that attribute\n", " # and whose values of are subtrees to consider next\n", " subtree_key = getattr(input, tree.attribute)\n", "\n", " if subtree_key not in tree.subtrees: # If no subtree for key,\n", " return tree.default_value # return the default value.\n", "\n", " subtree = tree.subtrees[subtree_key] # Choose the appropriate subtree\n", " return classify(subtree, input) # and use it to classify the input." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def build_tree_id3(inputs: List[Any],\n", " split_attributes: List[str],\n", " target_attribute: str) -> DecisionTree:\n", " # Count target labels\n", " label_counts = Counter(getattr(input, target_attribute)\n", " for input in inputs)\n", " most_common_label = label_counts.most_common(1)[0][0]\n", "\n", " # If there's a unique label, predict it\n", " if len(label_counts) == 1:\n", " return Leaf(most_common_label)\n", "\n", " # If no split attributes left, return the majority label\n", " if not split_attributes:\n", " return Leaf(most_common_label)\n", "\n", " # Otherwise split by the best attribute\n", "\n", " def split_entropy(attribute: str) -> float:\n", " \"\"\"Helper function for finding the best attribute\"\"\"\n", " return partition_entropy_by(inputs, attribute, target_attribute)\n", "\n", " best_attribute = min(split_attributes, key=split_entropy)\n", "\n", " partitions = partition_by(inputs, best_attribute)\n", " new_attributes = [a for a in split_attributes if a != best_attribute]\n", "\n", " # recursively build the subtrees\n", " subtrees = {attribute_value : build_tree_id3(subset,\n", " new_attributes,\n", " target_attribute)\n", " for attribute_value, subset in partitions.items()}\n", "\n", " return Split(best_attribute, subtrees, default_value=most_common_label)\n", "\n", "tree = build_tree_id3(inputs,\n", " ['level', 'lang', 'tweets', 'phd'],\n", " 'did_well')\n", "\n", "# Should predict True\n", "assert classify(tree, Candidate(\"Junior\", \"Java\", True, False))\n", "\n", "\n", "flowerTree(\"short\", \"medium\", \"short\",\"short\")\n", "\n", "# Should predict False\n", "assert not classify(tree, Candidate(\"Junior\", \"Java\", True, True))\n", "\n", "# data with unexpected values\n", "# Should predict True\n", "assert classify(tree, Candidate(\"Intern\", \"Java\", True, True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Home work practice in pair\n", "1. Please explore and understand the decision tree theory and code
\n", "2. (8 points)We explored the iris dataset (you could load it from sklearn) using several machine learning model, now could you please use the code we learned today about decision tree to train and make predictions for the following: \n", "
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n", "
[6.0, 3.0, 5.0, 0.6]\n", "
[6.0, 3.0, 5.0, 1.6]\n", "
[6.0, 3.0, 5.0, 2.6]\n", "
Hint: the attributes in iris dataset is floating number, but you could always convert that into classes by setting bins. For example, petal width, you could have three different classes: [0,0.8], (0.8, 1.75], (1.75, infinite). \n", "3. (2 points) Sklearn has great tutorial about decision tree (https://scikit-learn.org/stable/modules/tree.html), could you please train a model on iris dataset and make predictions as previous question? \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }