





























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
the second homework assignment in my class
Typology: Assignments
1 / 37
This page cannot be seen from the preview
Don't miss anything!
"metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Homework 2: Desperately Seeking Silver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due Thursday, Oct 3, 11:59 PM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
"\n", "The data for this homework can be found at [this link] (https://www.dropbox.com/s/vng5x10b837ahnc/hw2_data.zip). Download it to the same folder where you are running this notebook, and uncompress it. You should find the following files there:\n", "\n", "1. us-states.json\n", "2. electoral_votes.csv\n", "3. predictwise.csv\n", "4. g12.csv\n", "5. g08.csv\n", "6. 2008results.csv\n", "7. nat.csv\n", "8. p04.csv\n", "9. 2012results.csv\n", "10. cleaned-state_data2012.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Setup and Plotting code" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "from collections import defaultdict\n", "import json\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "from matplotlib import rcParams\n", "import matplotlib.cm as cm\n", "import matplotlib as mpl\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),\n", " (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),\n", " (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),\n", " (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),\n", " (0.4, 0.6509803921568628, 0.11764705882352941),\n", " (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),\n", " (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n",
" 'IN': 'Indiana',\n", " 'KS': 'Kansas',\n", " 'KY': 'Kentucky',\n", " 'LA': 'Louisiana',\n", " 'MA': 'Massachusetts',\n", " 'MD': 'Maryland',\n", " 'ME': 'Maine',\n", " 'MI': 'Michigan',\n", " 'MN': 'Minnesota',\n", " 'MO': 'Missouri',\n", " 'MP': 'Northern Mariana Islands',\n", " 'MS': 'Mississippi',\n", " 'MT': 'Montana',\n", " 'NA': 'National',\n", " 'NC': 'North Carolina',\n", " 'ND': 'North Dakota',\n", " 'NE': 'Nebraska',\n", " 'NH': 'New Hampshire',\n", " 'NJ': 'New Jersey',\n", " 'NM': 'New Mexico',\n", " 'NV': 'Nevada',\n", " 'NY': 'New York',\n", " 'OH': 'Ohio',\n", " 'OK': 'Oklahoma',\n", " 'OR': 'Oregon',\n", " 'PA': 'Pennsylvania',\n", " 'PR': 'Puerto Rico',\n", " 'RI': 'Rhode Island',\n", " 'SC': 'South Carolina',\n", " 'SD': 'South Dakota',\n", " 'TN': 'Tennessee',\n", " 'TX': 'Texas',\n", " 'UT': 'Utah',\n", " 'VA': 'Virginia',\n", " 'VI': 'Virgin Islands',\n", " 'VT': 'Vermont',\n", " 'WA': 'Washington',\n", " 'WI': 'Wisconsin',\n", " 'WV': 'West Virginia',\n", " 'WY': 'Wyoming'\n", "}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is some code to plot [State Chloropleth] (http://en.wikipedia.org/wiki/Choropleth_map) maps in matplotlib. make_map
is the function you will use." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#adapted from https://github.com/dataiap/dataiap/blob/master/resources/util/map_util.py\n", "\n", "#load in state geometry\n",
"state2poly = defaultdict(list)\n", "\n", "data = json.load(file("data/us-states.json"))\n", "for f in data['features']:\n", " state = states_abbrev[f['id']]\n", " geo = f['geometry']\n", " if geo['type'] == 'Polygon':\n", " for coords in geo['coordinates']:\n", " state2poly[state].append(coords)\n", " elif geo['type'] == 'MultiPolygon':\n", " for polygon in geo['coordinates']:\n", " state2poly[state].extend(polygon)\n", "\n", " \n", "def draw_state(plot, stateid, **kwargs):\n", " """\n", " draw_state(plot, stateid, color=..., *kwargs)\n", " \n", " Automatically draws a filled shape representing the state in\n", " subplot.\n", " The color keyword argument specifies the fill color. It accepts keyword\n", " arguments that plot() accepts\n", " """\n", " for polygon in state2poly[stateid]:\n", " xs, ys = zip(polygon)\n", " plot.fill(xs, ys, **kwargs)\n", "\n", " \n", "def make_map(states, label):\n", " """\n", " Draw a cloropleth map, that maps data onto the United States\n", " \n", " Inputs\n", " -------\n", " states : Column of a DataFrame\n", " The value for each state, to display on a map\n", " label : str\n", " Label of the color bar\n", "\n", " Returns\n", " --------\n", " The map\n", " """\n", " fig = plt.figure(figsize=(12, 9))\n", " ax = plt.gca()\n", "\n", " if states.max() < 2: # colormap for election probabilities \n", " cmap = cm.RdBu\n", " vmin, vmax = 0, 1\n", " else: # colormap for electoral votes\n", " cmap = cm.binary\n", " vmin, vmax = 0, states.max()\n", " norm = mpl.colors.Normalize(vmin=vmin, vmax=vmax)\n", " \n", " skip = set(['National', 'District of Columbia', 'Guam', 'Puerto Rico',\n", " 'Virgin Islands', 'American Samoa', 'Northern Mariana Islands'])\n", " for state in states_abbrev.values():\n", " if state in skip:\n", " continue\n", " color = cmap(norm(states.ix[state]))\n",
winner of the most votes in Maine and Nebraska gets ALL the electoral college votes there.) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the electoral vote breakdown by state:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a matter of convention, we will index all our dataframes by the state name" ] }, { "cell_type": "code", "collapsed": false, "input": [ "electoral_votes = pd.read_csv("data/electoral_votes.csv").set_index('State')\n", "electoral_votes.head()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the use of make_map
we plot the Electoral College" ] }, { "cell_type": "code", "collapsed": false, "input": [ "make_map(electoral_votes.Votes, "Electoral Vlotes");" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1: Simulating elections" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The PredictWise Baseline" ] },
"cell_type": "markdown", "metadata": {}, "source": [ "We will start by examining a successful forecast that [PredictWise] (http://www.predictwise.com/results/2012/president) made on October 2, 2012. This will give us a point of comparison for our own forecast models.\n", "\n", "PredictWise aggregated polling data and, for each state, estimated the probability that the Obama or Romney would win. Here are those estimated probabilities:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "predictwise = pd.read_csv('data/predictwise.csv').set_index('States')\n", "predictwise.head()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.1 Each row is the probability predicted by Predictwise that Romney or Obama would win a state. The votes column lists the number of electoral college votes in that state. Use make_map
to plot a map of the probability that Obama wins each state, according to this prediction." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Later on in this homework we will explore some approaches to estimating probabilities like these and quatifying our uncertainty about them. But for the time being, we will focus on how to make a prediction assuming these probabilities are known.\n", "\n", "Even when we assume the win probabilities in each state are known, there is still uncertainty left in the election. We will use simulations from a simple probabilistic model to characterize this uncertainty. From these simulations, we will be able to make a prediction about the expected outcome of the election, and make a statement about how sure we are about it.\n", "\n", "1.2 We will assume that the outcome in each state is the result of an independent coin flip whose probability of coming up Obama is given by a Dataframe of state-wise win probabilities. *Write a function that uses this
"collapsed": false, "input": [ "#compute the probability of an Obama win, given this simulation\n", "#Your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.3 Now, write a function called plot_simulation
to visualize the simulation. This function should:\n", "\n", "* Build a histogram from the result of simulate_election\n", "* Overplot the "victory threshold" of 269 votes as a vertical black line (hint: use axvline)\n", "* Overplot the result (Obama winning 332 votes) as a vertical red line\n", "* Compute the number of votes at the 5th and 95th quantiles, and display the difference (this is an estimate of the outcome's uncertainty)\n", "* Display the probability of an Obama victory \n", " " ] }, { "cell_type": "code", "collapsed": false, "input": [ """"\n", "Function\n", "--------\n", "plot_simulation\n", "\n", "Inputs\n", "------\n", "simulation: Numpy array with n_sim (see simulate_election) elements\n", " Each element stores the number of electoral college votes Obama wins in each simulation.\n", " \n", "Returns\n", "-------\n", "Nothing \n", """"\n", "\n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets plot the result of the Predictwise simulation. Your plot should look something like this:\n", "\n", "<img src="http://i.imgur.com/uCOFXHp.png">"
"cell_type": "code", "collapsed": false, "input": [ "plot_simulation(result)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Evaluating and Validating our Forecast\n", "\n", "The point of creating a probabilistic predictive model is to simultaneously make a forecast and give an estimate of how certain we are about it. \n", "\n", "However, in order to trust our prediction or our reported level of uncertainty, the model needs to be correct. We say a model is correct if it honestly accounts for all of the mechanisms of variation in the system we're forecasting.\n", "\n", "In this section, we evaluate our prediction to get a sense of how useful it is, and we validate the predictive model by comparing it to real data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.4 Suppose that we believe the model is correct. Under this assumption, we can evaluate our prediction by characterizing its accuracy and precision (see [here] (http://celebrating200years.noaa.gov/magazine/tct/accuracy_vs_precision_556.jpg) for an illustration of these ideas). What does the above plot reveal about the accuracy and precision of the PredictWise model?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your Answer Here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.5 Unfortunately, we can never be absolutely sure that a model is correct, just as we can never be absolutely sure that the sun will rise tomorrow. But we can test a model by making predictions assuming that it is true and comparing it to real events -- this constitutes a hypothesis test. After testing a large number of predictions, if we find no evidence that says the model is wrong, we can have some degree of confidence that the model is right (the same reason we're still quite confident about the sun being here tomorrow). We call this process model checking, and use it to validate our
will vote for the majority party. Implement this simple forecast." ] }, { "cell_type": "code", "collapsed": false, "input": [ """"\n", "Function\n", "--------\n", "simple_gallup_model\n", "\n", "A simple forecast that predicts an Obama (Democratic) victory with\n", "0 or 100% probability, depending on whether a state\n", "leans Republican or Democrat.\n", "\n", "Inputs\n", "------\n", "gallup : DataFrame\n", " The Gallup dataframe above\n", "\n", "Returns\n", "-------\n", "model : DataFrame\n", " A dataframe with the following column\n", " * Obama: probability that the state votes for Obama. All values should be 0 or 1\n", " model.index should be set to gallup.index (that is, it should be indexed by state name)\n", " \n", "Examples\n", "---------\n", ">>> simple_gallup_model(gallup_2012).ix['Florida']\n", "Obama 1\n", "Name: Florida, dtype: float64\n", ">>> simple_gallup_model(gallup_2012).ix['Arizona']\n", "Obama 0\n", "Name: Arizona, dtype: float64\n", """"\n", "\n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we run the simulation with this model, and plot it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = simple_gallup_model(gallup_2012)\n", "model = model.join(electoral_votes)\n", "prediction = simulate_election(model, 10000)\n", "\n", "plot_simulation(prediction)\n", "plt.show()\n",
"make_map(model.Obama, "P(Obama): Simple Model")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.7 Attempt to validate the predictive model using the above simulation histogram. Does the evidence contradict the predictive model?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Adding Polling Uncertainty to the Predictive Model\n", "\n", "The model above is brittle -- it includes no accounting for uncertainty, and thus makes predictions with 100% confidence. This is clearly wrong -- there are numerous sources of uncertainty in estimating election outcomes from a poll of affiliations. \n", "\n", "The most obvious source of error in the Gallup data is the finite sample size -- Gallup did not poll everybody in America, and thus the party affilitions are subject to sampling errors. How much uncertainty does this introduce?\n", "\n", "On their [webpage](http://www.gallup.com/poll/156437/heavily-democratic- states-concentrated-east.aspx#2) discussing these data, Gallup notes that the sampling error for the states is between 3 and 6%, with it being 3% for most states. (The calculation of the sampling error itself is an exercise in statistics. Its fun to think of how you could arrive at the sampling error if it was not given to you. One way to do it would be to assume this was a two-choice situation and use binomial sampling error for the non-unknown answers, and further model the error for those who answered 'Unknown'.)\n", "\n", "1.8 Use Gallup's estimate of 3% to build a Gallup model with some uncertainty. Assume that the Dem_Adv
column represents the mean of a Gaussian, whose standard deviation is 3%. Build the model in the function uncertain_gallup_model
. Return a forecast where the probability of an Obama victory is given by the probability that a sample from the Dem_Adv
Gaussian is positive.\n", "\n", "\n", "Hint\n", "The probability that a sample from a Gaussian with mean $\mu$ and standard deviation $\sigma$ exceeds a threhold $z$ can be found using the the Cumulative Distribution Function of a Gaussian:\n", "\n", "$$\n", "CDF(z) = \frac1{2}\left(1 + {\rm erf}\left(\frac{z - \mu} {\sqrt{2 \sigma^2}}\right)\right) \n", "$$\n"
"cell_type": "code", "collapsed": false, "input": [ "make_map(model.Obama, "P(Obama): Gallup + Uncertainty")\n", "plt.show()\n", "prediction = simulate_election(model, 10000)\n", "plot_simulation(prediction)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.9 Attempt to validate the above model using the histogram. Does the predictive distribution appear to be consistent with the real data? Comment on the accuracy and precision of the prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your answers here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Biases\n", "\n", "While accounting for uncertainty is one important part of making predictions, we also want to avoid systematic errors. We call systematic over- or under-estimation of an unknown quantity bias. In the case of this forecast, our predictions would be biased if the estimates from this poll systematically over- or under-estimate vote proportions on election day. There are several reasons this might happen:\n", "\n", "1. Gallup is wrong. The poll may systematically over- or under- estimate party affiliation. This could happen if the people who answer Gallup phone interviews might not be a representative sample of people who actually vote, Gallup's methodology is flawed, or if people lie during a Gallup poll.\n", "1. Our assumption about party affiliation is wrong. Party affiliation may systematically over- or under-estimate vote proportions. This could happen if people identify with one party, but strongly prefer the candidate from the other party, or if undecided voters do not end up splitting evenly between Democrats and Republicans on election day.\n", "1. Our assumption about equilibrium is wrong. This poll was released in August, with more than two months left for the elections. If there is a trend in the way people change their affiliations during this time period (for example, because one candidate is much worse at televised debates), an estimate in August could systematically miss the true value in November.\n", "\n", "One way to account for bias is to calibrate our model by estimating the bias and adjusting for it. Before we do this, let's explore how sensitive our prediction is to bias." ] }, { "cell_type": "markdown",
"metadata": {}, "source": [ "1.10 Implement a biased_gallup
forecast, which assumes the vote share for the Democrat on election day will be equal to Dem_Adv
shifted by a fixed negative amount. We will call this shift the "bias", so a bias of 1% means that the expected vote share on election day is Dem_Adv
-1.\n", "\n", "Hint You can do this by wrapping the uncertain_gallup_model
in a function that modifies its inputs." ] }, { "cell_type": "code", "collapsed": false, "input": [ """"\n", "Function\n", "--------\n", "biased_gallup_poll\n", "\n", "Subtracts a fixed amount from Dem_Adv, beofore computing the uncertain_gallup_model.\n", "This simulates correcting a hypothetical bias towards Democrats\n", "in the original Gallup data.\n", "\n", "Inputs\n", "-------\n", "gallup : DataFrame\n", " The Gallup party affiliation data frame above\n", "bias : float\n", " The amount by which to shift each prediction\n", " \n", "Examples\n", "--------\n", ">>> model = biased_gallup(gallup, 1.)\n", ">>> model.ix['Flordia']\n", ">>> .460172\n", """"\n", "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.11 Simulate elections assuming a bias of 1% and 5%, and plot histograms for each one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 },
"The np.polyfit
function can compute linear fits, as can sklearn.linear_model.LinearModel
" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that a lot of states in which Gallup reported a Democratic affiliation, the results were strongly in the opposite direction. Why might that be? You can read more about the reasons for this [here] (http://www.gallup.com/poll/114016/state-states-political-party- affiliation.aspx#1)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick look at the graph will show you a number of states where Gallup showed a Democratic advantage, but where the elections were lost by the democrats. Use Pandas to list these states." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We compute the average difference between the Democrat advantages in the election and Gallup poll" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print (prediction_08.Dem_Adv - prediction_08.Dem_Win).mean()" ], "language": "python", "metadata": {}, "outputs": [],
"prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.13 * Calibrate your forecast of the 2012 election using the estimated bias from 2008. Validate the resulting model against the real 2012 outcome. Did the calibration help or hurt your prediction?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 26 }, { "cell_type": "markdown", "metadata": {}, "source": [ "1.14* Finally, given that we know the actual outcome of the 2012 race, and what you saw from the 2008 race would you trust the results of the an election forecast based on the 2012 Gallup party affiliation poll?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your answer here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Question 2: Logistic Considerations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous forecast, we used the strategy of taking some side- information about an election (the partisan affiliation poll) and relating that to the predicted outcome of the election. We tied these two quantities together using a very simplistic assumption, namely that the vote outcome is deterministically related to estimated partisan affiliation.\n", "\n", "In this section, we use a more sophisticated approach to link side information -- usually called features or predictors -- to our