First off, Happy Halloween!
Second, a word on this blog. Sometimes, we’ll write about our progress, or something interesting in astronomy or physics. Sometimes we’ll write something a bit more technical. This is the latter. If that’s not your cup o’tea, don’t worry. The book, itself, isn’t technical, but occasionally, there’s a bit of math that I just need to get out of my head, and Emily Joy can only be so indulgent.
Third, like just about every numerically-minded person these days, I am obsessed with the latest polling data. I’m also in the midst of writing the first draft on our chapter on randomness in the universe. So I started thinking about daily tracking polls recently, and I noticed something odd.
First bear with me, because I don’t get to be numerical in the book, so I don’t get to do this example there. So let’s think about a poll in which N random people are asked their preferences between two candidates, and for convenience (and because of my own preferences), we’ll focus on Obama.
If the underlying percentage of people supporting Obama is fixed, then the 1-sigma (random) errors associated with a measurement (by polling) is given by sqrt(p(1-p)/N). For cases like elections where the polls are close enough to 50/50 for calculation of the fractional errors are ~0.5/sqrt(N).
So for a poll with 1000 people, this means that we can predict the fraction voting for Obama to within 1.5% about 68% of the time. Most polls actually post the 2-sigma errors (95% certainty), which means that for a poll with 1000 people, you’ll see them quote 3% errors.
Of course, this is only true for random errors, and not systematic ones. It may turn out that we’re asking the wrong people, or our turnout model is very wrong, but presumably if we want to see trends in the polling, we can look at how a given poll changes from day to day.
Let’s put this in context. First, let’s ignore undecided voters, and simply assume that the fraction of voters who go to McCain are simply 100%-p, so if Obama gets 52%, McCain gets 48%. Assuming a fixed fraction of undecided voters, then the uncertainty in their difference is simply 1/sqrt(N) (1-sigma errors, always!). This is the real figure of merit if you’re trying to figure out who’s actually winning according a particular poll.
But what about a tracking poll with a 3-day baseline? On day 3, you sample days 1,2,3, and on day 4, you sample 2,3,4 and average them in each case. In other words, the difference between the result on day 4 and day 3, you’re really only measuring the difference in polling between the 4th and 1st days.
So skipping a few steps of math, if we asked the question:
“Given an advantage x_2 today, and advantage x_1 yesterday, is there any reason to believe that poll has really changed?”
the distribution of x_2-x_1 has a standard deviation of (for a 3-day tracking poll) sqrt(2/(3N)).
For a 3-day tracking poll with a total of 1000 people (like the Daily Kos poll), we find a standard deviation for the difference between two successive days is 2.5%. In other words, we’d expect that on a typical day, the poll should vary by as much as 2 or 3 points, and occasionally, even more. This shouldn’t be uncommon at all. In fact, this should be a minimum scatter, since if there is genuine change in the poll (which is likely over the last week), then the standard deviation should be even more.
But over the last week, the Daily Kos poll has given an advantage to Obama of:
12 12 11 8 7 6 5 6
And therefore daily changes of:
0 -1 -3 -1 -1 -1 +1
This pattern definitely gives credence to the idea that the race is narrowing, but remember that should increase the standard deviation of the difference. On the other hand, the measured standard deviation of this is about 1.5%, far smaller than we predicted. So my question is,”Why aren’t the polls noisier?”
I think I have an answer. For many polls, including the Daily Kos, Rasmussen, Gallup, and most of the big trackers, a constant fraction of Democrats and Republicans are included in the sample. Whether these fractions are correct is immaterial. The point is that since they vote fairly consistently (~85-90%) for the candidate of their party, we’re essentially averaging in a constant each time, significantly reducing the effective uncertainty.
Now, there may not be anything wrong with this, but a natural result of it that if the effective error on a poll is (much) smaller than the official sampling error cited by the poll, then when pundits say that two polls,”are within the margin of errors of each other,” or that “Obama and McCain are within the margin of error,” this is nonsense!
Bottom line is that while I think this election is over, any surprises aren’t going to come because of sampling noise. It’s clearly a game of systematic errors.
Thanks for letting me get that off my chest.