Fitting an elephant with John Black

That John Black: he’s a genius. Today in the Australian he examines the recent Queensland state election, delving deep and long into demographic data and coming up with nuggets of high-carat wisdom. He can tell you that the poor voters who swung to Labor pay ‘relatively low rents and mortgages’ on ‘two or three-bedroom’ homes, and that they have ‘no internet connection.’ In general, it is “public housing tenants, Polynesians and persons speaking languages other than English” who comprise Queensland Labor’s vote, along with “single parents with young kids, female public servants, male and female transport workers, women aged 30-34 years and persons who were actively chasing jobs in latter part of 2013”. Why didn’t inner city Brisbane seats swing towards the ALP as much as others did? Well, the ALP didn’t appeal to “South African migrants” who are paying back mortgage debt, obviously.

But though any of these insights—which most of us mere mortals could spend a lifetime trying to figure out on our own—would be enough surely for one man, Black doesn’t stop there. He can also figure out how much of a particular candidate’s vote was due to demographic factors and how much was their ‘personal’ vote:

When we look at the personal vote scores in our report, we see that the predicted ALP 2PP vote in Newman’s seat of Ashgrove was 48.6 per cent. In other words, Newman should have won the seat with at least 51.4 per cent of the vote but he polled 5.5 per cent less than this figure. We can assume in a two-horse race Labor’s candidate Kate Jones was responsible for some of this lift in the Labor, but the more realistic conclusion is that Newman cost every LNP candidate, including himself, between 5 per cent and 6 per cent of the vote, and that’s why he lost the election.

And take a look at these numbers, will you?

One fact we are reasonably confident about in all the inferential statistics is that our election model — which explained 84 per cent of the variation in the ALP vote — predicted that the ALP should have won Clayfield with 53.4 per cent of the vote, whereas the LNP sitting member pulled Labor’s predicted vote down by 11.5 per cent.

That sounds very impressive indeed—his model explains 84% of the variation in the ALP vote! That’s pretty nifty!  So how, dear reader, does he do it? I know there’s a lot to be said for allowing a magician to retain the mystery of his work, but I think in this case the scientific interest outweighs this concern. In his report, Black refers to a demographic database that has 650 variables that he ‘correlates’ to voting tallies in each Queensland seat. In another report, he makes clear that he selects from this database only the choice cuts: using a technique called stepwise multiple linear regression, he eliminates variables that don’t explain anything about voting until he arrives at his personal selection to present to you, the reader. In his 2015 election report, he nominated some 57 variables that might explain Labor’s two-party preferred vote (he calls these ‘stereotypes’, and though it’s not entirely clear whether all of them are used in his regressions, it certainly seems likely). He puts these through his sausage machine, and out come estimated coefficients, that tell you, for example, that an increase of (say) 1 per cent in the number of Mormons in a seat will lead to an increase of (say) 0.46 per cent in the two-party preferred vote for the ALP. Any variation in the TPP vote that he cannot explain with his selected variables, he attributes to a candidate having a personal vote.

As it happens, I too have a database of 650 demographic variables that I will now use to show you how this magic trick is being performed. I too have used stepwise multiple linear regression to whittle down my database until I have only the variables I need. I too will explain about 80-90 per cent of the variation in the ALP TPP vote. And then I, too, will tell you exactly who won the Queensland election for (we now presume) the Labor Party. Is it hockey fans? Dog-owning pilots with two houses? People who once sniffed Lady Flo’s eau de toilette between the ages of 70 and 74? Let’s run our regression and see:

Screen Shot 2015-02-07 at 3.45.53 pm

The estimate column gives you the impact of a increase in that particular variable on the TPP vote for the ALP. And here I seem to have done slightly better than even Mr Black: I have 16 variables only, and yet I also explain more than 80 per cent of the variation in Labor’s TPP vote in Queensland. (Maybe Mr Black would like to pay me for my perhaps superior dataset!). Thinking about this visually, consider the following graph: it shows the relationship between what my demographic model of the Queensland electorate predicts and what actually happened on Election Day. The red line is a line that would show you where the dots were if my model were perfect at predicting the vote down the last decimal place. As it happens, it’s a pretty good fit, no?

Screen Shot 2015-02-07 at 4.28.12 pm

What about those personal votes? Like Mr Black, I can calculate those by finding out the variation in the vote that can’t be explained by the demographic model:

Screen Shot 2015-02-07 at 4.48.06 pm

Screen Shot 2015-02-07 at 4.50.13 pmLook at that gigantic personal vote for Kate Jones in Ashgrove—just as Black says, Campbell Newman was personal poison.

But haven’t I forgotten one thing? What are those magic variables—the 16 keys to electoral bliss, the 16 secret talismans of Annastacia Palaszczuk? Is it men who wear felt hats who saw the ALP return to power, or Samoans who are also Wiccans? Did obese Scientologists throw Campbell Newman from office?

Well, actually…all 650 variables in my database are identical and independently distributed normal variables with mean zero and standard deviation 1 produced by a random fucking number generator. The numerical information contained within the variables that were eventually selected by stepwise regression is utterly meaningless, and yet with the same procedure that John Black used, I can produce results of exactly the same level of explanatory power. John Black has had  650 variables to choose from to find enough variables to fit only 89 data points (I have slightly fewer since I don’t get TPP estimates for all seats for Labor). “With four parameters I can fit an elephant,” John von Neumann is supposed to have said “and with five I can make him wiggle his trunk.” And John Black, with his 650 demographic variables, can make him vote!


One Comment on “Fitting an elephant with John Black”

  1. […] We saw a little while ago how he uses a very questionable statistical ‘technique’, one that systematically overstates the explanatory power of the resulting model and the magnitude of the effects of variables AND which understates the uncertainty in estimates of the effects of variables, among other problems, to try to determine who voted for Labor in Queensland’s recent election, just by looking at census data and the aggregate votes in Queensland seats. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s