Samir's Blog :
    About     Archive     Feed

Do Democrats like guns too? This and other thoughts after building a voter prediction model

Election_Results_2016

For my third project at Metis, I decided to build a classification model that could predict which political party a person will vote for.

Why would this be useful? The obvious answer is political campaigns need this information in order to develop solid campaign strategies. For instance, the candidate may ask “Where should I spend my time campaigning?”. For a presidential campaign, data driven campaign strategies that answer this question are already public knowledge. Such as:

  • The United States of America has an electoral college voting system and thus it does not make sense for candidates to campaign in certain states.
  • For instance, It wouldn’t make sense for a democrat presidential candidate to spend a lot of time campaigning in California because the Democrat always tends to win California no matter what.
  • Also, there are states where every candidate should spend their time no matter their political party because the races are usually very close. These states are called “swing states” and are states like Michigan, Ohio and Pennsylvania. However, keep in mind what’s classified as a swing state can change from election to election.

So now that we’ve covered the basics, how do we decide how much time should a candidate spend in Michigan versus Ohio? Now it gets more interesting. For instance, If we know our candidate has a huge lead in Michigan and is behind by a small margin in Ohio then it would make sense to spend more time in Ohio. So first we need to have a model to predict how people will vote.

To that end, I found a dataset on Kaggle from the Show of Hands polling app that asked about 6000 people three types of questions:

  1. What party they intend to vote for (e.g. Democrat or Republican)
  2. ~100 personal questions (e.g. Are you an idealist or are you a pragmatist?)
  3. Demographic information (e.g Age, Income, Education Level, marital status)

The interesting part of this dataset was the personal questions. The link between demographics and voter tendencies is perhaps more well known. However, because of the amount of personal questions that were asked of each respondent, this gives us a chance to classify voters based on personality. Of course, a quick disclaimer: Kaggle does not provide many details about the dataset beyond the app that was used to collect the data. That means, we cannot use the results of the project to make any generalizations about voters. That is also not the point of this post. The post is simply meant to describe my learnings while building a voter prediction model and I hope others will find that interesting.

Model Selection and Feature Importance:

After trying out several models, I moved forward with the random forest model because the ROC(Receiver Operating Characteristics) curve had the highest area under the curve (AUC) at 68.10%. This means that the model predicts the right political party 68.10% of the time.

Is an AUC of 68.10% “good” though? That depends on the problem I’m solving. If I was trying to beat Vegas oddsmakers and had a model that beat them 68% of the time that would be considered world class. However, for this domain that is a question I do not know the answer to. Let’s come back to this.

Once I had my model, I took a look at the feature importances to see which questions were most influential in the model’s prediction. Here are the top 13 features ordered from most impact to least:

  1. Are you Feminist?
  2. What year were you born?
  3. Do you personally own a gun?
  4. Do you meditate or pray on a daily basis?
  5. Are you a male or female?
  6. Does your life have a purpose?
  7. Which parent “wore the pants” in your household?
  8. Would you say most of the hardship in your life has been the result of circumstances beyond your own control, or has it been mostly the result of your own decisions and actions?
  9. People who said yes or no to being married with kids
  10. Did your parents spank you as a form of discipline/punishment?
  11. Would you rather be happy or right?
  12. Are you more of an idealist or a pragmatist?
  13. Do you live within 20 miles of a major metropolitan area?

feature_importances

What would be driving this? Looking “under the hood” you can see why some of these features are here:

  • For people who said “yes” to the question: “Are you a feminist?”, 80% of them also said they were voting Democrat For people who “yes” to the question: “Do you personally own a gun?”, 61% of them also said they were voting Republican.
  • Thus, according to this sample, 39% of gun-owners intended to vote democrat. That certainly goes against conventional wisdom!

Those are two questions which divided the group of respondents the best. Ultimately, that is what feature importances show us: which answers help us divide the group into democrats or republicans the best. As you go down the list, when you split the group of people by how they answered each question the results get closer and closer to 50-50 splits.

And why might this be helpful?

Again, I’d like to say that this model should not be used to generalize anything about the U.S. population. However, let’s imagine for a second that we collected this information ourselves from a randomly sampled portion of the population and we are advising a democratic candidate. This might tell us a few things:

  • Democrat voters care a lot about feminine rights. Let’s make sure to make this a big part of our campaign
  • A good portion of democratic voters own a gun. Perhaps we should tread carefully with proposing certain firearm legislation.

Also, it’s interesting to note that because of the results of the “Are you a feminist?” question, my model is more confident in predicting a democrat voter rather than a republican. That’s also why one can give more advice to a democratic candidate based on these results.

Context matters:

The Random Forest model had the highest F1 score also at 67%. But is that the right scoring metric to judge the model on? Or should we aim to improve either precision or recall?

Let’s go back to the question we asked in the beginning: where should the candidate spend their time campaigning? Let’s say you find out the candidate only has a 2% lead in Ohio and a 6% lead in Michigan so you tell them to spend more time in Ohio and not worry about Michigan. If the model has too many false positives (i.e. voters who you think are going to vote for you but are actually voting for your opponent), you may actually be losing in Ohio and the race in Michigan is a lot closer than you expected. So perhaps you could lose Ohio or even both. However, if the model has a lot of false negatives then you might be winning large in both states and as a result neglect to visit a third state where you thought the candidate had no chance to win.

So how do we answer this question? How do we know whether 68% AUC is a good score for a model. Context and benchmarking. This is where domain expertise comes into play. A data science team for a candidate would know what accuracy benchmarks had been achieved in the past. They should also work on understanding which error (Type I or Type II) is more undesirable. Most likely, they would develop a confidence interval (as you see in most polls) and could for instance decide that any state where the candidate had somewhere within a -5% deficit to +5% lead would be important enough to spend time in.

Next Steps/Future Work:

So now what? A 68% F1 score shows that there is some predictive power to the model! This MVP has shown that there’s promise in this concept. However, we’d want to improve this. How might we do that? To start, we want to add voter location data to this dataset. I spent the whole blog post discussing states but we don’t even know where these voters live! Also, race/ethnicity is missing from the demographic information. Finally, we’d want to make sure to keep building domain expertise for this problem because ultimately the model can only improve as much as our knowledge and intuition improve.

How is the public coping emotionally with Covid-19?

Covid-19 needs no introduction. As a country, we are over two months into a scary pandemic. But this has been silently [potentially spreading worldwide since December, 2019. Luckily my family and I have been spared from the worst of the disease and we’re all healthy. However, as the infectious disease progressed, I’ve felt strong emotions and worried about the future. Once the government shutdown started across the country, I felt a lack of control and complete uncertainty for what would happen next. Although this was anecdotal, I wondered how others were coping mentally and emotionally with Covid-19.

To complete this analysis, I scraped Twitter using the GetOldTweets API and got 75,000 tweets related to the search query “coronavirus”. From there, I spent time doing various preprocessing like removing URLs, removing punctuation, putting everything in lower case, lemmatization and removing stop words. Then I used TF-IDF (term frequency–inverse document frequency) to vectorize the words and finally NMF (non-negative matrix factorization) to do topic modeling.

Below you can see the top 10 topics and the top words in each of them in no particular order:

Topics Words
Initial Chinese Outbreak china, outside, outside china, wuhan china, travel, sars, hubei, flu, pneumonia
Trump’s response trump, president, response, american, trump administration, trump response
Italy’s Covid-19 Outbreak report, italy, china report, bring, italy report, report bring, break, hubei
Stopping the spread spread, stop, stop spread, prevent, prevent spread, country, slow, slow spread, cdc
Anger A whole lot of curse words
Working from home work, employee, worker, hard, school, work together, company, office
Cruise and Quarantines quarantine, cruise, ship, lockdown, self, cruise ship, city, quarantinelife, princess
Wearing a mask mask, face, wear, face mask, wear mask, protect, wear face mask, hand, use
White House Briefings house, white, white house, bill, relief, force, package, stimulus, task, democrat
Second Wave Warnings bad, good, flu, cdc, second, wave, second wave, warns, winter, cure

These all look like pretty normal topics and based on the top words you can probably guess what each one is about. The one interesting but probably not surprising topic is one that I’m calling “Anger”. This was essentially just a collection of tweets that used a lot of curse words. I will not share them here as they are rather inappropropriate for a blog post such as this but when you read through some tweets you can surmise that folks are angry at the situation.

Ok so now we have our topics. But to get a better idea of how people were coping with Covid-19, we should look at how the topics changed over time. To do that I first took a look the document topic matrix that my NMF model spat out. The document topic matrix is essentially a table where each document (e.g. tweet) is a row and each column is a topic. The values represent how much of the topic was represented in each document. So to get an idea of how much of this topic was covered in the whole corpus of tweets I summed up each topic column. Next, I compared each topic’s value to understand the relative importance of each topic. Finally, I split up the data by month and plotted it to show how the topic “importance” change on a month-by-month basis.

topics_over_time

And voila! You can see here the main discussion on Twitter in December and January was about the initial outbreak in Wuhan, China which checks out. After that though, there are several topics near the top and things get a bit murkier. To get a better picture, let’s dive into each month to understand what changes and what that means.

December/January

In December there were only 300 tweets with the search term “coronavirus”. Near the beginning of January you start to see tweets about a “mystery illness” or “pneumonia” popping up. Also, you see some tweets quoting the WHO which said Covid-19 was not currently spreading. Obviously, at this time we now know that this disease spreads very quickly and through potential asymptomatic people but in January that wasn’t clear.

February

feb_topics

In February, at this point the Covid-19 outbreak was pretty bad in China. So the “Initial Chinese Outbreak” topic is still the biggest topic. However, in February the disease had spread to Italy and you see this topic peaking. In addition at this point, it’s becoming clear that this virus is spreading very fast so as a result you see topics about stopping the spread peaking. This is also when global events start to get cancelled to help to reduce the spread.

March

march_topics

In March, the anger topic peaked. For context, I’m only looking at tweets in English and this is the month when Covid-19 started in large english speaking countries like the U.S. and UK. In the United States, this is when life started to get affected for almost everyone so it makes sense that this is when the anger would rise up. Also, since the U.S. now starts to have to deal with Covid-19 you also start to see tweets discussing President Trump’s response to the contagion.

April/May

apr_topics

In April and May, the anger topic goes way down in relative importance. From a quick glance, it seems like people took their anger and depending on their political affiliation focused it either on criticizing or defending President Trump’s response to the disease. Also, topics related to the new normal like wearing a mask and working from home were popular in these months.

Are people grieving the loss of a normal life?

So what did we learn from all of that? First of all, there were no topics in the top 15 about solutions to this massive problem we’re all facing right now. By that I mean, there is little discussion about vaccines and epidemiology. This makes sense given most of the general public (especially me) knows next to nothing about these topics. However, more importantly this data shows that people were dealing with something very difficult beyond just the physical effects of the disease.

Lots of people had a lot of anger in March but the “anger” topic is slowly going away. When I think about what that reminds me of, I think about the 5 stages of grief. For those who don’t know, the 5 stages of grief is a framework to describe the 5 stages someone goes through to get over the loss of a loved one. The stages are Denial, Anger, Bargaining, Depression, and Acceptance. So perhaps people are grieving losing their normal lifes and this is showing on Twitter. Right now we can no longer go to restaurants and bars. We don’t get to see friends and loved ones in person. Also, Covid-19 has introduced a lot of uncertainty in life. Perhaps you got laid off and don’t know when the next paycheck is coming in. Perhaps you’re worried you may get laid off if things don’t improve. Perhaps you were planning on attend school in the fall and now this is in jeopardy. And on top of this you had no control over any of it. The director of this program is an invisible enemy called “coronavirus”. All of this can take an incredible emotional toll on us and some of us may need help to reach the final stage: acceptance. So the question then becomes how can we help people recover emotionally from the Covid-19 outbreak especially given the likelihood that this continues for a while or there is a siginificant second wave?