How will AI change the Social Sciences?

Three ways current progress in machine learning will change the future of Social Science

Nov 15, 2022

By Jeppe Johansen

In the last couple of years, a lot of hype has revolved around the term AI. This was largely due to multiple impressive results from research institutions, such as designing algorithms that can learn to play computer games, or very impressive image recognition performance. The expression AI should probably be substituted with the term machine learning since basically all of the interesting results have been from this sub-branch of AI. However, exactly how the hype of these tools will map to the social sciences, is still quite an unsettled matter. Below, I will outline three trends that I believe could have a sizeable impact on how these modern tools will be integrated into social sciences. To make this post accessible to a large audience, I have relaxed mathematical rigor and emphasized intuition.

For machine learning tools to be successfully applied in social science, I believe, they should fulfill two criteria; they should be useful for investigating causal relationships and easy to use. As social sciences, or at least economics, primarily focuses on causal effects, the methods that will gain traction are the ones that help investigate such questions. An example could be, if the government raises taxes, how does this causally impact the labor supply? So, if these news tools do not help to address such questions, their adoption will probably be limited. Furthermore, using these tools should be possible for an above-average university student, that has taken some introductory statistics courses, and possibly also a course in machine learning. If more than this is needed, such tools will not attract a critical mass of users and be broadly applied. Of course, highly specialized use cases of ML will still exist, but this blog speculates what will find a broad range of applications. For the interested reader, I have tried to include links, so one can try these tools out for oneself.

How machine learning differs from the tools already used by quantitative social scientist

The workhorse tool of social scientists is the linear regression. Basically, the idea is to generate a straight line1, that tries to capture the patterns of the data by having the values of the dataset as close as possible to the line. One drawback of this method is that the line, generated by the linear regression needs to be – as implied by the name – linear2. Of course, social scientists have other tools for modeling the data-generating process – that is, researchers can construct models that capture certain aspects of the real-world mechanisms that generated the data set – but these tools primarily hinge on the researcher specifying the structure of the model. Such a case could be a binary outcome (getting sick or not getting sick for example), where the researcher would transform the linear structure through an S-shaped curve, such that predictions will be on a scale of 0 to 1.

Simple Linear Regression in R - Articles - STHDA

Now, contrast this to machine learning (ML), which improves the fit of the model by imposing as little structure over the data-generating process as possible. An example could be what determines the salary of a person. Economists would usually use a relationship described by Jacob Mincer which describes the relationship of how individual variables such as experience and education map into the earnings the person receives. However, using a machine learning model would assume very little structure of how experience and education map to salary – instead we include it and let the algorithm optimize for how to model the relationship to best fit the actual data.

Such ML algorithms could be a neural network or a random forest – two of the most popular types of machine learning models. Both suffer from being hard to interpret but often will be better predictive models than linear regression. Now, machine learning has primarily been introduced into the social sciences in cases where prediction is the objective. Examples could be: How likely is a given student to drop out of school? Or, how likely is a given person of committing a crime? In other words, this is a problem where we do not say anything about, how the person would fare, if we intervened, only about what the expected outcome is. If we consider intervention, we introduce counterfactuals, which is how a scenario would look in an alternative reality, which prediction cannot answer. So we would not predict

So, if better predictive models exist, why is linear regression so popular? Not considering institutional conservatism – I suspect 3 things: Ease of interpretation, well-established guarantees of how the model works and what guarantees it has, and finally, a great ecosystem for analysis. The interpretation of linear models is usually pretty straightforward. When you go 1 unit to the right on the x-axis, you move b units up the y-axis. This is a nice property when doing social sciences. Again, consider the tax reform example from the introduction: If we raise the tax by one unit, how many hours less (on average) will people work? Second, the asymptotic properties of linear regression allow for statistical tests – that is, we know the distribution of our parameter of interest, and we can then assign a probability of such a parameter having a certain value. Essentially, social scientists love to claim that an effect is statistically significant3. This can be understood as we believe it is unlikely that a signal and not just noise generates the results. Finally, linear regression is what is taught in every introductory statistics course. Every piece of statistical software ships with a good implementation. Everybody is capable of reading the output. For this reason, I believe, the introduction of AI/ML should complement these features of modern quantitative social sciences.

So taking into consideration the strengths of linear regression outlined above, the three trends that I believe will gain traction in the social sciences will address the following needs: 1) Converting unstructured data into structured data sets, where classical statistical tools can be used. 2) Assuming minimal structure on a problem, but still getting statistical guarantees on models, and being able to consider counterfactuals. 3) Integrating domain knowledge of a problem into models with flexible model components.

Turning unstructured data into structured data using ML

I believe this trend will be the by far most dominant one of the three. Fundamentally, a lot of social sciences have been limited by having access to already curated data. Usually, official statistics, collected by some government body in traditional databases or spreadsheets, have been the object of analysis. But, at this point in time, the vast majority of data collected is unstructured. Images, texts, videos, and audio, can with the help of modern ML tools be transformed into curated data sets, such that traditional statistical tools can be used.

Say you are an economic historian, that wants to track if catastrophes are associated with a worse treatment of minorities throughout history. It would probably be hard to get access to well-curated datasets on this topic. However, consider instead the researcher text mines (uses a machine learning tool to read and annotate) all books in Europe from the beginning of the middle ages until the end of the 19th century, for all the instances where minority groups have been mentioned. The researcher then uses a tool for sentiment analysis, that is annotating the opinion expressed in the text, to see if minorities are described positively or negatively. That way you can construct an extensive structured data set where all the classical tools from econometrics and statistics can be used to see if catastrophes do cause worse treatment of minorities.

A great example, though still a draft, is the paper Bouncing With The Joneses. The author investigates if there are peer effects when buying durable goods – more concretely if your neighbor buys a trampoline, are you more likely to do so yourself? He investigates the question by using an image classification algorithm sweeping through aerial photos, identifying which houses own a trampoline (and how it changes over time). Using the data generated from this classification task, he uses classical statistical/econometric tools to gauge whether these peer effects are causal

Assuming minimal structure and still getting statistical guarantees – the case for semi-parametric methods

As mentioned at the beginning of this blog, linear regression has been the standard tool of statistical modeling in the social sciences. That approach assumes a linear structure of the problem the analyst is modeling. If you want some functional flexibility, you need to hand-code it – that implies various transformations of the variables. This could in practice be a logarithmic transformation, squaring, and adding cross-products between two variables to correctly model the relationship between the outcome, treatment, and controls4. Researchers in Semi-parametric inference have at least since the 80’ies studied how you can relax the linear relationship – that is if you want to model non-linearities it does not need to be hand-coded – but still get guarantees on the “parameter of interest”.More concretely, imagine a researcher is investigating a new drug and wants to figure out if he can say with a reasonable degree of certainty, that the drug really does have a positive effect. In this case, should he need to impose more structure than necessary? Should he be concerned with what functional form (linear, quadratic, etc.) age and gender enter as variables when testing for the efficacy of the drug, or should he just be able to enter age, and let the algorithm figure out for itself how it matters? As a modeler, he would probably have two concerns.

First, is the drug administered randomly, or are the characteristics of the subjects in the trial a determinant of being treated or not? Maybe, the people administering the drug will not give it to young people, for example. Second, when controlling for characteristics he wants to assume as little as possible about the causal structure of what he is investigating – which is the reason for the appeal of machine learning methods. An example of these tools is Double/debiased machine learning. The paper is primarily theoretical, but they also have some empirical examples, one of them being how cash bonus impacts unemployment. The primary takeaway is, that they can combine estimating parameters (the effect of cash on unemployment length) using all sorts of fancy machine learning methods such as neural networks, LASSO, and random forests, and still map it to an output that social scientists like – a table with effect sizes and standard errors5. Not only is this easier to interpret, but might help lessen some of the institutional resistance against changing methodologies.

Integrating domain knowledge into the estimation procedure – the case for probabilistic programming languages

To understand why I believe probabilistic programming languages (PPLs) will gain popularity, a little historic background helps. The social sciences have primarily been using an approach called frequentism, however simultaneously, a more niche approach referred to as Bayesianism has also existed. Not to be bogged down into detail, the Bayesian approach never really became mainstream, and my best guess why that is the case is, that historically these methods have been hard to use. Even though on a range of problems, they can offer more flexibility, implementation has been hard. They have required researchers to program their estimators from scratch, which is in stark contrast to the standard linear regression tool which is shipped with every piece of statistical software. Fundamentally, these Bayesian estimation methods work by a method called Markov Chain Monte Carlo estimation, where the parameters of the model are successively sampled6. You don’t need to understand what this means in practice – but historically, researchers have been forced to implement these samplers themselves. That is, deriving them mathematically – which is not trivial – and programming them from scratch. In other words, the applied researcher would need to invest a lot of time if he wishes to use these methods.

And that is not even considering the biggest hurdle of them all – the computational power often needed for the estimation method if it should not run for a very long time. However, in the last 10 years, Bayesian methods have been easier to use, due to the introduction of new approaches and tools. Especially, the introduction of the PPL Stan has made it possible to write up a model and estimate it using Bayesian methods, without having to derive and implement the necessary sampler. And now, newer approaches to estimation7 have been integrated into these languages as well. This innovation, in conjunction with deep learning frameworks8, has allowed for fast inference and extremely flexible modeling tools.

An example of such a tool is the PPL Pyro that integrated the PyTorch deep learning framework, making it super fast for estimating neural networks in a bayesian setting. Why is this development interesting for the social sciences? Because it allows us to include domain knowledge into our analysis, while still allowing for a lot of flexibility. You can even use Bayesian methods in conjunction with a neural network at this point. An example of a paper that embodies combining flexibility and domain knowledge is Causal Effect Inference with Deep Latent-Variable Models. Fundamentally, the problem the authors are tackling is, if there is some underlying, unobserved (latent) variable that influences whether or not you get a treatment, that can impact the outcome, can we then infer this underlying latent variable from other data? Concretely, the example used is summarized below in a graph from their paper:

In essence, the method allows for using all the proxies (education length, income, etc.) of the underlying, unobserved variable socio-economic status, to be modeled using a deep neural network (the fanciest of fancy AI approaches). In other words, the researchers leverage their domain knowledge (in this case, education and income) of a problem to estimate a causal effect! I believe, these approaches could gain a lot of traction as undergrads from the social sciences get introduced to them.

Honorable mentions

All the concepts above, have been selected due to how I believe they could map to the everyday challenges applied social science researchers are facing when analyzing data, while still being approachable for a driven university student. However, the ecosystem of these concepts is not necessarily mature enough to be used by the average researcher my bet is, they will become! Additionally, other cool things are happening in the intersection between social sciences and machine learning, which I think should be mentioned, without going into detail. First, transformers seem to still be gaining popularity and being useful in basically all sorts of problems. I would be surprised if they are not popping up in economics papers within the next 3 or 4 years. Second, Deep reinforcement learning has been extremely effective at learning to navigate dynamic problems – these methods could begin to compete with traditional dynamic programming approaches usually seen in economics!

Jeppe Johansen is a regular writer at Unreasonable Doubt, where he writes about aliens, economics, the integrity of institutions, and everything in between – if anything really. Jeppe is a Ph.D. fellow at the Center for Social data science at the University of Copenhagen.

Or in the multi-dimensional case a hyperplane.

That is not entirely true, but as a simplification for motivation, it suffices.

This way of thinking about significance is not necessarily the healthiest way to do econometrics and statistics.

Outcome, treatment, and controls are terms for describing how researchers in general are considering cause and effect. The outcome is the variable one is interested in researching, treatment is the variable that is administered and might impact the outcome, and finally, controls, are other variables that might influence the outcome, but which the researcher might not be interested in, and they are primarily added as adjustments.

Standard errors are a measure of the uncertainty of the estimates.

If you want a good introduction to this concept, this article describes it without math.

Variational inference is the primary innovation that has allowed for estimating neural networks using bayesian methods.

Deep learning frameworks are tools for specifying neural networks and are optimized for performance. These tools have been instrumental in the widespread adoption of neural networks in all sorts of use cases, from self-driving cars to Google image recognition.

Unreasonable Doubt