The Gender Pay Gap and the (reverse) Epistemic Bait and Switch
How some people make it easier for themselves by asking harder questions.
By Jeppe Johansen
In the last weeks of 2022, the gender pay gap discussion reemerged in Denmark. It arose when some commentators pointed to research that claimed1 that the gender pay gap disappears when controlling for occupation, implying that the pay gap was not driven by discrimination but by occupational preference. Note that “controlling” is a statistical procedure, where you remove the effects of other factors that might also impact the outcome - in this case pay levels. This aggravated another set of commentators who claimed that the analysis “controlled away the treatment”. They pointed out that if discrimination affects the occupational choices of men and women, we should not control for choice of occupation. Or put more generally, that discrimination happens at an earlier point in the causal chain and that the studies that control for occupation should therefore be dismissed. That is, according to the critics, the studies did not correctly address the possible discrimination because they were looking at the wrong place.
Initially, when I saw this reasoning, I agreed with the objections. I believed that controlling for occupation was a good example of mediator bias2 – the case where you have an intermediate variable that blocks the channel of discrimination – and that the commentators raised valid objections. However, I have come to believe that a certain bias drives part of such criticisms. The best term to describe this bias is what I will call the reverse epistemic bait and switch.
The epistemic bait and switch is a term I first stumbled upon in Philip Tetlock’s book Superforecasting. It describes how people will often do a mental rephrasing of a question so that they answer an easier one instead of the originally harder question that was posed. He gives the following example of a bait and switch, referring to a forecasting competition where the participants should reason about the cause of death of Yasser Arafat. The objective of the competition was to estimate the probability of French or Swiss government bodies finding elevated levels of polonium in the remains of Yasser Arafat. Instead of answering this difficult question, people would answer the different and less subtle question of whether Israel killed Yasser Arafat. This would lead to answers of a more simplistic nature, e.g. “Israel would never do that!” or “Of course Israel did it!” – replacing a hard for an easy question. This replacement of a hard question with an easy one is the essence of the bait and switch.
I believe the opposite (or reverse) is happening in the gender pay gap discussion. People who believe the gender pay gap is due to discrimination change the original easy question (are women paid the same as men, doing the same job), with the harder question (does discrimination play an important role in explaining the gender pay gap). And by making the question harder, they can epistemically undermine any approach that yields inconvenient answers.
The Statistical Issues
To understand the statistical issue at hand, let’s consider a stylized setup that can give credence to the critics’ objections about controlling for occupation. Consider the causal diagram below. If an arrow goes from one node to another, there is a causal relationship between them in the direction of the arrow. The diagram captures the two distinct (stylized) hypotheses that attempt to explain the gender pay gap3. The dashed line represents causal relationships we are unsure about.
The preference hypothesis suggests that the differences between men and women, leading to differences in wages are primarily a function of preferences and gender norms (they can be either biological or cultural) and not discrimination. For example, women tend to like working with humans and are usually employed in the public sector in Denmark. In the causal diagram this would correspond to arrow D existing, and arrow A, B and C not existing. One could argue an additional arrow should go directly from gender to wage, but for simplicity, I omit this.
The discrimination hypothesis is more complicated. It could take three distinct forms:
(A, B) This case could be described as the "Gender stereotypical jobs”. The story would be that while discrimination in the work environment is negligible, the ideas that culture enforces on us about the differences in male and female preferences, determine why women sort into less-paying jobs. That is, women will choose certain jobs because of discrimination. One example could be a guidance counselor that would suggest becoming an accountant to boys and becoming a nurse to girls, even though the boy and girl would initially have the same preferences.
(A, C) This case would be the opposite of the above. That is, gender does not really impact how we choose an occupation, however, discrimination in the workplace has an impact on how men are compensated compared to women.
(A, B, C) Under this scenario, discrimination impacts both occupational choice and wage. This is the most extreme of the three discrimination hypotheses.
Note, that The discrimination hypothesis and The preference hypothesis are not mutually exclusive, and small differences that stem from deeply ingrained gender norms might be exaggerated by discrimination.
Now, the rapport that was criticized by commentators for controlling for occupation investigated (among other things) the second or third discrimination case as outlined above. That is, they investigated whether the arrow C empirically would be present. And they found that there was no substantial evidence of a C in the causal story.
The Validity of the Criticism
The critics reasoned that this was bad practice for measuring discrimination. Specifically, they claimed that the impact of discrimination was obscured (or controlled out) because an important mediator of the discrimination – namely occupation - was removed.
This criticism is a valid statistical argument. Overcontrolling is a serious issue, and it’s easy to end up finding zero effect of treatment (in this case discrimination) if you just included all possible variables as control. However – and this is the central point of this blog post – it also becomes a very easy attack on anything that does not conform to one’s own worldview. It is reasonable to investigate whether men and women are paid the same when they work the same job. And it is not good practice to assume that this kind of analysis does not add value because it does not entirely answer all the questions of discrimination, as outlined above. We should not use the report to claim that discrimination has no impact on the wage level (which by the way, I don’t believe it claims), but we can use it to argue that the effect is probably limited if you narrow discrimination to the case where a manager in a given job discriminates between sexes as measured by wage4.
More importantly, should people who believe in the discrimination hypothesis not also update their prior beliefs of discrimination based on research like this report? In the Danish debate, believers in the discrimination hypothesis don’t do this, and I feel that this is the greatest issue. All research designs are flawed. The perfect design does not exist! And rejecting anything inconvenient because the design is not perfect is a telltale of someone who only updates his or her worldview if new information confirms their preconceived ideas. Put differently, I suspect that the insistence on a perfect study design allows believers in the discrimination hypothesis to ignore any evidence that does not enforce already preconceived notions of discrimination. However, the same people are often all too happy to rely on imperfect studies when it supports their worldview. Take the famous example of how research informed the belief that discrimination is a real problem in the labor market in the paper investigating “blind” auditions in classical orchestras. Here is a quote from the conclusion:
The question is whether hard evidence can support an impact of discrimination on hiring. Our analysis of the audition and roster data indicates that it can, although we mention various caveats before we summarize the reasons. Even though our sample size is large, we identify the coefficients of interest from a much smaller sample. Some of our coefficients of interest, therefore, do not pass standard tests of statistical significance and there is, in addition, one persistent result that goes in the opposite direction. The weight of the evidence, however, is what we find most persuasive and what we have emphasized. The point estimates, moreover, are almost all economically significant.
People who are very certain about the strong effects of discrimination in the labor market often use this example. I have even used it myself multiple times. However, when reading the conclusion, you could easily see, how a person on the opposite side of the political aisle would be able to reject these findings due to the weak statistical results. Specifically: “Some of our coefficients of interest, therefore, do not pass standard tests of statistical significance and there is, in addition, one persistent result that goes in the opposite direction”, is not something you would associate with a knockdown argument for discrimination.
These analyses complement each other in the sense that they both investigate discrimination in the labor market and both do not present some knockdown argument about whether discrimination is happening or not. If we reject the gender pay gap analysis on basis of the design, should the paper on orchestral audition not also be rejected, due to its weak findings? Or alternatively, should both pieces of evidence impact our prior beliefs on discrimination in the labor market, and inform our worldview?
The (reverse) Epistemic Bait and Switch
As I have alluded to in this post, I believe this reverse bait and switch is very common in hot-button political discussions, expressed as one-sided criticisms and defenses of facts and research. Consider topics like gender, race, climate change, gun control, drug legalization, or things as boring as progressive tax rates. Proponents on either side of these topics will very often refer to studies and dismiss others in perfect alignment with their political compass – and for good reasons. Usually, the criticisms of studies are valid and point to the results of the literature. The problem is the double standard and the aspiration for knockdown arguments in public discourse. They just don’t exist, with very few exemptions. Most of the free lunches are eaten, and we now need to reason about thorny problems where the answers rarely are clear. But at least this framework might be able to spot whether pundits (and other public figures) are reasoning in an honest fashion by identifying if they answer slightly different and harder questions than are needed for us to update our beliefs. More importantly, we should as a society be skeptical of people in the public debate who claim all the evidence points in one direction and be more willing to listen to people that admit the evidence is messy and scattered and are able to make “on average” claims.
Jeppe Johansen is a regular writer at Unreasonable Doubt, where he writes about aliens, economics, the integrity of institutions, and everything in between – if anything really. Jeppe is a Ph.D. fellow at the Center for Social data science at the University of Copenhagen.
If you liked this article, you might also like our post on feminism and economics. Subscribe to our feed for more content like this.
I believe he is referring to this report from 2020, which is summarized in this short article. Furthermore, the specific claim is, that only 2% of the wage gap between men and women is left when controlling for other things. Furthermore, I will add, the report is not nearly as adamant in its language as the surrounding Twitter debate would make you think.
Mediator bias is when some intermediate variable blocks the causal pathway of treatment. A simple example could be investigating the effects of doping on performance. Say doping increases the amount of work you can perform as well as your muscle mass. You do a statistical analysis where you control for muscle mass and the number of hours of training, then you would not find any effect on performance from doping. But that would be a bad interpretation. The problem is that muscle mass and the number of hours of training are what capture the effect of the treatment and the doped individuals would have larger muscles and train more, why they would have a better performance.
Some critics also mentioned the potential for “collider bias” issues. This is more tricky to explain, so in the interest of time, I have omitted it. You can read an example here in Scott Cunningham's book. This would also render the interpretation of the “zero” effect when controlling for occupation wrong. However, I believe this is a less likely issue, due to the effect going to (basically) zero. If you disagree, please contact me – I am ready to change my mind.
And yes again I ignore the possible “collider-bias” issue because the effect of gender on wage goes towards zero when controlling for occupation. This might be a future blog post of its own. But to the people who strongly believe the “collider bias” is a good model for this problem, you need to ask yourself what is the probability of a graph not being faithful?