Racial Bias in Italian Football Revisited

Author

Dimiter Toshkov, @DToshkov

Published

April 4, 2023

Introduction

A recent article in the journal Sociology claims that ‘darker-skinned [football] players receive more foul calls and more cards than lighter-skinned players’. The authors - Beatrice Magistro and Morgan Wack - collect data on 2205 football players in the Italian Serie A for the period 2009-2021. The dataset includes information on the skin color of the player (from the Football Manager videogame), number of attempted tackles, fouls committed, yellow and red cards received, minutes played, position, nationality, and club. Analyzing this data, the authors find ‘evidence of systematic racial bias against dark-skinned players’. They conclude that ‘the effects of racial biases in the Italian Serie A are striking’ and ‘[D]arker-skinned players are judged to have been guilty of both minor and major infractions at higher rates than their lighter-skinned teammates’. These are clearly important results on a normatively and politically salient topic, so it is not surprising that the article received coverage in the popular press (e.g. in The Guardian, see also the Twitter thread).

I am always interested in how we can establish bias and discrimination. The methodological challenges involved are inevitably considerable, and this article presents a carefully-argued and executed approach. The authors conduct a number of robustness checks, make many sensible statistical modeling choices, and openly discuss the limitations and uncertainties of their work. Best of all, they have made their data and analysis scripts public and openly accessible. Kudos for that! With the data being available, I decided to take a deep dive and see what patterns I find for myself.

Overall, my conclusions are rather different that the ones advanced in the article. I find that there are no significant associations between skin color and fouls/cards, when we only include appropriate controls. In fact, the number of fouls/cards vary in curvilinear fashion with skin color, so that players in the middle of the skin color distribution received the most fouls and cards. When potentially bad controls are included, for the players with Italian nationality, the number of fouls/cards increases significantly with skin color, but the effect is driven almost entirely by a couple of players - namely, Mario Balotelli and Stefano Okaka. Once we remove these two players from the data, there is no significant effect of skin color on fouls/cards for the Italian players as well.

In sum, it is very hard to interpret the evidence in racial terms. If it is about race, how come darker skin is associated with more negative referee decisions only for the Italian nationals (if we keep Balloteli and Okaka in the sample and conditions on variables such as minutes played, which - as I explain below - are possible colliders), but not for all the foreigners in the league? Why is the link between skin color and referee decisions more likely to be found for players with the same nationality than for players with different nationalities, given that race (whatever it means) varies much more strongly between than within nationalities? Moreover, given that the number of fouls/cards peak in the middle of the skin color distribution, what race corresponds to that?

Finally, I see no evidence that the effects of skin color on fouls/cards were significantly different during the 2020/21 season, which was played at empty stadiums due to COVID-19. This goes against the claim of the article that the public is responsible for a significant part of the racial bias in referee decisions.

Does this mean that there is no racial bias in the Italian Serie A? Of course not! Absence of evidence is not the same as evidence of absence. But - in my interpretation - the available data does not justify the strong claims about racial bias advanced in the article.

I present my analysis below. Comments and disagreements are welcome. I would like to thank again the authors for making their data and code available, which makes this re-evaluation of their arguments possible.

Data Analysis I: Descriptive analysis

To start with, let’s load and prepare the original dataset (following the script of Magistro and Wack, and adding a couple of additional variables).

library(tidyverse)
#load the final dataset
Total_Data <- read_csv(file='https://raw.githubusercontent.com/beatricemagistro/Racial_Bias_SerieA/main/Final_Merged.csv')

# create some new variables
Total_Data$COVID <- ifelse(Total_Data$Season.x=="2020-21", 1,0) # dummy for COVID
Total_Data_subset<- subset(Total_Data, Position != "GK") #exclude GKs because practically no fouls and no tackles

Total_Data_subset$Position <- as.factor(Total_Data_subset$Position) #transform to factor
Total_Data_subset$Season.x <- as.factor(Total_Data_subset$Season.x) #transform to factor
Total_Data_subset$Squad.x <- as.factor(Total_Data_subset$Squad.x) #transform to factor

# additional variables (DT)
Total_Data_subset$it = ifelse(Total_Data_subset$Nation=='ITA', 1, 0) #dummy for Italian nationality
Total_Data_subset$Italian= as.factor(Total_Data_subset$it) #make it a factor
Total_Data_subset$cards = Total_Data_subset$yellow_cards_overall + 2 * Total_Data_subset$red_cards_overall #total number of cards, with reds counting double

There are 5990 observations in the dataset (after removing the goalkeepers), but they correspond to 2034 unique players, because the unit of observation is a player-during-a-season.The skin color variable ranges between 1 and 20, with a median of 5 and a standard deviation of 4.4. The interquartile range is of length 4, between 4 and 8. Here are photos of some of the players with skin color values of 1, 5, 10/11 and 19/20.

Total_Data_subset$Player.x[which(Total_Data_subset$Skin==min(Total_Data_subset$Skin, na.rm=T))]
Total_Data_subset$Player.x[which(Total_Data_subset$Skin==median(Total_Data_subset$Skin, na.rm=T))]
Total_Data_subset$Player.x[which(Total_Data_subset$Skin==10)]
Total_Data_subset$Player.x[which(Total_Data_subset$Skin==max(Total_Data_subset$Skin, na.rm=T))]

Wesley Snyder, NED (1)

Aleksandar Kolarov, SRB (5)

Maya Yoshida, JAP (10)

Musa Barrow, GAM (20)

Raffaele Maiello, ITA (1)

Leonardo Bonucci, ITA (5)

Giuseppe Sculli, ITA (11)

Stefano Okaka, ITA (19)

As we can see, there is considerable range of variation in the skin color of Italian nationals as well. A total of 783 players, or 39% from the total of 2034 players included in the dataset, have Italian nationality. 23 Italian nationals have skin color greater than or equal to 10.

To start the analysis, we can plot the average number of foul calls and cards per skin color value. I combine the number of yellow and red cards into a single measure (red cards count double), because these bookings are not independent events. If someone is sent off with a direct red card, he will record zero yellow cards for the game, but he would have been booked by the referee. So someone who gets only red cards will appear as favored by the referee in the analysis of yellow cards.

means <- Total_Data_subset %>%
  group_by(Skin) %>%
  summarise (m_fouls = mean(Fouls, na.rm=T),
             sd_fouls = sd(Fouls, na.rm=T),
             m_cards = mean(cards, na.rm=T),
             sd_cards = sd(cards, na.rm=T))

library(mgcv)
library(tidymv)
library(ggplot2)
library(ggpubr)

gam_fouls <- mgcv::gam(Fouls ~ s(Skin), data=Total_Data_subset)
model_p <- predict_gam(gam_fouls)

p1 <- model_p %>%
  ggplot(aes(Skin, fit)) +
  geom_line(col='red', linewidth = 1.1) +
  geom_line(aes(y = fit+2*se.fit), color = "black", linetype = "dotted") +
  geom_line(aes(y = fit-2*se.fit), color = "black", linetype = "dotted") +
  geom_point(data=means, aes(x = Skin, y = m_fouls), col='red') +
  labs(x = '\nSkin color', y = 'Number of fouls\n') + my_theme_3() +
  theme(panel.grid.major = element_line(colour="lightgrey", size=0.2)) +
  coord_cartesian(ylim=c(0,30)) + 
  scale_x_continuous(breaks = c(1,5,10,15,20)) +  
  scale_y_continuous(breaks = seq(-10, 50, 10)) 

gam_cards <- mgcv::gam(cards ~ s(Skin), data=Total_Data_subset)
model_p2 <- predict_gam(gam_cards)

p2 <- model_p2 %>%
  ggplot(aes(Skin, fit)) +
  geom_line(col='red', linewidth = 1.1) +
  geom_line(aes(y = fit+2*se.fit), color = "black", linetype = "dotted") +
  geom_line(aes(y = fit-2*se.fit), color = "black", linetype = "dotted") +
  geom_point(data=means, aes(x = Skin, y = m_cards), col='red') +
  labs(x = '\nSkin color', y = 'Number of cards\n') + my_theme_3() +
  theme(panel.grid.major = element_line(colour="lightgrey", size=0.2)) +
  coord_cartesian(ylim=c(0,5)) + 
  scale_x_continuous(breaks = c(1,5,10,15,20)) +  
  scale_y_continuous(breaks = seq(-10, 50, 1)) 

plot1 = ggarrange(p1, p2, ncol = 2, labels = c("A. All foul calls", "B. Yellow and red cards"))
plot1

The red dots in the figure above show the observed average number of foul calls (left) and cards (right) for each of the 20 values of the skin color variable. The red lines are based on estimates from Generalized Additive Models (GAMs), with skin color entered as a smooth term based on a thin plate regression spline. The black dotted lines show the limits of the 95% confidence intervals of the fits.

As we can see from the figure above, there is a lot of random variation in the averages of foul calls and cards per skin color value. The smoothed lines are very wiggly, but even they fail to follow closely the observed means (the dots). If anything, it seems that the average number of fouls and cards is highest around skin color values of 8-9.

In light of this figure, the claim from the article’s abstract that ‘darker-skinned players receive more foul calls and more cards than lighter-skinned players’ could only be valid once we control for additional variables. Such controls might be appropriate, but they also might not be appropriate, depending on the goal of the research and the assumptions about the causal structure underlying the problem.

Compare the figure above with the text from The Guardian story: ‘Referees in Italy’s top football league give more yellow and red cards to Black and darker-skinned players than to their light-skinned teammates, research shows. Officials in Serie A awarded an average of 20% more fouls per season against darker-skinned players from 2009 to 2019, with 11% more yellow cards and 16% more red cards.’ None of these statements in The Guardian are actually true. The journalists fail to mention that these estimates are not what one sees in the data, but what is extrapolated from statistical models based on a particular set of assumptions and modeling choices. Next, we turn to the details of these models.

Data Analysis II: Generalized additive models

The patterns in the figure above are purely observational. It could be that there are variables that confound the relationships between skin color and fouls/cards, so that different patterns will emerge once we control for the right confounders. For now, I will follow the modeling approach and choice of covariates to include in the models of Magistro and Wack. Later, I will examine these choices in more detail. To keep the same presentation setup as above, I will first regress the number of fouls and cards on the covariates, but without including skin color. Then I will plot the output from the GAM models of the residuals from these models as a function of skin color.

lm_fouls1 <- lm(Fouls~Mins + Position + TotalAttemptedTackles + 
                  Season.x + Squad.x + Nation, data=Total_Data_subset)
Total_Data_subset$fouls_res <- lm_fouls1$residuals

gam_fouls_res <- mgcv::gam(fouls_res ~ s(Skin), data=Total_Data_subset)
model_p_res <- predict_gam(gam_fouls_res)

p1res<-model_p_res %>%
  ggplot(aes(Skin, fit)) +
  geom_line(col='red', linewidth = 1.5) +
  geom_line(aes(y = fit+2*se.fit), color = "black", linetype = "dotted") +
  geom_line(aes(y = fit-2*se.fit), color = "black", linetype = "dotted") +
  labs(x = '\nSkin color', y = 'Fit to model residuals\n') + my_theme_3() +
  theme(panel.grid.major = element_line(colour="lightgrey", size=0.2)) +
  coord_cartesian(ylim=c(-2,2)) + 
  scale_x_continuous(breaks = c(1,5,10,15,20)) +  
  scale_y_continuous(breaks = seq(-10, 50, 1)) 


lm_cards1 <- lm(cards~Mins + Position + TotalAttemptedTackles + 
                  Season.x + Squad.x + Nation, data=Total_Data_subset)
Total_Data_subset$cards_res <- lm_cards1$residuals

gam_cards_res <- mgcv::gam(cards_res ~ s(Skin), data=Total_Data_subset)
model_p2_res <- predict_gam(gam_cards_res)

p2res<-model_p2_res %>%
  ggplot(aes(Skin, fit)) +
  geom_line(col='red', linewidth = 1.5) +
  geom_line(aes(y = fit+2*se.fit), color = "black", linetype = "dotted") +
  geom_line(aes(y = fit-2*se.fit), color = "black", linetype = "dotted") +
  labs(x = '\nSkin color', y = 'Fit to model residuals\n') + my_theme_3() +
  theme(panel.grid.major = element_line(colour="lightgrey", size=0.2)) +
  coord_cartesian(ylim=c(-1,1)) + 
  scale_x_continuous(breaks = c(1,5,10,15,20)) +  
  scale_y_continuous(breaks = seq(-10, 50, 1)) 

plot2 = ggarrange(p1res, p2res, ncol = 2, labels = c("A. All foul calls", "B. Yellow and red cards"))
plot2

In the figure above, there is some evidence for a positive and more-or-less linear effect of skin color on the number of foul calls (left panel). There is no evidence for an effect on the number of cards (right panel). There are approximately two additional fouls predicted for someone with very dark skin compared to someone with very light skin, per season. The size of the estimated effect, however, is smaller than the one reported in the article, which is due to the linearity assumption imposed on the effect of skin color.

Note that these results are not observational: their goal is to identify the causal effect of skin color. To do that, the choice of covariates to control for is crucial. In general, it is not always the case that the more controls are included in a model, the more credible the estimated causal effect of interest. The decision whether to control or not for a certain variable needs to be considered carefully, because it can have a big impact on the results. We turn to this task now.

Choice of covariates

Minutes played

It is time to tackle the choice of covariates to include in the models. First, let’s consider the number of minutes played. It seems natural to include this variable as a covariate as the more time players spend on the field, the more time they have to accumulate foul calls and cards. Including this variable in the model makes a huge difference. The unconditional linear effects of skin color on foul calls and cards are actually negative (and in the case of cards, significant). But once we condition on minutes played, they switch signs, increase dramatically in magnitude, and gain statistical significance.

library(sandwich)
library(lmtest)

# make a function to extract the skin color effect information
skin_effects <- function (model, data) {
  t <- coeftest(model,vcov = vcovCL, cluster = ~ data$Player.x)[,] %>% as.data.frame() %>% tibble::rownames_to_column(var = "term")
  print(paste0("Coef: ", round(t$Estimate[t$term=='Skin'],2), " (", round(t$`Std. Error`[t$term=='Skin'],2), "), p=", round(t$`Pr(>|t|)`[t$term=='Skin'],3)))
}

# Unconditional effects of skin color on foul calls and cards
model0a =  lm(Fouls~Skin, data=Total_Data_subset)
model0b =  lm(cards~Skin, data=Total_Data_subset)
skin_effects(model0a, Total_Data_subset)
[1] "Coef: -0.1 (0.08), p=0.211"
skin_effects(model0b, Total_Data_subset)
[1] "Coef: -0.04 (0.02), p=0.02"
# Add minutes played
model1a =  lm(Fouls~Skin  + Mins, data=Total_Data_subset)
model1b =  lm(cards~Skin  + Mins, data=Total_Data_subset)
skin_effects(model1a, Total_Data_subset)
[1] "Coef: 0.18 (0.06), p=0.002"
skin_effects(model1a, Total_Data_subset)
[1] "Coef: 0.18 (0.06), p=0.002"
# Bivariate correlations
round(cor(Total_Data_subset$Skin, Total_Data_subset$Mins, use='complete'),2)
[1] -0.09
round(cor(Total_Data_subset$Fouls, Total_Data_subset$Mins, use='complete'),2)
[1] 0.78
round(cor(Total_Data_subset$cards, Total_Data_subset$Mins, use='complete'),2)
[1] 0.66

What’s going on? Minutes played are not very strongly correlated with skin color (-0.09) [darker players spend, on average, less time on the field], but they are very strongly correlated with foul calls (0.78) and cards (0.66). So the product of these two correlations is enough to flip the sign of the effect of skin color. But the important question is, should we condition on minutes played? Not really, because minutes played are a plausible collider. The time spent on the field can be influenced by skin color (e.g. because darker players tend to be younger or be more likely to play in particular positions). And it can be influenced by the number of foul calls and cards received: (1) players are more likely to be substituted after committing fouls and getting cards, (2) accumulating cards leads to missing games, and (3) players who commit too many fouls and get too many cards are likely to exit the league. If both skin color and foul calls/cards influence minutes played, the variable is a collider, and colliders should be not be included as ‘controls’, as this leads to bias in the estimate of the causal relationship of interest (between skin color and foul calls/cards in our case).

Number of attempted tackles

Next, let’s consider the inclusion of the number of attempted tackles. . Making the estimates for the effects of skin color conditional on the number of attempted tackles would be appropriate if the attempted tackles were a confounder. If they were a predictor of the number of fouls that is not otherwise related to skin color, their inclusion in the model might do no harm and improve precision of the estimate of the effect of skin color. But there are scenarios under which the number of attempted tackles is a ‘bad control’ and should be kept out. For example, if it is a mediator of the effect of skin color, it should not be controlled for. This is plausible, if darker players anticipate referee reactions and attempt fewer tackles, but the ones they do are too late. Another problematic situation is if some players make many fouls when they attempt to tackle, and as a result of that make fewer attempts. In this case, attempted tackles will be an effect (descendant) of the outcome variable. The most problematic, and quite plausible, scenario is if the number of attempted tackles is a collider; that is, if it is influenced by both skin color (for example, darker players being less likely to tackle because of the position they play) and the number of fouls the player has already made (for example, because they fear getting booked with a card). Conditioning on a collider can introduce bias in the estimation of the effect of interest and should be avoided.

Something that was not obvious to me from the start was that attempted tackles are only one of the ways in which players can make a foul or get a card. There are many other reasons. In the dataset, there are many players with more fouls than attempted tackles. Gianvito Plasmati has 18 fouls and 1 attempted tackle. Michele Paolucci has 38 fouls and 2 cards for only 5 attempted tackles. Therefore, interpreting the number of attempted tackles as a measure of the player’s aggressiveness (as the article does) is problematic. According to the authors, the use of tackle data allows to control for ‘propensity towards violent conduct as discussed in Miguel et al.’s (2011) excellent article’. But the referenced article uses the number of yellow and red cards as a proxy for propensity towards violent conduct, and not the number of attempted tackles. Indeed, having a high number of cards relative to the number of tackles is a better measure of aggressiveness than the number of attempted tackles themselves, which are a normal part of the game. In fact, some of the most aggressive behavior that players can exhibit will result in fouls and cards that are not related to attempted tackles. For all these reasons, I don’t think that this variable is an appropriate control neither.

Position on the field

There is a good reason why the position of the player (defender, midfielder or forward) should be included, as it is strongly correlated with the outcomes of interest. Yet, the player’s position could also be a mediator on a causal path from skin color to fouls/cards, via speed, for example. This would be the case if darker players are systematically faster, which makes them more likely to play as forwards, which makes them less likely to receive fouls and cards. Including a mediator in the list of covariates would not be appropriate if we want to estimate the total effect of skin color. But - all things considered - the benefit of increased precision is probably worth a small bias from including a possible mediator in this case. Having dummies for the year (football season) also seems like a reasonable choice.

Nationality

What about nationality? The models in the article include a categorical variable for the player’s nationality. But why? Including this variable makes the estimates of the effect of skin color conditional on the players having the same nationality. I see no reason whatsoever why we would want that. If there is bias against darker players, why would that bias be relative to the average skin color of the country whose nationality the player holds? To see why this could be problematic, consider two countries: one with uniformly light players and the other with uniformly dark players. Imagine that there is significant bias against darker players who are much more likely get foul calls. if we regress the number of fouls on skin color and a categorical variable for the country, we will not register any effect of skin color, because the effect will be picked up by the country dummy. Only the residual within-country variation in skin color will be allowed to influence the regression coefficient of skin color, but if this within-country variation is small or if it is in a small range that is irrelevant for the theoretical mechanism of bias, no effect will be registered even if it exists in reality.

This could explain why including country (nationality) dummies can bias the results against finding an effect of skin color. But could it do the opposite, making an effect appear where there is none? Yes! Consider again a scenario with two countries: one with lighter and one with darker players. Within each of the countries, there is some limited range of variation in skin color, and this happens to be correlated with the foul calls of the players for unsystematic reasons. But, overall, there is no relationship between skin color and fouls: the averages for the players from the lighter and from the darker countries are exactly the same. Nevertheless, if we run a regression with skin color and country dummies, we might register a significant effect of skin color, that is driven entirely from the within-country variation, which might be meaningless in substantive terms. All in all, I don’t think that controlling for the nationality of the player is appropriate, because I cannot imagine why it is the relative skin color of the players (compared to the average of the other players with this nationality) that matters and not the absolute skin color, unconditional on nationality. Think about what this means for racial bias: for the same offence, referees are more likely to sanction a darker colored Dane compared to a lighter-colored Dane, and a darker-colored Ghanaian compared to a lighter-colored Ghanaian, but referees are not more likely to sanction the Ghanaian nationals more than the Danes. That’s the kind of skin color bias that is picked up by a statistical model with nationality as a ‘control’. I don’t see how substantively or theoretically this could make sense. The only situation where this is plausible is comparing Italian nationals vs. foreigners, and this is a situation I explore at length below. On a more technical note, there are 24 nationalities with only one player in the Serie A for the period under study, so the nationality dummies effectively remove these players - many of whom come from African countries - from the sample. And there are many more countries with only a few players and a very limited range of variation in skin color, so estimating within-nationality effects for these countries is suspect as well. All in all, I think that nationality is not an appropriate control and should not be included in the models for estimating the total effect of skin color on referee decisions.

Squad

Similar considerations apply to the inclusion of dummies for the team of the player (the squad). Why would we care whether skin color makes a difference, conditional on the average skin color of the team? Is this something that referees (or the public) calculate when making decisions? If there is bias against dark players, why would that bias be different conditional on the average color of the player’s teammates? Again, including team dummies can bias results against finding effects of skin color, but it could also produce such effects when there are none, if the within-team variation in skin color happens to be correlated with the number of fouls and cards. This could arise if within teams darker players are more likely to play in particular positions, for example. Furthermore, the teams with the highest average values on the skin color scale are some of the richest teams in the league, which might be related to referee favoritism (or bias). Lastly, the way the team dummies are constructed in the models reported in the article is not season-specific, so the skin color of a particular player is made relative to the average skin value of the players who would ever play for this club during the 12 years of the analysis, even after the player might have retired. While this last point is easily fixable, the theoretical argument against included squad dummies remains.

To clarify, the problem is not when within-nationality or within-team variation in skin color is associated with the outcome (fouls, cards). The problem is when only within-nationality or within-team variation is associated with the outcome, but between-nationality or between-team is not.

Back to modeling

So let’s see what happens if we exclude the country and team dummies and number of tackles from the list of covariates (predictors). First we replicate the model of foul calls reported in the article. Then we remove the nationality and squad dummies, and then we remove the number of attempted tackles.

library(sandwich)
library(lmtest)

# replication fouls calls
model1 =  lm(Fouls~Skin + Mins + Position +TotalAttemptedTackles + Season.x + Squad.x + Nation, data=Total_Data_subset)
skin_effects(model1,Total_Data_subset )
[1] "Coef: 0.26 (0.09), p=0.003"
# remove minutes played
model2 = update(model1, . ~ . - Mins)
skin_effects(model2, Total_Data_subset)
[1] "Coef: 0.25 (0.1), p=0.012"
# remove nationality dummy 
model3 = update(model2, . ~ . - Nation)
skin_effects(model3, Total_Data_subset)
[1] "Coef: 0.13 (0.05), p=0.022"
# remove squad dummy
model4 = update(model3, . ~ . - Squad.x)
skin_effects(model4, Total_Data_subset)
[1] "Coef: 0.11 (0.06), p=0.054"
# remove attempted tackles
model5 = update(model4, . ~ . - TotalAttemptedTackles)
skin_effects(model5, Total_Data_subset)
[1] "Coef: -0.11 (0.08), p=0.151"
# add interaction with Italian nationality
model6 = update(model5, . ~ . + it * Skin)
skin_effects(model6, Total_Data_subset)
[1] "Coef: -0.22 (0.09), p=0.015"

Removing the minutes played and the nationality variable from the model reduces the coefficient of skin color by 50% (from 0.26 to .13). Removing the squad dummies brings it further down to 0.11. Removing the number of attempted tackles makes the coefficient of skin color negative.

In the last step, we introduce a dummy for Italian vs. international players and we interact it with skin color, so that the effect of skin color is allowed to vary in these two broad groups of players. Note that both these groups are large enough and have sufficient variation within the groups to avoid the problems with the nationality variable discussed above. Furthermore, there are theoretical reasons to suspect that referees and the public might treat Italian nationals and foreigners in the league differently. Once we include the interaction with Italian nationality, the effect of skin color in the remaining group (thus, all non-Italian nationals) is down to -0.22 and statistically significant at the 0.05 level.

We will have more to say about the differential effects of skin color for Italians and foreigners below. But before we move on, let’s look at how the effects on cards change as well.

# replication fouls calls
model1 =  lm(cards~Skin + Mins + Position +TotalAttemptedTackles + Season.x + Squad.x + Nation, data=Total_Data_subset)
skin_effects(model1,Total_Data_subset )
[1] "Coef: 0.04 (0.02), p=0.032"
# remove minutes played
model2 = update(model1, . ~ . - Mins)
skin_effects(model2, Total_Data_subset)
[1] "Coef: 0.03 (0.02), p=0.063"
# remove nationality dummy 
model3 = update(model2, . ~ . - Nation)
skin_effects(model3, Total_Data_subset)
[1] "Coef: 0.01 (0.01), p=0.258"
# remove squad dummy
model4 = update(model3, . ~ . - Squad.x)
skin_effects(model4, Total_Data_subset)
[1] "Coef: 0.01 (0.01), p=0.416"
# remove attempted tackles
model5 = update(model4, . ~ . - TotalAttemptedTackles)
skin_effects(model5, Total_Data_subset)
[1] "Coef: -0.03 (0.01), p=0.061"
# add interaction with Italian nationality
model6 = update(model5, . ~ . + it * Skin)
skin_effects(model6, Total_Data_subset)
[1] "Coef: -0.04 (0.02), p=0.026"

When yellow and red cards are modeled together (as they should), the effect of skin color with all covariates included is 0.04 and drops to 0.02 as soon as we remove the minutes played or the nationality dummy. It falls further to 0.01 once we remove the squad dummy and turns negative when we remove the number of attempted tackles.

Data Analysis III: The Italians

Let’s explore in more details what happens when we model Italians and non-Italians separately. To do so, we first regress the number of foul calls/cards on the number of minutes played, the player’s position and the season dummies. Then we take the residuals from these models and regress them on skin color interacted with Italian nationality, but using the flexible additive models. Finally, we plot these non-linear effects separately for the two groups of players.

lm_fouls2 <- lm(Fouls~Mins + Position + Season.x, data=Total_Data_subset)
Total_Data_subset$fouls_res2 <- lm_fouls2$residuals

gam_fouls_res2 <- mgcv::gam(fouls_res2 ~ Italian + s(Skin, by=Italian), data=Total_Data_subset)
model_p_res2 <- predict_gam(gam_fouls_res2)

p7 <- model_p_res2 %>%
  ggplot(aes(Skin, fit)) +
  geom_smooth_ci(Italian)+
  labs(x = '\nSkin color', y = 'Fit to model residuals\n') + my_theme_3() +
  theme(panel.grid.major = element_line(colour="lightgrey", size=0.2),
        legend.position=c(.25,.79)) +
  coord_cartesian(ylim=c(-2,20)) + 
  scale_x_continuous(breaks = c(1,5,10,15,20)) +  
  scale_y_continuous(breaks = seq(-10, 50, 5)) 

lm_cards2 <- lm(cards~Mins + Position + Season.x, data=Total_Data_subset)
Total_Data_subset$cards_res2 <- lm_cards2$residuals

gam_cards_res2 <- mgcv::gam(cards_res2 ~ Italian + s(Skin, by=Italian), data=Total_Data_subset)
model_p2_res2 <- predict_gam(gam_cards_res2)

p8 <- model_p2_res2 %>%
  ggplot(aes(Skin, fit)) +
  geom_smooth_ci(Italian)+
  labs(x = '\nSkin color', y = 'Fit to model residuals\n') + my_theme_3() +
   theme(panel.grid.major = element_line(colour="lightgrey", size=0.2),
        legend.position=c(.25,.79)) +
  coord_cartesian(ylim=c(-1,2)) + 
  scale_x_continuous(breaks = c(1,5,10,15,20)) +  
  scale_y_continuous(breaks = seq(-10, 50, 1)) 

plot4 = ggarrange(p7, p8, ncol = 2, labels = c("A. All foul calls", "B. Yellow and red cards"))
plot4

For Italian nationals, the number of fouls shoots up dramatically with skin color, but for the foreigners in the league, there is hardly any difference (when minutes played are included in the model, which - as discussed above - might be the wrong modeling choice). When it comes to cards, for Italians again there is some positive relationship with skin color (although nowhere near as dramatic as was the case for fouls), while for the foreign nationals, there is none. The only significant relationship is between fouls and skin color for Italian nationals. This relationship seems to be driven by observations at the extreme of the skin color distribution, and the effects are suspiciously large given the patterns in the rest of the data. There are not too many observations in that corner of the data space, so it is possible that a small number of outliers are responsible for the line shooting up. Inspection of the standardized residuals suggests that Mario Balotelli and Stefano Okaka play an outsized role in driving up the relationship between skin color and fouls/cards. Balotelli has 232 fouls and 41 cards for a total of 7072 minutes played and 62 attempted tackles. Okaka has 348 fouls and 30 cards for a total of 9486 minutes played and 149 attempted tackles.

Data Analysis IV: Outliers

What happens if we exclude these two observations from the dataset? In short, all significant effects of skin color disappear in the subset of Italian nationals as well, even when minutes played is included as a covariate.

Total_Data_subset2 <- Total_Data_subset %>% filter(Player.x!='Mario Balotelli', Player.x!='Stefano Okaka')


#only Italians, fouls
lm_fouls3.it.s <- lm(Fouls ~ Position  + Mins + Season.x + Skin, data=Total_Data_subset2[Total_Data_subset2$it==1,])
skin_effects (lm_fouls3.it.s , Total_Data_subset2[Total_Data_subset2$it==1,])
[1] "Coef: 0.09 (0.14), p=0.502"
#only Italians, cards
lm_cards3.it.s <- lm(cards ~ Position  + Mins + Season.x + Skin, data=Total_Data_subset2[Total_Data_subset2$it==1,])
skin_effects (lm_cards3.it.s , Total_Data_subset2[Total_Data_subset2$it==1,])
[1] "Coef: 0.04 (0.03), p=0.214"
#entire sample, fouls
lm_fouls3.s <- lm(Fouls ~ Position  + Mins + Season.x + Skin * it, data=Total_Data_subset2)
skin_effects (lm_fouls3.s , Total_Data_subset2)
[1] "Coef: 0.11 (0.06), p=0.077"
#entire sample, cards
lm_cards3.s <- lm(cards ~ Position  + Mins + Season.x + Skin * it, data=Total_Data_subset2)
skin_effects (lm_cards3.s , Total_Data_subset2)
[1] "Coef: 0.02 (0.01), p=0.21"

Data Analysis V: COVID-19

The published article compares the effects of skin color during the season affected by COVID-19 (2020-2021), when the games were played without public at empty stadiums, to the previous seasons. The conclusion is that the effects of skin color are much smaller during the COVID-19 seasons. This is interpreted as strong evidence that the public drives a significant part of the racial bias in referee decisions.

The figure below shows the estimated effects of skin color for international players per season, according to the model specification presented above (controls for player position and minutes played, and an interaction with Italian nationality, Balotelli and Okaka kept in the sample). As we can see from the figure, the effect of skin color on foul calls during the 2020-21 season was small and insignificant indeed. But the effect is not systematically smaller than in the previous seasons. In fact, in season 2012-13 the estimate for the effect was even smaller. In fact, the only season with a ‘significant’ effect of skin color was 2014-15.

u<-unique(Total_Data_subset$Season.x)
years<-data.frame(matrix(NA, nrow=length(u), ncol=5))
names(years) <- c('year','coef','up','down','pvalue')

for (i in 1:length(u)){
  temp <- Total_Data_subset[Total_Data_subset$Season.x==u[i],]
  t <- coeftest(lm(Fouls~Skin * it + Mins + Position, data=temp),
                vcov = vcovCL, cluster = ~ temp$Player.x)[,] %>% as.data.frame() %>% tibble::rownames_to_column(var = "term")
  
  years$year[i] <- round(as.numeric(substr(u[i],1,4)),0)
  years$coef[i] <- round(t$Estimate[t$term=='Skin'],2)
  years$up[i] <- round(t$Estimate[t$term=='Skin'],2) + 2 * round(t$`Std. Error`[t$term=='Skin'],2)
  years$down[i] <- round(t$Estimate[t$term=='Skin'],2) - 2 * round(t$`Std. Error`[t$term=='Skin'],2)
  years$pvalue[i] <- round(t$`Pr(>|t|)`[t$term=='Skin'],3)
}

years %>%
  ggplot(aes(year, coef)) +
  geom_segment(col='black', x=2009:2020, y=years$down, xend=2009:2020, yend=years$up)+
  geom_hline(yintercept=0, linewidth=1.2) +
  geom_point(aes(x=year, y=coef), col='red', size=5) +
  labs(x = '\nYear (start of football season)', y = 'Coefficient size\n') + my_theme_3() +
  theme(panel.grid.major = element_line(colour="lightgrey", size=0.2)) +
  coord_cartesian(ylim=c(-0.2,0.6)) + 
  scale_x_continuous(breaks = seq(2010,2020,2)) +  
  scale_y_continuous(breaks = seq(-0.8, 2.4, 0.2)) 

years
   year  coef   up  down pvalue
1  2009  0.19 0.57 -0.19  0.329
2  2010  0.02 0.36 -0.32  0.928
3  2011  0.11 0.41 -0.19  0.490
4  2012 -0.02 0.22 -0.26  0.892
5  2013  0.09 0.35 -0.17  0.495
6  2014  0.23 0.45  0.01  0.029
7  2015  0.13 0.37 -0.11  0.293
8  2016  0.17 0.37 -0.03  0.082
9  2017  0.11 0.31 -0.09  0.245
10 2018  0.16 0.36 -0.04  0.089
11 2019  0.05 0.23 -0.13  0.583
12 2020  0.01 0.21 -0.19  0.921

Conclusion

The original article by Magistro and Wack makes very strong claims about the presence of systematic racial bias in referee decisions in the Italian Serie A. My reanalysis of the data casts considerable doubt whether such strong claims are warranted. First, I show that in descriptive terms, there is no positive linear association between skin color, on the one hand, and foul calls and cards on the other hand. Second, I argue that a number of covariates included in the original analysis are possible colliders and mediators and, as such, should not be controlled for. In the absence of ‘bad controls’ in the models, the positive effect of skin color disappears, and in some specifications the effects is even negative and statistically significant. Third, I demonstrate that even with some potentially bad controls (minutes played, in particular), a positive relationship between foul calls and skin color is detectable only for players with Italian nationality, but not for the foreigners playing in the league. In addition, I show that even in the Italian subset of the data, the relationship hangs on having two ‘exceptional’ players in the data: Balotelli and Okaka. If we exclude these two, the relationship is gone. Finally, I demonstrate that - contrary to the claims in the article - the 2020/2021 season was not significantly different from the preceding ones in terms of the effect of skin color, which casts doubt on the proposed mechanism of racial bias via the influence of the public.

Where does this leave us? Could there be racial bias that only affects the darker Italian players? Possibly, although it is hard to imagine a mechanism that leads to bias against Italian nationals but not foreigners. More problematic is the observation that - observationally - the number of fouls (net of other covariates) peaks and levels off around the middle of the skin color distribution. This range of skin colors is harder to associate with a particular ‘race’, whatever ‘race’ is supposed to mean in the first place.

Should we exclude Balotelli and Okaka from the sample given their outsized influence on the results? It depends on whether we see their performance, and the reactions to it, as an expression of the same racially-biased mechanisms characteristic of the league as a whole or as aberrant observations who happen to be dark. Both of these players have a very high fouls-to-attempted tackles ratio, which means that many of their fouls are not related to defensive tackles. I don’t follow Italian football that closely, but there is no shortage of compilations of Balotelli’s many fouls and cards on Youtube. These could be attributed to reactions to racial abuse by the public or perceived unfair treatment by the referee. But they could also be individual idiosyncrasies. Overall, the fact that - even with some problematic modeling choices - the results depend on the inclusion or exclusion of two players does not inspire confidence in the general claims about racial bias in Italian football. Furthermore, I find no evidence for one the purported mechanisms of the racial bias - the influence of the public. Contrary to what the article claims, the effect of skin color on fouls and cards during the COVID-19 season (2020-21) was not significantly smaller than during the seasons before that.

Again, it should be reminded that absence of evidence is not the same as evidence of absence. It could very well be that there is racial bias in Italian football. But this particular dataset doesn’t provide strong evidence for such a conclusion. Identifying and estimating bias is inherently difficult: we need to approximate counterfactual situations in which everything but the race of the player is different. This requires that all aspects relevant to a potentially bookable offence are coded. Such information is not available. Using proxies such as the number of attempted tackles of a player during a season is inadequate at best and possibly misleading. In methodological terms, this modeling exercise underscores the importance of making explicit our causal assumptions and mechanisms about the effects of race/skin color before we can decide which variables should be controlled for and which introduce bias. These problems echo similar issues related to the estimation of bias in hiring, promotion and salary decisions, for example whether to control for occupation or not.

Identifying racial bias would be made easier if there was a definition of race provided in the article. One of the authors is quoted in The Guardian criticizing: ‘’A lot of studies have relied on very imperfect categorisations like country of origin, or putting all Europeans as white and South Americans as non-white, which is problematic.” This could be the case, but I don’t see how substituting race for skin color is an improvement, other than the latter is more readily measurable. In the academic article, the authors make the jump from race to ’colorism’ or ‘skin tone stratification’ rather quickly. I honestly don’t know whether every skin tone is supposed to correspond to a different race, or whether the mapping of skin color to race is linear or categorical, one-to-one or overlapping, etc. Is a dark Italian the same race as a light Middle-Easterner, if they share the same skin tone? What about people like me who can change several tones over the course of a year? This issue is related to the general problem of analysing qualities (as discrete properties, such as race) vs. quantities (such as skin color) and the allure of continuism. Intuitively, I guess the idea of race is related in a very complex way to color, but it will be useful to specify this more precisely, so we can have a chance at detecting racial bias.

//The end.