Does the CASE Score have a gender bias?

Does the CASE Score have a gender bias?

In our Fair Project, we are working with the University of Cologne on discrimination-free recruiting algorithms. But what is the actual situation with our CASE Score? Are certain groups given preference in the assessment of their academic performance? The abbreviated answer is yes, and justifiably women. In detail, it's a bit more complicated. If you don't feel like reading statistics at all, you can read on directly at "Enough with statistics: What does this result mean for HR practice?".

Unadjusted differences between men and women:
In the CASE Score, women score on average about four percentile ranks better. As you can see from the title, this is the unadjusted difference - that is, any differences between men and women that might have an impact on the CASE Score are not factored out. In general, we should ask two questions to better understand such a figure and to decide whether and to what extent men are actually discriminated against here.

( 1) Is the difference large?
No, four percentile ranks are not very much, considering that the scale runs from top 1% to top 100% (For statistics nerds: And uniformly distributed). Moreover, it should be taken into account that we are dealing here with the uncorrected effect, which can also include other differences between men and women. This figure is therefore not really meaningful.

(2) Is the difference statistically significant?
Yes, this difference is significant. Significance means that this result was not obtained by chance through an unfavourable sample. Since this result was calculated on the most recent waves of the Skilled Worker 2030 survey, the sample includes more than 30,000 students. With such large samples, there is little room for statistical chance and a significant - albeit small - difference is not surprising at this point.

Adjusted differences between men and women:
The distinction between adjusted and unadjusted differences is most familiar through the discussion of the gender pay gap. When information such as industry, position and experience is taken into account, the unadjusted gender pay gap melts from 21% to 6%. Of course, in this case one can argue that differences in variables such as position and experience are also discriminatory and should not be factored out.

Back to the topic: to get a more meaningful result, we should adjust for differences in the CASE score between men and women. If women achieve better grades on average, then of course we should adjust out this effect, as it represents a real difference in academic performance. We can certainly debate the choice of the right model at this point, those interested can find a few more thoughts on this below this blog post. Let's ask ourselves the same two questions again:

(1) Is the adjusted difference large?
No, it is actually quite small. Adjusted, the difference is now just under 0.3 percentile ranks - this time even in favour of men. The latter information is not relevant, however, because the size of this effect is tiny and we still have to clarify the question of significance.

(2) Is the adjusted difference statistically significant?
No, despite the very large sample, it can now no longer be ruled out that there is no difference at all between men and women in the CASE score. In fact, it is quite likely to get such a sample if there is no difference in the CASE score.

So despite the large sample, there are no significant differences. And here the sample size is really crucial, because while only large effects can be measured reliably with small samples, smaller effect sizes can also be determined significantly with large samples.

The bottom line is that with regard to the CASE score, there is a small unadjusted difference between men and women. This better performance of women is no longer evident when the better average grades are taken into account - even in our large sample of more than 30,000 students, the difference is insignificant.

However, we should think for a moment about the type of statistical adjustment. Because if men were to receive lower grades in their studies for the same performance, then we would also be modelling discrimination here with the grades. So the adjusted difference would not be very helpful. But there is research on this too, and its results are not always unambiguous. Discrimination against boys is observed in some studies, especially in the school context, but especially in mathematical subjects, girls tend to be disadvantaged. This does not prove a general discrimination of men in tertiary education.

And, even if there were such discrimination and the CASE score were to measure it (the alternative, contrary to our current practice, would be to query gender and give men a bonus), this effect would remain small at the maximum (see unadjusted differences). And, most importantly, this disadvantage would not only be small, it would be smaller than the advantage men receive in the labour market. Which brings us to another question: Is positive discrimination allowed to compensate for other forms of discrimination? I think so, but I would like to point out that this is a complex ethical question.

Enough with statistics: What does this result mean for HR practice?
For one thing, it means that women are more successful in their studies than men - albeit only minimally. This small difference is also found in the CASE score (unadjusted). Beyond that, however, there are no differences at all when corrected for the grades achieved. The CASE Score thus fairly reflects performance in studies between the sexes.

And this is no coincidence: the problem was considered in the programming and then empirically tested. This approach should be part of the "1x1" of personnel selection. Selection tools must not only be able to make a good prediction (predictive validity), but also be fair to different applicant groups. We would like to see companies incorporate these two criteria more into the decision on selection tools.

After all, the status quo in the labour market continues to show that women are massively discriminated against. And this discrimination results almost exclusively from human decisions. One should think about this when saying things like: "In our company, the gut feeling of the personnel managers still counts". Because this subjective gut feeling is not only poorly explainable by definition, but often distorted. This does not happen intentionally, but rather subconsciously, as studies show. Most personnel officers do not want to discriminate and yet they do. That's why no one wants to abolish HR managers, but rather support them in their decision-making through good algorithms, good aptitude diagnostics or good sensitivity training. This is how we can fight discrimination in the labour market together.

Because this discrimination is not only a problem for women. According to a 2018 study by the World Bank, every single European would be almost €50,000 richer if the labour market did not discriminate against women. Why? Because it is not only ethically wrong, but also economically inefficient to prefer to fill positions with men.

About the author:
Dr. Philipp Karl Seegers is a "Labour Economist" who focuses on the transition between education and the labour market. Together with Dr. Jan Bergerhoff and Dr. Max Hoyer, Philipp founded the HR-Tech company candidate select GmbH (CASE), which uses large data sets and scientific methods to make educational qualifications comparable. Philipp is project manager of the FAIR ("Fair Artificial Intelligence Recruiting") project funded by the state of NRW and the EU. In addition, as a Research Fellow at Maastricht University and as the initiator of the study series "Fachkraft 2030", Philipp actively researches issues in the field of education economics, psychological diagnostics and the labour market.

Statistics digression: What is the best way to adjust for the gender gap?
In the text above, we have adjusted the difference between CASE scores for the grade point average / current grade point average achieved in the study. That is, we run a multiple regression with CASE score as the dependent variable and gender and grade as the independent variables. The gender effect then measures whether there is a difference in the CASE Score for the same grade.

It can now be argued that in addition to the grade, adjustments should also be made for the field of study, the type of degree and the university. The latter variables should be included because there are large differences in grading between subjects, degrees and universities. Once the grade then becomes part of the model, such differences should also be modelled.

However, it can also be argued that this is not a good idea because it includes a large part of the input variables to determine a CASE score. Since the CASE score does not ask about gender, discrimination can only be modelled by differences in the input variables - and a correlation of those with gender. This argues for including fewer variables - even though it must be noted here that the CASE Score does not simply take into account the university or the field of study, but above all the interactions of these variables. This is not depicted here, so the argument is not entirely valid.

Statistically, however, it is always good if the results are as robust as possible. Even if we correct for other variables besides gender, we find no significant differences. And this is not even the end of possible control variables: there is a lot of other information in the data set, such as the Abitur grade or psychological measurements, such as a cognitive achievement test and a Big 5 personality test. Even with these many control variables, the result remains: When corrected for grades, study context and even psychological measures, there is no inequality between men and women in the CASE score.