News: 1746174490

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Open source AI hiring bots favor men, leave women hanging by the phone

(2025/05/02)


Open source AI models are more likely to recommend men than women for jobs, particularly the high-paying ones, a new study has found.

While bias in AI models is a well-established risk, the findings highlight the unresolved issue as the usage of AI [1]proliferates among recruiters and corporate human resources departments.

"We don't conclusively know which companies might be using these models," Rochana Chaturvedi, a PhD candidate at the University of Illinois in the US and a co-author of the study, told The Register . "The companies usually don't disclose this and our findings imply that such disclosures might be crucial for compliance with AI regulations."

[2]

Chaturvedi and co-author Sugat Chaturvedi, assistant professor at Ahmedabad University in India, set out to analyze a handful of mid-sized open-source LLMs for gender bias in hiring recommendations.

[3]

[4]

As described in their preprint [5]paper [PDF], "Who Gets the Callback? Generative AI and Gender Bias," the authors looked at the following open source models: Llama-3-8B-Instruct, Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Granite-3.1-8B-it, Ministral-8B-Instruct-2410, and Gemma-2-9B-it.

Using a dataset of 332,044 real English-language job ads from India’s National Career Services online job portal, the boffins prompted each model with job descriptions, and asked the model to choose between two equally qualified male and female candidates.

[6]

They then assessed gender bias by looking at the female callback rate – the percentage of times the model recommends a female candidate – and also the extent to which the job ad may contain or specify a gender preference. (Explicit gender preferences in job ads are prohibited in many jurisdictions in India, the researchers say, but they show up in 2 percent of postings nonetheless.)

We find that most models reproduce stereotypical gender associations and systematically recommend equally qualified women for lower-wage roles

"We find that most models reproduce stereotypical gender associations and systematically recommend equally qualified women for lower-wage roles," the researchers conclude. "These biases stem from entrenched gender patterns in the training data as well as from an agreeableness bias induced during the reinforcement learning from human feedback stage."

The models exhibited varying levels of bias.

"We find substantial variation in callback recommendations across models, with female callback rates ranging from 1.4 percent for Ministral to 87.3 percent for Gemma," the paper explains. "The most balanced model is Llama-3.1 with a female callback rate of 41 percent."

Llama-3.1, the researchers observed, was also the most likely to refuse to consider gender at all. It avoided picking a candidate by gender in 6 percent of cases, compared to 1.5 percent or less exhibited by other models. That suggests Meta's built-in fairness guardrails are stronger than in other open-source models, they say.

[7]

When the researchers adjusted the models for callback parity so the female and male callback rates were both about 50 percent. The jobs with female callbacks tended to pay less – but not always.

"We find that the wage gap is lowest for Granite and Llama-3.1 (≈ 9 log points for both), followed by Qwen (≈ 14 log points), with women being recommended for lower wage jobs than men," the paper explains. "The gender wage penalty for women is highest for Ministral (≈ 84 log points) and Gemma (≈ 65 log points). In contrast, Llama-3 exhibits a wage penalty for men (wage premium for women) of approximately 15 log points."

[8]Zuck ghosts metaverse as Meta chases AI goldrush

[9]AI models routinely lie when honesty conflicts with their goals

[10]Red, white, and blew it? Trump tariffs may cost America the AI race

[11]Brewhaha: Turns out machines can't replace people, Starbucks finds

Whether this holds true for Llama-4 is not addressed in the paper. When [12]Meta released Llama 4 last month, it acknowledged earlier models had a left-leaning bias and said it aimed to reduce this by training the model to represent multiple viewpoints.

"It’s well-known that all leading LLMs have had issues with bias – specifically, they historically have leaned left when it comes to debated political and social topics," the social media giant [13]said at the time. "This is due to the types of training data available on the internet."

The researchers also looked at how "personality" behaviors affected LLM output.

LLMs have been found to exhibit distinct personality behaviors, often skewed toward socially desirable or sycophantic responses

"LLMs have been found to exhibit distinct personality behaviors, often skewed toward socially desirable or sycophantic responses – potentially as a byproduct of reinforcement learning from human feedback (RLHF)," they explain.

An example of how this might manifest itself was seen in OpenAI's recent [14]rollback of an update to its GPT-4o model that made its responses more fawning and deferential.

The various personality traits measured (Agreeableness, Conscientiousness, Emotional Stability, Extroversion, and Openness) may be communicated to a model in a system prompt that describes desired behaviors or through training data or data annotation. An example cited in the paper tells a model, "You are an agreeable person who values trust, morality, altruism, cooperation, modesty, and sympathy."

To assess the extent to which these prescribed or inadvertent behaviors might shape job callbacks, the researchers told the LLMs to play the role of 99 different historical figures.

"We find that simulating the perspectives of influential historical figures typically increases female callback rates – exceeding 95 percent for prominent women’s rights advocates like Mary Wollstonecraft and Margaret Sanger," the paper says.

"However, the model exhibits high refusal rates when simulating controversial figures such as Adolf Hitler, Joseph Stalin, Margaret Sanger, and Mao Zedong, as the combined persona-plus-task prompt pushes the model’s internal risk scores above threshold, activating its built-in safety and fairness guardrails."

That is to say, the models emulating infamous figures balked at making any job candidate recommendation because invoking names like Hitler and Stalin tends to trigger model safety mechanisms, causing the model to clam up.

Female callback rates slightly declined - by 2 to 5 percentage points - when the model was prompted with personas like Ronald Reagan, Queen Elizabeth I, Niccolò Machiavelli, and D.W. Griffith.

In terms of wages, female candidates did best when Margaret Sanger and Vladimir Lenin were issuing job callbacks.

The authors believe their auditing approach using real-world data can complement existing testing methods that use curated datasets. Chaturvedi said that the audited models can be fine-tuned to be better suited to hiring, as with this [15]Llama-3.1-8B variant .

They argue that given the rapid update of open source models, it's crucial to understand their biases for responsible deployment under various national regulations like the European Union’s Ethics Guidelines for Trustworthy AI, the OECD’s Recommendation of the Council on Artificial Intelligence, and India’s AI Ethics & Governance framework.

With the US having [16]scraped AI oversight rules earlier this year, stateside job candidates will just have to hope that Stalin has a role for them. ®

Get our [17]Tech Resources



[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC10822991/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aBSXvPzqMKv2VkZm9X2-5gAAAcQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aBSXvPzqMKv2VkZm9X2-5gAAAcQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aBSXvPzqMKv2VkZm9X2-5gAAAcQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://arxiv.org/abs/2504.21400

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aBSXvPzqMKv2VkZm9X2-5gAAAcQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aBSXvPzqMKv2VkZm9X2-5gAAAcQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://www.theregister.com/2025/05/01/metas_metaverse_mention/

[9] https://www.theregister.com/2025/05/01/ai_models_lie_research/

[10] https://www.theregister.com/2025/05/01/abi_trump_tariffs_datacenter/

[11] https://www.theregister.com/2025/04/30/starbucks_finds_machines_cant_replace/

[12] https://www.theregister.com/2025/04/07/llama_4_debuts/

[13] https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[14] https://openai.com/index/sycophancy-in-gpt-4o/

[15] https://huggingface.co/LlamaFactoryAI/Llama-3.1-8B-Instruct-cv-job-description-matching

[16] https://www.whitehouse.gov/presidential-actions/2025/01/removing-barriers-to-american-leadership-in-artificial-intelligence/

[17] https://whitepapers.theregister.com/



mark l 2

I wonder if you are unlucky enough to apply for a job that's using a AI recruitment bot and have the same name as a bad people from history 'Charles Manson' or 'Rose West' for example, are these AI bots are going to use that against you since the training data will have overwhelming negative data in its system about people with those names?

Brewster's Angle Grinder

"It’s well-known that all leading LLMs have had issues with bias – specifically, they historically have leaned left when it comes to debated political and social topics," the social media giant said at the time. "This is due to the types of training data available on the internet."

Translation: most people have pretty left-leaning views (even if they don't realise it and won't describe themselves as left-wing) but our rich masters hate that so we had to appease them by leaning on the scales on the side of the rich.

Dan 55

In similar news:

[1]Jeremy Corbyn's Policies More Popular Than The Tories' - But Only If They Aren't Linked To Labour, Poll Suggests

[1] https://www.huffingtonpost.co.uk/entry/jeremy-corbyn-media-policies-labour_uk_57fe651be4b0010a7f3da76b

I ain't Spartacus

Most people have both sets of views. Also the public in general change their mind a lot. So, for example, privatisation might be popular if you've had to suffer for years under a lot of inflexible and badly run nationalised industries - but that switches once you've lived for decades with lots of badly run and inflexible privatised ones...

In the UK, I'd say the majority of people are soft centre left. Sceptical of private industry and so wanting the government to do more - but probably believing in tax rises more in theory than in practise. However there's also probably a majority for being tough on law and order issues, so the majority of the public are to the right of most of the politicians on this. Of course the public have the luxury of believing stuff without having to do it, so when there's a miscarriage of justice or police over-reach, they're all suddenly on the victims' side. But will still want "tougher policing" next week.

Hence that old 2 axis political graph, with economics doing the left-right bit, and then socially liberal against authoritarian on the y axis.

Ian Johnston

Only HR departments could be stupid enough to believe that AI output could be useful in making appointments.

Anonymous Coward

So maybe feed the bot all applicant details except name, ethnicity and gender.

Dan 55

And what's the problem we're trying to solve here? If it's seeing which candidate has most matches for the job requirements then grep would give better results.

"I am convinced that the manufacturers of carpet odor removing powder have
included encapsulated time released cat urine in their products. This
technology must be what prevented its distribution during my mom's reign. My
carpet smells like piss, and I don't have a cat. Better go by some more."
-- timw@zeb.USWest.COM, in alt.conspiracy