AI can't replace freelance coders yet, but that day is coming
- Reference: 1747917453
- News link: https://www.theregister.co.uk/2025/05/22/freelance_coders_ai_work/
- Source link:
At least that was the case two months ago, when researchers with Alabama-based engineering consultancy PeopleTec set out to compare how four LLMs performed on freelance coding jobs.
David Noever, chief scientist at PeopleTec, and Forrest McKee, AI/ML data scientist at PeopleTec, describe their project in a preprint [1]paper titled, "Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale."
[2]
"We found that there is a great data set of genuine [freelance job] bids on Kaggle as a competition, and so we thought: why not put that to large language models and see what they can do?"
[3]
[4]
Using the [5]Kaggle dataset of Freelancer.com jobs , the authors built a set of 1,115 programming and data analysis challenges that could be evaluated using automated tests. The benchmarked programming tasks necessary to perform the freelance jobs were also assigned a monetary value, at an average of $306 (median $250), such that the paper stated that completing every freelance job could achieve a total potential value of "roughly $1.6 million."
Then they evaluated four models: Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral, the first two representing commercial models and the latter two being open source. The authors estimate that a human software engineer would be able to solve more than 95 percent of the challenges. No model did as well as that, but Claude came closest.
[6]
"Claude 3.5 Haiku narrowly outperformed GPT-4o-mini, both in accuracy and in dollar earnings," the paper reports, noting that Claude managed to capture about $1.52 million in theoretical payments out of the possible $1.6 million.
"It solved 877 tasks with all tests passing, which is 78.7 percent of the benchmark – a very high score for such a diverse task set. GPT-4o-mini was close behind, solving 862 tasks (77.3 percent). Qwen 2.5 was the third best, solving 764 tasks (68.5 percent). Mistral 7B lagged behind, solving 474 tasks (42.5 percent)."
Inspired by OpenAI's SWE-Lancer benchmark
Noever told The Register that the project came about in response to OpenAI's [7]SWE-Lancer benchmark, [8]published in February.
"They had accumulated a million dollars' worth of software tasks that were genuinely market reflective of [what companies were actually asking for]," said Noever. "It was unlike any other benchmark we've seen, and you know there's millions of those. And so we wanted to make it more universal beyond just ChatGPT."
Overall, the models evaluated had much less success with the OpenAI SWE-Lancer benchmark than with the benchmarks the researchers created, possibly because the range of problems was more difficult in the OpenAI study. The payouts in OpenAI's SWE-Lancer study, with a total work value of $1 million, came to $403,325 for Claude 3.5 Sonnet, $380,350 for GPT-o1, and $303,525 for GPT-4o.
[9]
On one specific subset of tasks in the OpenAI study, the best performing model was more or less worthless.
"The best performing model, Claude 3.5 Sonnet, earns $208,050 on the SWE-Lancer Diamond set and resolves 26.2 percent of IC SWE issues; however, the majority of its solutions are incorrect, and higher reliability is needed for trustworthy deployment," the OpenAI paper says.
[10]Google's AI vision clouded by business model hallucinations
[11]Nvidia CEO Jensen Huang labels US GPU export bans 'precisely wrong' and 'a failure'
[12]Microsoft-backed AI out-forecasts hurricane experts without crunching the physics
[13]Estimating AI energy usage is fiendishly hard – but this report took a shot
Regardless, while AI models cannot replace freelance coders, Noever said people are already using them to help them fulfill freelance software engineering tasks. "I don't know whether someone's completely automated the pipeline," he said. "But I think that's coming, and I think that could be months."
People, he said, are already using AI models to generate freelance job requirements. And those are being answered by AI models and scored by AI models. It's AI all the way down.
"It's really phenomenal to watch," he said.
One of the interesting findings to come out of this study, Noever said, was that open source models break at 30 billion parameters. "That's right at the limit of a consumer GPU," he said. "I think [14]Codestral is probably one of the strongest [of these open source models], but it's not going to complete these tasks. …So as it plays out, I think it does take infrastructure. There's just no way around that." ®
Get our [15]Tech Resources
[1] https://arxiv.org/abs/2505.13511
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aC9KJG2UAlq_Kawbj3StUAAAAYg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aC9KJG2UAlq_Kawbj3StUAAAAYg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aC9KJG2UAlq_Kawbj3StUAAAAYg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://www.kaggle.com/datasets/isaacoresanya/freelancer
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aC9KJG2UAlq_Kawbj3StUAAAAYg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://openai.com/index/swe-lancer/
[8] https://arxiv.org/abs/2502.12115
[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aC9KJG2UAlq_Kawbj3StUAAAAYg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[10] https://www.theregister.com/2025/05/21/googles_ai_vision/
[11] https://www.theregister.com/2025/05/21/jensen_huang_h20_ban_criticism/
[12] https://www.theregister.com/2025/05/21/earth_system_model_hurricane_forecast/
[13] https://www.theregister.com/2025/05/21/ai_energy_consumption_loose_estimates/
[14] https://mistral.ai/news/codestral
[15] https://whitepapers.theregister.com/
hmmmm
I wonder what passed the automated tests actually means?
Did they use AI to build tests to see if AI had passed the tests to build the required code? And did they then use another AI to assess whether the AI built tests passed yet more AI built tests to make sure the tests were actually testing success? Is it AI all the way down?
Builder.ai?
Just asking.
Please Reg . . .
> "It's really phenomenal to watch," he said.
Have you considered an "A1 Fuckwit of the Month" award?
Devstral from Mistral in France is the LLM that is customised/trained specifically for software engineering tasks. It is available as a small model that will fit on a PC and run on a 4090 level GPU if you want to have your own.
It is supposedly able to deal with larger software engineering problems as opposed to using generic LLMs which are good at the atomic chunks of code (which I described in another thread and got obtuse responses).
They are a nascent tech but are coming on leaps and bounds. Best not be an angry stockinger or fingers in the ears denier.
80 20 rule?
We all know the first 80% is the easy bit.......the last 20% is the hard bit and will take more than 80% of the time.
The Great Replacement
The irony is almost elegant: the people most loudly predicting the end of freelance coders are the ones most at risk of being replaced themselves.
They don’t write code, they write about people who write code. So when they see AI generating boilerplate Python, they assume the entire profession is obsolete - never mind the bugs, the rewrites, the missing context, or the fact that real-world coding involves clients, edge cases, and systems duct-taped together across decades.
They’ve built careers narrating other people’s work. Now they watch AI automate their job - summarising benchmarks, writing bland takes, generating midwit predictions - and panic. So they scream: “Look! It’s coming for them!” Because admitting it's coming for you is harder.
Meanwhile, actual devs are busy rewriting AI-generated nonsense into something that doesn’t crash in prod.
Coders aren’t getting replaced. Commentators are. And not a moment too soon.
The limitations
The "limitations" section of the paper (page 9) is all important. The tasks were simple enough to be evaluated by automation, but as the authors state " In a real freelance scenario, requirements can be vague or evolving, clients might change their mind, and there could be integration issues beyond just writing a piece of code. Our benchmark doesn't capture those aspects – every task here is a neatly packaged problem that starts and ends within a single prompt/response. "
So the "AI" might (80% of the time) replace grunt coders given detailed briefs for simple tasks, but not programmers or software engineers who have to exercise initiative and imagination to fulfil larger and more complex tasks. So there's potential for such tools to assist, but not replace, expert developers (provided the time and effort needed to weed out "hallucinations" doesn't negate the gains).
Re: The limitations
Clear case of this new hammer will replace carpenters!
Given how vague and imprecise most job descriptions are on freelancer.com I really fail to see how its possible to create any kind of acceptance criteria, much less a suite of tests to prove the code.
AI has is merits in code generation - but to an extent. Knowing its power and therefore its weakness is key in a development project. That is what today's AI specialist should be accounted for.
You can do with less coders in the end - they are swapped with more "system integrators", busy to glue all these AI fragments into one coherent product that ultimately does compile and run. Understanding code remains necessary to be succesfull at the job, and understanding any piece of code these days is often a challenge on its own - with or without AI.
AI has is merits in code generation
Yes, but the AI (LLM?) must be trained only on proved code, not on the general mish-mash that makes up stackoverflow and/or reddit answers to questions - often either just plain wrong or so outdated as to be useless.
What use is the bulk of literature LLMs ingest from LibGen to programming?
Does 80% correct code run ? Thought not.
Just a small point
I've worked for dozens of companies over the years, and faced a lot of coding tests from them.
In decades of work, for all of those companies, I did not once solve a problem that looked remotely like those coding tests. Not even slightly.