Chinese Firm Trains Massive AI Model for Just $5.5 Million (techcrunch.com)

(Friday December 27, 2024 @07:05AM (msmash) from the bucking-the-trend dept.)

Reference: 0175772491
News link: https://slashdot.org/story/24/12/27/0420235/chinese-firm-trains-massive-ai-model-for-just-55-million
Source link: https://techcrunch.com/2024/12/26/deepseeks-new-ai-model-appears-to-be-one-of-the-best-open-challengers-yet/

Chinese AI startup DeepSeek has [1]released what appears to be [2]one of the most powerful open-source language models to date , trained at a cost of just $5.5 million using restricted Nvidia H800 GPUs.

The 671-billion-parameter DeepSeek V3, released this week under a permissive commercial license, outperformed both open and closed-source AI models in internal benchmarks, including Meta's Llama 3.1 and OpenAI's GPT-4 on coding tasks.

The model was trained on 14.8 trillion tokens of data over two months. At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.

Andrej Karpathy, former OpenAI and Tesla executive, [3]comments :

> For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.

>

> Does this mean you don't need large GPU clusters for frontier LLMs? No but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms.

[1] https://techcrunch.com/2024/12/26/deepseeks-new-ai-model-appears-to-be-one-of-the-best-open-challengers-yet/

[2] https://x.com/deepseek_ai/status/1872242657348710721

[3] https://x.com/karpathy/status/1872362712958906460

So, the usual (Score:2)

by Mr. Dollar Ton ( 5495648 )

At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.

You can have any two of cheap, good and fast, even when the definitions of "good" and "fast" are as blurry as the one of "AI".

Re: (Score:3)

by AmiMoJo ( 196126 )

It's a sign that the attempts to limit Chinese AI development are having the expected effect - accelerated development. They clearly have the talent to advance this field very quickly, and we should probably have used environmental reasons to force the same kind of improvements here.

Re: (Score:2)

by locater16 ( 2326718 )

That's just not the Silicon Valley Way son, more money needs to be thrown at the problem until such time as the company fails or we're the last company around and have a monopoly!

Re: (Score:1)

by sxpert ( 139117 )

I was saying at the time the US came up with the sanctions that this would have the opposite effect, and taking your enemy for a fool was a stupid way to attack the problem.

here we are, China is leapfrogging the US with fundamental research...

Re: (Score:2)

by gweihir ( 88907 )

Protectionism is an utterly dumb move if the other side has a reasonable chance of reacting. China does, with the expected effects.

Re: (Score:2)

by Luckyo ( 1726890 )

Funniest thing in this narrative is that PRC is extremely protectionist. Far more so than US is today. If US implemented even a fraction of protectionism that PRC does, trade between PRC and USA would basically cease.

And yet US protectionism bad, PRC protectionism good, because look how China supposedly succeeds "because of increase of US protectionism (please don't look at PRC's protectionism)". According to the china bots, and their gullible victims who are chronically incapable of taking a look at entire

Re: (Score:2)

by ShanghaiBill ( 739463 )

> You can have any two of cheap, good and fast

That's a common saying about software, but it applies less to hardware where there isn't much difference between "fast" and "good".

Custom tensor processors are the future of AI and they are cheaper and faster.

Re: (Score:2)

by Njovich ( 553857 )

For AI models you have:

Not cheap, good, fast: Large AI model on expensive GPU

Cheap, not good, and fast: Small AI model on RAM or cheap GPU

Cheap, good and not fast: Large AI model on RAM

You can get cheap, good and fast from Google's API if you are willing to give them your data though, but I guess that's a form of cost.

Re: (Score:2)

by Mr. Dollar Ton ( 5495648 )

The "AI" has no "future", but whatever.

Re: So, the usual (Score:2)

by Fons_de_spons ( 1311177 )

I agree here. Been doing some simple embedded software as a holiday project. The compiled code is ridiculous. Needlessly spending way too much time putting stuff on and off the stack. No, no. Hold your horses. I am not suggesting to do an Ai model in assembly. But there is a huge amount to gain in hardware there. Once the software side converges to something stable. The hardware guys will do their magic. They may even phone to their semiconductor friends and ask for something different than CMOS for Ai. Sof

Re: (Score:2)

by Pinky's Brain ( 1158667 )

MoE on average will run far faster than a dense model. Both in training and inference this will run more like a 34B dense model than a 671B dense model.

Re: (Score:2)

by gweihir ( 88907 )

"Good" boils down to "crappy" instead of "more crappy" with LLMs.

Re: (Score:2)

by 2TecTom ( 311314 )

fast and good don't really matter, it's only a matter of time before AI training becomes ongoing, dynamic and distributed, the next gen AIs will be self-taught and they'll be self-managed and far beyond understanding and they will lose on the Internet, this is just the Adolescence of P1

[1]https://en.wikipedia.org/wiki/... [wikipedia.org]

then we'll find out just how stupid and unethical the human race really has been when our AI master takes over

[1] https://en.wikipedia.org/wiki/The_Adolescence_of_P-1

The PR is interesting (Score:3)

by larryjoe ( 135075 )

There was a time when China dominated the Top 500 list of scientific supercomputers. Now, the systems that dominated the list lie at #15 and #24. What's interesting is that China could at anytime easily ticket back to the #1 position, but it chooses not to do so. Why? Because the attention was a double-edged sword. There was satisfaction in bragging about besting the West, but the accompanying attention prompted Western ideas about containing China.

That's why this piece of PR is a bit puzzling. The West is already in the middle of a China containment initiative, so the PR is a message that the containment isn't working, which suggests that the containment needs to be increased, which isn't in Chinese interests. Furthermore, if the PR news is indeed true, keeping it secret as a trade secret would seen to be far more advantageous. There are some bragging rights, but as with the Top 500 list, China is realizing that topping the list and garnering global acclaim are not the same.

Re: (Score:2)

by dhasenan ( 758719 )

China is not a monolith. It's a capitalist nation where even government power is divided, not applied with a unified objective. If the execs at Tencent decided to build the most powerful supercomputer in the world, they could. The government of Shenzen could decide to outdo them the next year, giving a massive grant to the South University of Science and Technology. Xi Jinping could stop either, but it would take political capital.

DeepSeek is a product of an AI hedge fund, High-Flyer. Presumably they're try

Re: (Score:1)

by Anonymous Coward

Actually China simply no longer publishes any material on their super computers anymore. What's the point. If someone publishes, then US tries to sanction. Therefore they don't care about the dick measuring contest of who's #1 in the rankings anymore.

Re: (Score:2)

by Luckyo ( 1726890 )

Unironically one of the big changes is that PRC's censorship apparatus started hitting Chinese nationalists posting abroad. "China numbah one" fifty centers all suddenly stopped.

Didn't stop in China proper though. Their social media is still full of "China numbah one" screechers. They're less prominent as they not as useful any more, but they're certainly there. Though their primary use domestically is to beat down the people complaining about "Garbage Time of History" as most Chinese people are suffering f

Re: (Score:2)

by thegarbz ( 1787294 )

The only supercomputers listed in the top 500 are those that people want listed in the top 500. There are plenty out there (I've personally used one which would rank top 50 performance wise) that are not listed. Those owned by private companies or run by governments for purposes not related to open research are not going to be found in that list.

"11x less" - learn your math (Score:2)

by japa ( 28571 )

I hate seeing these "n times less" where n is more than 1.

Even if you can correctly say "n times more", you can't say "n times less".

Ie. If something needs half more resouces, it's clear that a 50% increase is needed. And if it needs half less resources, it needs 50% less resources. Now, when it needs twice the resources, it needs original plus one more set of original, or 100% more. What about twice less? original minus one more set of original = 0??? Or Original - 100% = 0!

Not to speak about 11 times les

Re: "11x less" - learn your math (Score:2)

by flyingfsck ( 986395 )

Hmm, those clowns must be doing reciprocal math

Re: (Score:2)

by Pinky's Brain ( 1158667 )

> Even if it takes 1/11th fraction, it does not take 11 times less!

Words don't need to map 1:1 to operators. Due to the context of "times", "less" can be unambiguously interpreted as division.

So it can have the clear meaning of 1/11th, you clearly know the meaning is 1/11th, it is customary to use it to mean 1/11t ... it is the meaning of 11 times less.

Re: (Score:2)

by Entrope ( 68843 )

By your argument, "a third less" really 'unambiguously' means "three times as much", which as we know is now anonymous with "three times more", which more literally means "four times as much".

You are bad and you should feel bad.

Re: (Score:2)

by znrt ( 2424692 )

> What about twice less? original minus one more set of original = 0??? Or Original - 100% = 0!

it isn't that complicated:

twice more = double the need of the original.

twice less = half the need of the original.

"a needs twice more than b" means exactly the same thing than "b needs twice less than a"

> Sorry for the rant, but I hate when math is being mishandled.

it's okay, your problem isn't with the math but either with reading comprehension, deliberately being a douche on general language usage or both.

did i misunderstand the math here? (Score:1)

by Jayhawk0123 ( 8440955 )

The model was trained on "14.8 trillion tokens of data"

which is supposedly 1.6 times the size of Meta's Llama 3.1 which was trained on 405 billion parameters and 128,000 context window size with 15 trillion multilingual tokens.

where is this 1.6 supposed to be?

you get what you pay for (Score:1)

by elcor ( 4519045 )

deepseek is so bad that it's unusable

Re: (Score:2)

by thegarbz ( 1787294 )

[Citation Required]

News: 0175772491

Chinese Firm Trains Massive AI Model for Just $5.5 Million (techcrunch.com)

So, the usual (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: So, the usual (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The PR is interesting (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

"11x less" - learn your math (Score:2)

Re: "11x less" - learn your math (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

did i misunderstand the math here? (Score:1)

you get what you pay for (Score:1)

Re: (Score:2)