How is the cost of DeepSeek calculated?
Advertisements
In a world increasingly driven by artificial intelligence, the emergence of DeepSeek has set off ripples of excitement across the global tech landscapeYesterday, in a highly publicized live stream, Elon Musk unveiled what he calls "the smartest AI on the planet," Gork 3. Musk claims that Gork 3's reasoning abilities surpass all known models, boasting superior performance to both DeepSeek R1 and OpenAI’s o1 in reasoning testsThe news comes just after the major Chinese app WeChat announced it would integrate DeepSeek R1, which is currently undergoing a gradual rollout for testingThis new alliance has fueled speculation that the AI search sector is on the verge of transformation.
The enthusiasm for DeepSeek extends beyond just its capabilities; it is also fueled by its accessibilityMajor tech giants like Microsoft, NVIDIA, Huawei Cloud, and Tencent Cloud have all begun integrating DeepSeek into their operationsFurthermore, imaginative users have begun creating innovative applications, including fortune-telling and lottery prediction tools, which quickly morph into profitable venturesThus, DeepSeek's valuation has skyrocketed, hitting an impressive $100 billion.
The significance of DeepSeek's success lies not only in its free and user-friendly nature but also in its remarkable cost efficiencyIt reportedly achieved the equivalent performance of OpenAI’s o1 model with just $5.576 million spent on GPU costs to train the DeepSeek R1 modelThis is in stark contrast to the eye-watering sums spent by various domestic and international AI companies, which have collectively invested tens of billions of dollars in the so-called "Battle of the Models." Musk's Gork 3, while impressive, came with its own steep price tag; Musk stated that Gork 3 consumed the processing power equivalent to 200,000 NVIDIA GPUs, each costing about $30,000, while insiders estimate DeepSeek managed to achieve its results using only around 10,000 GPUs.
However, not all players in this rapidly evolving landscape are content to be overshadowed by DeepSeek
Advertisements
Recently, a team led by Fei-Fei Li claimed to have developed a reasoning model, S1, with less than $50 in cloud computing costsS1 demonstrates similar capabilities in mathematical and coding tests compared to both OpenAI's o1 and DeepSeek's R1. It is essential to clarify that S1 is a mid-sized model, and cannot be directly compared to the complexity and performance of DeepSeek's R1, which has hundreds of billions of parameters.
Despite the disparity in training costs, the curiosity remains: how truly powerful is DeepSeek? What specific capabilities allow it to dominate the conversation, and what is the true cost of developing a large model? What components are involved in the process? Moreover, what are the prospects for further reducing these training costs in the future?
One of the misunderstandings surrounding DeepSeek stems from an oversimplified perception of its functionalitiesWhile it is true that DeepSeek R1 has garnered significant attention as one of DeepSeek's leading models, it is crucial to note that the company has developed a wide array of other models with varying functions and capabilitiesThe mentioned $5.576 million refers specifically to the GPU costs incurred during the training of DeepSeek's general model, DeepSeek-V3 – essentially the net computing cost.
When comparing general and reasoning models, the distinctions become clearerGeneral models receive detailed instructions and break down tasks based on user input, which needs to be articulated clearlyThey tend to respond quickly, utilizing pattern prediction based on vast amounts of dataIn contrast, reasoning models take on simpler, more direct tasksUsers can express their requests plainly, allowing the model to develop its own planHowever, they tend to respond more slowly, as they are based on a chain of reasoning to arrive at answers, acknowledging a process that can lead to moments of overthinking.
Many users mistakenly assume that reasoning models are inherently superior to general models
Advertisements
In reality, they represent a cutting-edge approach developed by OpenAI after the pretraining paradigm reached its limits, focusing more computational power on reasoning stagesWhile reasoning models consume more resources and take longer to train, they are not always the best fit for every type of questionFor instance, when prompted with straightforward questions, such as identifying a capital city, reasoning models can sometimes lag in efficiency compared to general models.
According to the respected AI expert Liu Cong, on simpler inquiries, reasoning models can yield slower response times and higher computational costs, which can even result in inaccurate answers due to excessive deliberationHe recommends employing reasoning models for complex challenges, like intricate math problems or coding tasks, while advising that basic summarization or translation needs will be better served by general models.
So what is the actual strength of DeepSeek? Various authoritative rankings and analyses have positioned DeepSeek highly within both reasoning and general model categoriesWithin the first echelon of reasoning models, competitors include OpenAI's o-series models (such as o3-mini) and Google's Gemini 2.0 alongside DeepSeek-R1 and Alibaba's QwQ.
Despite ongoing discussions about DeepSeek R1's upper hand in capabilities compared to OpenAI, industry insiders contend there remain some gaps between DeepSeek and the latest o3 offerings from OpenAIHowever, they agree that DeepSeek has significantly narrowed the technological divide that previously existed. "If there was a 2 to 3-generation gap before, it is now down to 0.5 generations with DeepSeek-R1's arrival," remarked Jiang Shu, a seasoned industry expert.
This sudden rise of DeepSeek raises essential questions regarding the monetization of large AI modelsWhat actually underpins the cost of creating a top-tier model unlike any seen before? Liu Cong elaborated on the life cycle of large models, which generally encompasses two key phases: pre-training and post-training
Advertisements
He likens the development of a model to raising a child: initially only capable of crying, the child eventually learns to understand and communicate effectively through interaction with adults.
Pre-training involves equipping the model with vast amounts of data—essentially feeding it a treasure trove of information so it can absorb knowledgeHowever, at this stage, the model only accumulates information without any ability to utilize itThe post-training phase is where the model learns to develop and apply that knowledge, which can occur through techniques like model fine-tuning (SFT) or reinforcement learning (RLHF).
All players in the AI space—be they based domestically or internationally—follow this processBased on the widely adopted Transformer model, the foundation and training processes have no fundamental differencesAnalysts and practitioners agree that training costs can vary significantly between organizations, primarily owing to hardware, data, and laborEach segment may use different methods and incur varying costs associated with them.
Liu Cong illustrated this with several examplesThe choice between purchasing versus renting hardware can yield drastically different price pointsIf an organization buys hardware, it incurs a hefty one-time investment but significantly lowers ongoing operational costs, primarily needing to cover electricity billsIn contrast, renting might be cheaper initially but does not alleviate recurring costs over timeThe model training data can also vary dramatically based on whether it is purchased or manually scraped, each option yielding distinct price differencesFurther variations arise with each new model version iteration prior to final release, which can influence overall cost, although many companies remain opaque about these aspects.
In summary, every phase of development is imbued with high operational costs that can often remain hidden from viewExternal estimates suggest that the training cost for top models can be staggering—GPT-4 is pegged at around $78 million, Llama 3.1 surpasses $60 million, and Claude 3.5 approximates $100 million
Yet, as these models remain closed source, accurately gauging power wasted is difficultIn comparison, DeepSeek's modest $5.576 million training cost stands out sharplyHowever, it's crucial to clarify that this figure only represents the base model, DeepSeek-V3, with the actual costs involved in research, architecture, and algorithm trials still unaccounted for.
Analysis from semi-conductor market research firm SemiAnalysis suggests that considering capital expenditure for servers and operational costs, DeepSeek's total expenditures could potentially climb to $2.573 billion over four yearsEven when compared with competitors’ multi-billion dollar investments, DeepSeek's figure appears impressively low.
Moreover, DeepSeek-V3 utilized only 2,048 NVIDIA GPUs, requiring only 2.788 million GPU hours for training—conversely, competitors like OpenAI have used thousands of GPUs, while the training model Llama-3.1-405B required 30.84 million GPU hoursMarkedly, DeepSeek not only excels in efficiency during the training phase but also delivers cost-effective results during the inference stage as well.
Such efficiency translates into pricing strategies; DeepSeek's API pricing for large model functionalities such as text generation, dialogue, and code generation remains lower than those of competitors like OpenAIWith DeepSeek pricing its input at 1 yuan per million tokens (cache hit) and 16 yuan per million for output, compared to OpenAI's 0.55 dollars (approximately 4 yuan) for input and 4.4 dollars (around 31 yuan) for output, a significant cost advantage is evident.
Cache hit refers to utilizing data from stored caches as opposed to recalculating from scratch, unlocking time-saving benefits and reducing costsThe industry strives to delineate between cache hits and misses to enhance API pricing competitiveness, affording smaller enterprises easier access to cutting-edge technologyNotably, after a recent discount period, DeepSeek raised its API prices for input tokens to 0.5 yuan and output tokens to 8 yuan yet continues to maintain pricing lower than other market leaders.
While forecasting comprehensive training costs remains a complex endeavor, industry insiders collectively regard DeepSeek as emblematic of the low-cost model market
Henceforth, it is expected that other competitors will look to DeepSeek's framework in pursuit of reducing overheads.
The cost-efficiency of DeepSeek can be attributed to optimizations at various structural levels—from modeling through training phasesFor instance, to ensure solidity in responses, many companies leverage MoE (Mixture of Experts) models, addressing complex challenges by decomposing tasks and assigning them to specialized expertsWhile this approach has been widely referenced, DeepSeek has reached a pinnacle in potential expert specialization.
By utilizing fine-grained expert segmentation (subdivision of tasks among experts within common categories) and shared expert isolation (reducing knowledge redundancy), this methodology enhances efficiency and performance, resulting in more rapid and accurate responses.
Estimates suggest that DeepSeek's MoE model achieves similar results to LLaMA2-7B while utilizing only about 40% of the computational resourcesData handling also emerges as a critical frontier in training large models, with various companies vying to enhance computational efficiency while concurrently lowering hardware demands like memory and bandwidth.
DeepSeek's approach incorporates FP8 low-precision training for data processing, a leading tactic among existing open-source models—contrasting with the predominantly adopted FP16 or BF16 mixed-precision training, with FP8 allowing for considerably quicker training speeds.
In the realm of post-training reinforcement learning, strategizing optimization often presents tricky challengesThis may be likened to teaching a model, much like AlphaGo, to choose optimal moves in strategy-based gamesDeepSeek elected to employ GRPO (Group Relative Policy Optimization) rather than PPO (Proximal Policy Optimization), marking a distinction in how optimization algorithms leverage value models, thus reducing computing requirements and associated costs.
On the inference front, DeepSeek's shift to a multi-head potential attention mechanism has drastically diminished memory usage and computational complexity, yielding the tangible benefit of lowered API interface pricing.
Ultimately, the prevalent takeaway from DeepSeek's innovative trajectory lies in its ability to spotlight alternative avenues for enhancing AI model inference, demonstrating that both purely fine-tuned models (SFT) and entirely reinforcement-learned models (RLHF) can each constitute viable pathways for constructing reasoning models.
In essence, future paths in reasoning model development can potentially span four approaches: 1) Pure Reinforcement Learning (DeepSeek-R1-zero), 2) SFT + Reinforcement Learning (DeepSeek-R1), 3) Pure SFT (DeepSeek’s distilled models), and 4) Pure Prompting (cost-effective smaller models).
As Liu Cong noted, previously the consensus had centered on SFT combined with reinforcement learning, without recognizing the positive outcomes achievable by pursuing either SFT or reinforcement learning independently.
The implications of DeepSeek's cost-cutting methodologies stretch beyond the technical realm, marking a pivotal influence on the strategic paths AI companies might choose
According to Inno Angel Fund partner Wang Sheng, pathways toward achieving Artificial General Intelligence—AGI—can unfold along two vectors: one prioritizes a paradigm of computational arms races, bolstering technology and investment to enhance large model capabilities before commercial application, while the other opts for an algorithmic efficiency model that aims for immediate industrial application through innovative configurations designed to yield low-cost, high-performance models.
As Wang Sheng emphasizes, the suite of models developed by DeepSeek validate the feasibility of an efficiency-optimized paradigm over one of endless capability escalation in situations where ceilings for advancement seem insurmountablePractitioners remain optimistic, firmly believing that through the continual evolution of algorithms, the training costs associated with large models will gradually decline further.
Notably, Cathie Wood, founder and CEO of Ark Invest, previously posited that prior to DeepSeek, AI training expenses dwindled by an average of 75% yearly, with reasoning costs even plummeting by 85% to 90%. Wang Sheng has echoed that newly launched models experience significant cost reductions by the end of their first operational year, potentially dropping to as low as one-tenth of previous estimations.
The recent analysis from independent research facility SemiAnalysis corroborates that minimizing reasoning costs is emblematic of the substantive strides bolstering AI's ongoing advancementWhat formally necessitated supercomputers and extensive GPU arrays to replicate GPT-3 performance can now often be accomplished via smaller models on simple laptops, synthesis that yields low costsDario, the CEO of Anthropic, believes that major price reductions have corresponded with GPT-3’s quality trajectory, estimating cost declines of up to 1,200 times.
Looking ahead, the attention remains fixed on the accelerating descent of large model costs in what promises to be a transformative era for AI
The pathways are opening, not just in the realm of sophisticated models, but in creating affordable solutions that will be paramount as we venture into a future replete with possibilities anchored in AI-driven intelligence.