- 31
- March
DeepSeek Series EP.2
DeepSeek V3 has 671 billion parameters, yet it was trained on a budget of just $5.6 million. That number sounds impossible when you consider that OpenAI reportedly spent over $100 million training GPT-4, a model of comparable scale. So how did DeepSeek pull it off? The answer lies in an architecture called Mixture of Experts (MoE) — a technique that has fundamentally changed the economics of AI. This article is EP.2 of our DeepSeek Series, and it takes a deep dive into what MoE is, how it works, and why it makes DeepSeek up to 10x cheaper than GPT.
Quick Summary — What is Mixture of Experts (MoE)?
- MoE = an architecture that divides a model into many specialized "experts," but activates only a few of them per query
- DeepSeek V3: 671B total parameters, but only 32B are active per token (4.8%)
- Reduces compute costs by 10-50x compared to a dense model of equivalent size
- The same technique is used by Google in Gemini (Switch Transformer) and by Mistral in Mixtral
What is MoE? — A Simple Explanation
Mixture of Experts (MoE) is a neural network architecture that does not use every part of the model to process every input. Instead, a Router (gating network) decides which "experts" should handle each input. The concept was first proposed in 1991 by Michael I. Jordan and Robert A. Jacobs, but it has only been applied at scale in large language models over the past two to three years.
To make this concrete, imagine a large hospital with 256 specialist doctors — cardiologists, orthopedic surgeons, ophthalmologists, dermatologists, and so on. When a patient arrives, a triage nurse (the Router) evaluates their symptoms and refers them to only the 8 most relevant specialists, rather than having every single doctor examine every case.
| Aspect | Dense Model (Traditional) | MoE Model (e.g., DeepSeek V3) |
|---|---|---|
| Analogy | Every doctor examines every case | Router selects the right specialists |
| Resource Usage | Every parameter activated for every token | Only selected experts are activated |
| Speed | Slow (must compute all parameters) | Fast (computes only selected portion) |
| Cost | High | Low (10-50x reduction) |
| Total Knowledge | Limited by total parameter count | Greater (many params, but only a few used at a time) |
Here is how the MoE mechanism works in summary: an Input (Token) is fed into a Router Network, which computes an "affinity score" for each Expert, then selects the highest-scoring Experts (for example, 8 out of 256). The outputs from each selected Expert are then combined using a Weighted Sum to produce the final output.
DeepSeek V3's MoE Architecture — The Numbers in Detail
DeepSeek V3 does not use a standard MoE architecture. Instead, it employs a custom-built system called DeepSeekMoE, which introduces several innovations that push performance well beyond what conventional MoE delivers. Here are the key specifications:
| Specification | Value | Notes |
|---|---|---|
| Total Parameters | 671 billion (671B) | Combined size of the entire model |
| Active Parameters per token | 32 billion (32B) | Only 4.8% of the total |
| Number of Experts | 256 Routed + 1 Shared | Shared Expert is active for every token |
| Experts Selected per token | 8 Experts | Out of 256 Routed Experts |
| Architecture | DeepSeekMoE | Auxiliary-loss-free load balancing |
| Training Cost | $5.576 million | 2,788K H800 GPU hours |
| Training Tokens | 14.8 trillion (14.8T) | Massive dataset, but processed cheaply |
The most critical number here is 4.8% — meaning that for each token passing through the model, only 32B out of 671B total parameters are used. Think of it as having a team of 257 people (256 Routed + 1 Shared), but assigning only 9 of them (8 Routed + 1 Shared) to any given task. The Shared Expert is a specialist who works on every single token regardless of what the input is — like a general practitioner who sees every patient before referring them onward.
Why is MoE 10x Cheaper Than Dense Models?
The core of the answer lies in compute per token. In a dense model, every token must pass through every parameter. In an MoE model, each token only passes through the selected experts. Here is a side-by-side comparison:
| Aspect | Dense Model (e.g., GPT-4) | MoE Model (e.g., DeepSeek V3) |
|---|---|---|
| Total Parameters | ~1.8T (estimated) | 671B |
| Active Params per token | All params = ~1.8T | 32B (4.8%) |
| Compute per token | Very high | Low — roughly 20x less |
| Training Cost | $100M+ | $5.6M |
| Inference Cost (API) | $15 / 1M tokens | $0.27 / 1M tokens |
| Advantage | Consistent performance across all queries | 10-50x cheaper for both training and inference |
| Disadvantage | Extremely expensive for both training and inference | Requires massive RAM (entire 671B model must be loaded into memory) |
The price difference of $0.27 vs. $15 per million tokens represents a 55x cost reduction, which is why DeepSeek has become accessible to small and mid-sized organizations. Consider an organization that uses AI to process 100,000 pages of documents per month — the cost difference could amount to hundreds of thousands of Thai Baht. However, the key trade-off with MoE is RAM requirements. Although inference is faster and cheaper, the entire 671B model must be loaded into memory, requiring GPUs with extremely high VRAM or multi-server setups.
DeepSeek's Special Techniques — This Is Not Ordinary MoE
DeepSeek did not simply adopt standard MoE. The team developed several supplementary techniques that push the model's efficiency even further. Here are the four key innovations that make DeepSeek V3 stand out:
1. Multi-head Latent Attention (MLA)
One of the biggest challenges in Transformer models is the KV Cache — data that must be stored in memory during inference to retain the context of previous tokens. The larger the model, the more memory the KV Cache consumes. A model with 671 billion parameters could require a KV Cache of approximately 200GB, necessitating multiple GPUs just for memory.
DeepSeek solved this problem with Multi-head Latent Attention (MLA), a technique that compresses the KV Cache using Low-rank Compression. This reduces the KV Cache from roughly 200GB to just ~20GB — a 10x reduction. The result is significantly cheaper inference and the ability to serve many more concurrent users without sacrificing answer quality.
2. Auxiliary-loss-free Load Balancing
A classic problem in MoE architectures is Load Imbalance — some experts get selected far too often (Popular Experts) while others are never chosen at all (Dead Experts). This wastes resources and degrades model quality.
The traditional fix involves adding an Auxiliary Loss Term to the loss function, forcing the Router to distribute tokens evenly across all experts. But this approach often hurts main task performance because the two loss objectives conflict with each other. DeepSeek solved this with Auxiliary-loss-free Load Balancing, which uses a Bias Term in the Router instead of adding a loss term. This achieves even distribution without compromising model quality — an elegantly simple innovation with remarkably effective results.
3. FP8 Mixed Precision Training
Normally, AI models are trained using FP16 or BF16 (16-bit floating point), which demands substantial memory and compute. DeepSeek was one of the first companies to successfully train a large-scale model using FP8 (8-bit floating point), cutting memory usage in half and significantly accelerating training speed. This is not pure FP8 — it is Mixed Precision, using FP8 for operations that tolerate precision loss (such as Forward/Backward Pass) while maintaining FP32 for critical operations (such as Loss Accumulation).
4. Multi-Token Prediction (MTP)
Standard language models predict one token per step — generating 100 tokens requires 100 steps. DeepSeek V3 uses Multi-Token Prediction (MTP), which predicts multiple tokens simultaneously in a single step, making inference up to 1.8x faster. For example, instead of predicting "I" then "love" then "this" then "system" one word at a time, MTP predicts "I love this system" all at once (when the model is highly confident). This dramatically reduces the time users spend waiting for responses.
Summary of the 4 Techniques That Make DeepSeek V3 Special:
| Technique | Result | What It Reduces |
|---|---|---|
| MLA | KV Cache reduced by 10x | Memory cost during inference |
| Load Balancing | No dead experts | Wasted resources |
| FP8 Training | Memory halved, faster training | Training cost |
| MTP | Inference 1.8x faster | Response wait time |
MoE Across the AI Industry — Who Else Uses It?
DeepSeek is far from the only company using MoE. In fact, MoE has become the dominant trend in Large Language Model design for 2025-2026, as every major AI company seeks to reduce costs while scaling up capabilities. Here is a look at who is using MoE:
| Model | Company | Architecture | Experts | Status |
|---|---|---|---|---|
| DeepSeek V3 | DeepSeek | DeepSeekMoE | 256 + 1 Shared | Open-source (MIT) |
| Mixtral 8x22B | Mistral | MoE | 8 Experts | Open-source |
| Gemini 1.5 / 2 | MoE (Switch) | Undisclosed | Closed-source | |
| GPT-4 (rumored) | OpenAI | MoE (rumored) | ~16 (unconfirmed) | Closed-source |
| Grok-1 | xAI | MoE | 8 Experts | Open-source |
As the table shows, DeepSeek V3 has by far the largest number of experts (256+1) compared to Mixtral (8) or Grok-1 (8). The more experts a model has, the more "specialized" it can become across diverse domains — but this also requires more sophisticated routing. The fact that DeepSeek manages to maintain excellent load balancing across 256 experts is an impressive engineering achievement. The MoE technique also serves as the foundation for developing Agentic AI, which demands fast and affordable processing at production scale.
How Does the Router Work? — The Heart of MoE That Most People Overlook
Many people understand MoE as simply "splitting experts and choosing between them," but the reality is that the Router Network is the most difficult component to design and the single biggest factor determining whether the model performs well or poorly. Here is how DeepSeek V3's Router operates:
Step 1: Compute Affinity Scores — When a token arrives, the Router calculates an "affinity score" between that token and every expert using a Linear Projection followed by Softmax, producing a probability distribution that indicates which experts the token should be routed to.
Step 2: Top-K Selection — From the probability distribution above, the Router selects the Top-8 Experts with the highest scores out of all 256. The value K=8 that DeepSeek chose represents a sweet spot between quality and efficiency — if K is too small, quality suffers; if K is too large, speed suffers.
Step 3: Weighted Combination — The outputs from the 8 selected experts are combined using a Weighted Sum based on the probabilities the Router computed. Experts with higher scores contribute more weight to the final output, making the result a "consensus opinion" from the most relevant specialists.
Step 4: Shared Expert — In addition to the 8 Routed Experts, there is 1 Shared Expert that processes every single token regardless of the topic. This Shared Expert acts as the model's "foundational knowledge" — covering things like language grammar, basic logic, and common sense that the model should know in every situation.
An Analogy to Visualize the Router:
Imagine a large enterprise call center with 256 specialized agents, each expert in a different area — accounting, procurement, IT support, etc. When a call comes in, the IVR system (the Router) asks one or two questions and then routes the call to 2-3 relevant agents simultaneously, while a supervisor (the Shared Expert) listens in on every call. MoE works exactly the same way, but at the neural network level, processing billions of tokens per day.
Limitations of MoE — Not a Silver Bullet
While MoE has made AI dramatically cheaper, it comes with limitations that must be understood:
| Limitation | Description | Impact |
|---|---|---|
| High Memory Footprint | The entire model must be loaded into RAM, even though only a fraction is used per token — DeepSeek V3 requires 350GB+ VRAM | Requires multiple servers or expensive GPUs for self-hosting |
| Routing Overhead | The Router must compute affinity scores for every expert before selecting Top-K — the more experts, the longer this computation takes | Adds slight latency, especially in models with 256+ experts |
| Expert Collapse | If the Router is poorly designed, some experts will never be selected (Dead Experts), wasting resources | DeepSeek solved this with Auxiliary-loss-free Load Balancing, but other companies may still face this issue |
| Inconsistency | Similar tokens may be routed to different experts, resulting in slightly inconsistent outputs | Noticeable in tasks requiring deterministic outputs, such as financial calculations |
| Communication Overhead | When running across multiple GPUs, tokens must be transmitted between machines — the more distributed the experts, the more data must be transferred | Requires high-bandwidth interconnects (NVLink/InfiniBand) between GPUs |
These limitations do not mean MoE is flawed — they mean that not every organization should attempt to self-host a large-scale MoE model. For many organizations, using MoE models via API is the better approach, since the infrastructure challenges fall on the service provider, not on you. For more details, see EP.4: Running DeepSeek On-Premise.
The Future of MoE — Another Wave of Change Is Coming
MoE is not the end of the road — it is a starting point. Here are the key trends emerging in 2026-2027 that will make MoE even more powerful:
- Expert Pruning: Removing underutilized experts to make the model smaller without sacrificing quality — potentially reducing memory requirements by an additional 30-50%
- Dynamic Expert Loading: Loading experts into VRAM only when they are needed rather than loading the entire model at once — significantly reducing the memory footprint
- Hierarchical MoE: Multiple layers of experts — first selecting a category-level expert, then choosing sub-specialists within that category — improving routing accuracy
- MoE + Speculative Decoding: Using a small draft model to generate rough predictions first, then having the large model verify them — potentially achieving 3-5x speed improvements
For Thai organizations planning their long-term AI strategy, these trends mean that AI will become even cheaper and faster within the next one to two years. Starting to learn about and experiment with AI today will position your organization to capture the full benefits as the technology matures.
MoE and Thai Organizations — Why Should You Care?
You might wonder why an advanced AI architecture matters to organizations in Thailand. The answer is that it is directly relevant in several important ways:
Cheaper AI = Small Organizations Can Access It
Before MoE, using high-end AI cost hundreds of thousands of Thai Baht per month, putting it out of reach for all but the largest corporations. With MoE driving prices down by 10-50x, Thai SMEs with limited IT budgets can now access GPT-level AI for tasks like sales data analysis, automated report generation, or customer service chatbots. For example, a mid-sized manufacturing factory that needs to analyze product quality (QC) data — what used to cost 50,000 Baht per month in AI services could now cost just 2,000-5,000 Baht per month, making the investment clearly worthwhile.
Self-hosting MoE Requires Caution — High RAM Despite Low Compute
For organizations that want to run DeepSeek on their own servers, it is essential to understand that MoE has a unique limitation: the entire model must be loaded into RAM, even though only a fraction is used per token. DeepSeek V3 requires combined GPU VRAM of more than 350GB, which means servers with 4-8 A100/H100 GPUs — costing millions of Baht. If your budget does not stretch that far, you can opt for smaller distilled models like DeepSeek-R1-Distill-Qwen-32B, which offer good quality at a fraction of the size, or simply use the API instead.
AI + ERP = Greater Return on Investment
For organizations running ERP systems, cheaper AI means that integrating AI with ERP for deep data analysis, automated reporting, or sales forecasting has become a worthwhile investment. API costs that were once prohibitive are now low enough for daily production use. Here are practical examples of AI-ERP integration that MoE makes economically viable:
- Automated P&L Summaries — Send data from ERP to AI for executive-friendly summaries in natural language
- Anomaly Detection — AI analyzes transactions within the ERP to flag irregularities (fraud detection)
- Internal Chatbot — Employees can ask how to use the ERP system and get instant answers without opening a manual
- SQL Assistant — Convert natural language questions into SQL queries to pull data directly from the ERP database
If you would like to start from the beginning, read EP.1: What is DeepSeek? For information about security risks, continue to EP.3: Risks of Chinese AI. Or if you want to see how AI works with real ERP tasks, check out EP.5: Can DeepSeek Really Help with ERP?
DeepSeek Series — Read More
DeepSeek Series — 5 Episodes on the Chinese AI Challenger:
- EP.1: What is DeepSeek? — The Chinese AI That Shook the World
- EP.2: Mixture of Experts (MoE) — The Technique That Makes It 10x Cheaper (this article)
- EP.3: Risks of Chinese AI — What Thai Organizations Must Know
- EP.4: Running DeepSeek On-Premise — Is It Worth It?
- EP.5: Can DeepSeek Really Help with ERP?
MoE has proven that great AI does not have to consume massive resources — the era where "cheaper" means "more accessible" has already begun.
— Saeree ERP Team
