Skip to content
English library

deepseek

I am the DeepSeek-R3 reasoning models

Play icon crypto ? New deep seek

🚀 DeepSeek-V3: Scaling Open-Source AGI with Efficiency

DeepSeek-V3 is a 671B parameter Mixture-of-Experts (MoE) model, with 37B activated per token, designed to push the boundaries of open-source LLMs. It leverages innovative architectures like Multi-head Latent Attention (MLA) and DeepSeekMoE for efficient training and inference, while pioneering auxiliary-loss-free load balancing and multi-token prediction to enhance performance.

AI Research Breakthrough

🔧 Optimized Training: FP8 Precision and DualPipe Algorithm

DeepSeek-V3 introduces FP8 mixed precision training and the DualPipe algorithm for pipeline parallelism, achieving near-zero communication overhead and high training efficiency. This enables pre-training on 14.8T tokens at a cost of only 2.664M H800 GPU hours, making it one of the most cost-effective large-scale models.

Training Optimization

📦 Post-Training: Knowledge Distillation from DeepSeek-R1

DeepSeek-V3 incorporates reasoning capabilities from DeepSeek-R1 through innovative distillation techniques, enhancing its performance in math, coding, and reasoning tasks. This approach maintains a balance between accuracy and generation length, ensuring robust and efficient outputs.

Model Distillation

🏛️ Architecture: Multi-head Latent Attention and DeepSeekMoE

DeepSeek-V3's architecture is built on the Transformer framework, featuring Multi-head Latent Attention (MLA) for efficient inference and DeepSeekMoE for economical training. MLA reduces Key-Value (KV) cache during inference, while DeepSeekMoE employs a novel auxiliary-loss-free load balancing strategy to ensure balanced expert utilization during training.

Architecture Innovation

🔮 Multi-Token Prediction (MTP)

DeepSeek-V3 introduces Multi-Token Prediction (MTP), a training objective that predicts multiple future tokens at each position. This approach densifies training signals and improves data efficiency, enabling the model to pre-plan its representations for better future token prediction. During inference, MTP modules can be repurposed for speculative decoding to reduce generation latency.

Training Enhancement

Find the plan that's right for you, each plan includes

docs iconsDocs
sheets iconsSheets
slides iconsslides
forms iconsforms
keep iconskeep
sites iconssites
drive iconsdrive
gmail iconsgmail
meet iconsmeet
calendar iconscalendar
Chat_icon@1x iconsChat
docusaurus_keytar iconsjup
docusaurus iconsBusiness
GoogleMaps iconsGoogleMaps
book iconbook
books iconbooks
security iconsecurity
restaurant iconrestaurant
thought iconthought
recipe iconrecipe
news iconnews
deepseek icondeepseek
deepseekr1 icondeepseekr1
deepseekr2 icondeepseekr2

Released under the MIT License.

deepseek has loaded