“ Over the past 24 hours (UTC+8 02/27/2025 12:00 PM to 02/28/2025 12:00 PM), the combined peak node occupancy for V3 and R1 inference services reached 278, with an average occupancy of 226.75 nodes (each node contains 8 H800 GPUs). Assuming the leasing cost of one H800 GPU is $2 per hour, the total daily cost amounts to $87,072…
If all tokens were billed at DeepSeek-R1’s pricing (*), the total daily revenue would be $562,027, with a cost profit margin of 545%. However, our actual revenue is substantially lower for the following reasons…”
Gross margin in theory. They are using API pricing to project revenue but not all traffic is API. They don't charge for chat traffic so this is theoretical. It is in response to a raging debate on whether their gross margin is negative with their API pricing. Apparently not if one can saturate the traffic.
Interesting they chose to split prefilling into its own independent service, I hadn't heard of that technique before. I found this paper that researches why that could be beneficial: https://arxiv.org/abs/2401.09670v1
There are a lot of shared prefixes from user prompts. You can save by first looking into the cache to find the longest prefix for a prompt. Their MLA makes the KV cache particularly efficient.
“ Over the past 24 hours (UTC+8 02/27/2025 12:00 PM to 02/28/2025 12:00 PM), the combined peak node occupancy for V3 and R1 inference services reached 278, with an average occupancy of 226.75 nodes (each node contains 8 H800 GPUs). Assuming the leasing cost of one H800 GPU is $2 per hour, the total daily cost amounts to $87,072…
If all tokens were billed at DeepSeek-R1’s pricing (*), the total daily revenue would be $562,027, with a cost profit margin of 545%. However, our actual revenue is substantially lower for the following reasons…”
reply