DeepSeek V4 Flash on 2x DGX Spark: 61 tok/s single, 261 tok/s at 16 concurrent (DSpark spec-decode, decode-only)
Reviews stop at ~40 tok/s for DeepSeek V4 Flash on a dual DGX Spark. On our 2-node cluster the DeepSeek-V4-Flash-DSpark checkpoint lifts single-stream decode to 61 tok/s and 261 tok/s at 16 concurrent. The lever is speculative decoding, not FP4, and the honest catch: it crashes under concurrent long context.
#software#performance#nvidia#dgx-spark#deepseek#speculative-decoding