Running the model with a dense attention
Since the model is not yet supported in llama.cpp I did an experiment and ran DeepSeek V3.2 with dense attention by disabling lightning indexer. The model seems to perform just fine - at least based on my limited testing. Do you think the model performance (by this I mean intelligence, not speed) will be affected by doing this?
Yes
@Fujikre
I tried quite hard to find a difference in the model intelligence. I tested Q4_K_M quantized DeepSeek V3.2 GGUF with disabled lightning indexer in regular llama.cpp build by running my lineage-bench logical reasoning benchmark. The result (numbers are mean accuracy for given difficulty, there are 4 difficulty levels, for each difficulty there are 40 quizzes):
| Nr | model_name | lineage | lineage-8 | lineage-64 | lineage-128 | lineage-192 |
|---|---|---|---|---|---|---|
| 1 | deepseek/deepseek-v3.2 | 0.988 | 1.000 | 1.000 | 1.000 | 0.950 |
So DeepSeek V3.2 with dense attention solved correctly almost all quizzes (there were 160 quizzes overall), it got wrong only 2 most difficult quizzes. When I tested the original model via API the result was:
| Nr | model_name | lineage | lineage-8 | lineage-64 | lineage-128 | lineage-192 |
|---|---|---|---|---|---|---|
| 1 | deepseek/deepseek-v3.2 | 0.956 | 1.000 | 1.000 | 0.975 | 0.850 |
So it looks like the model with dense attention performed even a bit better than the sparse attention one.
Or perhaps you meant that the model intelligence will be affected, but positively?