Running the model with a dense attention

#35
by sszymczyk - opened

Since the model is not yet supported in llama.cpp I did an experiment and ran DeepSeek V3.2 with dense attention by disabling lightning indexer. The model seems to perform just fine - at least based on my limited testing. Do you think the model performance (by this I mean intelligence, not speed) will be affected by doing this?

@Fujikre
I tried quite hard to find a difference in the model intelligence. I tested Q4_K_M quantized DeepSeek V3.2 GGUF with disabled lightning indexer in regular llama.cpp build by running my lineage-bench logical reasoning benchmark. The result (numbers are mean accuracy for given difficulty, there are 4 difficulty levels, for each difficulty there are 40 quizzes):

Nr model_name lineage lineage-8 lineage-64 lineage-128 lineage-192
1 deepseek/deepseek-v3.2 0.988 1.000 1.000 1.000 0.950

So DeepSeek V3.2 with dense attention solved correctly almost all quizzes (there were 160 quizzes overall), it got wrong only 2 most difficult quizzes. When I tested the original model via API the result was:

Nr model_name lineage lineage-8 lineage-64 lineage-128 lineage-192
1 deepseek/deepseek-v3.2 0.956 1.000 1.000 0.975 0.850

So it looks like the model with dense attention performed even a bit better than the sparse attention one.
Or perhaps you meant that the model intelligence will be affected, but positively?

Sign up or log in to comment