Running the model with a dense attention

#35

by sszymczyk - opened 4 days ago

4 days ago

Since the model is not yet supported in llama.cpp I did an experiment and ran DeepSeek V3.2 with dense attention by disabling lightning indexer. The model seems to perform just fine - at least based on my limited testing. Do you think the model performance (by this I mean intelligence, not speed) will be affected by doing this?

Fujikre

3 days ago

Yes

sszymczyk

about 6 hours ago

@Fujikre
I tried quite hard to find a difference in the model intelligence. I tested Q4_K_M quantized DeepSeek V3.2 GGUF with disabled lightning indexer in regular llama.cpp build by running my lineage-bench logical reasoning benchmark. The result (numbers are mean accuracy for given difficulty, there are 4 difficulty levels, for each difficulty there are 40 quizzes):

Nr	model_name	lineage	lineage-8	lineage-64	lineage-128	lineage-192
1	deepseek/deepseek-v3.2	0.988	1.000	1.000	1.000	0.950

So DeepSeek V3.2 with dense attention solved correctly almost all quizzes (there were 160 quizzes overall), it got wrong only 2 most difficult quizzes. When I tested the original model via API the result was:

Nr	model_name	lineage	lineage-8	lineage-64	lineage-128	lineage-192
1	deepseek/deepseek-v3.2	0.956	1.000	1.000	0.975	0.850

So it looks like the model with dense attention performed even a bit better than the sparse attention one.
Or perhaps you meant that the model intelligence will be affected, but positively?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment