Super broken, but i didnt expect much

#1
by rombodawg - opened

I tried this on LM studio with the latest version. The model does not work correctly, repeats tokens and talks about nonsense after so many tokens

I tried this on LM studio with the latest version. The model does not work correctly, repeats tokens and talks about nonsense after so many tokens

Don't use LMstudio, that launcher looks fancy but can't consume most models, esp crashed with very big ones like Kimi K2 in Q6 or etc.
Tested already. Works perfectly fine in oobabooga (aka text generation webGUI)(maybe except continue function which distracts model from topic but its broken there for long time, can be avoided with lot of tokens, model continue topic with every next question/prompt after end). https://github.com/oobabooga/text-generation-webui/releases
I want to test in vLLM, but its torture to install for CPU only.
I've tested Q8 quality, which uses kinda 110-115 Gb of RAM+VRAM, in my case i managed to tune it for 93Gb RAM and other offloaded to 3090.

Guide to use oobabooga:

  1. download latest release of text-gen portable (its distributed in one packet like ComfyUI for Windows), unzip file.
    2.drop your models into models folders in user_data (in GGUF=size of file=amount of RAM (RAM+VRAM) needed). Super large models like Kimi K2 or Deepseek Speciale can be used only from external drive obviously, so for that need to be written path in CMD_FLAGS.txt file (in user_data folder) like --model-dir /drive/your/model/folder
    3.launch by start_linux(or etc) in web browser (do not use high RAM consuming browsers like Chrome)
  2. in Model section choose your model then tune launch settings:
    4.1 gpu-layers if you want to use GPU+CPU or put 0 for CPU only
    4.2 ctx-size important - context size of discussed topic, more=more RAM
    4.3 cpu-moe, streaming-llm on your choice
    4.4 Other options is important - Threads is number of your CPU cores, threads_batch is number of CPU threads
    4.5 batch_size can be played after, this number affect loading of answer on prompt
    4.6 no-mmap and numa can be used by some, as i remember its against using storage drive for model space and for non-uniform memory
    4.7 many other setting can be played
  3. click Load button above and wait for confirmation that model loaded into RAM (or RAM+VRAM), useful to use any system resources app to check used RAM, with very big models usually all RAM used with leaving only minimum for OS itself, so RAM-eating apps need to be removed if model not loaded or super slow (when usually Linux started using SSD drive for model space)
    user_data folder with all models/chats/settings can be migrated into any next new version of oobabooga
    Oobabooga also distributed in docker container, maybe for corporate environments http://github.com/ashleykleynhans/text-generation-docker

Sign up or log in to comment