Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu_layers is not effective #3479

Open
msameer opened this issue Sep 3, 2024 · 0 comments
Open

gpu_layers is not effective #3479

msameer opened this issue Sep 3, 2024 · 0 comments
Labels
bug Something isn't working unconfirmed

Comments

@msameer
Copy link

msameer commented Sep 3, 2024

LocalAI version:
localai/localai:v2.20.1-cublas-cuda12-core

Environment, CPU architecture, OS, and Version:
Lenovo Legion laptop, AMD 5800H CPU, 40GB RAM, NVIDIA 3060 with 6G Memory

Describe the bug
I am running local AI using this container:
docker run -p 8090:8080 --rm --gpus all --name local-ai-llava -e DEBUG=true -e MODELS_PATH=/models -v /home/msameer/local-ai-models:/models -ti localai/localai:v2.20.1-cublas-cuda12-core https://gist.githubusercontent.com/msameer/dec4efaf7b1674fbd5be38d8d2b83484/raw/f4943915546ae4013eb6c0220b9ea35783bc2fbd/llava.yaml

The content of the yaml gist is as follows:

name: llava-1.6-mistral
context_size: 4096
f16: true
threads: 11
gpu_layers: 32
mmap: true
parameters:
  # Reference any HF model or a local file here
  model: llava-v1.6-mistral-7b.gguf
template:

  chat: &template |
    Instruct: {{.Input}}
    Output:
  # Modify the prompt template here ^^^ as per your requirements
  completion: *template

When I execute this request it responds successfully after 19 seconds:

curl -X POST --location "http://localhost:8090/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d '{
          "model": "llava-1.6-mistral",
          "messages": [
            {
              "role": "user",
              "content": "How many pyramids are there in Giza?"
            }
          ],
          "temperature": 0.7
        }'

I see this in the logs regardless of the gpu_layers number I put in the gist and the time for response is always the same so it looks like it's not effective:

8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: ggml ctx size = 0.27 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: offloading 4 repeating layers to GPU
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: offloaded 4/33 layers to GPU

Here are some parts of the debug log as it's too long to include in full:

@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also #288
@@@@@
CPU info:
model name : AMD Ryzen 7 5800H with Radeon Graphics
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
CPU: AVX found OK
CPU: AVX2 found OK
CPU: no AVX512 found
@@@@@
8:36PM INF env file found, loading environment variables from file envFile=.env
8:36PM DBG Setting logging to debug
8:36PM INF Starting LocalAI using 8 threads, with models path: /models
8:36PM INF LocalAI version: v2.20.1 (a9c521e)
8:36PM DBG CPU capabilities: [3dnowprefetch abm adx aes aperfmperf apic arat avic avx avx2 bmi1 bmi2 bpext cat_l3 cdp_l3 clflush clflushopt clwb clzero cmov cmp_legacy constant_tsc cpb cppc cpuid cqm cqm_llc cqm_mbm_local cqm_mbm_total cqm_occup_llc cr8_legacy cx16 cx8 de debug_swap decodeassists erms extapic extd_apicid f16c flushbyasid fma fpu fsgsbase fsrm fxsr fxsr_opt ht hw_pstate ibpb ibrs ibs invpcid irperf lahf_lm lbrv lm mba mca mce misalignsse mmx mmxext monitor movbe msr mtrr mwaitx nonstop_tsc nopl npt nrip_save nx ospke osvw overflow_recov pae pat pausefilter pclmulqdq pdpe1gb perfctr_core perfctr_llc perfctr_nb pfthreshold pge pku pni popcnt pse pse36 rapl rdpid rdpru rdrand rdseed rdt_a rdtscp rep_good sep sha_ni skinit smap smca smep ssbd sse sse2 sse4_1 sse4_2 sse4a ssse3 stibp succor svm_lock syscall tce topoext tsc tsc_scale umip user_shstk v_spec_ctrl v_vmsave_vmload vaes vgif vmcb_clean vme vmmcall vpclmulqdq wbnoinvd wdt xgetbv1 xsave xsavec xsaveerptr xsaveopt xsaves]
8:36PM DBG GPU count: 2
8:36PM DBG GPU: card #1 @0000:05:00.0 -> driver: 'amdgpu' class: 'Display controller' vendor: 'Advanced Micro Devices, Inc. [AMD/ATI]' product: 'Cezanne'
8:36PM DBG GPU: card #2 @0000:01:00.0 -> driver: 'nvidia' class: 'Display controller' vendor: 'NVIDIA Corporation' product: 'GA106M [GeForce RTX 3060 Mobile / Max-Q]'
8:36PM DBG [startup] downloading https://gist.githubusercontent.com/msameer/dec4efaf7b1674fbd5be38d8d2b83484/raw/f4943915546ae4013eb6c0220b9ea35783bc2fbd/llava.yaml
8:36PM DBG guessDefaultsFromFile: template already set name=llava-1.6-mistral
8:36PM DBG guessDefaultsFromFile: not a GGUF file
8:36PM DBG guessDefaultsFromFile: template already set name=gpt-4
8:36PM DBG guessDefaultsFromFile: not a GGUF file
8:36PM DBG guessDefaultsFromFile: not a GGUF file
8:36PM DBG guessDefaultsFromFile: template already set name=gpt-4-vision-preview
8:36PM DBG guessDefaultsFromFile: not a GGUF file
8:36PM DBG guessDefaultsFromFile: not a GGUF file
8:36PM DBG guessDefaultsFromFile: template already set name=llava-1.6-mistral
8:36PM INF Preloading models from /models
8:36PM DBG Checking "ggml-whisper-base.bin" exists and matches SHA
8:36PM DBG File "/models/ggml-whisper-base.bin" already exists and matches the SHA. Skipping download

Model name: whisper-1
...
Model name: stablediffusion
...

8:36PM DBG Checking "llava-v1.6-mistral-7b.Q5_K_M.gguf" exists and matches SHA
8:36PM DBG File "/models/llava-v1.6-mistral-7b.Q5_K_M.gguf" already exists. Skipping download
8:36PM DBG Checking "llava-v1.6-7b-mmproj-f16.gguf" exists and matches SHA
8:36PM DBG File "/models/llava-v1.6-7b-mmproj-f16.gguf" already exists. Skipping download

Model name: gpt-4-vision-preview
...
Model name: jina-reranker-v1-base-en
...
Model name: tts-1
...
Model name: llava-1.6-mistral

Model name: text-embedding-ada-002
...
Model name: gpt-4
...
8:36PM DBG Model: llava-1.6-mistral (config: {PredictionOptions:{Model:llava-v1.6-mistral-7b.gguf Language: Translate:false N:0 TopP:0xc001083fd8 TopK:0xc001083fe0 Temperature:0xc001083fe8 Maxtokens:0xc000556038 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000556030 TypicalP:0xc000556008 Seed:0xc000556050 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llava-1.6-mistral F16:0xc001083fa0 Threads:0xc001083fa8 Debug:0xc000556048 Roles:map[] Embeddings:0xc000556049 Backend: TemplateConfig:{Chat:Instruct: {{.Input}}
Output:
ChatMessage: Completion:Instruct: {{.Input}}
Output:
Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000556000 MirostatTAU:0xc001083ff8 Mirostat:0xc001083ff0 NGPULayers:0xc001083fb0 MMap:0xc001083fb8 MMlock:0xc000556049 LowVRAM:0xc000556049 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc001083f90 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:})
8:36PM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8080
8:36PM DBG Request received: {"model":"llava-1.6-mistral","language":"","translate":false,"n":0,"top_p":null,"top_k":null,"temperature":0.7,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"How many pyramids are there in Giza?"}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
8:36PM DBG guessDefaultsFromFile: template already set name=llava-1.6-mistral
8:36PM DBG Configuration read: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.gguf Language: Translate:false N:0 TopP:0xc001083fd8 TopK:0xc001083fe0 Temperature:0xc00052c1d8 Maxtokens:0xc000556038 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000556030 TypicalP:0xc000556008 Seed:0xc000556050 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llava-1.6-mistral F16:0xc001083fa0 Threads:0xc001083fa8 Debug:0xc00052c360 Roles:map[] Embeddings:0xc000556049 Backend: TemplateConfig:{Chat:Instruct: {{.Input}}
Output:
ChatMessage: Completion:Instruct: {{.Input}}
Output:
Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000556000 MirostatTAU:0xc001083ff8 Mirostat:0xc001083ff0 NGPULayers:0xc001083fb0 MMap:0xc001083fb8 MMlock:0xc000556049 LowVRAM:0xc000556049 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc001083f90 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
8:36PM DBG Parameters: &{PredictionOptions:{Model:llava-v1.6-mistral-7b.gguf Language: Translate:false N:0 TopP:0xc001083fd8 TopK:0xc001083fe0 Temperature:0xc00052c1d8 Maxtokens:0xc000556038 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000556030 TypicalP:0xc000556008 Seed:0xc000556050 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:llava-1.6-mistral F16:0xc001083fa0 Threads:0xc001083fa8 Debug:0xc00052c360 Roles:map[] Embeddings:0xc000556049 Backend: TemplateConfig:{Chat:Instruct: {{.Input}}
Output:
ChatMessage: Completion:Instruct: {{.Input}}
Output:
Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder: SchemaType:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000556000 MirostatTAU:0xc001083ff8 Mirostat:0xc001083ff0 NGPULayers:0xc001083fb0 MMap:0xc001083fb8 MMlock:0xc000556049 LowVRAM:0xc000556049 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc001083f90 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
8:36PM DBG Prompt (before templating): How many pyramids are there in Giza?
8:36PM DBG Template found, input modified to: Instruct: How many pyramids are there in Giza?
Output:

8:36PM DBG Prompt (after templating): Instruct: How many pyramids are there in Giza?
Output:

8:36PM DBG Loading from the following backends (in order): [llama-cpp llama-ggml llama-cpp-fallback piper rwkv stablediffusion whisper huggingface bert-embeddings /build/backend/python/mamba/run.sh /build/backend/python/coqui/run.sh /build/backend/python/transformers/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/vllm/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/bark/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/diffusers/run.sh]
8:36PM INF Trying to load the model 'llava-v1.6-mistral-7b.gguf' with the backend '[llama-cpp llama-ggml llama-cpp-fallback piper rwkv stablediffusion whisper huggingface bert-embeddings /build/backend/python/mamba/run.sh /build/backend/python/coqui/run.sh /build/backend/python/transformers/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/rerankers/run.sh /build/backend/python/parler-tts/run.sh /build/backend/python/openvoice/run.sh /build/backend/python/autogptq/run.sh /build/backend/python/exllama2/run.sh /build/backend/python/vllm/run.sh /build/backend/python/vall-e-x/run.sh /build/backend/python/bark/run.sh /build/backend/python/sentencetransformers/run.sh /build/backend/python/exllama/run.sh /build/backend/python/transformers-musicgen/run.sh /build/backend/python/diffusers/run.sh]'
8:36PM INF [llama-cpp] Attempting to load
8:36PM INF Loading model 'llava-v1.6-mistral-7b.gguf' with backend llama-cpp
8:36PM DBG Loading model in memory from file: /models/llava-v1.6-mistral-7b.gguf
8:36PM DBG Loading Model llava-v1.6-mistral-7b.gguf with gRPC (file: /models/llava-v1.6-mistral-7b.gguf) (backend: llama-cpp): {backendString:llama-cpp model:llava-v1.6-mistral-7b.gguf threads:11 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0001b6488 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh openvoice:/build/backend/python/openvoice/run.sh parler-tts:/build/backend/python/parler-tts/run.sh rerankers:/build/backend/python/rerankers/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:36PM DBG AMD GPU device found, no embedded HIPBLAS variant found. You can ignore this message if you are using container with HIPBLAS support
8:36PM DBG Nvidia GPU device found, no embedded CUDA variant found. You can ignore this message if you are using container with CUDA support
8:36PM INF [llama-cpp] attempting to load with AVX2 variant
8:36PM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp-avx2
8:36PM DBG GRPC Service for llava-v1.6-mistral-7b.gguf will be running at: '127.0.0.1:33657'
8:36PM DBG GRPC Service state dir: /tmp/go-processmanager2301374485
8:36PM DBG GRPC Service Started
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr I0000 00:00:1725395782.629042 35 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache, work_serializer_dispatch
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr I0000 00:00:1725395782.629342 35 ev_epoll1_linux.cc:125] grpc epoll fd: 3
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr I0000 00:00:1725395782.629543 35 server_builder.cc:392] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr I0000 00:00:1725395782.631104 35 ev_epoll1_linux.cc:359] grpc epoll fd: 4
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr I0000 00:00:1725395782.631443 35 tcp_socket_utils.cc:634] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout Server listening on 127.0.0.1:33657
8:36PM DBG GRPC Service Ready
8:36PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:} sizeCache:0 unknownFields:[] Model:llava-v1.6-mistral-7b.gguf ContextSize:4096 Seed:1197245088 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:4 MainGPU: TensorSplit: Threads:11 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/llava-v1.6-mistral-7b.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /models/llava-v1.6-mistral-7b.gguf (version GGUF V3 (latest))
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 0: general.architecture str = llama
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 1: general.name str = 1.6
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 2: llama.context_length u32 = 32768
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 4: llama.block_count u32 = 32
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 11: general.file_type u32 = 18
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess...
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - kv 23: general.quantization_version u32 = 2
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - type f32: 65 tensors
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_model_loader: - type q6_K: 226 tensors
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_vocab: special tokens cache size = 3
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_vocab: token to piece cache size = 0.1637 MB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: format = GGUF V3 (latest)
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: arch = llama
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: vocab type = SPM
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_vocab = 32000
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_merges = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: vocab_only = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_ctx_train = 32768
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_embd = 4096
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_layer = 32
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_head = 32
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_head_kv = 8
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_rot = 128
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_swa = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_embd_head_k = 128
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_embd_head_v = 128
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_gqa = 4
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_embd_k_gqa = 1024
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_embd_v_gqa = 1024
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: f_norm_eps = 0.0e+00
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: f_logit_scale = 0.0e+00
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_ff = 14336
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_expert = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_expert_used = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: causal attn = 1
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: pooling type = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: rope type = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: rope scaling = linear
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: freq_base_train = 1000000.0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: freq_scale_train = 1
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: n_ctx_orig_yarn = 32768
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: rope_finetuned = unknown
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: ssm_d_conv = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: ssm_d_inner = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: ssm_d_state = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: ssm_dt_rank = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: ssm_dt_b_c_rms = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: model type = 7B
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: model ftype = Q6_K
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: model params = 7.24 B
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: model size = 5.53 GiB (6.56 BPW)
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: general.name = 1.6
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: BOS token = 1 ''
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: EOS token = 2 '
'
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: UNK token = 0 ''
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: PAD token = 0 ''
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: LF token = 13 '<0x0A>'
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_print_meta: max token length = 48
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr ggml_cuda_init: found 1 CUDA devices:
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: ggml ctx size = 0.27 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: offloading 4 repeating layers to GPU
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: offloaded 4/33 layers to GPU
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: CPU buffer size = 5666.09 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llm_load_tensors: CUDA0 buffer size = 682.62 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr ...................................................................................................
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: n_ctx = 4096
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: n_batch = 512
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: n_ubatch = 512
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: flash_attn = 0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: freq_base = 1000000.0
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: freq_scale = 1
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_kv_cache_init: CUDA_Host KV buffer size = 448.00 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: CUDA0 compute buffer size = 309.13 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: graph nodes = 1030
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr llama_new_context_with_model: graph splits = 312
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395791,"level":"INFO","function":"initialize","line":504,"message":"initializing slots","n_slots":1}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395791,"level":"INFO","function":"initialize","line":513,"message":"new slot","slot_id":0,"n_ctx_slot":4096}
8:36PM INF [llama-cpp] Loads OK
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr sampling:
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395791,"level":"INFO","function":"launch_slot_with_data","line":886,"message":"slot is processing task","slot_id":0,"task_id":0}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr repeat_last_n = 0, repeat_penalty = 0.000, frequency_penalty = 0.000, presence_penalty = 0.000
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.700
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stderr mirostat = 2, mirostat_lr = 0.100, mirostat_ent = 5.000
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395791,"level":"INFO","function":"update_slots","line":1787,"message":"kv cache rm [p0, end)","slot_id":0,"task_id":0,"p0":0}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395802,"level":"INFO","function":"print_timings","line":327,"message":"prompt eval time = 8673.90 ms / 19 tokens ( 456.52 ms per token, 2.19 tokens per second)","slot_id":0,"task_id":0,"t_prompt_processing":8673.901,"num_prompt_tokens_processed":19,"t_token":456.5211052631579,"n_tokens_second":2.19047923189347}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395802,"level":"INFO","function":"print_timings","line":341,"message":"generation eval time = 1968.19 ms / 12 runs ( 164.02 ms per token, 6.10 tokens per second)","slot_id":0,"task_id":0,"t_token_generation":1968.194,"n_decoded":12,"t_token":164.01616666666666,"n_tokens_second":6.0969599541508614}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395802,"level":"INFO","function":"print_timings","line":351,"message":" total time = 10642.09 ms","slot_id":0,"task_id":0,"t_prompt_processing":8673.901,"t_token_generation":1968.194,"t_total":10642.095}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395802,"level":"INFO","function":"update_slots","line":1598,"message":"slot released","slot_id":0,"task_id":0,"n_ctx":4096,"n_past":30,"n_system_tokens":0,"n_cache_tokens":31,"truncated":false}
8:36PM DBG GRPC(llava-v1.6-mistral-7b.gguf-127.0.0.1:33657): stdout {"timestamp":1725395802,"level":"INFO","function":"update_slots","line":1551,"message":"all slots are idle and system prompt is empty, clear the KV cache"}
8:36PM DBG Response: {"created":1725395782,"object":"chat.completion","id":"94714ed1-5650-4d3e-9849-1caef2f3de5e","model":"llava-1.6-mistral","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"There are three pyramids in Giza. \u003c/s\u003e"}}],"usage":{"prompt_tokens":19,"completion_tokens":12,"total_tokens":31}}
8:36PM INF Success ip=172.17.0.1 latency=19.54115392s method=POST status=200 url=/v1/chat/completions
8:37PM INF Success ip=127.0.0.1 latency="44.634µs" method=GET status=200 url=/readyz

@msameer msameer added bug Something isn't working unconfirmed labels Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
1 participant