Only use 4 CPU threads in P2P worker cluster #3410

titogrima · 2024-08-26T23:09:53Z

LocalAI version:
v2.20.1-ffmpeg-core docker image for two workers and latest-aio-cpu for master

Environment, CPU architecture, OS, and Version:
Cluster P2P lab in docker machine with heteregeneus CPU AMD64 and ARM
Linux clusteria1 6.6.45-0-virt #1-Alpine SMP PREEMPT_DYNAMIC 2024-08-13 08:10:32 aarch64 Linux
8 CPU 7 GB RAM run one worker
Linux ia 6.1.0-23-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.99-1 (2024-07-15) x86_64 GNU/Linux
12 CPU 10 GB RAM run master and one worker

Describe the bug
When use P2P workers mode work fine but always use only 4 CPU to inference in each node,
I try env file with LOCALAI_THREADS=12 and --threads 12 in 12 CPU node and LOCALAI_THREADS=7 and --threads 7 in 8 CPU node
Else try THREADS variable in env file
If only run a master without workers work without any problem THREADS variable

To Reproduce
Launch a P2P worker cluster and set threads distinct from 4 threads

Expected behavior
Node use the threads defined

Logs
Logs from one worker

create_backend: using CPU backend
Starting RPC server on 127.0.0.1:37885, backend memory: 9936 MB
^C@@@@@
Skipping rebuild
@@@@@
If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
CMAKE_ARGS="-DGGML_F16C=OFF -DGGML_AVX512=OFF -DGGML_AVX2=OFF -DGGML_FMA=OFF"
see the documentation at: https://localai.io/basics/build/index.html
Note: See also #288
@@@@@
CPU info:
model name : AMD Ryzen 9 5900X 12-Core Processor
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt lbrv nrip_save tsc_scale vmcb_clean flushbyasid pausefilter pfthreshold v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid fsrm arch_capabilities
CPU: AVX found OK
CPU: AVX2 found OK
CPU: no AVX512 found
@@@@@
9:22PM INF env file found, loading environment variables from file envFile=.env
9:22PM DBG Setting logging to debug
9:22PM DBG Extracting backend assets files to /tmp/localai/backend_data
{"level":"INFO","time":"2024-08-26T21:22:10.977Z","caller":"config/config.go:288","message":"connmanager disabled\n"}
{"level":"INFO","time":"2024-08-26T21:22:10.977Z","caller":"config/config.go:292","message":" go-libp2p resource manager protection disabled"}
9:22PM INF Starting llama-cpp-rpc-server on '127.0.0.1:34015'
{"level":"INFO","time":"2024-08-26T21:22:10.978Z","caller":"node/node.go:118","message":" Starting EdgeVPN network"}
create_backend: using CPU backend
Starting RPC server on 127.0.0.1:34015, backend memory: 9936 MB
2024/08/26 21:22:10 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details.
{"level":"INFO","time":"2024-08-26T21:22:10.987Z","caller":"node/node.go:172","message":" Node ID: 12D3KooWFvq7aNHpre5tyQDZN9Gn2tZh84E3Vf9tfBuCmB5ULJSB"}
{"level":"INFO","time":"2024-08-26T21:22:10.987Z","caller":"node/node.go:173","message":" Node Addresses: [/ip4/127.0.0.1/tcp/41065 /ip4/127.0.0.1/udp/43346/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip4/127.0.0.1/udp/47629/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip4/127.0.0.1/udp/59911/quic-v1 /ip4/192.168.XX.XX/tcp/41065 /ip4/192.168.XX.XX/udp/43346/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip4/192.168.XX.XX/udp/47629/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip4/192.168.XX.XX/udp/59911/quic-v1 /ip6/::1/tcp/33785 /ip6/::1/udp/46892/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip6/::1/udp/49565/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip6/::1/udp/59078/quic-v1 /ip6/fda7:761c:127e:4::26/tcp/33785 /ip6/fda7:761c:127e:4::26/udp/46892/webrtc-direct/certhash/uEiDbmMPnLfeQJBvFRcfp-zDNXx-_CjljBg0ia3Nr20Xs7g /ip6/fda7:761c:XXXX:XX::XX/udp/49565/quic-v1/webtransport/certhash/uEiA46crpiIhxfL7skSKai7WxlGHkv8mZNXzAYoogm_qhow/certhash/uEiCt91_kaygLCTKWpqX6PEOTzb617BIH7KHDTRrw_eyurw /ip6/fda7:761c:XXX:XX::XX/udp/59078/quic-v1]"}
{"level":"INFO","time":"2024-08-26T21:22:10.987Z","caller":"discovery/dht.go:104","message":" Bootstrapping DHT"}
Accepted client connection, free_mem=10418868224, total_mem=10418868224
Client connection closed
Accepted client connection, free_mem=10418868224, total_mem=10418868224
Client connection closed
Accepted client connection, free_mem=10418868224, total_mem=10418868224
Client connection closed

Additional context

mudler · 2024-08-27T08:52:09Z

Hey @titogrima - LocalAI doesn't set any thread when running in p2p mode. This sounds more like a bug in llama.cpp as we just run the vanilla rpc service from the llama.cpp project, did you check if there are bugs relative to that upstream?

titogrima · 2024-08-27T10:30:04Z

Hi!

I checked llama.cpp repo https://github.com/ggerganov/llama.cpp but don't see any issue with this problem but if LocalAI don't set any thread in p2p mode maybe is better open issue in llama.cpp repo
I'm going to investigate this issue further but it's helpful to know that LocalAI doesn't set threads in p2p mode, maybe I can set threads directly in llama.cpp

Thanks and sorry for my english XD!!

mudler · 2024-08-27T10:40:22Z

Also might be worth noting that you can pass any command options of llama.cpp from LocalAI with --llama-cpp-args or LOCALAI_EXTRA_LLAMA_CPP_ARGS, from the --help output:

./local-ai worker p2p-llama-cpp-rpc --help                        
Usage: local-ai worker p2p-llama-cpp-rpc [flags]

Starts a LocalAI llama.cpp worker in P2P mode (requires a token)

Flags:
  -h, --help                     Show context-sensitive help.
      --log-level=LOG-LEVEL      Set the level of logs to output [error,warn,info,debug,trace]
                                 ($LOCALAI_LOG_LEVEL)

      --token=STRING             P2P token to use ($LOCALAI_TOKEN, $LOCALAI_P2P_TOKEN, $TOKEN)
      --no-runner                Do not start the llama-cpp-rpc-server ($LOCALAI_NO_RUNNER, $NO_RUNNER)
      --runner-address=STRING    Address of the llama-cpp-rpc-server ($LOCALAI_RUNNER_ADDRESS,
                                 $RUNNER_ADDRESS)
      --runner-port=STRING       Port of the llama-cpp-rpc-server ($LOCALAI_RUNNER_PORT, $RUNNER_PORT)
      --llama-cpp-args=STRING    Extra arguments to pass to llama-cpp-rpc-server
                                 ($LOCALAI_EXTRA_LLAMA_CPP_ARGS, $EXTRA_LLAMA_CPP_ARGS)

titogrima · 2024-08-27T12:33:30Z

Hi!

I tried this LOCALAI_EXTRA_LLAMA_CPP_ARGS=--threads 7
https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md#number-of-threads
But llama-cpp-rpc-server only support this arguments

11:12AM INF Starting llama-cpp-rpc-server on '127.0.0.1:35291'
error: unknown argument: --threads 7
Usage: /tmp/localai/backend_data/backend-assets/util/llama-cpp-rpc-server [options]

options:
-h, --help show this help message and exit
-H HOST, --host HOST host to bind to (default: 127.0.0.1)
-p PORT, --port PORT port to bind to (default: 35291)
-m MEM, --mem MEM backend memory size (in MB)

And fail to boot with threads option

titogrima · 2024-08-27T22:37:43Z

Well

Reading code of llama.cpp the "problem" is that in rcp-server code
https://github.com/ggerganov/llama.cpp/blob/master/examples/rpc/rpc-server.cpp
When initialized cpu backend line 87 call ggml_backend_cpu_init() function in gglm code
https://github.com/ggerganov/llama.cpp/blob/20f1789dfb4e535d64ba2f523c64929e7891f428/ggml/src/ggml-backend.c#L869
line 869 and this function have a GGML_DEFAULT_N_THREADS variable thats is 4 in headers gglm file
https://github.com/ggerganov/llama.cpp/blob/20f1789dfb4e535d64ba2f523c64929e7891f428/ggml/include/ggml.h#L236
line 236
Maybe I can recompiling it with GGML_DEFAULT_N_THREADS change or similar

Thanks for your help!!

titogrima · 2024-08-28T16:58:06Z

I recompiled ggml with /build/backend/cpp/llama/llama.cpp/ggml/include/ggml.h variable GGML_DEFAULT_N_THREADS change and works
Obviously is not the best solution but works....

Regards!

titogrima added bug Something isn't working unconfirmed labels Aug 26, 2024

Repository owner deleted a comment Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only use 4 CPU threads in P2P worker cluster #3410

Only use 4 CPU threads in P2P worker cluster #3410

titogrima commented Aug 26, 2024

mudler commented Aug 27, 2024

titogrima commented Aug 27, 2024

mudler commented Aug 27, 2024

titogrima commented Aug 27, 2024

titogrima commented Aug 27, 2024

titogrima commented Aug 28, 2024

Only use 4 CPU threads in P2P worker cluster #3410

Only use 4 CPU threads in P2P worker cluster #3410

Comments

titogrima commented Aug 26, 2024

mudler commented Aug 27, 2024

titogrima commented Aug 27, 2024

mudler commented Aug 27, 2024

titogrima commented Aug 27, 2024

titogrima commented Aug 27, 2024

titogrima commented Aug 28, 2024