I'm setting up a Frankenstein system at the moment. It's a Chinese DDR3 X99 motherboard with a 12 core Xeon v3, 32gb 1866MT/s ram, and a 1080 Ti.
I'm shoehorning it back in the Optiplex that donated the ram, so it's not ready to go at the moment, but when I had it running on top of the motherboard box as a test I ran the (9B?) gemma4:e4b-it-q4_K_M since it can fit entirely in the 11gb vram. It flew, more than 50tk/s. A model that small isn't useful for coding, but there could be uses. I'd love to figure out a Wake-on-Use and use it as my personal ChatGPT. I'm not sure how that would work... Maybe proxy the LLM thru a Pi with a script to Wake-on-LAN the PC? It'll be a fun weekend project someday.
My always-on LLM is the dense Gemma4:31b that's not quite half in GPU on a 12gb 2060. It's really slow, but the quality is great and my use case is an automated queue so I'm not sitting there watching the output. I have another 2060 but unfortunately the PC won't POST with both installed for some reason.
if you have an openwrt router this is very easy to do. i have a script on my main working machine that will ssh openwrt and turn on the server and this work well
I'm shoehorning it back in the Optiplex that donated the ram, so it's not ready to go at the moment, but when I had it running on top of the motherboard box as a test I ran the (9B?) gemma4:e4b-it-q4_K_M since it can fit entirely in the 11gb vram. It flew, more than 50tk/s. A model that small isn't useful for coding, but there could be uses. I'd love to figure out a Wake-on-Use and use it as my personal ChatGPT. I'm not sure how that would work... Maybe proxy the LLM thru a Pi with a script to Wake-on-LAN the PC? It'll be a fun weekend project someday.
My always-on LLM is the dense Gemma4:31b that's not quite half in GPU on a 12gb 2060. It's really slow, but the quality is great and my use case is an automated queue so I'm not sitting there watching the output. I have another 2060 but unfortunately the PC won't POST with both installed for some reason.