Selfhost an LLM

Shimitar@downonthestreet.eu · 1 month ago

Selfhost an LLM

hendrik@palaver.p3x.de · 1 month ago

There’s another community for this: [email protected]
Though we mostly discuss the news and specific questions there, beginner questions are a bit more rare.

I think you already got a lot of good answers here, LMStudio, OpenWebUI, LocalAI…
I’d like to add KoboldCpp that’s kind of made for gaming/dialogue, but it can do everything. And from my experience it’s very easy to set up and bundles everything into one program.

badcommandorfilename@lemmy.world · 1 month ago

I just started using https://lemonade-server.ai/

It has so far been pretty effortless and would be good if you are new to selfhosting

sakphul@discuss.tchncs.de · 1 month ago

What’s your experience compared to ollama+openwebui?

vane@lemmy.world · 1 month ago

You can host ollama and open-webui on container. If you want to wire search you can connect open-webui to playwright (also container) and searxng (also container) and llm will search the web for answers

splendoruranium@infosec.pub · 1 month ago

I read about OLLAMA, but it’s all unclear to me.

There’s really nothing more to it than the initial instructions tell you. Literally just a “curl -fsSL https://ollama.com/install.sh | sh”. Then you’re just a “ollama run qwen3:14b” away from having a chat with the model in your terminal.
That’s the “chat with it”-part done.

After that you can make it more involved by serving the model via API, manually adding .gguf quantizations (usually smaller or special-purpose modified bootleg versions of big published models) to your Ollama library with a modelcard, ditching Ollama altogether for a different environment or, the big upgrade, giving your chats a shiny frontend in the form of Open-WebUI.

immobile7801@piefed.social · 1 month ago

If you like videos, I’d highly recommend techno Tims video on how to do this. Its what I used when building mine. Link

billwashere@lemmy.world · 1 month ago

100% agree. TechnoTim is quite good. Also take a look at NetworkChuck. But be aware, these two will send you down rabbit holes of self-hosting ideas. Awesome rabbit holes, but rabbit holes nonetheless. I’ve spent weeks playing with stuff they’ve suggested. N8n and MCP is my latest obsession.

Evotech@lemmy.world · edit-2 30 days ago

N8n and ollama

You can create workflows using your own hosted models so you can have agents you can chat with where you want them, telegram, or discord or whatever. And enable tools etc.

Open webui is an alternative for the front end if you want a simpler approach

Mike Wooskey@lemmy.thewooskeys.com · edit-2 1 month ago

Sounds like you already know what you need to know to host Ollama in a Docker container. Ollama is an LLM “engine” - you can interact with LLM models via a CLI or you can integrate them into other services via an API.

To have a web page chat like ChatGPT or others, I installed OpenWebU. I love it! A friend of mine likes LMStudio, which i think is a desktop app, but I don’t know anything about it.

ikt@aussie.zone · 1 month ago

+1 LM Studio, so easy to use and so powerful

iii@mander.xyz · 1 month ago

One of these projects might be of interest to you:

Do note that CPU inference is quite a lot slower than GPU or the well known SAAS providers. I currently like the quantized deepseek models as the best balance between quality of replies and inference time when not using GPU.

ProperlyProperTea@lemmy.ml · 1 month ago

Indeed, other than being able to get the model running, having decent hardware is the next most important part.

3060 12gb is probably cheapest card to get, 3090 or other 24gb card if you can get it

three@lemmy.world · 1 month ago

LM Studio is a beast.

edit-2 1 month ago

Openwebui is awesome and allows u to use it as an api for all the models u have it hooked up to. Can point it at ollama or any openai api compatible endpoint (like open routers)

thr0w4w4y2@sh.itjust.works · 1 month ago

there’s a good tutorial to host ollama and a vector database here

ragingHungryPanda@piefed.keyboardvagabond.com · 1 month ago

i had to do a particular command to get the AMD GPU properly available in docker. i can’t find that if you need

eleitl@lemmy.zip · 1 month ago

Is Radeon V with 8 GB HBM worth using today?

ragingHungryPanda@piefed.keyboardvagabond.com · 1 month ago

not for LLMs. I have a 16GB and even what I can fit in there just isn’t really enough to be useful. It can still do things and quickly enough, but I can’t fit models that large enough to be useful.

I also don’t know if your GPU is compatible with ROCM or not.

eleitl@lemmy.zip · 29 days ago

The GPU used to but they dropped ROCm support for Radeon V and VII some time ago. Have to look at that Strix Halo/AI Max thing I guess.

ikidd@lemmy.world · 1 month ago

OpenWebUI is pretty much exactly what you’re looking for. It can start up an ollama instance that you can use for your other applications over the network, and chat with it as you see fit. If you have an API key from an outside subscription like OpenRouter or Anthropic, you can enter it and use the models avaialable there if the local ones you’ve downloaded aren’t up to the task.