If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.
Every project races to have support on launch day so they donāt lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So youāre going to see a lot of āI tried it but it sucks because it canāt even do tool callsā and other reports about how the models donāt work at all in the coming weeks from people who donāt realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when itās changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when itās tested to be correct.
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
For the specific issue parent is talking about, you really need to give various tools a try yourself, and if you're getting really shit results, assume it's the implementation that is wrong, and either find an existing bug tracker issue or create a new one.
Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.
So it's a process thing, not "this software is better than that", and it heavily depends on the model.
After spending the past few weeks playing with different backends and models, I just canāt believe how buggy most models are.
It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.
Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.
Like seriously⦠how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!
Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.
Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.
I've had really good success with LMStudio and GLM 4.7 Flash and the Zed editor which has a baked in integration with LMStudio. I am able to one-shot whole projects this way, and it seems to be constantly improving. Some update recently even allowed the agent to ask me if it can do a "research" phase - so it'll actually reach out to website and read docs and code from github if you allow it. GLM 4.7 flash has been the most adept at tool calling I've found, but the Qwen 3 and 3.5 models are also fairly good, though run into more snags than I've seen with GLM 4.7 flash.
I donāt know if any of engines are fully tested yet.
For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux itās easy to set up a local build.
If you donāt want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks
just use openrouter or google ai playground for the first week till bugs are ironed out. You still learn the nuances of the model and then yuu can switch to local. In addition you might pickup enough nuance to see if quantization is having any effect
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
Yes, this is the conclusion I've come to as well. I don't want to continue supporting OpenAI nor Anthropic, but the other models don't seem to be anywhere close yet, despite the hype.
What coding harness are you using? What are some example workflows you have used either for? Have you used them only for new/simple projects or for more complicated refactoring or architecture design?
Haven't really tried GLM5 much but I've used 4.7 quite a bit and it was pretty far from competing with Sonnet at the time, although I saw claims online to the contrary.
Huge Claude user here⦠can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
Even with the latest version of LM Studio and the latest runtimes I find that tool use fails 100% of the time with the following error: Error rendering prompt with jinja template: "Cannot apply filter "upper" to type: UndefinedValue".
EDIT: The issue is addressed in LM Studio 0.4.9 (build 1), which auto-update wasn't picking up for me for some reason.
Sorry for being off topic, but why canāt I open this without being logged into GitHub? I thought gists are either completely private or publicly accessible. Are they no longer publicly accessible?
In case anyoneās wondering, I tried it again and it worked this time, even without logging in. Maybe because this was my first visit to GitHub in a new country (Iām currently on vacation), I triggered some sort of anti-scraping measure or something.
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
awesome man, canāt wait! And just now checked it out and indeed 0.3.2 does already work for baseline chatting with mlx versions of Gemma 4 ⦠downloading and comparing different variants right now!
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
It needs to support tool calling and many of the quantized ggufs don't so you have to check.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
Why is ollama so many peopleās go-to? Genuinely curious, Iāve tried it but it feels overly stripped down / dumbed down vs nearly everything else Iāve used.
Lately Iāve been playing with Unsloth Studio and think thatās probably a much better āgive it to a beginnerā default.
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
to be fair, llama.cpp has gotten much easier to use lately with llama-server -hf <model name>. That said, the need to compile it yourself is still a pretty big barrier for most people.
I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.
One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
LM Studio has been around longer. Iāve used it since three years ago. Iād also agree it is generally a better beginner choice then and now.
Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.
I run Little Snitch[1] on my Mac, and I haven't seen LM Studio make any calls that I feel like it shouldn't be making.
Point it to a local models folder, and you can firewall the entire app if you feel like it.
Digressing, but the issue with open source software is that most OSS software don't understand UX. UX requires a strong hand and opinionated decision making on whether or not something belongs front-and-center and it's something that developers struggle with. The only counterexample I can think of is Blender and it's a rare exception and sadly not the norm.
LM Studio manages the backend well, hides its complexities and serves as a good front-end for downloading/managing models. Since I download the models to a shared common location, If I don't want to deal with the LM Studio UX, I then easily use the downloaded models with direct llama.cpp, llama-swap and mlx_lm calls.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
> Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
Used to be an Ollama user. Everything that you cite as benefits for Ollama is what I was drawn to in the first place as well, then moved on to using llama.cpp directly. Apart from being extremely unethical, The issue is that they try to abstract away a bit too much, especially when LLM model quality is highly affected by a bunch of parameters. Hell you can't tell what quant you're downloading. Can you tell at a glance what size of model's downloaded? Can you tell if it's optimized for your arch? Or what Quant?
`ollama pull gemma4`
(Yes, I know you can add parameters etc. but the point stands because this is sold as noob-friendly. If you are going to be adding cli params to tweak this, then just do the same with llama.cpp?)
That became a big issue when Deep Seek R1 came out because everyone and their mother was making TikToks saying that you can run the full fat model without explaining that it was a distill, which Ollama had abstracted away. Running `ollama run deepseek-r1` means nothing when the quality ranges from useless to super good.
> And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
I'd go so far as to say, I can *GUARANTEE* you're missing out on performance if you are using Ollama, no matter the size of your GPU VRAM. You can get significant improvement if you just run underlying llama.cpp.
Secondly, it's chock full of dark patterns (like the ones above) and anti-open source behavior. For some examples:
1. It mangles GGUF files so other apps can't use them, and you can't access them either without a bunch of work on your end (had to script a way to unmangle these long sha-hashed file names)
2. Ollama conveniently fails contribute improvements back to the original codebase (they don't have to technically thanks to MIT), but they didn't bother assisting llama.cpp in developing multimodal capabilities and features such as iSWA.
3. Any innovations to the do is just piggybacking off of llama.cpp that they try to pass off as their own without contributing back to upstream. When new models come out they post "WIP" publicly while twiddling their thumbs waiting for llama.cpp to do the actual work.
It operates in this weird "middle layer" where it is kind of user friendly but itās not as user friendly as LM Studio.
After all this, I just couldn't continue using it. If the benefits it provides you are good, then by all means continue.
IMO just finding the most optimal parameters for a models and aliasing them in your cli would be a much better experience ngl, especially now that we have llama-server, a nice webui and hot reloading built into llama.cpp
Ollama has had bad defaults forever (stuck on a default CTX of 2048 for like 2 years) and they typically are late to support the latest models vs llamacpp. Absolutely no reason to use it in 2026.
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they āportedā it to Go which means theyāre just vibe code translating llama.cpp, bugs included.
I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
Yes, they introduced that Golang rewrite precisely to support the visual pipeline and other things that weren't in llama.cpp at the time. But then llama.cpp usually catches up and Ollama is just left stranded with something that's not fully competitive. Right now it seems to have messed up mmap support which stops it from properly streaming model weights from storage when doing inference on CPU with limited RAM, even as faster PCIe 5.0 SSDs are finally making this more practical.
The project is just a bit underwhelming overall, it would be way better if they just focused on polishing good UX and fine-tuning, starting from a reasonably up-to-date version of what llama.cpp provides already.
Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.
I don't think it does, but llama.cpp does, and can load models off HuggingFace directly (so, not limited to ollama's unofficial model mirror like ollama is).
In some places in the source code they claim sole ownership of the code, when it is highly derivative of that in llama.cpp (having started its life as a llama.cpp frontend). They keep it the same license, however, MIT.
There is no reason to use Ollama as an alternative to llama.cpp, just use the real thing instead.
If itās MIT code derived from MIT code, in what way is its openness āquasiā? Issues of attribution and crediting diminish the karma of the derived project, but I donāt see how it diminishes the level of openness.
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
It depends on the hardware, backend and options. I've recently tried running some local AIs (Qwen3.5 9B for the numbers here) on an older AMD 8GB VRAM GPU (so vulkan) and found that:
llama.cpp is about 10% faster than LM studio with the same options.
LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.
Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.
Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but itās okay.
Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
> I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training.
The Gemma models were literally released yesterday. You canāt ask LLMs for advice on these topics and get accurate information.
Please donāt repeat LLM-sourced answers as canonical information
It's not just LLM sourced though, folks have literally tried this after the release with the 26A4B model and it wasn't very good. Maybe the dense ~31B model is worthwhile though.
I agree with your criticism. I should have simply said that I had good results with gemma 4 tool use, and agentic coding with gemma 4 didnāt yet work well for me.
I spent two hours doing my own research before asking for Geminiās analysis, which reinforced my own opinion that the gemini models historically have not been trained and target for agentic coding use.
Have you tried using the new Gemma 4 models with agentic coding tools?If you do, you might end up agreeing with me.
I wasnāt very clear, sorry. By my āown researchā I meant spending 90 minutes experimenting with Gemma 4 models for tool use (good results!) and a half hour using with pi and OpenCode (I didnāt get good results, yet.)
Oh yeah absolute genius. I asked GPT-2 about Claude Opus 4.6 and it said āthis is not a recommendation. You might get some benefits from Opus⦠but this is not what you wantā. Damn, real wisdom from the OG there. What a legend
If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.
Every project races to have support on launch day so they donāt lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.
So youāre going to see a lot of āI tried it but it sucks because it canāt even do tool callsā and other reports about how the models donāt work at all in the coming weeks from people who donāt realize they were using broken implementations.
If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when itās changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when itās tested to be correct.
You seem like you know what you're talking about... what inference engine should I use? (linux, 4090)
I keep having "I tried it but it sucks" issues mostly around tool calling and it's not clear if it's the model or ollama. And not one model in particular, any of them really.
For the specific issue parent is talking about, you really need to give various tools a try yourself, and if you're getting really shit results, assume it's the implementation that is wrong, and either find an existing bug tracker issue or create a new one.
Same thing happened when GPT-OSS launched, bunch of projects had "day-1" support, but in reality it just meant you could load the model basically, a bunch of them had broken tool calling, some chat prompt templates were broken and so on. Even llama.cpp which usually has the most recent support (in my experience) had this issue, and it wasn't until a week or two after llama.cpp that GPT-OSS could be fairly evaluated with it. Then Ollama/LM Studio updates their llama.cpp some days after that.
So it's a process thing, not "this software is better than that", and it heavily depends on the model.
After spending the past few weeks playing with different backends and models, I just canāt believe how buggy most models are.
It seems to me that most model providers are not running/testing via the most used backends i.e Llama, Ollama etc because if they were, they would see how broken their release is.
Tool calling is like the Achilles Heel where most will fail unless you either modify the system prompts or run via proxies so you can inject/munge the request/reply.
Like seriously⦠how many billions and billions (actually we saw one >800 billion evaluation last week, so almost a whole trillion) goes into AI development and yet 99.999% of all models from the big names do not work straight out of the box with the most common backends. Blows my mind!
Just since I'm curious, what exact models and quantization are you using? In my own experience, anything smaller than ~32B is basically useless, and any quantization below Q8 absolutely trashes the models.
Sure, for single use-cases, you could make use of a ~20B model if you fine-tune and have very narrow use-case, but at that point usually there are better solutions than LLMs in the first place. For something general, +32B + Q8 is probably bare-minimum for local models, even the "SOTA" ones available today.
I've had really good success with LMStudio and GLM 4.7 Flash and the Zed editor which has a baked in integration with LMStudio. I am able to one-shot whole projects this way, and it seems to be constantly improving. Some update recently even allowed the agent to ask me if it can do a "research" phase - so it'll actually reach out to website and read docs and code from github if you allow it. GLM 4.7 flash has been the most adept at tool calling I've found, but the Qwen 3 and 3.5 models are also fairly good, though run into more snags than I've seen with GLM 4.7 flash.
I donāt know if any of engines are fully tested yet.
For new LLMs I get in the habit of building llama.cpp from upstream head and checking for updated quantizations right before I start using it. You can also download llama.cpp CI builds from their release page but on Linux itās easy to set up a local build.
If you donāt want to be a guinea pig for untested work then the safe option would be to wait 2-3 weeks
just use openrouter or google ai playground for the first week till bugs are ironed out. You still learn the nuances of the model and then yuu can switch to local. In addition you might pickup enough nuance to see if quantization is having any effect
Kinda crazy that I can run a 26B model on a 1500⬠laptop (MacBook Air M5 32GB). Does anyone know how I can actually use this in a productive way?
Slightly off topic, but question for folks.
I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).
If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.
In short: no.
Nothing comes close, in my opinion. Sonnet and Opus are still the best models. The Codex variants of the GPT models are also great. I've tried MiniMax, GLM, Qwen and Kimi and for anything even remotely complex these models seriously struggle.
Thank you for the honest answer.
Yes, this is the conclusion I've come to as well. I don't want to continue supporting OpenAI nor Anthropic, but the other models don't seem to be anywhere close yet, despite the hype.
Yes GLM5 and KimiK2.5 are pretty close replacements for sonnet.
What coding harness are you using? What are some example workflows you have used either for? Have you used them only for new/simple projects or for more complicated refactoring or architecture design?
Haven't really tried GLM5 much but I've used 4.7 quite a bit and it was pretty far from competing with Sonnet at the time, although I saw claims online to the contrary.
Huge Claude user here⦠can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?
They are good for small tasks but you would not be able to use it like you use Claude and most likely be disappointed. But also, I do not know how you use claude.
There are many services online which offer hosted services for these models, my advice for anyone who is thinking about buying hardware to self host this is to try those first, that way you can get an impression of the capabilities and limitations of those models before you commit to buying hardware
I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.
However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.
Best way to find out is to buy $10 of OpenRouter credits and try the models for yourself.
From my experience doing this, they're nowhere close, but it's entertaining to check in once in a while.
So far, Iāve found gpt-oss-20B to be pretty good agentic wise, but itās nothing like Claude Code using its paid models.
(I havenāt tried the 120B, which Iāve read is significantly better than 20B)
I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?
I failed to run in LM Studio on M5 with 32gb at even half max context. Literally locked up computer and had to reboot.
Ran gemma-4-26B-A4B-it-GGUF:Q4_K_M just fine with llama.cpp though. First time in a long time that I have been impressed by a local model. Both speed (~38t/s) and quality are very nice.
Tool calls falling is a problem with the inference engineās implementation and/or the quant. Update and try again in a few days.
This is how all open weight model launches go.
Haven't had time to try yet, but heard from others that they needed to update both the main and runtime versions for things to work.
Even with the latest version of LM Studio and the latest runtimes I find that tool use fails 100% of the time with the following error: Error rendering prompt with jinja template: "Cannot apply filter "upper" to type: UndefinedValue".
EDIT: The issue is addressed in LM Studio 0.4.9 (build 1), which auto-update wasn't picking up for me for some reason.
I googled it- supposed fixed template
https://github.com/ggml-org/llama.cpp/issues/21347#issuecomm...
Alas, this does not resolve the issue for me.
Sorry for being off topic, but why canāt I open this without being logged into GitHub? I thought gists are either completely private or publicly accessible. Are they no longer publicly accessible?
In case anyoneās wondering, I tried it again and it worked this time, even without logging in. Maybe because this was my first visit to GitHub in a new country (Iām currently on vacation), I triggered some sort of anti-scraping measure or something.
Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.
There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?
The Mac mini doesn't have different memory for the CPU and GPU, so maybe that's ignorable?
M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling
The latest release v0.3.2 has partial support, generation is supported but not all special tokens are handled. I've done some personal testing to add tool calling and <|channel> thinking support. https://github.com/Yukon/omlx
awesome man, canāt wait! And just now checked it out and indeed 0.3.2 does already work for baseline chatting with mlx versions of Gemma 4 ⦠downloading and comparing different variants right now!
I know that someone got Gemma 4 E4B working with MLX [1] but I don't know much more than that.
1: https://github.com/bolyki01/localllm-gemma4-mlx
Which harness (IDE) works with this if any? Can I use it for local coding right now?
Yes, you can use it for local coding. Most harnesses can be pointed at a local endpoint which provides an OpenAI compatible API, though I've had some trouble using recent versions of Codex with llama.cpp due to an API incompatibility (Codex uses the newer "responses" API, but in a way that llama.cpp hasn't fully supported).
I personally prefer Pi as I like the fact that it's minimalist and extensible. But some people just use Claude Code, some OpenCode, there are a ton of options out there and most of them can be used with local models.
It needs to support tool calling and many of the quantized ggufs don't so you have to check.
I've got a workaround for that called petsitter where it sits as a proxy between the harness and inference engine and emulates additional capabilities through clever prompt engineering and various algorithms.
They're abstractly called "tricks" and you can stack them as you please.
https://github.com/day50-dev/Petsitter
You can run the quantized model on ollama, put petsitter in front of it, put the agent harness in front of that and you're good to go
If you have trouble, file bugs. Please!
Thank you
edit: just checked, the ollama version supports everything
so you can just use that.Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...
Why is ollama so many peopleās go-to? Genuinely curious, Iāve tried it but it feels overly stripped down / dumbed down vs nearly everything else Iāve used.
Lately Iāve been playing with Unsloth Studio and think thatās probably a much better āgive it to a beginnerā default.
Ollama is good enough to dabble with, and getting a model is as easy as ollama pull <model name> vs figuring it out by yourself on hugging face and trying to make sense on all the goofy letters and numbers between the forty different names of models, and not needing a hugging face account to download.
So you start there and eventually you want to get off the happy path, then you need to learn more about the server and it's all so much more complicated than just using ollama. You just want to try models, not learn the intricacies of hosting LLMs.
to be fair, llama.cpp has gotten much easier to use lately with llama-server -hf <model name>. That said, the need to compile it yourself is still a pretty big barrier for most people.
I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.
One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.
Ollama got some first-mover advantage at the time when actually building and git pulling llama.cpp was a bit of a moat. The devs' docker past probably made them overestimate how much they could lay claim to mindshare. However, no one really could have known how quickly things would evolve... Now I mostly recommend LM-studio to people.
What does unsloth-studio bring on top?
LM Studio has been around longer. Iāve used it since three years ago. Iād also agree it is generally a better beginner choice then and now.
Unsloth Studio is more featureful (well integrated tool calling, web search, and code execution being headline features), and comes from the people consistently making some of the best GGUF quants of all popular models. It also is well documented, easy to setup, and also has good fine-tuning support.
LM Studio isn't free/libre/open source software, which misses the point of using open weights and open source LLMs in the first place.
Disagree, there are a lot of reasons to use open source local LLMs that aren't related to free/libre/oss principles. Privacy being a major one.
If you care about privacy making sure the closed source software does not call home is a concern...
I run Little Snitch[1] on my Mac, and I haven't seen LM Studio make any calls that I feel like it shouldn't be making.
Point it to a local models folder, and you can firewall the entire app if you feel like it.
Digressing, but the issue with open source software is that most OSS software don't understand UX. UX requires a strong hand and opinionated decision making on whether or not something belongs front-and-center and it's something that developers struggle with. The only counterexample I can think of is Blender and it's a rare exception and sadly not the norm.
LM Studio manages the backend well, hides its complexities and serves as a good front-end for downloading/managing models. Since I download the models to a shared common location, If I don't want to deal with the LM Studio UX, I then easily use the downloaded models with direct llama.cpp, llama-swap and mlx_lm calls.
[1]: https://obdev.at
What I really don't get is why more people don't talk about LMStudio, I switched to it months ago and it seems like a straight upgrade.
Isnāt LMStudio closed source?
How does LMStudio compare to Unsloth Studio?
Advertising, mostly.
Ollama's org had people flood various LLM/programming related Reddits and Discords and elsewhere, claiming it was an 'easy frontend for llama.cpp', and tricked people.
Only way to win is to uninstall it and switch to llama.cpp.
Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
> Ollama user with the opposite question -- why not? What am I missing out on? I'm using it as the backend for playing with other frontend stuff and it seems to work just fine.
Used to be an Ollama user. Everything that you cite as benefits for Ollama is what I was drawn to in the first place as well, then moved on to using llama.cpp directly. Apart from being extremely unethical, The issue is that they try to abstract away a bit too much, especially when LLM model quality is highly affected by a bunch of parameters. Hell you can't tell what quant you're downloading. Can you tell at a glance what size of model's downloaded? Can you tell if it's optimized for your arch? Or what Quant?
`ollama pull gemma4`
(Yes, I know you can add parameters etc. but the point stands because this is sold as noob-friendly. If you are going to be adding cli params to tweak this, then just do the same with llama.cpp?)
That became a big issue when Deep Seek R1 came out because everyone and their mother was making TikToks saying that you can run the full fat model without explaining that it was a distill, which Ollama had abstracted away. Running `ollama run deepseek-r1` means nothing when the quality ranges from useless to super good.
> And as someone running at 16gb card, I'm especially curious as to if I'm missing out on better performance?
I'd go so far as to say, I can *GUARANTEE* you're missing out on performance if you are using Ollama, no matter the size of your GPU VRAM. You can get significant improvement if you just run underlying llama.cpp.
Secondly, it's chock full of dark patterns (like the ones above) and anti-open source behavior. For some examples:
1. It mangles GGUF files so other apps can't use them, and you can't access them either without a bunch of work on your end (had to script a way to unmangle these long sha-hashed file names) 2. Ollama conveniently fails contribute improvements back to the original codebase (they don't have to technically thanks to MIT), but they didn't bother assisting llama.cpp in developing multimodal capabilities and features such as iSWA. 3. Any innovations to the do is just piggybacking off of llama.cpp that they try to pass off as their own without contributing back to upstream. When new models come out they post "WIP" publicly while twiddling their thumbs waiting for llama.cpp to do the actual work.
It operates in this weird "middle layer" where it is kind of user friendly but itās not as user friendly as LM Studio.
After all this, I just couldn't continue using it. If the benefits it provides you are good, then by all means continue.
IMO just finding the most optimal parameters for a models and aliasing them in your cli would be a much better experience ngl, especially now that we have llama-server, a nice webui and hot reloading built into llama.cpp
Ollama has had bad defaults forever (stuck on a default CTX of 2048 for like 2 years) and they typically are late to support the latest models vs llamacpp. Absolutely no reason to use it in 2026.
For me it's just the server. I use openwebui as interface. I don't want it all running on the same machine.
Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.
There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they āportedā it to Go which means theyāre just vibe code translating llama.cpp, bugs included.
I really like LM Studio when I can use it under Windows but for people like me with Intel Macs + AMD gpu ollama is the only option because it can leverage the gpu using MoltenVK aka Vulkan, unofficially. We're still testing it, hoping to get the Vulkan support in the main branch soon. It works perfectly for single GPUs but some edge cases when using multiple GPUs are unsupported until upstream support from MoltenVK comes through. But yeah, I agree, it wasn't cool to repackage Georgi's work like that.
LM Studio is closed source.
And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
Yes, they introduced that Golang rewrite precisely to support the visual pipeline and other things that weren't in llama.cpp at the time. But then llama.cpp usually catches up and Ollama is just left stranded with something that's not fully competitive. Right now it seems to have messed up mmap support which stops it from properly streaming model weights from storage when doing inference on CPU with limited RAM, even as faster PCIe 5.0 SSDs are finally making this more practical.
The project is just a bit underwhelming overall, it would be way better if they just focused on polishing good UX and fine-tuning, starting from a reasonably up-to-date version of what llama.cpp provides already.
Do y'all mean backend or the Ollama frontend or both? I find it trivially easy to sub in my local Ollama api thing in virtually all of the interesting frontend things. I'm quite curious about the "why not Ollama" here.
Does LM Studio have an equivalent to the ollama launch command? i.e. `ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4`
I don't think it does, but llama.cpp does, and can load models off HuggingFace directly (so, not limited to ollama's unofficial model mirror like ollama is).
There is no reason to ever use ollama.
> I don't think it does, but llama.cpp does
I just checked their docs and can't see anything like it.
Did you mistake the command to just download and load the model?
-hf ModelName:Q4_K_M
Did you mistake the command to just download and load the model too?
Actually that shouldn't be a question, you clearly did.
Hint: it also opens Claude code configured to use that model
sure there's a reason...it works fine thats the reason
I feel like the READMEs for these 3 large popular packages already illustrate tradeoffs better than hacker news argument
> There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
Ollama is quasi-open source.
In some places in the source code they claim sole ownership of the code, when it is highly derivative of that in llama.cpp (having started its life as a llama.cpp frontend). They keep it the same license, however, MIT.
There is no reason to use Ollama as an alternative to llama.cpp, just use the real thing instead.
If itās MIT code derived from MIT code, in what way is its openness āquasiā? Issues of attribution and crediting diminish the karma of the derived project, but I donāt see how it diminishes the level of openness.
lm studio is not opensource and you can't use it on the server and connect clients to it?
LM Studio can absolutely run as as server.
IIRC it does so as default too. I have loads of stuff pointing at LM Studio on localhost
>Ollama is slower
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
It depends on the hardware, backend and options. I've recently tried running some local AIs (Qwen3.5 9B for the numbers here) on an older AMD 8GB VRAM GPU (so vulkan) and found that:
llama.cpp is about 10% faster than LM studio with the same options.
LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.
Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.
Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.
In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:
https://www.youtube.com/live/G5OVcKO70ns
The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.
how many TPS does a build like this achieve on gemma 4 26b?
Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but itās okay.
Isn't 26 tok/s quite usable for a claw-like agent though? You can chat with it on a IM platform and get notified as soon as it replies, you're not dependent on real-time quick interaction.
Why are you using Ollama? Just use llama.cpp
brew install llama.cpp
use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app
For MLX I'd guess.
https://omlx.ai/
Does this have a CLI only interface?
Yes. You could also look at the README.md.
That also comes upstream from llama.cpp https://github.com/ggml-org/llama.cpp/discussions/4345
The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.
> I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training.
The Gemma models were literally released yesterday. You canāt ask LLMs for advice on these topics and get accurate information.
Please donāt repeat LLM-sourced answers as canonical information
It's not just LLM sourced though, folks have literally tried this after the release with the 26A4B model and it wasn't very good. Maybe the dense ~31B model is worthwhile though.
Many Gemma implementations are or were broken on launch day. The first attempts to fix llama.cppās tokenizer were merged hours ago.
Everyone hated Qwen3.5 at launch too because so many implementations were broken and couldnāt do tool calling.
You need to ignore social media āI tried this and it sucksā echo chambers for new model releases.
I agree with your criticism. I should have simply said that I had good results with gemma 4 tool use, and agentic coding with gemma 4 didnāt yet work well for me.
I spent two hours doing my own research before asking for Geminiās analysis, which reinforced my own opinion that the gemini models historically have not been trained and target for agentic coding use.
Have you tried using the new Gemma 4 models with agentic coding tools?If you do, you might end up agreeing with me.
I've found my research on certain topics like this becoming less reliable these days, compared to just trying it out to form an opinion.
I wasnāt very clear, sorry. By my āown researchā I meant spending 90 minutes experimenting with Gemma 4 models for tool use (good results!) and a half hour using with pi and OpenCode (I didnāt get good results, yet.)
LLMs can search the web. Although I donāt trust the LLM (or someone repeating its claim) without quotes and URLs to where it got the information.
Oh yeah absolute genius. I asked GPT-2 about Claude Opus 4.6 and it said āthis is not a recommendation. You might get some benefits from Opus⦠but this is not what you wantā. Damn, real wisdom from the OG there. What a legend