Qwen 3.5 small models punch way above their weight. The 9B variant is not "cute for a tiny model." It's genuinely useful.
I have been spending more time with local models lately. I think a lot more people should try them, and Qwen 3.5 is the reason I am finally writing this up.
Four reasons to run AI locally:
- Cheap. You pay for the Mac and the electricity. There's no per-token bill, no monthly subscription, no metered API.
- Private. The model runs on hardware you control. Prompts never leave your machine unless you deliberately send them somewhere.
- Increasingly capable. Local models used to feel like a science fair project. That changed fast.
- Low barrier. If you already have Apple silicon sitting around, the setup takes about ten minutes.
The Qwen team framed the small series as "more intelligence, less compute," and after using them daily, I think that's accurate:
0.8Band2Bare tiny and fast, built for edge devices4Bis a surprisingly strong lightweight model9Bis compact, but already pushing into "serious everyday tool" territory
They also released base models alongside the instruction-tuned variants, which is useful if you want
to fine-tune or experiment. For most readers, though, the practical story is simpler: the 4B and 9B
models run comfortably on normal Apple silicon machines and produce results that no longer feel like
a novelty. On the current Ollama Qwen 3.5 library page (opens in new tab), the
family spans 0.8b, 2b, 4b, 9b, 27b, 35b, and 122b, all with text-and-image support and
a 256K context window. If you want a rough idea of what your hardware can support,
ModelFit (opens in new tab) is a useful sanity check before you start downloading random 20 GB
artifacts.
I originally bought a Mac mini for OpenClaw (opens in new tab) experiments. Then I got frustrated by WhatsApp constantly disconnecting from it. That detour pushed me toward Qwen 3.5, and I am glad it did. The small models deliver an absurd amount of useful AI for a fraction of the compute cost of larger open-weight alternatives.
This guide covers the full path: install Ollama on a Mac, pull qwen3.5:4b or qwen3.5:9b, connect
to it from Cumbersome on that same Mac, then make it available to an iPhone or
another Mac on your local network. I will also cover the simplest privacy hardening so your "local"
setup does not quietly spray logs all over disk.
My Setup
A base M4 Mac mini with 16 GB of RAM. Nothing exotic. Enough to make qwen3.5:4b and qwen3.5:9b
feel practical.
My Mac mini M4 handles both models capably. Not magically. Capably. That distinction matters.
If you are expecting frontier-cloud performance on every hard reasoning task, recalibrate. That's not what this is. But for private writing help, summarization, brainstorming, coding assistance, and general-purpose "think with me" AI on your own hardware, this setup works well. I use it daily.
Step 1: Install Ollama on Your Mac
Ollama has a Quickstart (opens in new tab), but here is the Mac version in plain English:
- Go to ollama.com/download (opens in new tab) and grab the macOS app.
- Open the download and move it into
Applicationsif macOS asks. - Launch Ollama once. A background service starts automatically.
- Leave it running. Ollama now serves a local API on
http://localhost:11434.
Two endpoints matter:
- Ollama's native API:
http://localhost:11434/api(docs (opens in new tab)) - OpenAI-compatible endpoint:
http://localhost:11434/v1(docs (opens in new tab))
Cumbersome uses the second one.
Step 2: Pull Qwen 3.5
Open Terminal. The quickest way to get started:
ollama run qwen3.5
That resolves to the current default, which right now is the 9B variant on the Qwen 3.5 library page (opens in new tab). I prefer being explicit:
ollama pull qwen3.5:4b
ollama pull qwen3.5:9b
Here is how the sizes break down:
qwen3.5:0.8bandqwen3.5:2bexist for minimal-footprint, maximum-speed use cases.qwen3.5:4bis about 3.4 GB on disk. Light, quick, and cheap to run. This is the one I would point most curious local-AI newcomers toward first.qwen3.5:9bis about 6.6 GB. My daily driver. Still small enough to feel local, strong enough for real work.qwen3.5:27b,35b, and122bexist, but they need significantly more RAM and are a different hardware conversation.
The 9B model in particular punches closer to big-model territory than its parameter count has any right to. That's the story of the Qwen 3.5 small series right now, and the reason this guide exists.
Step 3: Connect Cumbersome to Ollama
Download Cumbersome on the Mac where Ollama is running. Then add Ollama as an OpenAI-compatible provider.
Point Cumbersome at http://localhost:11434/v1. For the API key, enter literally anything.
The settings:
- Provider Name:
LocalorOllama(your choice) - Base URL:
http://localhost:11434/v1 - API key: any string at all, such as
ollama
That API key part sounds absurd, but it's straight from Ollama's OpenAI compatibility docs (opens in new tab): the key field is required by many clients, including Cumbersome, but Ollama ignores whatever you put there.
Quick sanity check (Terminal). On the same Mac where Ollama is running, run
curl -sS http://localhost:11434/v1/models. On that machine 127.0.0.1 and localhost are the
same for this check; use whichever host form you use here in Cumbersome's base URL too.
If you don't get JSON like this (including a data array of models), the problem is not Cumbersome or your API key string. Check Ollama installation, that the daemon is running, port 11434, firewall rules, and anything else blocking loopback on that Mac, then try again.
Save the provider and start a conversation. If everything is working, your local Qwen models show up in the model dropdown.
Cumbersome sees the local models. Set the Title Model to the same model you plan to chat with.
One practical tip: set the Title Model to match your main chat model. In local setups, bouncing between one model for conversation and a different one for auto-generated titles adds latency you do not need. Keep them the same.
Step 4: Turn Thinking Off
This is my strongest Qwen-specific recommendation: leave thinking off.
Qwen 3.5 is smart. It can also be spectacularly overwrought when thinking is enabled. You ask for a single fact and it produces three pages of internal deliberation first.
To find the setting, tap the + button at the bottom of the chat composer. That opens the advanced
features panel:
The AI Reasoning toggle lives behind the + button. Yes, people complain that the plus icon buries
these controls. We are following the same pattern ChatGPT and Claude use: they both hide their
little knobs behind a plus-style menu too.
Thinking should be off by default. Here is a concrete example of why.
I asked qwen3.5:9b a dead-simple prompt: "tell me an interesting fact about flowers."
With thinking off:
Clean, direct answer. Fast. No drama.
With thinking on:
An enormous amount of internal throat-clearing for a one-sentence flower fact.
For everyday local use, thinking off wins:
- faster responses
- less rambling
- no multi-page internal monologues
- better fit for writing, summarization, and utility work
You can always flip it back on for a single hard prompt. I wouldn't leave it on by default.
Step 5: Expose Ollama to Your Local Network
Running Ollama on the same Mac is useful. Running it from your phone on the couch is better.
Open Ollama's settings and flip the network toggle:
"Expose Ollama to the network" turns your Mac from localhost-only into a local AI server for every device on your LAN.
With this on, other devices on the same Wi-Fi can reach Ollama.
Step 6: Find Your Mac's Local IP Address
You need the Mac's LAN IP to point other devices at it.
Two quick ways:
- System Settings → Wi-Fi → click your current network → look for the IP address.
- In Terminal:
ipconfig getifaddr en0
You are looking for something like 192.168.0.108. Build the base URL from that:
http://192.168.0.108:11434/v1
Same Ollama server, addressed over the local network instead of localhost.
Step 7: Use It from Your iPhone or Another Mac
On your iPhone (or a second Mac), add another OpenAI-compatible provider in Cumbersome using the LAN
address instead of localhost:
Same setup, different device. Swap localhost for the Mac's LAN IP and you are running local Qwen
from your phone.
That's it. The Mac does the heavy lifting. The phone or second Mac is just a client, with a much nicer interface than poking at a terminal or a browser tab.
Three things need to be true:
- the Mac running Ollama stays on and awake
- both devices are on the same network
- Cumbersome points to the LAN URL, not
localhost
Privacy Tips: Keep the Good Part Local
The whole point of running local is that your prompts stay on hardware you control. Do not undermine that by leaving a trail of logs behind.
1. Only expose the network when you need it
If you are only using Ollama from the same Mac, leave network exposure off. One machine, one process, no surface area.
When you turn it on for same-network phone or laptop access, understand what that means: the Mac is now serving AI to every device on your LAN. That's still vastly more private than a cloud provider, but it's no longer "this process only talks to itself."
2. Drop routine logs
The simplest privacy move is to stop writing request logs to disk. If you start Ollama manually from a shell, redirect stdout to nowhere and only keep errors:
ollama serve >/dev/null 2>>"$HOME/Library/Logs/ollama-error.log"
If a helper or wrapper insists on writing to a specific log path, you can also symlink it to
/dev/null:
ln -sf /dev/null /path/to/whatever.log
Blunt, but effective. If your prompts are privacy-sensitive, do not casually keep request traces you do not need.
3. Watch for cloud features
Ollama now includes cloud models (opens in new tab) and cloud-based web search. The web-search piece is probably useful for a lot of tasks. I haven't played with it yet.
But if the reason you set all of this up is privacy and keeping everything on your own hardware, leave the cloud features off. The moment you enable cloud models or cloud search, you are back in a hybrid setup. Local means local.
Practical Recommendations
The short version:
- Start with
qwen3.5:9bif your Mac has the RAM. - Use
qwen3.5:4bwhen you want something lighter or faster. - Set the Title Model to match your main chat model.
- Leave thinking off by default.
- Use
localhoston the Ollama machine, LAN IP on everything else. - Keep logs off disk unless you genuinely need them.
The Tradeoffs
Local AI is not free. You are paying in three currencies: hardware, electricity, and patience when a small model decides to be weird.
But compared with paying cloud token bills in perpetuity, I think this trade is getting more attractive every few months. The small models keep getting better. The hardware keeps getting cheaper.
qwen3.5:9b is not GPT-4o. But for a surprising amount of everyday use, it gets close enough that
the privacy and cost advantages tip the balance.
Part 2: Remote Access
This post covered the simplest version: one Mac, one local network, Cumbersome connecting from the Mac itself, an iPhone, or another Mac nearby.
The next step is making that same setup reachable when you are away from home, without throwing a port open to the public Internet. That's where tools like Tailscale come in.
I will cover that in part 2.
Bless up! 🙏✨