You've got Hermes running on your VPS, answering messages on Telegram, managing your tasks, and generally being helpful. But here's the thing: you're paying for every API call to the cloud, and sometimes you just want to run a quick local model for sensitive data or offline work.
What if you could have both? Cloud models when you need power, local models when you want privacy or cost savingsโand the ability to switch between them instantly?
โก The Short Version
This guide shows you how to configure Hermes with:
- Venice.ai cloud models โ Kimi K2.5, ZAI GLM series, Gemma variants
- Ollama local models โ Running on your laptop via SSH reverse tunnel
- Hermes web dashboard โ Access via SSH forward tunnel at
localhost:9119 - Seamless switching โ Use
/modelin Telegram to change on the fly
The secret sauce? SSH tunnels that bridge your VPS and laptop in both directions.
Why Mix Cloud and Local?
Each approach has trade-offs:
| Consideration | Cloud (Venice.ai) | Local (Ollama) |
|---|---|---|
| Cost | Per-token pricing | Free after hardware cost |
| Speed | Fast, GPU-backed | Depends on your GPU |
| Privacy | Data leaves your system | 100% local, zero telemetry |
| Availability | Requires internet | Works offline |
| Model variety | Access to cutting-edge | Limited by your VRAM |
The hybrid approach gives you the best of both worlds. Use cloud models for complex reasoning, coding, and when you need the latest capabilities. Drop down to local models for routine tasks, sensitive data, or when you just want zero latency.
The Architecture
Here's how everything connects:
Cloud Models
Port Forward
Ollama API
The key insight: Hermes on your VPS can't directly access your laptop's Ollama instance because your laptop is behind NAT/firewall. The SSH reverse tunnel fixes this by forwarding a port from your VPS back to your laptop. Meanwhile, a forward tunnel lets you access the Hermes web dashboard from your laptop as if it were running locally.
Part 1: Setting Up Venice.ai Cloud Models
Venice.ai offers several excellent models through their API. Here's how to add them to Hermes.
Step 1: Get Your API Key
Head to venice.ai and generate an API key from your settings. You'll need this for the config file.
Step 2: Configure Hermes Providers
Edit your Hermes config file (usually at ~/.hermes/config.yaml):
# ~/.hermes/config.yaml
# Default model for new conversations
default_model: venice-kimi
# Optional: Fallback if primary fails
# fallback_model:
# provider: venice-glm-5
# model: zai-org/glm-5
custom_providers:
# Venice.ai models - all use the same base URL and key
- name: venice-kimi
base_url: https://api.venice.ai/api/v1
api_key: ${VENICE_API_KEY}
model: kimi-k2-5
- name: venice-glm-5
base_url: https://api.venice.ai/api/v1
api_key: ${VENICE_API_KEY}
model: zai-org/glm-5
- name: venice-glm-flash
base_url: https://api.venice.ai/api/v1
api_key: ${VENICE_API_KEY}
model: zai-org/glm-4.7-flash
- name: venice-gemma
base_url: https://api.venice.ai/api/v1
api_key: ${VENICE_API_KEY}
model: google/gemma-4-31b-it
๐ก Pro Tip: Use Environment Variables
Instead of hardcoding your API key, use ${VENICE_API_KEY} and set it in your shell:
export VENICE_API_KEY="your-key-here"
Add this to your ~/.bashrc or ~/.zshrc to make it persistent.
Step 3: Test Your Cloud Models
Restart Hermes and test in Telegram:
/model venice-glm-5
You should get a confirmation message. Ask something to verify it's working.
Part 2: Setting Up Ollama on Your Laptop
Now for the fun partโgetting local models accessible from your VPS.
Step 1: Install Ollama
On your laptop (not the VPS), install Ollama:
# macOS/Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download
Step 2: Pull Some Models
# Good general-purpose model (4-bit quantized, ~5GB)
ollama pull qwen2.5:7b
# Smaller, faster model for quick tasks (~2GB)
ollama pull gemma2:2b
# Larger model for complex tasks (~9GB)
ollama pull qwen2.5:14b
โ ๏ธ Ollama Must Listen on All Interfaces
By default, Ollama only listens on localhost. You need to configure it to accept external connections (which the SSH tunnel will use):
# macOS: Set environment variable
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
# Linux: Edit service or use:
OLLAMA_HOST=0.0.0.0:11434 ollama serve
# Or create a systemd override:
sudo systemctl edit ollama.service
Step 3: The SSH Reverse Tunnel
This is the magic that makes everything work. From your laptop, run:
ssh -N -R 11434:localhost:11434 user@your-vps-ip
What this does:
-Nโ Don't execute remote commands (tunnel only)-R 11434:localhost:11434โ Forward remote port 11434 (VPS) to local port 11434 (laptop)user@your-vps-ipโ Your VPS SSH credentials
Leave this terminal open. The tunnel is now active.
๐ Using a Different Port?
If port 11434 is taken on your VPS, use any available port:
ssh -N -R 11435:localhost:11434 user@your-vps-ip
Just remember to update the Hermes config accordingly.
Making the Tunnel Persistent (Optional)
Using autossh for automatic reconnection:
# Install autossh
sudo apt install autossh # Debian/Ubuntu
brew install autossh # macOS
# Run with auto-reconnect
autossh -N -R 11434:localhost:11434 user@your-vps-ip
Or add to your SSH config (~/.ssh/config on your laptop):
Host vps-ollama
HostName your-vps-ip
User your-username
RemoteForward 11434 localhost:11434
LocalForward 9119 localhost:9119
ServerAliveInterval 60
ServerAliveCountMax 3
# For both OpenClaw and Hermes dashboards
Host vps-both
HostName your-vps-ip
User your-username
RemoteForward 11434 localhost:11434
LocalForward 18789 localhost:18789
LocalForward 9119 localhost:9119
ServerAliveInterval 60
ServerAliveCountMax 3
Then run ssh vps-ollama or ssh vps-both to establish the tunnels.
Step 4: Accessing the Hermes Dashboard (Forward Tunnel)
The reverse tunnel lets your VPS reach Ollama on your laptop. But what if you want to access Hermes' web dashboard from your laptop? For that, you need a forward tunnel (the -L flag).
Here's the combined command to access both the Hermes dashboard and keep the Ollama tunnel active:
# Forward tunnel for Hermes dashboard + reverse tunnel for Ollama
ssh -L 9119:localhost:9119 -R 11434:localhost:11434 user@your-vps-ip
What this does:
-L 9119:localhost:9119โ Forwards your laptop's port 9119 to the VPS's port 9119 (Hermes dashboard)-R 11434:localhost:11434โ Forwards VPS port 11434 to your laptop's Ollama (as before)
Once connected, open http://localhost:9119 on your laptop to access the Hermes web UI.
๐ค Running Both OpenClaw and Hermes?
If you run both agents on the same VPS, you can forward both dashboards in one command:
# Both dashboards + Ollama tunnel
ssh -L 18789:localhost:18789 -L 9119:localhost:9119 -R 11434:localhost:11434 user@your-vps-ip
Then access:
- OpenClaw dashboard:
http://localhost:18789 - Hermes dashboard:
http://localhost:9119 - Ollama API: Available to both agents via the reverse tunnel
Part 3: Configuring Hermes for Local Models
Now add your local Ollama instance to the Hermes config:
# ~/.hermes/config.yaml
custom_providers:
# Venice.ai models (from Part 1)
- name: venice-kimi
base_url: https://api.venice.ai/api/v1
api_key: ${VENICE_API_KEY}
model: kimi-k2-5
# ... other Venice providers ...
# Local Ollama via SSH tunnel
- name: ollama-qwen
base_url: http://localhost:11434
api_key: "ollama" # Ollama doesn't require auth, but needs a placeholder
model: qwen2.5:7b
- name: ollama-gemma
base_url: http://localhost:11434
api_key: "ollama"
model: gemma2:2b
- name: ollama-local
base_url: http://localhost:11434
api_key: "ollama"
model: llama3.2:latest # Change to whatever you have loaded
๐ฏ Why localhost:11434?
Because the SSH tunnel makes your laptop's Ollama appear as if it's running on the VPS itself. Hermes just sees localhost:11434 and has no idea it's actually forwarding to your laptop across the internet.
Part 4: Switching Models on the Fly
The real power comes from being able to instantly switch. In any Telegram chat with Hermes:
/model venice-kimi # Switch to cloud Kimi K2.5
/model ollama-qwen # Switch to local Qwen 2.5
/model venice-glm-flash # Switch to fast cloud model
/model ollama-gemma # Switch to local lightweight model
Each chat can have its own active model. Your coding discussion can use the heavy cloud model while your casual conversation uses a local oneโwithout affecting each other.
Per-Chat Model Persistence
Hermes remembers which model each chat is using. Set up specialized chats:
- Coding Channel โ
/model venice-kimi(best reasoning) - Quick Questions โ
/model ollama-gemma(fast, free) - Sensitive Data โ
/model ollama-local(100% private)
Use Cases and Recommendations
When to Use Cloud (Venice.ai)
- Complex coding tasks โ Cloud models have better reasoning and larger context windows
- Long documents โ Kimi K2.5 handles 256K tokens, local models are limited by your VRAM
- Tool use โ Cloud models are better at function calling and multi-step workflows
- When you're away from your laptop โ Tunnel needs to be active for local models
When to Use Local (Ollama)
- Quick lookups โ "What's the syntax for..." type questions
- Private data โ Financial info, personal documents, proprietary code
- Offline work โ On a plane, spotty internet, or data-capped connection
- Cost control โ Running 100+ small queries daily adds up in the cloud
My Personal Workflow
Here's what works for me:
- Keep the SSH tunnel running in a tmux session on my laptop
- Default to
ollama-qwenfor most day-to-day chatting - Switch to
venice-kimiwhen I need serious coding help - Use
venice-glm-flashwhen I want speed over depth - Any sensitive prompts automatically go to local models
Troubleshooting
"Connection refused" when using local models
The SSH tunnel isn't active or Ollama isn't listening on the right interface:
# Test from your VPS
curl http://localhost:11434/api/tags
# If this fails, check the tunnel is running
# Then check Ollama is listening on 0.0.0.0
netstat -tlnp | grep 11434 # Should show 0.0.0.0:11434
"Model not found" errors
Make sure you've pulled the model on your laptop:
ollama list # See what's available
ollama pull qwen2.5:7b # Pull if missing
Tunnel disconnects frequently
Use autossh or add keepalive settings:
ssh -N -o ServerAliveInterval=60 -o ServerAliveCountMax=3 \
-R 11434:localhost:11434 user@your-vps-ip
Local models are too slow
You're probably CPU-bound. Check GPU usage:
# nVidia
nvidia-smi
# macOS (Apple Silicon)
# Activity Monitor โ Window โ GPU History
If Ollama isn't using your GPU, check the GPU support docs.
Security Considerations
๐ Lock Down Your Setup
The SSH tunnel exposes Ollama to your VPS. Consider these precautions:
- Ensure your VPS firewall only allows SSH (port 22) from trusted IPs
- Use SSH key authentication, not passwords
- Consider binding Ollama to localhost only and relying solely on the tunnel
- Monitor your VPS with
fail2banor similar
The beauty of this approach is that even if someone compromised your VPS, they'd only see the forwarded portโnot your actual laptop or home network.
Final Thoughts
The hybrid cloud/local setup gives you something no single approach can: flexibility. You're not locked into per-token pricing for every query, but you're also not limited by your laptop's hardware when you need serious compute.
The SSH tunnel is the unsung hero here. It's simple, secure, and lets you treat your laptop as an extension of your VPS. Combined with Hermes' per-chat model switching, you get a truly personalized AI experience.
Start with one cloud provider and one local model. Once that works, expand your roster. The config is just YAMLโeasy to tweak, version control, and share.
๐ Related Reading
- Meet Hermes: Your AI Agent for Multi-Platform Collaboration โ Introduction to Hermes
- How to Run OpenClaw and Hermes Side-by-Side โ VPS setup guide
- Venice.ai โ Privacy-focused AI inference platform
- Ollama โ Run LLMs locally
๐ฌ Comments