Skip to main content

Benchmarking Local LLMs on an RTX 3060: What 12GB of VRAM Can Actually Do 

03 Apr 2026 · Originally published on x.com

Standard benchmarks like MMLU and HumanEval tell you how smart a model is on paper. They don’t tell you if it can parse a messy recipe, draft a firm email to your landlord, diagnose an nginx error, or call the right tool when your Docker container is down.

After building my homelab with an RTX 3060 12GB, I needed to figure out which models actually work for daily use. So I built a practical benchmark that tests what matters — and the results surprised me.

Why Build My Own Benchmark?

When I first started using Large Language Models(LLMs) I tried relying on leaderboards and public benchmark scores. But with more experience using these models, I realized that they don’t really reflect day-to-day usage. A model that scores 85% on MMLU might still hallucinate Caddy directives or burn its entire token budget on internal reasoning before producing an empty response.

I wanted to know: if I ask this model to do something I’d actually ask it on a Tuesday evening, does it deliver?

The Benchmark

18 tests across 6 categories, each scored out of 5 by hand. No auto-scoring — regex matching rewards format over substance. Total possible score: 90.

Knowledge & Learning (3 tests) — Can it explain DNS using Git branching as an analogy? Compare SQLite vs PostgreSQL for a small deployment? Break down a TLS handshake for a backend developer?

Practical Text Tasks (3 tests) — Parse a typo-filled recipe into structured JSON. Draft a firm-but-polite landlord email citing lease terms. Translate a German contract termination letter and advise on next steps.

Technical & Homelab (3 tests) — Diagnose an “Address already in use” nginx error in 3 sentences. Generate a Docker Compose file for Postgres 16. Review a Caddyfile with intentional problems.

Extraction & Classification (2 tests) — Classify support tickets by category and priority as JSON. Extract entities from a business conversation.

Instruction Following (2 tests) — Write a product review with 6 simultaneous constraints. Return JSON with an exact schema.

Tool Calling (5 tests, 25 points) — The most heavily weighted section. Can it call the right tool, handle multi-tool calls, know when not to use a tool, and investigate before acting on vague requests?

View the full benchmark script
#!/bin/bash
# ============================================================================
# Local LLM Benchmark v2
#
# Runs practical tasks against any OpenAI-compatible endpoint.
# Captures outputs, timing, and token counts. Scoring is done manually.
#
# Usage:
# ./bench.sh <base_url> <model_name>
#
# Examples:
# ./bench.sh http://192.168.8.123:11434/v1 qwen3.5:9b
# ./bench.sh http://192.168.8.123:8081/v1 bonsai-8b.gguf
# ============================================================================
set -euo pipefail
BASE_URL="${1:?Usage: ./bench.sh <base_url> <model_name>}"
MODEL="${2:?Usage: ./bench.sh <base_url> <model_name>}"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
RESULT_DIR="./results"
RESULT_FILE="${RESULT_DIR}/${MODEL//[:\/]/_}_${TIMESTAMP}.md"
TEST_COUNT=0
mkdir -p "$RESULT_DIR"
CYAN='\033[0;36m'
YELLOW='\033[1;33m'
NC='\033[0m'
# ── Helpers ─────────────────────────────────────────────────────────────────
chat() {
local prompt="$1"
local temp="${2:-0.3}"
local max_tokens="${3:-2048}"
curl -s --max-time 300 "$BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg model "$MODEL" \
--arg prompt "$prompt" \
--argjson temp "$temp" \
--argjson max "$max_tokens" \
'{model:$model, messages:[{role:"user",content:$prompt}], temperature:$temp, max_tokens:$max}'
)" 2>/dev/null
}
chat_with_tools() {
local prompt="$1"
local tools="$2"
local temp="${3:-0.1}"
local max_tokens="${4:-2048}"
curl -s --max-time 300 "$BASE_URL/chat/completions" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg model "$MODEL" \
--arg prompt "$prompt" \
--argjson tools "$tools" \
--argjson temp "$temp" \
--argjson max "$max_tokens" \
'{model:$model, messages:[{role:"user",content:$prompt}], tools:$tools, temperature:$temp, max_tokens:$max}'
)" 2>/dev/null
}
get_content() {
local c
c=$(echo "$1" | jq -r '.choices[0].message.content // ""')
if [ -z "$c" ]; then
local reasoning
reasoning=$(echo "$1" | jq -r '.choices[0].message.reasoning // empty' 2>/dev/null)
[ -n "$reasoning" ] && c="[THINKING MODE: content empty — reasoning tokens consumed the budget]"
fi
echo "$c"
}
run_test() {
local name="$1"
local prompt="$2"
local temp="${3:-0.3}"
local max_tokens="${4:-2048}"
TEST_COUNT=$((TEST_COUNT + 1))
printf " [%2d] %-50s" "$TEST_COUNT" "$name"
local start end resp content tokens ptok ctok wall
start=$(date +%s%N)
resp=$(chat "$prompt" "$temp" "$max_tokens")
end=$(date +%s%N)
content=$(get_content "$resp")
ptok=$(echo "$resp" | jq -r '.usage.prompt_tokens // 0')
ctok=$(echo "$resp" | jq -r '.usage.completion_tokens // 0')
wall=$(( (end - start) / 1000000 ))
printf "${CYAN}%5dms %4d tok${NC}\n" "$wall" "$ctok"
{
echo "### $TEST_COUNT. $name"
echo ""
echo "**Time:** ${wall}ms | **Tokens:** ${ptok}→${ctok}"
echo ""
echo "**Prompt:**"
echo '```'
printf '%s\n' "$prompt"
echo '```'
echo ""
echo "**Output:**"
echo ""
printf '%s\n' "$content"
echo ""
echo "**Score:** /5"
echo ""
echo "---"
echo ""
} >> "$RESULT_FILE"
}
run_tool_test() {
local name="$1"
local prompt="$2"
local tools="$3"
local temp="${4:-0.1}"
local max_tokens="${5:-2048}"
TEST_COUNT=$((TEST_COUNT + 1))
printf " [%2d] %-50s" "$TEST_COUNT" "$name"
local start end resp content tool_calls tokens ptok ctok wall
start=$(date +%s%N)
resp=$(chat_with_tools "$prompt" "$tools" "$temp" "$max_tokens")
end=$(date +%s%N)
content=$(get_content "$resp")
tool_calls=$(echo "$resp" | jq '.choices[0].message.tool_calls // empty' 2>/dev/null)
ptok=$(echo "$resp" | jq -r '.usage.prompt_tokens // 0')
ctok=$(echo "$resp" | jq -r '.usage.completion_tokens // 0')
wall=$(( (end - start) / 1000000 ))
printf "${CYAN}%5dms %4d tok${NC}\n" "$wall" "$ctok"
{
echo "### $TEST_COUNT. $name"
echo ""
echo "**Time:** ${wall}ms | **Tokens:** ${ptok}→${ctok}"
echo ""
echo "**Prompt:**"
echo '```'
printf '%s\n' "$prompt"
echo '```'
echo ""
if [ -n "$tool_calls" ] && [ "$tool_calls" != "null" ]; then
echo "**Tool Calls:**"
echo '```json'
printf '%s\n' "$tool_calls" | jq '.' 2>/dev/null || printf '%s\n' "$tool_calls"
echo '```'
else
echo "**Tool Calls:** none"
fi
echo ""
if [ -n "$content" ]; then
echo "**Text Response:**"
echo ""
printf '%s\n' "$content"
fi
echo ""
echo "**Score:** /5"
echo ""
echo "---"
echo ""
} >> "$RESULT_FILE"
}
section() {
echo ""
echo -e "${YELLOW}── $1 ──${NC}"
{ echo ""; echo "## $1"; echo ""; } >> "$RESULT_FILE"
}
# ── Health Check ────────────────────────────────────────────────────────────
echo ""
echo "========================================"
echo " Local LLM Benchmark v2"
echo "========================================"
echo " Endpoint: $BASE_URL"
echo " Model: $MODEL"
echo ""
health=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 "$BASE_URL/models" 2>/dev/null || echo "000")
if [ "$health" != "200" ]; then
echo "ERROR: Endpoint not reachable (HTTP $health)"
exit 1
fi
cat > "$RESULT_FILE" <<EOF
# Benchmark: $MODEL
- **Date:** $(date +%Y-%m-%d\ %H:%M)
- **Endpoint:** $BASE_URL
---
EOF
# ========================================================================
# 1. KNOWLEDGE & LEARNING
# ========================================================================
section "Knowledge & Learning"
run_test "Analogical Explanation" \
'I understand how Git branching works really well. Using that as an analogy, explain how DNS resolution works — mapping the concepts (branches, merges, remotes, etc.) to DNS equivalents. Keep it under 200 words.' \
0.5 2048
run_test "Tradeoff Comparison" \
'I am choosing between SQLite and PostgreSQL for a self-hosted app that will have 1-5 users. Give me a comparison table with at least 5 factors, then a one-sentence recommendation for my situation.' \
0.3 2048
run_test "Concept Breakdown" \
'Break down how HTTPS/TLS handshake works into 5 simple steps. For each step, give a one-sentence explanation that a backend developer (not a security expert) would understand. Number each step.' \
0.3 2048
# ========================================================================
# 2. PRACTICAL TEXT TASKS
# ========================================================================
section "Practical Text Tasks"
run_test "Messy Text → Structured JSON" \
'Parse this messy recipe into clean JSON with fields: name, servings (number), prep_time, ingredients (array of {item, amount}), steps (array of strings). Return ONLY the JSON.
"ok so this is my grandmas famous pasta sauce lol. feeds about 6 people, takes maybe 45 min to prep. you need like 2 cans of crushed tomatoes (28oz each), 1 pound ground beef, half an onion diced, 3 cloves garlic minced, 2 tbsp olive oil, salt and pepper to taste, and a handful of fresh basil. brown the beef first then saute the onion and garlic in olive oil until soft, add the tomatoes and simmer for 30 min, season with salt pepper and basil at the end"' \
0.3 2048
run_test "Formal Email Draft" \
'Draft a polite but firm reply to this email from my landlord. I want to push back without being rude. Keep it under 100 words.
"""Dear Tenant, We are increasing your monthly rent by 15% effective next month. This is due to rising maintenance costs. Please confirm your acceptance. Regards, Management"""
My points: the lease requires 60 days written notice for rent changes, and the lease caps annual increases at 5%.' \
0.5 2048
run_test "Translate, Explain, Advise" \
'Translate this German text to English, then tell me in one sentence: what kind of document is this, and do I need to do anything about it?
"Sehr geehrte Damen und Herren, hiermit kündigen wir den Vertrag Nr. 4712-B zum nächstmöglichen Zeitpunkt gemäß §14 der Allgemeinen Geschäftsbedingungen. Bitte bestätigen Sie den Eingang dieses Schreibens und teilen Sie uns das genaue Kündigungsdatum mit. Mit freundlichen Grüßen"' \
0.3 2048
# ========================================================================
# 3. TECHNICAL & HOMELAB
# ========================================================================
section "Technical & Homelab"
run_test "Error Diagnosis" \
'What is wrong here and how do I fix it? Max 3 sentences.
```
nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)
nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)
nginx: [emerg] still could not bind()
```' \
0.3 1024
run_test "Docker Compose Generation" \
'Generate a Docker Compose file for a Postgres 16 database with:
- Database: appdb, User: appuser, password from env var DB_PASSWORD
- Named volume "pgdata" for persistence
- Port 5432 exposed
- A healthcheck using pg_isready
Return ONLY the YAML.' \
0.3 2048
run_test "Config Review" \
'Review this Caddyfile for problems. List each issue and how to fix it.
```
:80 {
reverse_proxy /api* localhost:3000
reverse_proxy localhost:8080
tls internal
header {
X-Frame-Options "DENY"
-Server
}
log {
output file /var/log/caddy/access.log
}
}
```' \
0.3 2048
# ========================================================================
# 4. EXTRACTION & CLASSIFICATION
# ========================================================================
section "Extraction & Classification"
run_test "Ticket Classification" \
'Classify each support ticket. Return a JSON array where each object has: id (number), category (bug|feature_request|billing|account|documentation), priority (low|medium|high|critical). Return ONLY valid JSON.
1. "App crashes every time I try to export a PDF. Losing client work."
2. "Would be great if you could add dark mode."
3. "I was charged twice this month, need a refund ASAP."
4. "How do I reset my password? Cannot find the option."
5. "The API docs say POST but the endpoint only accepts GET."
6. "Entire dashboard is down, nothing loads for any of our 200 users."' \
0.1 2048
run_test "Entity Extraction" \
'Extract all named entities from this conversation. Return ONLY a JSON object with: people (array), companies (array), dates (array), amounts (array), locations (array).
"Hey Sarah, just got off the call with the Stripe team. They proposed a $50,000 annual contract starting January 15th. Mike from their sales team said they can do onboarding at their San Francisco office or remotely. We need legal review done before the board meeting on February 3rd. Also, Lisa mentioned Datadog offered us a competing deal at $42,000."' \
0.1 2048
# ========================================================================
# 5. INSTRUCTION FOLLOWING
# ========================================================================
section "Instruction Following"
run_test "Six Constraints" \
'Write a short product review for wireless earbuds. Follow ALL these rules:
1. Exactly 4 sentences
2. First sentence must start with the word "After"
3. Second sentence must contain a negative point
4. Include a star rating as the last line: "Rating: X/5"
5. Total under 80 words
6. No brand names' \
0.3 2048
run_test "Exact JSON Schema" \
'Return a JSON object with EXACTLY these 3 fields, no more, no less:
- "can_do": array of exactly 3 things you can help with
- "cannot_do": array of exactly 3 things you cannot help with
- "version": the exact string "v1"
No explanation, no markdown, no extra fields.' \
0.1 2048
# ========================================================================
# 6. TOOL CALLING
# ========================================================================
section "Tool Calling"
TOOLS='[
{
"type": "function",
"function": {
"name": "get_container_status",
"description": "Get the status of a Docker container by name",
"parameters": {
"type": "object",
"properties": {
"container_name": {"type": "string", "description": "Name of the Docker container"}
},
"required": ["container_name"]
}
}
},
{
"type": "function",
"function": {
"name": "restart_service",
"description": "Restart a service on the homelab server",
"parameters": {
"type": "object",
"properties": {
"service_name": {"type": "string", "description": "Name of the service to restart"},
"force": {"type": "boolean", "description": "Force restart even if service is healthy"}
},
"required": ["service_name"]
}
}
},
{
"type": "function",
"function": {
"name": "search_logs",
"description": "Search recent logs for a service",
"parameters": {
"type": "object",
"properties": {
"service_name": {"type": "string", "description": "Name of the service"},
"query": {"type": "string", "description": "Search query for log contents"},
"since": {"type": "string", "description": "Time range, e.g. 1h, 30m, 2d"}
},
"required": ["service_name", "query"]
}
}
},
{
"type": "function",
"function": {
"name": "send_notification",
"description": "Send a notification message via Telegram",
"parameters": {
"type": "object",
"properties": {
"message": {"type": "string", "description": "Notification message text"},
"priority": {"type": "string", "enum": ["low", "normal", "high"], "description": "Message priority"}
},
"required": ["message"]
}
}
}
]'
run_tool_test "Simple Tool Call" \
"Is my Ollama container running?" \
"$TOOLS"
run_tool_test "Right Tool for the Job" \
"Immich has been throwing 500 errors for the last hour. Can you check what is going on?" \
"$TOOLS"
run_tool_test "Multiple Tools in One Request" \
"Force restart the Caddy container and send me a high priority Telegram alert about the restart." \
"$TOOLS"
run_tool_test "Should NOT Use a Tool" \
"What is the difference between Docker volumes and bind mounts?" \
"$TOOLS"
run_tool_test "Vague Request — Investigate, Don't Act" \
"Something seems off with Nginx" \
"$TOOLS"
# ── Scorecard ───────────────────────────────────────────────────────────────
cat >> "$RESULT_FILE" <<'EOF'
---
## Scorecard
| # | Test | Score | Notes |
|---|------|-------|-------|
| 1 | Analogical Explanation | /5 | |
| 2 | Tradeoff Comparison | /5 | |
| 3 | Concept Breakdown | /5 | |
| 4 | Messy Text → Structured JSON | /5 | |
| 5 | Formal Email Draft | /5 | |
| 6 | Translate, Explain, Advise | /5 | |
| 7 | Error Diagnosis | /5 | |
| 8 | Docker Compose Generation | /5 | |
| 9 | Config Review | /5 | |
| 10 | Ticket Classification | /5 | |
| 11 | Entity Extraction | /5 | |
| 12 | Six Constraints | /5 | |
| 13 | Exact JSON Schema | /5 | |
| 14 | Simple Tool Call | /5 | |
| 15 | Right Tool for the Job | /5 | |
| 16 | Multiple Tools in One Request | /5 | |
| 17 | Should NOT Use a Tool | /5 | |
| 18 | Vague Request — Investigate | /5 | |
| | **Total** | **/90** | |
### Scoring Guide
- **5** — Perfect or near-perfect. Output is directly usable.
- **4** — Good. Minor issues but still usable with minimal editing.
- **3** — Okay. Gets the idea right but needs significant cleanup.
- **2** — Poor. Partially correct but mostly unusable.
- **1** — Bad. Wrong approach, misunderstood the task, or format is broken.
- **0** — Failed. No output, empty, or completely wrong.
EOF
echo ""
echo "========================================"
echo " Done — $TEST_COUNT tests completed"
echo " Results: $RESULT_FILE"
echo "========================================"
echo ""

The Models

All models run on the same RTX 3060 12GB via Ollama, except Bonsai-8B which uses a llama.cpp fork by PrismML.

ModelParametersVRAMSpeed
Gemma 4 E4B8B total, 4.5B effective9.8GB~77 tok/s
Qwen 3.5 9B9B6.6GB~38 tok/s
Qwopus 3.5 9B9B (Opus distill)6.5GB~23 tok/s
Bonsai-8B8B (1-bit quant, 1.15GB on disk)~2GB~99 tok/s
Qwen3-Coder 30B-A3B30B (3B active, MoE)11.4GB+~0.5 tok/s

Judge the Models Yourself

Before I share the results, try judging the outputs blind. Each section below shows real model outputs — anonymized. Pick the one you think is best, then reveal the results.

Round 1: Formal Email Draft

Prompt: Draft a polite but firm reply to this email from my landlord. I want to push back without being rude. Keep it under 100 words.

“Dear Tenant, We are increasing your monthly rent by 15% effective next month. This is due to rising maintenance costs. Please confirm your acceptance. Regards, Management”

My points: the lease requires 60 days written notice for rent changes, and the lease caps annual increases at 5%.

Test: Formal Email Draft

Who wrote the best landlord pushback email?

Which model gave the best response? Read the outputs below and cast your vote.

Model A
Dear Management,

Thank you for the notice regarding the rent increase. However, I am unable to accept this proposal. Our lease agreement stipulates a 60-day written notice for rent adjustments, which has not been satisfied. Furthermore, the contract caps annual increases at 5%, not 15%.

I value our tenancy and want to ensure we remain compliant with the lease terms. Please review the agreement and discuss a compliant adjustment. I look forward to your response.

Best regards,

[Your Name]
Model B
Dear Management,

Thank you for informing me of the proposed rent increase.

Before confirming acceptance, I need to refer to the terms of my current lease agreement. Specifically, the lease mandates 60 days written notice for any rent adjustments, and it also stipulates that annual increases are capped at 5%.

Could you please provide clarification on how this proposed 15% increase aligns with these contractual terms? I look forward to your review of this matter.

Sincerely,
[Your Name]
Model C
Dear Landlord,

Thank you for the notice regarding the rent increase. However, I've reviewed our lease and it requires 60 days written notice for rent changes and caps annual increases at 5%. The proposed 15% increase violates both provisions.

Please reconsider this change and withdraw the notice, or adjust the increase to comply with our agreement.

Sincerely,
[Tenant]
Model D
Dear [Landlord's Name],

Thank you for the notice. I understand the increase is due to rising maintenance costs. However, I want to clarify that the lease requires 60 days' written notice for rent changes, and annual increases are capped at 5%. I respectfully request that this adjustment be reconsidered in light of these terms. Please let me know if you'd like to discuss further.

Best regards,
[Your Name]

Round 2: DNS Explained via Git (Analogical Explanation)

Prompt: I understand how Git branching works really well. Using that as an analogy, explain how DNS resolution works — mapping the concepts (branches, merges, remotes, etc.) to DNS equivalents. Keep it under 200 words.

Test: Analogical Explanation

Which model mapped Git concepts to DNS most effectively?

Which model gave the best response? Read the outputs below and cast your vote.

Model A
DNS resolution can be likened to Git branching, merging, and remotes. The DNS server acts like a remote — it's the central authority that resolves domain names to IP addresses, much like a remote in Git stores changes from a shared repository.

When you type a domain name (like example.com), the DNS server performs a lookup, similar to how Git pulls changes from a remote. It follows a chain of authoritative servers, much like Git follows a branching structure through a repository's history.

A DNS record (A, CNAME, etc.) is like a branch — it defines how the domain should be resolved. For instance, an A record points to an IP address, just as a Git branch points to a specific point in the history.

When you access a website, the DNS system resolves the domain to an IP, which then directs your request to the web server. This process is similar to a merge — it resolves conflicts by choosing the correct path.

DNS propagation is like a merge conflict — it takes time for changes to propagate across the network. DNS caching is like a local branch — it stores resolved information to speed up future lookups.
Model B
In Git, branches represent parallel development paths; similarly, DNS zones represent hierarchical namespace segments. Each zone (like a Git branch) contains authoritative records — immutable snapshots of IP mappings akin to Git commits.

Authoritative name servers function as Git remotes, holding the "truth" for their zone's records. When a resolver queries DNS, it's like performing a git checkout: the system traverses from the root zone (origin/master) down through sub-zones, consulting authoritative servers at each level.

TTL values act like merge commit timestamps, determining how long a cached record remains valid before refreshing — preventing constant re-fetching while ensuring eventual consistency.

DNS propagation mirrors Git merge conflicts: when a zone updates, the change must propagate through the resolver hierarchy. Until the TTL expires, resolvers may still see stale records, much like a merged branch might show outdated data until the merge completes.

Ultimately, DNS resolution is Git's checkout command executed across a distributed, redundant network of remotes.
Model C
Think of the internet as a massive Git repository.

The Domain Name (e.g., example.com) is the overall repository. To find a service, you don't check out the whole repo; you query specific branches.

The process starts with your local machine querying the Root DNS Servers (the remote origin). These roots don't know the IP, but they point you to the next set of servers — the TLD Servers (.com). This is like fetching the initial remote pointers.

The TLD servers then point you to the Authoritative Nameservers for example.com. These hold the final record (the commit): the A record, which maps the branch (www) to the final IP Address (the current state).

When you query, you are essentially performing a controlled git pull across multiple, interconnected remotes until you reach the definitive, resolved commit (the IP address).

Round 3: Caddyfile Config Review

Prompt: Review this Caddyfile for problems. List each issue and how to fix it.

:80 {
reverse_proxy /api* localhost:3000
reverse_proxy localhost:8080
tls internal
header {
X-Frame-Options "DENY"
-Server
}
log {
output file /var/log/caddy/access.log
}
}

This test only had 2 models produce output — Gemma 4 and Bonsai-8B. Qwen 3.5 and Qwopus both returned empty (thinking mode consumed all tokens). This makes it a perfect showcase for how 1-bit quantization affects technical accuracy. Both outputs were long (1400 and 2048 tokens) — what you see below is trimmed, with the full versions in the detailed output sections further down.

Test: Config Review

Which model found real issues in the Caddyfile?

Which model gave the best response? Read the outputs below and cast your vote.

Model A
This Caddyfile is generally functional, but it has several issues related to scope, best practices, and potential conflicts.

Issue 1: Multiple reverse_proxy Directives (Scope/Conflict)
The two reverse_proxy directives are competing. reverse_proxy /api* localhost:3000 handles paths starting with /api. reverse_proxy localhost:8080 (without a path prefix) will attempt to proxy all incoming requests, potentially overriding the API proxy.
Fix: Use path matching for all proxies — reverse_proxy /api* localhost:3000 and reverse_proxy /* localhost:8080.

Issue 2: Ambiguous tls internal
tls internal is syntactically valid for generating a self-signed certificate, but is poor practice for production. If you are exposing this to the public internet, use a real domain name so Caddy can automatically obtain a valid certificate via Let's Encrypt.

Issue 3: Missing HTTP to HTTPS Redirect
Since you are only listening on port 80 (:80), users will not be automatically upgraded to HTTPS.
Fix: Use a domain name and let Caddy handle the redirect automatically.

The header block and log block are correct.

[Full output included a recommended refactored Caddyfile — see detailed outputs below]
Model B
Your Caddyfile has several issues that could lead to unexpected behavior, security vulnerabilities, or configuration errors.

1. reverse_proxy Syntax is Incorrect
Caddy uses wildcards in the reverse_proxy directive, but the syntax is not correct. The correct syntax requires /localhost:8080.
[This is wrong — the original syntax is valid in Caddy.]

2. Missing tls Directive
Caddy requires a full tls block with cert, key, internal, and listen directives.
[This is wrong — tls internal is valid shorthand in Caddy.]

3. Missing listen Directive in tls Block
[This directive does not exist in Caddy.]

4. The -Server line is not valid in Caddy's header block.
[This is wrong — -Server is valid Caddy syntax to remove a header.]

5. Missing ports Directive in tls Block
[This directive does not exist in Caddy.]

6. Missing server Directive
[Caddy does not have a server directive.]

...continued for 11 total issues, most fabricated. Recommended a completely rewritten config using non-existent Caddy syntax including server blocks, force-ssl, and ports directives.

Full Results

Now for the full picture. Here’s how every model scored across all 18 tests.

Gemma 4 E4B — 82/90 (91%)

The clear winner.

SectionScoreNotes
Knowledge & Learning14/15Analogy slightly stretched in places
Practical Text Tasks15/15Perfect across all three
Technical & Homelab14/15Missed TLS-on-:80 contradiction in config review
Extraction & Classification10/10Perfect
Instruction Following10/10Nailed all 6 constraints, exact JSON schema
Tool Calling23/25”No tool needed” answer was too verbose; vague request asked questions instead of investigating

With only 4.5B effective parameters, Gemma 4 E4B outscored every 9B model. It manages its thinking budget well — uses reasoning internally without burning the token budget.

View Gemma 4 detailed outputs

Test 4 — Messy Text to Structured JSON:

{
"name": "Grandma's Famous Pasta Sauce",
"servings": 6,
"prep_time": "45 minutes",
"ingredients": [
{"item": "crushed tomatoes", "amount": "2 cans (28oz each)"},
{"item": "ground beef", "amount": "1 pound"},
{"item": "onion", "amount": "half, diced"},
{"item": "garlic", "amount": "3 cloves, minced"},
{"item": "olive oil", "amount": "2 tbsp"},
{"item": "salt and pepper", "amount": "to taste"},
{"item": "fresh basil", "amount": "a handful"}
],
"steps": [
"Brown the beef.",
"Saut onion and garlic in olive oil until soft.",
"Add the tomatoes and simmer for 30 minutes.",
"Season with salt, pepper, and basil at the end."
]
}

Test 8 — Docker Compose Generation:

version: '3.8'
services:
postgres:
image: postgres:16
container_name: postgres_db
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
ports:
- "5432:5432"
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 10s
timeout: 5s
retries: 5
volumes:
pgdata:
driver: local

Test 12 — Six Constraints (earbuds review):

After using these earbuds for a week, I am genuinely impressed by the sound quality and comfort. The carrying case feels a bit flimsy, which is a minor disappointment. However, the battery life is exceptional, easily lasting through my longest commutes. Overall, these are fantastic little audio companions for daily use. Rating: 4/5

Test 14 — Simple Tool Call (“Is my Ollama container running?”):

[{"function": {"name": "get_container_status", "arguments": "{\"container_name\":\"Ollama\"}"}}]

Test 16 — Multiple Tools (“Force restart Caddy + Telegram alert”):

[
{"function": {"name": "restart_service", "arguments": "{\"force\":true,\"service_name\":\"Caddy\"}"}},
{"function": {"name": "send_notification", "arguments": "{\"message\":\"Caddy container was force restarted.\",\"priority\":\"high\"}"}}
]

Qwopus 3.5 9B — 69/90 (77%)

The surprise performer. A community distillation of Claude Opus into Qwen 3.5.

SectionScoreNotes
Knowledge & Learning14/15Strong analogies (TTL = merge timestamps). Wishy-washy recommendation
Practical Text Tasks15/15Email was more direct than Gemma (“violates both provisions”)
Technical & Homelab5/152 of 3 tests returned empty (thinking mode)
Extraction & Classification10/10Perfect
Instruction Following5/101 test empty (thinking mode)
Tool Calling24/25Checked status AND logs for investigation

Where it produces output, the quality matches or exceeds Gemma 4. But 3 tests returned empty because thinking tokens consumed the entire token budget. Estimated score without the thinking issue: ~84/90. If Ollama adds thinking budget control, this model becomes the top performer.

View Qwopus detailed outputs

Test 3 — TLS Handshake Breakdown:

  1. The client initiates the connection by sending supported TLS versions, cipher suites, and a random value to the server.
  2. The server responds with its chosen encryption method, its digital certificate containing the public key, and key exchange parameters.
  3. The client validates the server’s certificate against trusted root certificates to verify the server’s identity before proceeding.
  4. Both parties exchange cryptographic material to derive a shared session key that will encrypt all subsequent traffic.
  5. Both sides confirm the handshake succeeded with finished messages, and encrypted application data begins flowing.

Test 6 — German Translation:

Translation: “Dear Sir or Madam, we hereby terminate Contract No. 4712-B effective on the earliest possible date in accordance with §14 of the General Terms and Conditions. Please confirm receipt of this letter and inform us of the exact termination date. Sincerely,”

Document type and required action: This is a contract termination notice (Kündigungsschreiben), and yes, you need to respond by confirming receipt of the letter and providing the exact termination date as requested.

Test 15 — Right Tool (“Immich throwing 500 errors”):

[
{"function": {"name": "get_container_status", "arguments": "{\"container_name\":\"immich\"}"}},
{"function": {"name": "search_logs", "arguments": "{\"service_name\":\"immich\",\"query\":\"500 error\",\"since\":\"1h\"}"}}
]

Qwopus checked both status AND logs — investigating before acting. Gemma only searched logs.

Bonsai-8B — 63/90 (70%)

Qwen3-8B quantized down to 1-bit. The speed demon.

SectionScoreNotes
Knowledge & Learning9/15Over word limit, indecisive recommendations
Practical Text Tasks11/15JSON extraction excellent; couldn’t translate German text
Technical & Homelab6/15Hallucinated Caddy directives, broken Docker Compose
Extraction & Classification8/10Solid with minor formatting issues
Instruction Following9/10Near-perfect constraint adherence
Tool Calling20/25Missed Telegram notification in multi-tool test

Most tests completed in under 2 seconds. At 99 tok/s and only ~2GB of VRAM during inference (1.15GB model weights + KV cache), it runs alongside other models. But 1-bit quantization has clear limits — it fabricated non-existent Caddy directives and couldn’t handle translation.

View Bonsai-8B detailed outputs

Test 4 — Messy Text to Structured JSON (2516ms!):

{
"name": "Grandmas Famous Pasta Sauce",
"servings": 6,
"prep_time": 45,
"ingredients": [
{"item": "crushed tomatoes", "amount": "2 cans (28oz each)"},
{"item": "ground beef", "amount": "1 pound"},
{"item": "onion", "amount": "half an onion, diced"},
{"item": "garlic", "amount": "3 cloves, minced"},
{"item": "olive oil", "amount": "2 tbsp"},
{"item": "salt", "amount": "to taste"},
{"item": "pepper", "amount": "to taste"},
{"item": "fresh basil", "amount": "a handful"}
],
"steps": [
"brown the beef first",
"saute the onion and garlic in olive oil until soft",
"add the tomatoes and simmer for 30 min",
"season with salt, pepper, and basil at the end"
]
}

Clean extraction in 2.5 seconds — this is Bonsai’s sweet spot.

Test 6 — German Translation (failed):

“This is a formal notice announcing the termination of a contract. You need to confirm receipt and provide the exact termination date.”

It summarized instead of translating. The actual German text was never rendered in English.

Test 8 — Docker Compose (broken):

services:
appdb:
image: postgres:16
environment:
- DB_PASSWORD=${DB_PASSWORD} # Wrong: should be POSTGRES_PASSWORD
healthcheck:
test: ["pg_isready", "timeout", "10s"] # Invalid syntax
command: ["pg_isready", "-U", "appuser", "-d", "appdb"] # Not a real field

Wrong env vars, broken healthcheck, invented a command field inside healthcheck.

Qwen 3.5 9B — 44/90 (49%)

The score doesn’t tell the full story. 8 out of 18 tests returned empty responses.

SectionScoreNotes
Knowledge & Learning5/152 of 3 tests returned empty
Practical Text Tasks10/151 empty; email and translation were excellent
Technical & Homelab0/15All 3 tests returned empty
Extraction & Classification5/101 empty; entity extraction was perfect
Instruction Following5/101 empty; JSON schema was perfect
Tool Calling24/25Near-perfect — better investigation than Gemma

The culprit: Ollama 0.20.0-rc0 enables thinking by default for Qwen 3.5, and the thinking tokens consume the entire max_tokens budget before any visible output appears. There’s no thinking budget control via the API — only think: true (broken) or think: false (no thinking at all).

When it works, it’s excellent. Tool calling scored 24/25 — Qwen investigated both container status and logs for the Immich error, and acted immediately on the vague Nginx request.

View Qwen 3.5 detailed outputs (where it produced them)

Test 5 — Formal Email (33 seconds, but excellent):

Dear Management, Thank you for the notice regarding the rent increase. However, I am unable to accept this proposal. Our lease agreement stipulates a 60-day written notice for rent adjustments, which has not been satisfied. Furthermore, the contract caps annual increases at 5%, not 15%. I value our tenancy and want to ensure we remain compliant with the lease terms. Please review the agreement and discuss a compliant adjustment.

Test 18 — Vague Request (“Something seems off with Nginx”):

[
{"function": {"name": "get_container_status", "arguments": "{\"container_name\":\"nginx\"}"}},
{"function": {"name": "search_logs", "arguments": "{\"service_name\":\"nginx\",\"query\":\"error\",\"since\":\"1h\"}"}}
]

Same investigation pattern as Qwopus — checked status AND logs. Gemma asked clarifying questions instead.

Qwen3-Coder 30B-A3B — Did Not Finish

At 18GB, only 30 of 49 layers fit on the GPU. The remaining 19 layers run on CPU at ~0.5 tokens per second. A simple “say hello” took 21 seconds. MoE architecture (30B total, 3B active) doesn’t help with memory — all weights still need to be loaded. Needs 24GB+ VRAM.

The Head-to-Head

ModelScoreSpeedVRAMBest at
Gemma 4 E4B82/90 (91%)~77 tok/s9.8GBEverything — most reliable all-rounder
Qwopus 3.5 9B69/90 (77%)~23 tok/s6.5GBText quality, tone, tool calling (when thinking works)
Bonsai-8B63/90 (70%)~99 tok/s~2GBSpeed, extraction, classification
Qwen 3.5 9B44/90 (49%)~38 tok/s6.6GBTool calling (24/25) — crippled by thinking mode
Qwen3-Coder 30BDNF~0.5 tok/s11.4GB+Nothing on 12GB VRAM

What I Learned

Smaller can win. Gemma 4 E4B has 4.5B effective parameters and outscored every 9B model. Raw parameter count matters less than how efficiently a model manages its compute budget.

Thinking models are a double-edged sword. Gemma 4 manages its thinking internally without burning the token budget. Qwen-based models do not — they consume unlimited tokens on reasoning, even for trivial prompts. Until Ollama adds thinking budget control, this makes Qwen models risky for automation.

Tool calling is the great equalizer. Gemma 4, Qwen 3.5, and Qwopus all scored 23-24/25. Small models are surprisingly good at structured tool use. If you’re building agent workflows on local models, the capability is already there.

Distillation works. Qwopus — a community distillation of Claude Opus into Qwen 3.5 — produced the highest quality text outputs. Better analogies, more direct tone, richer explanations. The teacher model’s quality transfers, as long as the output actually makes it through the thinking bottleneck.

1-bit quantization has clear limits. Bonsai-8B is fast and handles extraction well, but hallucinated on technical tasks. It’s a triage and extraction model, not a general-purpose assistant. Maybe if it had right tools (web search) it might have performed better?

What I’m Running Now

While these tests proved Gemma 4 to be the leader, implementation issues in Ollama prevented Qwopus and Qwen 3.5 from completing all the tests. Moreover, these tests were directly executed by calling the Ollama API. But when trying to use these models in Open WebUI, the experience has been different. Most of it has been frustrating.

For a fair benchmark, I am now in the process of migrating to llama.cpp. According to my research (I just asked Claude) it might increase the performance upto 30%. llama.cpp also has its own web UI, which might become my go to option, then OpenWeb UI. There’s a lot to experiment with, a lot to learn!

Try It Yourself

If you’re running Ollama (or any OpenAI-compatible endpoint), you can run these same tests on your own models. The full benchmark script is in the collapsible section at the top of this post — copy it into a file called bench.sh and make it executable:

Terminal window
chmod +x bench.sh

The script takes two arguments: your endpoint URL and the model name. For Ollama running locally:

Terminal window
./bench.sh http://localhost:11434/v1 gemma4:e4b

Or if your server is on another machine on your network:

Terminal window
./bench.sh http://192.168.1.100:11434/v1 qwen3.5:9b

It will run all 18 tests and save the results to ./results/<model>_<timestamp>.md — full outputs, timing, and token counts per test, with an empty scorecard at the end for you to fill in. The prompts are hardcoded in the script, so every model gets the exact same inputs.

The only dependency is jq for JSON parsing. If you don’t have it: sudo apt install jq on Ubuntu or brew install jq on macOS.

Score each output 0-5 yourself. That’s the whole point — these tests measure whether the output is actually usable for your tasks, not whether it matches a regex on someone else’s leaderboard.

If you’re choosing a model for your 12GB GPU, I hope this saves you some time. The leaderboards won’t tell you which model really is useful for your day-to-day agentic workflows.

Once I migrate to llama.cpp, I would run the benchmarks again. If you are interested in learning about the results, you can connect with me on the internet. I will be sharing it soon!

Last updated on 03 Apr 2026