Leveraging Local LLMs

active
ai local-models llm

Context

The local LLM landscape is shifting. Models like GLM-4.7-Flash (released January 19, 2026) are achieving competitive performance with significantly smaller active parameter counts through MoE architecture, making them viable for consumer hardware.

Key Discoveries

GLM-4.7-Flash (Zhipu AI)

Released: January 19, 2026 Architecture: 30B total parameters, ~3B active per token (MoE) Performance:

  • SWE-bench Verified: 59.2% (vs Qwen3-Coder 480B at 55.4%)
  • Speed: 43-82 tokens/second on local hardware (M4 Max reports)

Positioning: Free-tier model optimized for high-volume endpoints, UI assistants, batch processing. Designed to run locally on consumer hardware.

Zhipu AI Background

  • Founded: 2019, spin-off from Tsinghua University’s Knowledge Engineering Group
  • Founders: Professors Tang Jie and Li Juanzi
  • Timeline:
    • 2021: GLM base model
    • 2022: GLM-130B (bilingual, open-source)
    • March 2023: ChatGLM (ChatGPT competitor)
    • 2026: Recently went public, pushing international markets

Market position: Strong in China, less known in Western markets. Academic lineage similar to Anthropic/OpenAI.

Local Testing

GLM-4.7-Flash (bf16)

Hardware: 96GB GPU RAM Memory usage: 58GB (plenty of headroom) Quantization: bf16 Performance: ~6 minutes for simple queries (“what do you think of this project?“)

Conclusion: bf16 is unusably slow. The reported benchmarks (43-82 tok/s) are likely using quantized versions (int4/int8). Switching to q4_K_M to test speed vs quality tradeoff.

Learning: bf16 is for training, not inference. It has the same exponent range as fp32 (handles gradient updates without overflow) and dedicated GPU tensor cores make training faster. For inference, the precision is overkill—q4/q5 quantization gives 4-8x speed improvement with minimal quality loss.

System Stability Issues

Problem: PC rebooted during testing when switching between models (60GB bf16 → 20GB model).

Root cause: Multiple factors compounding:

  1. Model overlap: Inference server (LM Studio/Ollama) loaded second model without unloading first
    • 58GB (GLM-4.7-Flash bf16) + 20GB (second model) = 78GB total in VRAM
    • Both models active during transition created power draw spike
  2. Faulty KVM: Dodgy KVM with ground fault (shocks on contact) introducing voltage instability
    • System tolerates issues during idle/light load
    • High GPU power draw (300-400W) during inference exposes power delivery problems
    • Model switching spike exceeds KVM’s degraded capacity

Key insight: Local inference memory budget ≠ power budget. Having VRAM headroom (78GB/96GB) doesn’t prevent issues if power delivery can’t sustain the draw. Inference servers often cache multiple models for fast switching without considering cumulative power implications.

Workarounds:

  • Replace faulty KVM (immediate priority)
  • Manually unload models between tests
  • Configure inference server for single model at a time
  • Restart inference server between model switches

Questions to Explore

  • What does local inference mean for team AI adoption timelines?
  • Cost model comparison: local vs cloud APIs at scale
  • How does this affect model selection guidance for the team?
  • What workflows benefit from local models vs cloud?
  • Privacy/security implications for enterprise use
  • Quantization sweet spot: where does quality drop become noticeable?

Sources

Comments