Every recommendation I make comes from a running system. I operate a production-grade private AI infrastructure — the same stack I'd help you deploy: a large language model running locally, a RAG pipeline over 600,000+ documents, agentic automation, full observability, encrypted storage. No cloud dependency. No per-token billing.
122B
parameters, running locally
600K+
documents, searchable in milliseconds
0
bytes sent to external APIs
I got tired of recommending tools I hadn't run myself. So I built the same stack I deploy for clients — on my own hardware, with my own data, at production scale. Everything I recommend, I've already debugged at 2am.
Infrastructure overview
Compute layer
spark1
Node 1 — primary inference
128 GB RAM
Tensor shard A
ConnectX-7
23.3 GB/s
spark2
Node 2 — tensor parallel
128 GB RAM
Tensor shard B
spark1
Node 1 — primary inference
128 GB RAM
Tensor shard A
ConnectX-7
23.3 GB/s
spark2
Node 2 — tensor parallel
128 GB RAM
Tensor shard B
vLLM tensor-parallel · Ray
Model layer
Qwen3.5-122B
FP8 quantized262K context windowInference cost: electricity onlyVision fallback: GLM-4.7-Flash · 80K context · 56 tokens/sec
RAG · graph · memory
Knowledge layer
⬡
Qdrant
645K vectors · 4096-dim · cosine
◉
Neo4j
68K nodes · 106K relationships · person-aware
◎
mem0
711 stored memories · linked to graph
Not benchmarks. Not demos. These are live systems I use every day.
645,000 emails indexed as vectors. I ask questions in plain language — 'what did we agree about X in March?' — and get the right answer in under a second. 100% recall on every evaluation query.
Speech-to-text in under 300ms. Full conversation with the local model. Text-to-speech reply in 2–5 seconds. Multilingual. Nothing leaves the network.
14 agents handling tasks: web research, memory, image generation, API orchestration, document search. Persistent memory across conversations. All running on local infrastructure.
A 122B parameter model distributed across two nodes via high-speed interconnect. 262,000 token context window. Inference cost: electricity only.
64 monitoring targets, 34 alert rules, 16 dashboards. Every container, every GPU, every model lane tracked in real time. I know when something breaks before it matters.
73 documented architecture decisions. 18 Ansible playbooks. OpenTofu infrastructure definitions. Every change reviewed, validated, and committed.
Cost comparison
Running this workload on OpenAI API
~€8,000/month
Running it here
~€130/month
Cost of electricity
For the technically curious. Everything below is running in production as of April 2026.
AI MODEL SERVING
KNOWLEDGE & MEMORY
COMPUTE
SECURITY & OPERATIONS
I help Danish and EU enterprises deploy private AI infrastructure — from single-model setups to full agentic platforms.