Before I deploy this for you, I've deployed it for myself.

Every recommendation I make comes from a running system. I operate a production-grade private AI infrastructure — the same stack I'd help you deploy: a large language model running locally, a RAG pipeline over 600,000+ documents, agentic automation, full observability, encrypted storage. No cloud dependency. No per-token billing.

122B

parameters, running locally

600K+

documents, searchable in milliseconds

bytes sent to external APIs

I got tired of recommending tools I hadn't run myself. So I built the same stack I deploy for clients — on my own hardware, with my own data, at production scale. Everything I recommend, I've already debugged at 2am.

Infrastructure overview

Compute layer

spark1

Node 1 — primary inference

128 GB RAM

Tensor shard A

ConnectX-7

23.3 GB/s

spark2

Node 2 — tensor parallel

128 GB RAM

Tensor shard B

spark1

Node 1 — primary inference

128 GB RAM

Tensor shard A

ConnectX-7

23.3 GB/s

spark2

Node 2 — tensor parallel

128 GB RAM

Tensor shard B

vLLM tensor-parallel · Ray

Model layer

Qwen3.5-122B

FP8 quantized262K context windowInference cost: electricity only

Vision fallback: GLM-4.7-Flash · 80K context · 56 tokens/sec

RAG · graph · memory

Knowledge layer

⬡

Qdrant

645K vectors · 4096-dim · cosine

◉

Neo4j

68K nodes · 106K relationships · person-aware

◎

mem0

711 stored memories · linked to graph

What the lab actually does

Not benchmarks. Not demos. These are live systems I use every day.

Email intelligence

645,000 emails indexed as vectors. I ask questions in plain language — 'what did we agree about X in March?' — and get the right answer in under a second. 100% recall on every evaluation query.

Voice interface

Speech-to-text in under 300ms. Full conversation with the local model. Text-to-speech reply in 2–5 seconds. Multilingual. Nothing leaves the network.

AI agents

14 agents handling tasks: web research, memory, image generation, API orchestration, document search. Persistent memory across conversations. All running on local infrastructure.

Private LLM serving

A 122B parameter model distributed across two nodes via high-speed interconnect. 262,000 token context window. Inference cost: electricity only.

Full observability

64 monitoring targets, 34 alert rules, 16 dashboards. Every container, every GPU, every model lane tracked in real time. I know when something breaks before it matters.

Everything as code

73 documented architecture decisions. 18 Ansible playbooks. OpenTofu infrastructure definitions. Every change reviewed, validated, and committed.

Cost comparison

Running this workload on OpenAI API

~€8,000/month

Running it here

~€130/month

Cost of electricity

Technical architecture

For the technically curious. Everything below is running in production as of April 2026.

AI MODEL SERVING

Main LLMQwen3.5-122B · distributed across 2 nodes · 262K context · FP8 quantized

EmbeddingsQwen3-Embedding-8B · 147 chunks/sec · 4096-dimensional vectors

Speech-to-textWhisper Large-v3-Turbo · <300ms latency · CUDA inference

Text-to-speechQwen3-TTS-1.7B · 2.5–5s · 9 voices · multilingual

Vision (primary)Qwen3.5-122B · multimodal · 262K context · tensor-parallel

Vision (fallback)GLM-4.7-Flash · 80K context · 56 tokens/sec · lightweight

RerankerQwen3-Reranker-0.6B · sub-100ms · used in all RAG pipelines

KNOWLEDGE & MEMORY

Vector storeQdrant · 645K vectors (4096-dim) · cosine similarity

Graph storeNeo4j · 68K nodes · 106K relationships · person-aware retrieval

Persistent memorymem0 · 711 stored memories · linked to graph

Recall performance100% Recall@10 on 40-query golden evaluation set

COMPUTE

Total memory494 GB across 4 nodes (distributed inference capable)

GPU interconnectNVIDIA ConnectX-7 QSFP · 23.3 GB/s NCCL bandwidth

Inference enginevLLM tensor-parallel · Ray cluster coordination

Containers26 LXC + 40 Docker · Proxmox VE hypervisor

StorageNVMe SSDs · LUKS2 AES-XTS-512 encrypted at rest · ZFS snapshots

Power draw~1.2 kW under inference load · ~€130/month at Danish energy prices

Uplink10 GbE LAN · Proxmox VE cluster spanning 4 nodes

SECURITY & OPERATIONS

EncryptionLUKS2 AES-XTS-512 · encrypted at rest · zero-touch unlock via NBDE

AuthenticationPasskey-only SSO across all services · no passwords by default

Secrets managementAnsible Vault · pre-commit secret scanning on every commit

Backup3-layer: nightly snapshots + encrypted database dumps + off-site replication · 24h RPO

ObservabilityPrometheus · Grafana · Loki · 30-day retention · alerting

Want this for your organisation?

I help Danish and EU enterprises deploy private AI infrastructure — from single-model setups to full agentic platforms.

Book a Sparring Session See the research