Before I deploy this for you, I've deployed it for myself.

Every recommendation I make comes from a running system. I operate a production-grade private AI infrastructure — the same stack I'd help you deploy: a large language model running locally, a RAG pipeline over 600,000+ documents, agentic automation, full observability, encrypted storage. No cloud dependency. No per-token billing.

122B

parameters, running locally

600K+

documents, searchable in milliseconds

0

bytes sent to external APIs

I got tired of recommending tools I hadn't run myself. So I built the same stack I deploy for clients — on my own hardware, with my own data, at production scale. Everything I recommend, I've already debugged at 2am.

Infrastructure overview

Compute layer

spark1

Node 1 — primary inference

128 GB RAM

Tensor shard A

ConnectX-7

23.3 GB/s

spark2

Node 2 — tensor parallel

128 GB RAM

Tensor shard B

vLLM tensor-parallel · Ray

Model layer

Qwen3.5-122B

FP8 quantized262K context windowInference cost: electricity only

Vision fallback: GLM-4.7-Flash · 80K context · 56 tokens/sec

RAG · graph · memory

Knowledge layer

Qdrant

645K vectors · 4096-dim · cosine

Neo4j

68K nodes · 106K relationships · person-aware

mem0

711 stored memories · linked to graph

What the lab actually does

Not benchmarks. Not demos. These are live systems I use every day.

Email intelligence

645,000 emails indexed as vectors. I ask questions in plain language — 'what did we agree about X in March?' — and get the right answer in under a second. 100% recall on every evaluation query.

Voice interface

Speech-to-text in under 300ms. Full conversation with the local model. Text-to-speech reply in 2–5 seconds. Multilingual. Nothing leaves the network.

AI agents

14 agents handling tasks: web research, memory, image generation, API orchestration, document search. Persistent memory across conversations. All running on local infrastructure.

Private LLM serving

A 122B parameter model distributed across two nodes via high-speed interconnect. 262,000 token context window. Inference cost: electricity only.

Full observability

64 monitoring targets, 34 alert rules, 16 dashboards. Every container, every GPU, every model lane tracked in real time. I know when something breaks before it matters.

Everything as code

73 documented architecture decisions. 18 Ansible playbooks. OpenTofu infrastructure definitions. Every change reviewed, validated, and committed.

Cost comparison

Running this workload on OpenAI API

~€8,000/month

Running it here

~€130/month

Cost of electricity

Technical architecture

For the technically curious. Everything below is running in production as of April 2026.

AI MODEL SERVING

Main LLMQwen3.5-122B · distributed across 2 nodes · 262K context · FP8 quantized
EmbeddingsQwen3-Embedding-8B · 147 chunks/sec · 4096-dimensional vectors
Speech-to-textWhisper Large-v3-Turbo · <300ms latency · CUDA inference
Text-to-speechQwen3-TTS-1.7B · 2.5–5s · 9 voices · multilingual
Vision (primary)Qwen3.5-122B · multimodal · 262K context · tensor-parallel
Vision (fallback)GLM-4.7-Flash · 80K context · 56 tokens/sec · lightweight
RerankerQwen3-Reranker-0.6B · sub-100ms · used in all RAG pipelines

KNOWLEDGE & MEMORY

Vector storeQdrant · 645K vectors (4096-dim) · cosine similarity
Graph storeNeo4j · 68K nodes · 106K relationships · person-aware retrieval
Persistent memorymem0 · 711 stored memories · linked to graph
Recall performance100% Recall@10 on 40-query golden evaluation set

COMPUTE

Total memory494 GB across 4 nodes (distributed inference capable)
GPU interconnectNVIDIA ConnectX-7 QSFP · 23.3 GB/s NCCL bandwidth
Inference enginevLLM tensor-parallel · Ray cluster coordination
Containers26 LXC + 40 Docker · Proxmox VE hypervisor
StorageNVMe SSDs · LUKS2 AES-XTS-512 encrypted at rest · ZFS snapshots
Power draw~1.2 kW under inference load · ~€130/month at Danish energy prices
Uplink10 GbE LAN · Proxmox VE cluster spanning 4 nodes

SECURITY & OPERATIONS

EncryptionLUKS2 AES-XTS-512 · encrypted at rest · zero-touch unlock via NBDE
AuthenticationPasskey-only SSO across all services · no passwords by default
Secrets managementAnsible Vault · pre-commit secret scanning on every commit
Backup3-layer: nightly snapshots + encrypted database dumps + off-site replication · 24h RPO
ObservabilityPrometheus · Grafana · Loki · 30-day retention · alerting

Want this for your organisation?

I help Danish and EU enterprises deploy private AI infrastructure — from single-model setups to full agentic platforms.