5-server fleet · all systems operational

Do Cao Hieu

I keep AI products running in production.

Ho Chi Minh City, Vietnam · self-taught in tech since 2016 · DevOps since 2023. No drama. No downtime theater.

// system status live
0
Production servers
0
Products shipped
0
In tech since
0
DevOps since
fleet operational · Ho Chi Minh City, VN
scroll to fly through
01 — Perspective

AI-native infrastructure is a different discipline.

// cost

LLM inference costs scale with tokens, not requests.

// data

Vector databases require tuning the way cache layers did a decade ago.

// orchestration

Agent orchestration is replacing cron jobs.

// observability

Now includes prompt regressions, evaluation drift, latency budgets, and model behavior.

I've been building and operating AI-native systems since 2025. The goal isn't deploying a demo — it's keeping production AI systems reliable, observable, and cost-efficient at scale.

02 — Journey

From IT helpdesk to production AI.

// the camera flies the pipeline — each station ignites as you arrive

2015–2019commit e1f3c20○ pending

Automotive Engineering

HCMUTE

Built a foundation in systems thinking, diagnostics, reliability, and problem solving.

2019–2023commit 7d4e9b1○ pending

IT & Technical Support

Furniture Manufacturer

Set up and maintained the company's computers, network, and CNC machine systems. Became the de-facto technical person on site.

2023commit c091af2○ pending

DevOps Engineer

the pivot · self-taught

Sharpened my English, then taught myself DevOps — moving from manufacturing systems to software systems.

2024commit 3e8d5c4○ pending

AI Data · Outlier

Top 5% highest-performing contributor globally — remote AI training & evaluation, alongside contract technical work for Wellington Decorators.

2025 — nowHEAD → main○ pending

DevOps & AI Infrastructure

Building and operating production AI workloads across a 5-server fleet — inference, observability, automation, reliability. Three products shipped and running.

03 — Capabilities
// what I do
Infrastructure automation CI/CD pipelines Observability & monitoring AI deployment & agents Self-hosting & edge

The stack I run on.

Cloud & Edge
AWSGoogle CloudCloudflare
Containers & Orchestration
KubernetesDocker
Infrastructure as Code
TerraformAnsible
CI/CD
GitHub ActionsJenkins
Networking & Systems
Nginxsystemd
Databases
PostgreSQLRedispgvector
Observability
GrafanaPrometheusLoki
Languages
PythonGoTypeScriptBash
AI Infrastructure
MCPClaude Agent SDKGemini API
04 — Experience

Where I've operated.

DevOps Engineer & AI Infrastructure

Self-employed · self-hosted
2023 — Present
  • Design and operate 5 production servers running 97 Docker containers across a 5-node WireGuard full-mesh VPN.
  • Full LGTM observability (Prometheus, Grafana, Loki, Tempo); CI/CD with Jenkins, Harbor registry, and Terrakube / Semaphore for IaC.
  • Shipped and maintain IELTS Pocket, Clinic ERP and a self-hosted DevOps Lab Platform.

AI Data Contributor

Outlier
2024 — Present
  • Top 5% highest-performing contributor globally.
  • Train and evaluate frontier models (Claude, GPT, Gemini) through expert prompt engineering and code review.

IT & Technical Support

Furniture Manufacturer
2019 — 2023
  • Ran IT for 30+ staff — workstations, LAN / Wi-Fi, file server / NAS and backups.
  • Maintained CNC machine systems and the production-floor data pipeline.
05 — Selected Work

Things I've shipped.

/00 · featured

Self-Hosted Infrastructure Fleet

5 production servers · 97 containers · one platform.

● live
INTERNET Cloudflare · DNS + Proxy Loadbalancer · Traefik TLS Singapore edge · 9 containers WireGuard · 5-node full-mesh VPN · 10.10.0.0/24 server-1 staging · dev IELTS Pocket (staging) 10 containers server2 CI/CD · registry Jenkins · Harbor · Terrakube 29 containers server3 observability Prometheus · Grafana · Loki 19 containers oracle · ARM apps · platform GitLab · Supabase · Vault 30 containers 5 servers · 97 containers · LGTM observability + Beszel on every host
WireGuardTraefikPrometheusGrafanaLokiTempoHarborJenkinsKopia
/01● live

DevOps Lab Platform

Self-hosted interactive DevOps learning platform — browser-based, hands-on labs running in real containers.

sysboxk3sDockerNext.js
devops.docaohieu.com →
/02◆ open source

DevOps Automation Toolkits

27 reusable IaC & automation repos — Ansible, Terraform / OpenTofu / Pulumi, ArgoCD / Tekton, CI/CD and monitoring stacks.

TerraformAnsibleArgoCDHelm
github.com/docaohieu2808 →
/03◆ open source

Engram-Mem

Persistent memory for AI agents — dual-memory (episodic vector + semantic graph) over CLI, HTTP, MCP & WebSocket.

PythonVector DBMCPWebSocket
github.com/docaohieu2808/Engram-Mem →
/04● live

IELTS Pocket

AI-powered IELTS preparation platform — automated Speaking & Writing grading, running in production.

Next.jsGemini APIText-to-SpeechPostgreSQL
ieltspocket.com →
/05● live

Clinic ERP

ERP & dashboard for a physiotherapy clinic — scheduling, patient records, and billing. Built and running in production for a private client.

Cloudflare WorkersD1Next.js
● private client project
/06◆ open source

Floatwave

Floating mini-player for YouTube & YouTube Music on Windows — always-on-top, global hotkeys, queue, favorites, focus & cinema mode.

ElectronVanilla JSyoutubei.js
github.com/docaohieu2808/floatwave →
06 — Writing

War stories, written down.

AI Agents in Production — book cover
Second Edition · March 2026 · 13 chapters

AI Agents in Production

War Stories from a DevOps Engineer

A practical guide to building, deploying and operating AI agents in production — model selection, context engineering, cost management, observability and security. Based on real systems running Claude, GPT, Gemini and open-source models.

Get it on Gumroad
// the shelf · basic → advanced
DevOps Bằng Hình — cover
DevOps Bằng Hình
Visual · learn by diagrams
Read PDF →
Docker Compose Toàn Tập — cover
Docker Compose Toàn Tập
12 chapters · Compose deep-dive
Read PDF →
DevOps Level MAX — cover
DevOps · Level MAX
25 chapters · the handbook
Read PDF →
DevOps Tricks — cover
DevOps Tricks
24 chapters · beginner → advanced
Read PDF →
DevOps Field Manual — cover
DevOps Field Manual
848 pages · operations
Read PDF →
07 — Contact

Let's build reliable systems.

Whether it's AI infrastructure, platform engineering, observability, automation, or production operations — I'm always interested in challenging engineering problems.

Email
contact@docaohieu.com
Phone
+84 908 388 911
GitHub
github.com/docaohieu2808
LinkedIn
linkedin.com/in/docaohieu2808
Email me Download Résumé Open to DevOps · Platform · SRE roles — remote or Ho Chi Minh City
Do Cao Hieu · DevOps & AI Infrastructure Built & operated in Ho Chi Minh City · ● all systems operational