ORNL
Case Study
ORNL Leverages Secure RKE2 Infrastructure and RAG-Powered Knowledge Graphs to Accelerate LLM Adoption
Introduction
Oak Ridge National Laboratory (ORNL) is one of the United States’ leading multi-disciplinary research institutions, established in 1943 as part of the Manhattan Project.
Located in Oak Ridge, Tennessee, ORNL employs over 7,000 staff across a $2.6 billion annual budget, and is renowned for its capabilities in high-performance computing, advanced materials, neutron science, nuclear science, and national security research.
ORNL sought to explore generative AI and retrieval-augmented generation (RAG) for large language model (LLM) applications in highly regulated environments (e.g., defense, healthcare).
They required a tightly controlled GPU compute environment, end-to-end monitoring, robust model and data storage, and disaster-recovery mechanisms—while ensuring no exposure to the public internet.

Challenge
Oak Ridge National Laboratory needed to bring large-scale LLM capabilities into its highly regulated research environment, but faced problems common to both cloud and on-prem teams
several challenges
Regulatory Isolation
All compute and storage must reside on private subnets with no public internet exposure.
Secure GPU Orchestration
NVIDIA GPU clusters need strict scheduling and device-plugin management under RKE2.
Container Security
Every Docker image must be vulnerability-scanned and remediated prior to deployment.
Continuous Delivery
Coordinating model updates, microservice releases, and GitOps-driven deployments without manual steps.
Observability & Compliance
Real-time telemetry across GPUs, pods, and application layers for both operations and audits.
Disaster Recovery
Enterprise-grade backup/restore for cluster state, MinIO buckets, and Postgres metadata.
Solution
To address these challenges, ORNL’s DevOps team collaborated with our architects to deliver
Network-Isolated RKE2 Cluster
Deploy RKE2 on private subnets, secured with FIPS-compliant PKI and RBAC.
GPU Workload Management
Use NVIDIA’s RKE2 device plugin and dedicated node pools for training vs. inference.
Container CI/CD with ArgoCD & Trivy
Image Build Pipelines build Docker images. Trivy Scans detect known vulnerabilities; any findings are automatically triaged and fixed before promotion. ArgoCD continuously reconciles Git repos to the RKE2 cluster, ensuring only vetted images are ever deployed
RAG & Knowledge Graph Integration
Ingest ORNL technical publications into a graph database. Provide a RAG frontend powered by OpenAI embeddings and custom retrievers.
Model & Data Storage
MinIO for LLM weights, checkpoints, and vector indexes (S3-compatible). Postgres for metadata and session data.
Observability
Prometheus + DCGM Exporter for GPU metrics (utilization, memory, temperature). Grafana dashboards for cluster health, pod status, and inference latency.
Disaster Recovery
Velero with encrypted backups stored on-premise. Nightly full snapshots + hourly incremental backups.

Architecture & Technology Overview
Orchestration
RKE2 (private network, FIPS PKI, RBAC)
GPU Scheduling
NVIDIA RKE2 Device Plugin, Dedicated node pools
CI/CD
GitOps with ArgoCD, Image scanning via Trivy
Model Storage
MinIO (S3-compatible)
Metadata DB
PostgreSQL
Monitoring
Prometheus, Grafana
Backup
Velero
Results
Security Hardened
100% of container images scanned and remediated before deployment, eliminating known critical vulnerabilities.
Reliability
Velero-verified recovery with a mean RTO of 12 minutes.
Zero unplanned downtime during scaled GPU workloads.
RAG Performance
Sub-second average query latency
(200 ms, down from 800 ms)
Compliance
Passed internal and external audits with no major findings on network isolation, encryption, or container supply-chain security.
Operational Efficiency
GPU utilization climbed to 95% for inference tasks New model rollouts via ArgoCD completed in under 8 minutes end-to-end.