ORNL Case Study – BerryBytes

ORNL

Case Study

ORNL Leverages Secure RKE2 Infrastructure and RAG-Powered Knowledge Graphs to Accelerate LLM Adoption

Introduction

Oak Ridge National Laboratory (ORNL) is one of the United States’ leading multi-disciplinary research institutions, established in 1943 as part of the Manhattan Project.

Located in Oak Ridge, Tennessee, ORNL employs over 7,000 staff across a $2.6 billion annual budget, and is renowned for its capabilities in high-performance computing, advanced materials, neutron science, nuclear science, and national security research.

ORNL sought to explore generative AI and retrieval-augmented generation (RAG) for large language model (LLM) applications in highly regulated environments (e.g., defense, healthcare).

They required a tightly controlled GPU compute environment, end-to-end monitoring, robust model and data storage, and disaster-recovery mechanisms—while ensuring no exposure to the public internet.

Challenge

Oak Ridge National Laboratory needed to bring large-scale LLM capabilities into its highly regulated research environment, but faced problems common to both cloud and on-prem teams

several challenges

Regulatory Isolation

All compute and storage must reside on private subnets with no public internet exposure.

Secure GPU Orchestration

NVIDIA GPU clusters need strict scheduling and device-plugin management under RKE2.

Container Security

Every Docker image must be vulnerability-scanned and remediated prior to deployment.

Continuous Delivery

Coordinating model updates, microservice releases, and GitOps-driven deployments without manual steps.

Observability & Compliance

Real-time telemetry across GPUs, pods, and application layers for both operations and audits.

Disaster Recovery

Enterprise-grade backup/restore for cluster state, MinIO buckets, and Postgres metadata.

Solution

To address these challenges, ORNL’s DevOps team collaborated with our architects to deliver

Network-Isolated RKE2 Cluster

Deploy RKE2 on private subnets, secured with FIPS-compliant PKI and RBAC.

GPU Workload Management

Use NVIDIA’s RKE2 device plugin and dedicated node pools for training vs. inference.

Container CI/CD with ArgoCD & Trivy

Image Build Pipelines build Docker images. Trivy Scans detect known vulnerabilities; any findings are automatically triaged and fixed before promotion. ArgoCD continuously reconciles Git repos to the RKE2 cluster, ensuring only vetted images are ever deployed

RAG & Knowledge Graph Integration

Ingest ORNL technical publications into a graph database. Provide a RAG frontend powered by OpenAI embeddings and custom retrievers.

Model & Data Storage

MinIO for LLM weights, checkpoints, and vector indexes (S3-compatible). Postgres for metadata and session data.

Observability

Prometheus + DCGM Exporter for GPU metrics (utilization, memory, temperature). Grafana dashboards for cluster health, pod status, and inference latency.

Disaster Recovery

Velero with encrypted backups stored on-premise. Nightly full snapshots + hourly incremental backups.

Architecture & Technology Overview

Orchestration

RKE2 (private network, FIPS PKI, RBAC)

GPU Scheduling

NVIDIA RKE2 Device Plugin, Dedicated node pools

CI/CD

GitOps with ArgoCD, Image scanning via Trivy

Model Storage

MinIO (S3-compatible)

Metadata DB

PostgreSQL

Monitoring

Prometheus, Grafana

Backup

Velero

Results

Security Hardened

100% of container images scanned and remediated before deployment, eliminating known critical vulnerabilities.

Reliability

Velero-verified recovery with a mean RTO of 12 minutes.
Zero unplanned downtime during scaled GPU workloads.

RAG Performance

Sub-second average query latency

(200 ms, down from 800 ms)

Compliance

Passed internal and external audits with no major findings on network isolation, encryption, or container supply-chain security.

Operational Efficiency

GPU utilization climbed to 95% for inference tasks New model rollouts via ArgoCD completed in under 8 minutes end-to-end.

Conclusion

ORNL’s secure, private RKE2-powered AI platform—with rigorous container-scanning, GitOps via ArgoCD, and comprehensive observability—has transformed how the lab evaluates and deploys large LLMs in regulated scenarios. Future plans include expanding the knowledge graph to IoT-stream data and federating multiple data sources for collaborative research across DOE labs.

Agentic AI Engineering

AI/ML

Click House

Cloud Advancement

Managed Kubernetes Services

Platform Engineering

Infrastructure Automation

Data Intelligence And Innovation

Cloud Native Architecture

Internet of Things (IoT)

SAP

DevEx

ORNL

Introduction

Challenge

several challenges

Regulatory Isolation

Secure GPU Orchestration

Container Security

Continuous Delivery

Observability & Compliance

Disaster Recovery

Solution

Network-Isolated RKE2 Cluster

GPU Workload Management

Container CI/CD with ArgoCD & Trivy

RAG & Knowledge Graph Integration

Model & Data Storage

Observability

Disaster Recovery

Architecture & Technology Overview

Orchestration

GPU Scheduling

CI/CD

Model Storage

Metadata DB

Monitoring

Backup

Results

Security Hardened

Reliability

RAG Performance

Compliance

Operational Efficiency

Conclusion

Navigation

Services

Services

Legal

Agentic AI Engineering

AI/ML

Click House

Cloud Advancement

Managed Kubernetes Services

Platform Engineering

Infrastructure Automation

Data Intelligence And Innovation

Cloud Native Architecture

Internet of Things (IoT)

SAP

DevEx