A Python-based sidecar runtime for AI inference engines on Kubernetes. Patio provides a unified interface between workload controllers (such as RoleBasedGroup) and inference engines like SGLang and vLLM.
Patio is a lightweight FastAPI server that runs alongside inference engine containers and provides:
- LoRA Adapter Management — Dynamically load and unload LoRA adapters at runtime without restarting the engine
- Unified Prometheus Metrics — Scrape, normalize, and re-expose inference engine metrics with a
patio:prefix - Distributed Topology Management — Register workers with a central router, heartbeat-based recovery, and graceful shutdown
- Health Check & Readiness — Standard Kubernetes liveness and readiness probes
┌─────────────────────────────────────────────────────────────┐
│ Pod │
│ ┌──────────────────┐ ┌─────────────────────────────────┐│
│ │ Inference Engine │ │ Patio Sidecar (port 9091) ││
│ │ (SGLang/vLLM) │◄──►│ ││
│ │ │ │ - LoRA API ││
│ │ :8000 │ │ - Metrics (/metrics) ││
│ │ │ │ - Topology Client ││
│ │ │ │ - Health (/health) ││
│ └──────────────────┘ └─────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
Patio acts as a proxy and management layer, allowing RBG to control inference engines through a standardized HTTP API.
Dynamically manage LoRA adapters without engine restarts:
# Load a LoRA adapter
curl -X POST http://localhost:9091/load_lora_adapter \
-H "Content-Type: application/json" \
-d '{"lora_name": "my-adapter", "lora_path": "/models/my-adapter"}'
# Unload a LoRA adapter
curl -X POST http://localhost:9091/unload_lora_adapter \
-H "Content-Type: application/json" \
-d '{"lora_name": "my-adapter"}'Patio proxies these requests to the underlying engine (SGLang or vLLM) using their native APIs.
Patio scrapes the inference engine's /metrics endpoint and normalizes metric names:
| Engine Metric | Patio Metric |
|---|---|
sglang:num_running_reqs |
patio:num_requests_running |
vllm:num_requests_running |
patio:num_requests_running |
sglang:num_prompt_tokens_total |
patio:input_tokens_total |
Access metrics at http://localhost:9091/metrics in Prometheus text format.
For multi-node inference deployments, Patio manages worker registration and discovery:
- Worker Registration — Automatically registers with a central router on startup
- Heartbeat — Periodic heartbeats ensure automatic recovery if the router restarts
- Graceful Shutdown — Unregisters from the router on SIGTERM/SIGINT
Configure via environment variables:
env:
- name: TOPO_TYPE
value: "SGLang"
- name: SGL_ROUTER_ROLE_NAME
value: "router"
- name: SGL_ROUTER_PORT
value: "8000"- Python 3.8+
- Kubernetes cluster with RoleBasedGroup installed
- Inference engine (SGLang or vLLM) running in the same pod
Install dependencies:
pip install -r requirements.txtFor development, also install test dependencies:
pip install -r requirements-dev.txtStart the Patio server:
python -m patio.app --host 127.0.0.1 --port 9091| Option | Default | Description |
|---|---|---|
--host |
0.0.0.0 |
Host to listen on |
--port |
9091 |
Port to listen on |
--log-level |
INFO |
Logging level (DEBUG, INFO, WARNING, ERROR) |
--enable-fastapi-docs |
false |
Enable FastAPI documentation endpoints |
--scrape-engine-metrics |
true |
Enable scraping of engine metrics |
| Variable | Description | Default |
|---|---|---|
INFERENCE_ENGINE |
Inference engine type (sglang or vllm) |
sglang |
INFERENCE_ENGINE_VERSION |
Engine version | v0.5.3 |
INFERENCE_ENGINE_ENDPOINT |
Engine endpoint URL | http://localhost:8000 |
TOPO_TYPE |
Topology type (SGLang or None) |
None |
GROUP_NAME |
RBG group name | None |
ROLE_NAME |
RBG role name | None |
ROLE_INDEX |
RBG role index | None |
HEARTBEAT_INTERVAL |
Topology heartbeat interval (seconds) | 30 |
Patio is typically deployed as a sidecar container within a RoleBasedGroup role. Use the ClusterEngineRuntimeProfile CRD (defined in RBG) to inject Patio automatically:
apiVersion: workloads.x-k8s.io/v1alpha2
kind: ClusterEngineRuntimeProfile
metadata:
name: patio-runtime
spec:
containers:
- name: patio
image: rolebasedgroup/rbgs-patio-runtime:latest
ports:
- containerPort: 9091
name: patio
env:
- name: INFERENCE_ENGINE
value: "sglang"
- name: INFERENCE_ENGINE_ENDPOINT
value: "http://localhost:8000"
- name: TOPO_TYPE
value: "SGLang"
resources:
requests:
cpu: "100m"
memory: "128Mi"
---
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
name: my-inference
spec:
roles:
- name: inference
replicas: 2
engineRuntimes:
- profileName: patio-runtime
standalonePattern:
template:
spec:
containers:
- name: sglang
image: lmsysorg/sglang:latest
ports:
- containerPort: 8000| Method | Path | Description |
|---|---|---|
GET |
/ |
Server status |
GET |
/health |
Liveness probe |
GET |
/metrics |
Prometheus metrics |
| Method | Path | Description |
|---|---|---|
POST |
/load_lora_adapter |
Load a LoRA adapter |
POST |
/unload_lora_adapter |
Unload a LoRA adapter |
LoadLoraAdapterRequest:
{
"lora_name": "string",
"lora_path": "string"
}UnLoadLoraAdapterRequest:
{
"lora_name": "string"
}inference-engine-runtime/
├── app.py # Main entry point (FastAPI + uvicorn)
├── config.py # Configuration constants
├── envs.py # Environment variable parsing
├── logger.py # Logging configuration
├── api/ # HTTP API layer
│ ├── server_router.py # Server endpoints
│ ├── lora_router.py # LoRA endpoints
│ └── protocol.py # Request/response models
├── engine/ # Inference engine abstraction
│ ├── base.py # Base engine interface
│ ├── sglang_engine.py # SGLang implementation
│ └── vllm_engine.py # vLLM implementation
├── metrics/ # Prometheus metrics
│ ├── metrics.py # Built-in metrics
│ ├── engine_collector.py # Engine metrics scraper
│ └── standard_rules.py # Metric normalization
├── topo/ # Topology management
│ ├── factory.py # Client/server factories
│ ├── client/ # Topology clients
│ └── server/ # Topology servers
├── tests/ # Unit and E2E tests
└── doc/ # Additional documentation
# Run all unit tests
python -m pytest tests/mock_tests -v
# Run specific test file
python -m pytest tests/mock_tests/test_sglang_engine.py -v
# Run with coverage
python -m pytest tests/mock_tests --cov=patio --cov-report=term-missingdocker build -t inference-engine-runtime:latest .Apache License 2.0. See LICENSE.
Patio was originally developed as part of the RoleBasedGroup project and has been extracted into a standalone project for independent development and deployment.