Inference Engine Runtime (Patio)

A Python-based sidecar runtime for AI inference engines on Kubernetes. Patio provides a unified interface between workload controllers (such as RoleBasedGroup) and inference engines like SGLang and vLLM.

Overview

Patio is a lightweight FastAPI server that runs alongside inference engine containers and provides:

LoRA Adapter Management — Dynamically load and unload LoRA adapters at runtime without restarting the engine
Unified Prometheus Metrics — Scrape, normalize, and re-expose inference engine metrics with a patio: prefix
Distributed Topology Management — Register workers with a central router, heartbeat-based recovery, and graceful shutdown
Health Check & Readiness — Standard Kubernetes liveness and readiness probes

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Pod                                                         │
│  ┌──────────────────┐    ┌─────────────────────────────────┐│
│  │  Inference Engine │    │  Patio Sidecar (port 9091)      ││
│  │  (SGLang/vLLM)   │◄──►│                                 ││
│  │                  │    │  - LoRA API                       ││
│  │  :8000           │    │  - Metrics (/metrics)            ││
│  │                  │    │  - Topology Client               ││
│  │                  │    │  - Health (/health)              ││
│  └──────────────────┘    └─────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘

Patio acts as a proxy and management layer, allowing RBG to control inference engines through a standardized HTTP API.

Features

LoRA Adapter Management

Dynamically manage LoRA adapters without engine restarts:

# Load a LoRA adapter
curl -X POST http://localhost:9091/load_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-adapter", "lora_path": "/models/my-adapter"}'

# Unload a LoRA adapter
curl -X POST http://localhost:9091/unload_lora_adapter \
  -H "Content-Type: application/json" \
  -d '{"lora_name": "my-adapter"}'

Patio proxies these requests to the underlying engine (SGLang or vLLM) using their native APIs.

Unified Prometheus Metrics

Patio scrapes the inference engine's /metrics endpoint and normalizes metric names:

Engine Metric	Patio Metric
`sglang:num_running_reqs`	`patio:num_requests_running`
`vllm:num_requests_running`	`patio:num_requests_running`
`sglang:num_prompt_tokens_total`	`patio:input_tokens_total`

Access metrics at http://localhost:9091/metrics in Prometheus text format.

Distributed Topology Management

For multi-node inference deployments, Patio manages worker registration and discovery:

Worker Registration — Automatically registers with a central router on startup
Heartbeat — Periodic heartbeats ensure automatic recovery if the router restarts
Graceful Shutdown — Unregisters from the router on SIGTERM/SIGINT

Configure via environment variables:

env:
  - name: TOPO_TYPE
    value: "SGLang"
  - name: SGL_ROUTER_ROLE_NAME
    value: "router"
  - name: SGL_ROUTER_PORT
    value: "8000"

Quick Start

Prerequisites

Python 3.8+
Kubernetes cluster with RoleBasedGroup installed
Inference engine (SGLang or vLLM) running in the same pod

Installation

Install dependencies:

pip install -r requirements.txt

For development, also install test dependencies:

pip install -r requirements-dev.txt

Running Locally

Start the Patio server:

python -m patio.app --host 127.0.0.1 --port 9091

Command Line Options

Option	Default	Description
`--host`	`0.0.0.0`	Host to listen on
`--port`	`9091`	Port to listen on
`--log-level`	`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)
`--enable-fastapi-docs`	`false`	Enable FastAPI documentation endpoints
`--scrape-engine-metrics`	`true`	Enable scraping of engine metrics

Environment Variables

Variable	Description	Default
`INFERENCE_ENGINE`	Inference engine type (`sglang` or `vllm`)	`sglang`
`INFERENCE_ENGINE_VERSION`	Engine version	`v0.5.3`
`INFERENCE_ENGINE_ENDPOINT`	Engine endpoint URL	`http://localhost:8000`
`TOPO_TYPE`	Topology type (`SGLang` or `None`)	`None`
`GROUP_NAME`	RBG group name	`None`
`ROLE_NAME`	RBG role name	`None`
`ROLE_INDEX`	RBG role index	`None`
`HEARTBEAT_INTERVAL`	Topology heartbeat interval (seconds)	`30`

Usage with RoleBasedGroup

Patio is typically deployed as a sidecar container within a RoleBasedGroup role. Use the ClusterEngineRuntimeProfile CRD (defined in RBG) to inject Patio automatically:

apiVersion: workloads.x-k8s.io/v1alpha2
kind: ClusterEngineRuntimeProfile
metadata:
  name: patio-runtime
spec:
  containers:
    - name: patio
      image: rolebasedgroup/rbgs-patio-runtime:latest
      ports:
        - containerPort: 9091
          name: patio
      env:
        - name: INFERENCE_ENGINE
          value: "sglang"
        - name: INFERENCE_ENGINE_ENDPOINT
          value: "http://localhost:8000"
        - name: TOPO_TYPE
          value: "SGLang"
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
---
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
  name: my-inference
spec:
  roles:
    - name: inference
      replicas: 2
      engineRuntimes:
        - profileName: patio-runtime
      standalonePattern:
        template:
          spec:
            containers:
              - name: sglang
                image: lmsysorg/sglang:latest
                ports:
                  - containerPort: 8000

API Reference

Server Endpoints

Method	Path	Description
`GET`	`/`	Server status
`GET`	`/health`	Liveness probe
`GET`	`/metrics`	Prometheus metrics

LoRA Endpoints

Method	Path	Description
`POST`	`/load_lora_adapter`	Load a LoRA adapter
`POST`	`/unload_lora_adapter`	Unload a LoRA adapter

Request Models

LoadLoraAdapterRequest:

{
  "lora_name": "string",
  "lora_path": "string"
}

UnLoadLoraAdapterRequest:

{
  "lora_name": "string"
}

Project Structure

inference-engine-runtime/
├── app.py                  # Main entry point (FastAPI + uvicorn)
├── config.py               # Configuration constants
├── envs.py                 # Environment variable parsing
├── logger.py               # Logging configuration
├── api/                    # HTTP API layer
│   ├── server_router.py    # Server endpoints
│   ├── lora_router.py      # LoRA endpoints
│   └── protocol.py         # Request/response models
├── engine/                 # Inference engine abstraction
│   ├── base.py             # Base engine interface
│   ├── sglang_engine.py    # SGLang implementation
│   └── vllm_engine.py      # vLLM implementation
├── metrics/                # Prometheus metrics
│   ├── metrics.py          # Built-in metrics
│   ├── engine_collector.py # Engine metrics scraper
│   └── standard_rules.py   # Metric normalization
├── topo/                   # Topology management
│   ├── factory.py          # Client/server factories
│   ├── client/             # Topology clients
│   └── server/             # Topology servers
├── tests/                  # Unit and E2E tests
└── doc/                    # Additional documentation

Development

Running Tests

# Run all unit tests
python -m pytest tests/mock_tests -v

# Run specific test file
python -m pytest tests/mock_tests/test_sglang_engine.py -v

# Run with coverage
python -m pytest tests/mock_tests --cov=patio --cov-report=term-missing

Building Docker Image

docker build -t inference-engine-runtime:latest .

License

Apache License 2.0. See LICENSE.

Acknowledgments

Patio was originally developed as part of the RoleBasedGroup project and has been extracted into a standalone project for independent development and deployment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inference Engine Runtime (Patio)

Overview

Architecture

Features

LoRA Adapter Management

Unified Prometheus Metrics

Distributed Topology Management

Quick Start

Prerequisites

Installation

Running Locally

Command Line Options

Environment Variables

Usage with RoleBasedGroup

API Reference

Server Endpoints

LoRA Endpoints

Request Models

Project Structure

Development

Running Tests

Building Docker Image

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
api		api
doc		doc
engine		engine
metrics		metrics
tests		tests
topo		topo
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
config.py		config.py
envs.py		envs.py
logger.py		logger.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Inference Engine Runtime (Patio)

Overview

Architecture

Features

LoRA Adapter Management

Unified Prometheus Metrics

Distributed Topology Management

Quick Start

Prerequisites

Installation

Running Locally

Command Line Options

Environment Variables

Usage with RoleBasedGroup

API Reference

Server Endpoints

LoRA Endpoints

Request Models

Project Structure

Development

Running Tests

Building Docker Image

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages