AI practice

A specialist AI practice for regulated workloads.

Seven aspects, one operating discipline. From strategy and readiness through AI platforms, LLM inference, retrieval and agentic patterns, MLOps, and AI security and governance — delivered as engineering work, on the platforms your auditors already trust.

Discuss an AI engagement Explore the practice

01 — Overview

AI as engineering work, not slideware.

Many regulated organisations have launched AI programmes that produced demos that never reached production — or worse, reached production without an audit posture. We build AI capability that ships and survives the audit cycle, on the platforms your security and operations teams already trust.

Strategy & readiness

Use-case discovery, build-vs-buy, ROI modelling, AI governance setup, pilot design.

AI platforms

OpenShift AI, NVIDIA AI Enterprise, Kubeflow, GPU operator, multi-tenancy, model registry.

LLM inference

vLLM, Triton, NVIDIA NIM, TGI, Ollama. Quantisation, batching, KV-cache, sizing.

RAG & agentic AI

Vector stores, retrieval evaluation, MCP, multi-agent patterns, guardrails.

MLOps & lifecycle

Feature stores, training pipelines, model registry, continuous evaluation, drift.

Security & governance

Model-risk management, identity-bound AI, audit trails, regulator readiness, AI red teaming.

Engagement archetypes

Engagement type	Typical scope	Duration
AI readiness & strategy	Use-case discovery, build-vs-buy mapping, ROI modelling, AI governance design, pilot definition	4–8 weeks
AI platform stand-up	OpenShift AI or NVIDIA AI Enterprise deployment, tenancy model, GPU operator, model registry, dashboards	8–14 weeks
LLM inference platform	vLLM/Triton/NIM deployment, model serving, quantisation, sizing, multi-model routing, observability	6–10 weeks
RAG application build	Vector store, embedding pipelines, retrieval evaluation, re-ranking, integration with content sources behind identity	10–16 weeks
Agentic application build	MCP servers, multi-agent design, tool wiring, guardrails, evaluation harness, audit trail	12–18 weeks
MLOps practice bring-up	Pipeline, registry, evaluation, monitoring, drift detection, A/B testing, rollback path	8–14 weeks
AI governance & assurance	Model-risk framework, regulator mapping (EU AI Act, NIST AI RMF), red-teaming programme, audit-evidence capture	6–10 weeks

What makes us different

Platform-anchored AI. We deploy AI workloads on the same OpenShift fleets your other regulated workloads already run on — with the same identity boundary, the same supply chain, the same RHACS posture.
Audit posture by design. Every model call and every tool call is traceable to an authenticated identity and captured in an audit trail. Production agentic features survive internal audit because the audit was designed into the build.
Build vs buy, honestly. When a third-party API solves the problem cleanly, we say so. When the answer requires self-hosted inference for data-sovereignty reasons, we build that. We don't pitch every customer the same stack.
Documented handover. The model registry, the evaluation harness, the prompt repository, the runbook set — all owned by your team at the end of the engagement.

03 — AI Platforms

AI workloads on the platform your auditors trust.

We deploy AI platforms on the same OpenShift fleets that host your other regulated workloads — with the same identity boundary, the same GitOps delivery path, and the same security posture. AI is a workload class, not a separate organisation.

Platform options we operate

Platform	Strengths	When we choose it
OpenShift AI	Integrates with existing OpenShift fleets, GitOps-native, multi-tenant by design, vendor-supported, on-prem and disconnected operation	Default for organisations already on OpenShift; required for fully-disconnected and sovereign-data deployments
NVIDIA AI Enterprise	Curated NIM catalog of pre-optimised inference models, vendor-supported, layered cleanly on OpenShift	When the inference catalog or NVIDIA AI Workbench tooling earns its licence cost in your context
Kubeflow	Open-source, pipeline-centric, strong notebook integration, mature multi-tenant model	Training-heavy practices, organisations preferring an open-source-anchored stack
Cloud-managed alternatives	AWS SageMaker / Bedrock, Azure ML / OpenAI on Azure, Google Vertex AI — managed, integrated with hyperscaler IAM	When data-sovereignty allows it and the operating overhead of self-hosting is not justified

GPU operations

NVIDIA GPU Operator — driver, CUDA runtime, container toolkit, MIG configuration, monitoring agent — all managed.
Node Feature Discovery — nodes labelled by GPU model, memory, NVLink topology so workloads schedule onto the right hardware.
MIG partitioning — carve A100 / H100 GPUs into independent slices for smaller workloads.
Time-slicing & sharing — multiple workloads on a single GPU where MIG is not viable.
Scheduling discipline — gang scheduling for distributed training (Volcano, Kueue), priority and pre-emption policies for inference workloads.
Cost & utilisation reporting — per-tenant GPU consumption tied to your FinOps reporting.

Multi-tenancy and platform governance

Tenant boundaries. Namespace-per-tenant with network policy and quota; multi-cluster separation where regulator or risk profile requires.
Model registry. OpenShift AI registry, MLflow, or a custom registry on top of your existing artifact store — with version-controlled lineage, evaluation metadata, and signing.
Notebook tenancy. Per-user or per-team notebook servers with bounded resources, RBAC tied to your identity provider.
Inference tenancy. Multi-model serving with per-tenant rate limits, isolation by GPU slice or by node, observability per tenant.
Cost attribution. Every GPU-second attributable to a tenant, surfaced in FinOps.

What you get at the end of an AI platform engagement

A working AI platform (OpenShift AI, NVIDIA AI Enterprise, or Kubeflow) deployed and integrated with your fleet
GPU operator, NFD, MIG configuration, and scheduling discipline in place
A multi-tenant operating model tied to your identity provider and your FinOps reporting
A model registry with version control, lineage, and signing
Notebook and inference tenancy with documented quotas, RBAC, and isolation
Runbooks for the failure modes that matter: driver mismatches, GPU pre-emption, registry corruption, capacity exhaustion

04 — LLM Inference

Serving LLMs at the throughput, latency, and cost the workload deserves.

Most inference deployments are either dramatically under-utilised or struggling at their throughput ceiling. The difference between the two is engineering — inference stack selection, quantisation discipline, batching strategy, and scheduling rigour. We do that engineering.

Inference stack selection

Stack	Strengths	Typical fit
vLLM	State-of-the-art throughput via PagedAttention and continuous batching, broad model support, OpenAI-compatible API	Default for high-throughput open-weights LLM serving
NVIDIA NIM	Pre-optimised, vendor-supported inference microservices with NVIDIA-tuned models out of the box	When the curated catalog and vendor support earn their licence cost
NVIDIA Triton	Multi-framework (TensorRT-LLM, PyTorch, ONNX), strong batching scheduler, broad model-format support	Mixed-framework deployments; classical-ML and LLM co-tenancy on the same server
Text Generation Inference (TGI)	Hugging-Face native, strong streaming, decent throughput for medium-scale	Hugging-Face-anchored organisations, medium-scale deployments
Ollama	Operationally simple, single-node, GGUF-native, good for prototyping and air-gapped experimentation	Pilots, developer environments, air-gapped sandbox
llama.cpp / GGUF	CPU-friendly, broad hardware support, useful where GPU access is constrained	Edge inference, CPU-only environments, smaller-context workloads

Quantisation, batching, KV cache

Quantisation. GPTQ, AWQ, FP8, INT4, GGUF — selected by model family, deployment target, and acceptable quality loss. Benchmarked against full-precision baseline on your evaluation set, not on generic benchmarks.
Continuous batching. vLLM / Triton continuous batching tuned for your request-arrival pattern. Static batching used only where latency requirements force it.
KV-cache management. PagedAttention configuration, prefix caching for repeated context, cache eviction policy tied to your priority model.
Speculative decoding. Where the model and traffic shape allow, draft-model speculation for latency-bound paths.
Tensor / pipeline parallelism. Multi-GPU serving for models that don't fit on a single device, with documented latency and throughput trade-offs.

Sizing and capacity planning

We size inference clusters the way we size any platform — empirically, against traffic shaped to look like yours, with documented headroom for failure modes:

Workload shaping. Token-in / token-out distribution, request arrival pattern, latency expectation per route.
Throughput benchmarking on your hardware, at your context length, with your quantisation choice.
Failure-mode planning. GPU failure, node failure, model-server crash, traffic spike. Documented capacity headroom for each.
Cost-per-token modelling for build-vs-buy decisions and for transparent FinOps reporting.

Multi-model routing & OpenAI-compatible APIs

Production AI applications rarely depend on a single model. We design and operate the routing layer that maps application calls to the right backend — the right model, the right region, the right tenant:

OpenAI-compatible API gateway in front of vLLM / NIM / Triton (LiteLLM-style or custom).
Per-tenant rate limits, quota, fallback chains, retry semantics.
Observability: tokens-in, tokens-out, latency, error rate, cost, per route and per tenant.
Routing rules that can A/B test models without application changes.

What you get at the end of an LLM inference engagement

An inference platform (vLLM, NIM, Triton, or mix) deployed and tuned for your workload mix
Quantisation, batching, and KV-cache configuration justified against your evaluation set
Documented sizing for current traffic with headroom and failure-mode capacity
An OpenAI-compatible API gateway with per-tenant quotas and observability
Throughput, latency, and cost-per-token dashboards tied to your FinOps reporting
Runbooks for failure modes: model crash, GPU loss, traffic spike, version rollback

07 — AI Security & Governance

AI workloads that survive internal audit and the next regulator visit.

AI workloads expand the attack surface in ways traditional appsec doesn't fully cover — prompt injection, training-data poisoning, tool-misuse via agents, opaque decision paths, model-output leakage. We design the controls that close those gaps and produce the evidence auditors will accept.

Model-risk management

Model classification. Each model categorised by impact (advisory vs decisioning, internal vs customer-facing, regulated vs non-regulated) with the corresponding review and approval path.
Use-case approval gate. Documented sign-off path before any model reaches production — security, audit, business owner, model-risk function.
Periodic re-review. Annual or triggered re-evaluation of production models against drift, regulatory change, and incident learning.
Model inventory. A single, authoritative list of every model in production, with owner, classification, last review date, and live metrics.

AI-specific threats

Threat	Mitigation
Prompt injection	Input sanitisation, system-prompt isolation from untrusted content, tool-call confirmation for high-impact actions, output filtering, monitoring for jailbreak signatures
Indirect / cross-domain injection	Treat retrieved content as untrusted; never grant tool execution authority from retrieved instructions; explicit tool-grant policies per MCP server
Data leakage via output	Output filtering for PII, secrets, regulated content; retrieval-source classification policies; audit log of every model output
Training-data poisoning	Training-data provenance and lineage, signed dataset versions, anomaly detection on training-data updates, restricted ingest sources
Tool misuse via agents	MCP-layer authorisation, step budgets, human-in-the-loop gates for high-impact actions, audit trail per tool call
Membership inference / model extraction	Rate limiting, anomaly detection on query patterns, output randomisation where applicable, differential privacy in training where the use case allows
Hallucination on regulated content	Mandatory citations, RAG-grounded answers, post-generation verification against retrieved sources, human review for high-impact outputs

AI red teaming

Programme structure. Where AI red teaming sits in the SDLC, who runs it (in-house team or external), what findings are blocking for production rollout.
Adversarial-prompt libraries. Maintained over time, with attempted attacks mapped to mitigations.
Automated harness. Re-runs the red-team library against every prompt or tool change, blocking regressions.
Manual deep dives. Quarterly hands-on red-team exercises against production-shaped systems, with findings landing in the engineering backlog.

Regulatory and assurance posture

Framework	Where it bites
EU AI Act	Risk classification, conformity assessments, transparency obligations for high-risk systems, prohibited-use clarity
NIST AI RMF	Govern / map / measure / manage functions, profile-driven control selection
ISO/IEC 42001	AI management system standard, audit-ready certification path
Sector regulators	Central-bank IT guidance on AI-assisted decisions, insurance-regulator guidance on AI underwriting and claims, telecom regulator guidance on AI in customer-facing channels
Privacy regimes	GDPR Article 22 (automated decisions), training-data lineage, data-subject rights against model-derived attributes

Audit-evidence capture

Every model call captured: input, output, model version, retrieval context, latency, cost, user identity.
Every tool call captured: tool, input, output, calling agent identity, downstream effects.
Evaluation runs versioned alongside the model, signed, retained per regulatory requirement.
Model-risk decisions captured in the model registry: classification, approval, conditions of use, expiration.
Incidents captured in the same incident system as the rest of the platform — AI is not a separate failure domain for audit.

What you get at the end of an AI governance engagement

A model-risk management framework mapped to your regulatory profile
An authoritative model inventory with classification, owner, and review cadence
AI-specific threat-mitigation controls deployed and tested
An AI red-teaming programme with library, harness, and the first round of exercises completed
Audit-evidence capture wired through model calls, tool calls, and evaluation runs
A residual-risk register accepted by your audit and model-risk functions

Start an AI engagement

Have an AI programme that needs to survive audit?

Send us a short note describing the use case and the regulatory context. We'll write back with a concrete first-two-weeks scope and a definition of done for the engagement.

A specialist AI practice for regulated workloads.

AI as engineering work, not slideware.

Engagement archetypes

What makes us different

Decide what to build — and what not to.

Use-case discovery

AI governance setup

Pilot design

What you get at the end of a strategy & readiness engagement

AI workloads on the platform your auditors trust.

Platform options we operate

GPU operations

Multi-tenancy and platform governance

What you get at the end of an AI platform engagement

Serving LLMs at the throughput, latency, and cost the workload deserves.

Inference stack selection

Quantisation, batching, KV cache

Sizing and capacity planning

Multi-model routing & OpenAI-compatible APIs

What you get at the end of an LLM inference engagement

Retrieval-grounded, identity-bound, evaluated end-to-end.

RAG architecture

Retrieval evaluation

Agentic AI

Identity-bound AI

What you get at the end of a RAG or agentic engagement

Models in production behave nothing like models on a laptop.

Data & feature pipelines

Training & experimentation

Model registry & deployment

Monitoring & drift

What you get at the end of an MLOps engagement