A specialist AI practice for regulated workloads.
Seven aspects, one operating discipline. From strategy and readiness through AI platforms, LLM inference, retrieval and agentic patterns, MLOps, and AI security and governance — delivered as engineering work, on the platforms your auditors already trust.
AI as engineering work, not slideware.
Many regulated organisations have launched AI programmes that produced demos that never reached production — or worse, reached production without an audit posture. We build AI capability that ships and survives the audit cycle, on the platforms your security and operations teams already trust.
Strategy & readiness
Use-case discovery, build-vs-buy, ROI modelling, AI governance setup, pilot design.
AI platforms
OpenShift AI, NVIDIA AI Enterprise, Kubeflow, GPU operator, multi-tenancy, model registry.
LLM inference
vLLM, Triton, NVIDIA NIM, TGI, Ollama. Quantisation, batching, KV-cache, sizing.
RAG & agentic AI
Vector stores, retrieval evaluation, MCP, multi-agent patterns, guardrails.
MLOps & lifecycle
Feature stores, training pipelines, model registry, continuous evaluation, drift.
Security & governance
Model-risk management, identity-bound AI, audit trails, regulator readiness, AI red teaming.
Engagement archetypes
| Engagement type | Typical scope | Duration |
|---|---|---|
| AI readiness & strategy | Use-case discovery, build-vs-buy mapping, ROI modelling, AI governance design, pilot definition | 4–8 weeks |
| AI platform stand-up | OpenShift AI or NVIDIA AI Enterprise deployment, tenancy model, GPU operator, model registry, dashboards | 8–14 weeks |
| LLM inference platform | vLLM/Triton/NIM deployment, model serving, quantisation, sizing, multi-model routing, observability | 6–10 weeks |
| RAG application build | Vector store, embedding pipelines, retrieval evaluation, re-ranking, integration with content sources behind identity | 10–16 weeks |
| Agentic application build | MCP servers, multi-agent design, tool wiring, guardrails, evaluation harness, audit trail | 12–18 weeks |
| MLOps practice bring-up | Pipeline, registry, evaluation, monitoring, drift detection, A/B testing, rollback path | 8–14 weeks |
| AI governance & assurance | Model-risk framework, regulator mapping (EU AI Act, NIST AI RMF), red-teaming programme, audit-evidence capture | 6–10 weeks |
What makes us different
- Platform-anchored AI. We deploy AI workloads on the same OpenShift fleets your other regulated workloads already run on — with the same identity boundary, the same supply chain, the same RHACS posture.
- Audit posture by design. Every model call and every tool call is traceable to an authenticated identity and captured in an audit trail. Production agentic features survive internal audit because the audit was designed into the build.
- Build vs buy, honestly. When a third-party API solves the problem cleanly, we say so. When the answer requires self-hosted inference for data-sovereignty reasons, we build that. We don't pitch every customer the same stack.
- Documented handover. The model registry, the evaluation harness, the prompt repository, the runbook set — all owned by your team at the end of the engagement.
Decide what to build — and what not to.
Most regulated organisations have more AI use cases than budget. The hard work is not generating ideas; it's picking the ones with a real ROI, an implementable data model, and an audit posture that survives. We do that work before any model serves its first token.
Use-case discovery
A structured discovery workshop programme that produces a ranked, scoped backlog — not a brainstorm wall:
- Business-process inventory. Where is human time spent today on bounded, knowledge-intensive work? That's the surface where AI usually earns its keep.
- Data-readiness probe. Does the data exist? Is it accessible? Is the lineage clean enough for audit? Many "obvious" use cases die here — better here than after engineering invests.
- Regulatory probe. What does your regulator say about AI-augmented decisions in this workflow? Some use cases are off-limits; some need specific documentation; some are wide open.
- Build-vs-buy mapping. Does a SaaS API solve this cleanly? Does data-sovereignty require self-hosting? Is the workflow generic enough to commodify?
- ROI model. Conservative, time-to-value, and risk-adjusted — not best-case slideware.
AI governance setup
Before any model reaches production, the governance model needs to exist. We help stand up:
- Model-risk policy. What classes of model are allowed, what review path each requires, who signs off on production rollout.
- Data-governance policy. What data can train what kind of model, what consent posture is required, how training-data lineage is captured.
- Use-case approval workflow. A documented gate — with the security, audit, and business stakeholders explicitly involved — that a use case must clear before engineering invests.
- Vendor-AI policy. When the AI is a third-party SaaS: what data may leave the organisation, under what contractual posture, with what auditability requirements.
- AI red-teaming charter. Where AI red-teaming sits in the SDLC, who runs it, what findings are blocking.
Pilot design
Pilots that produce decisions, not demos. We design pilots to test the riskiest assumption in the use case — usually data quality or model behaviour at the edges — not the showcase-friendly path:
- A bounded scope (single workflow, single user cohort, single quarter).
- Measurable success criteria defined before the pilot starts.
- An explicit rollback path if the criteria are missed.
- A decision moment at the end — ship, iterate, or kill.
What you get at the end of a strategy & readiness engagement
- A ranked, scoped use-case backlog with build-vs-buy decisions per item
- ROI models for the top 3–5 candidates, conservative and risk-adjusted
- An AI-governance policy set: model risk, data, vendor AI, red teaming
- A use-case approval workflow operating in your existing change-management system
- One or more pilot designs with measurable success criteria and rollback paths
- An AI readiness assessment against your regulatory profile
AI workloads on the platform your auditors trust.
We deploy AI platforms on the same OpenShift fleets that host your other regulated workloads — with the same identity boundary, the same GitOps delivery path, and the same security posture. AI is a workload class, not a separate organisation.
Platform options we operate
| Platform | Strengths | When we choose it |
|---|---|---|
| OpenShift AI | Integrates with existing OpenShift fleets, GitOps-native, multi-tenant by design, vendor-supported, on-prem and disconnected operation | Default for organisations already on OpenShift; required for fully-disconnected and sovereign-data deployments |
| NVIDIA AI Enterprise | Curated NIM catalog of pre-optimised inference models, vendor-supported, layered cleanly on OpenShift | When the inference catalog or NVIDIA AI Workbench tooling earns its licence cost in your context |
| Kubeflow | Open-source, pipeline-centric, strong notebook integration, mature multi-tenant model | Training-heavy practices, organisations preferring an open-source-anchored stack |
| Cloud-managed alternatives | AWS SageMaker / Bedrock, Azure ML / OpenAI on Azure, Google Vertex AI — managed, integrated with hyperscaler IAM | When data-sovereignty allows it and the operating overhead of self-hosting is not justified |
GPU operations
- NVIDIA GPU Operator — driver, CUDA runtime, container toolkit, MIG configuration, monitoring agent — all managed.
- Node Feature Discovery — nodes labelled by GPU model, memory, NVLink topology so workloads schedule onto the right hardware.
- MIG partitioning — carve A100 / H100 GPUs into independent slices for smaller workloads.
- Time-slicing & sharing — multiple workloads on a single GPU where MIG is not viable.
- Scheduling discipline — gang scheduling for distributed training (Volcano, Kueue), priority and pre-emption policies for inference workloads.
- Cost & utilisation reporting — per-tenant GPU consumption tied to your FinOps reporting.
Multi-tenancy and platform governance
- Tenant boundaries. Namespace-per-tenant with network policy and quota; multi-cluster separation where regulator or risk profile requires.
- Model registry. OpenShift AI registry, MLflow, or a custom registry on top of your existing artifact store — with version-controlled lineage, evaluation metadata, and signing.
- Notebook tenancy. Per-user or per-team notebook servers with bounded resources, RBAC tied to your identity provider.
- Inference tenancy. Multi-model serving with per-tenant rate limits, isolation by GPU slice or by node, observability per tenant.
- Cost attribution. Every GPU-second attributable to a tenant, surfaced in FinOps.
What you get at the end of an AI platform engagement
- A working AI platform (OpenShift AI, NVIDIA AI Enterprise, or Kubeflow) deployed and integrated with your fleet
- GPU operator, NFD, MIG configuration, and scheduling discipline in place
- A multi-tenant operating model tied to your identity provider and your FinOps reporting
- A model registry with version control, lineage, and signing
- Notebook and inference tenancy with documented quotas, RBAC, and isolation
- Runbooks for the failure modes that matter: driver mismatches, GPU pre-emption, registry corruption, capacity exhaustion
Serving LLMs at the throughput, latency, and cost the workload deserves.
Most inference deployments are either dramatically under-utilised or struggling at their throughput ceiling. The difference between the two is engineering — inference stack selection, quantisation discipline, batching strategy, and scheduling rigour. We do that engineering.
Inference stack selection
| Stack | Strengths | Typical fit |
|---|---|---|
| vLLM | State-of-the-art throughput via PagedAttention and continuous batching, broad model support, OpenAI-compatible API | Default for high-throughput open-weights LLM serving |
| NVIDIA NIM | Pre-optimised, vendor-supported inference microservices with NVIDIA-tuned models out of the box | When the curated catalog and vendor support earn their licence cost |
| NVIDIA Triton | Multi-framework (TensorRT-LLM, PyTorch, ONNX), strong batching scheduler, broad model-format support | Mixed-framework deployments; classical-ML and LLM co-tenancy on the same server |
| Text Generation Inference (TGI) | Hugging-Face native, strong streaming, decent throughput for medium-scale | Hugging-Face-anchored organisations, medium-scale deployments |
| Ollama | Operationally simple, single-node, GGUF-native, good for prototyping and air-gapped experimentation | Pilots, developer environments, air-gapped sandbox |
| llama.cpp / GGUF | CPU-friendly, broad hardware support, useful where GPU access is constrained | Edge inference, CPU-only environments, smaller-context workloads |
Quantisation, batching, KV cache
- Quantisation. GPTQ, AWQ, FP8, INT4, GGUF — selected by model family, deployment target, and acceptable quality loss. Benchmarked against full-precision baseline on your evaluation set, not on generic benchmarks.
- Continuous batching. vLLM / Triton continuous batching tuned for your request-arrival pattern. Static batching used only where latency requirements force it.
- KV-cache management. PagedAttention configuration, prefix caching for repeated context, cache eviction policy tied to your priority model.
- Speculative decoding. Where the model and traffic shape allow, draft-model speculation for latency-bound paths.
- Tensor / pipeline parallelism. Multi-GPU serving for models that don't fit on a single device, with documented latency and throughput trade-offs.
Sizing and capacity planning
We size inference clusters the way we size any platform — empirically, against traffic shaped to look like yours, with documented headroom for failure modes:
- Workload shaping. Token-in / token-out distribution, request arrival pattern, latency expectation per route.
- Throughput benchmarking on your hardware, at your context length, with your quantisation choice.
- Failure-mode planning. GPU failure, node failure, model-server crash, traffic spike. Documented capacity headroom for each.
- Cost-per-token modelling for build-vs-buy decisions and for transparent FinOps reporting.
Multi-model routing & OpenAI-compatible APIs
Production AI applications rarely depend on a single model. We design and operate the routing layer that maps application calls to the right backend — the right model, the right region, the right tenant:
- OpenAI-compatible API gateway in front of vLLM / NIM / Triton (LiteLLM-style or custom).
- Per-tenant rate limits, quota, fallback chains, retry semantics.
- Observability: tokens-in, tokens-out, latency, error rate, cost, per route and per tenant.
- Routing rules that can A/B test models without application changes.
What you get at the end of an LLM inference engagement
- An inference platform (vLLM, NIM, Triton, or mix) deployed and tuned for your workload mix
- Quantisation, batching, and KV-cache configuration justified against your evaluation set
- Documented sizing for current traffic with headroom and failure-mode capacity
- An OpenAI-compatible API gateway with per-tenant quotas and observability
- Throughput, latency, and cost-per-token dashboards tied to your FinOps reporting
- Runbooks for failure modes: model crash, GPU loss, traffic spike, version rollback
Retrieval-grounded, identity-bound, evaluated end-to-end.
The interesting problems in modern AI applications are rarely in the model itself. They are in the retrieval (what context did the model see?), the tool wiring (what could it do?), and the evaluation (did it actually work for users, or only in the demo?). We design those layers as an engineering discipline.
RAG architecture
- Vector stores. Pinecone, Weaviate, Milvus, Qdrant, pgvector — selected by latency, scale, data-sovereignty, and operating-overhead profile. Self-hosted by default for regulated workloads.
- Embedding pipelines. Source ingestion, chunking strategy (size, overlap, semantic boundaries), embedding model selection, batch vs streaming, re-embedding triggers when documents update.
- Hybrid search. Vector + BM25 keyword fusion for the queries where pure vectors lose — named entities, code identifiers, exact phrases.
- Re-ranking. Cross-encoder re-rank of the top-N candidates from first-stage retrieval, where the latency budget allows.
- Metadata filtering. Pre-filter by tenant, document type, date, classification — cheaper than letting the LLM filter in context.
- Citation discipline. Every model answer references the chunks it used, surfaced to the user, retained for audit.
Retrieval evaluation
Retrieval quality is testable and should be tested before retrieval changes ship:
- A labelled evaluation set built with subject-matter experts, not synthesised by an LLM.
- Metrics that match what users actually need: recall@k, MRR, nDCG — with the trade-offs across them documented for your workload.
- An automated evaluation harness that runs on every retrieval-stack change, blocking regressions.
- Production sampling: a portion of real queries logged (with consent), re-scored, used to refresh the eval set.
Agentic AI
Agents are not magic. They are LLM calls with tools, loops, and state — and every one of those is a place where production behaviour diverges from demo behaviour. We design agentic applications to be observable, bounded, and rollback-able:
- Pattern selection. Single-agent ReAct, multi-agent collaboration, reflexion, plan-execute-replan — matched to the workflow, not to the latest paper.
- Tool wiring via MCP. Each capability the agent uses is an MCP server — separately deployed, audit-logged, identity-bound. A misbehaving tool gets disabled at the MCP layer without rebuilding the agent.
- Guardrails. Input filtering, output filtering, prompt-injection defences, jailbreak detection — with explicit policies for when an agent escalates to a human.
- Bounded autonomy. Step budgets, cost budgets, and explicit human-in-the-loop gates for high-impact actions.
- Evaluation harness. Scripted scenarios that test agent behaviour on the cases that matter, including adversarial inputs. Runs on every prompt or tool change.
- Rollback path. Every agent run can be reverted — either by undoing the tool calls, or by routing the case back to a fully-human flow with the agent's context preserved for the human reviewer.
Identity-bound AI
Agents that reach backend systems do so under an authenticated identity that mirrors the human user the agent is acting on behalf of — or under a distinct service identity with the same governance treatment:
- Agent identities provisioned via SCIM from the same directory of record as humans.
- OAuth2 token exchange when the agent needs to act on a user's behalf.
- Per-tool authorisation policies enforced at the MCP layer.
- Audit trail correlating: user → agent run → tool call → backend record change.
What you get at the end of a RAG or agentic engagement
- A production RAG or agentic application integrated with your content sources, behind your identity provider
- Vector store, embedding pipeline, hybrid search, and re-ranking tuned to your queries
- A retrieval-evaluation harness with a labelled eval set, blocking regressions
- MCP servers for every tool the agent uses, each separately audited
- Guardrails, bounded autonomy, and a documented rollback path
- An audit trail correlating user → agent → tool → backend, accepted by your audit function
Models in production behave nothing like models on a laptop.
The MLOps discipline is the difference between a model that shipped and a model that keeps working. We build pipeline, registry, evaluation, monitoring, and rollback into a coherent practice your team can own.
Data & feature pipelines
- Data platform. Apache Spark, lakehouse table formats (Iceberg, Delta), MinIO or S3 for object storage. Built on the same fleet as the model serving so handoffs are simple.
- Feature stores. Feast or platform-native equivalents, with explicit feature lineage and freshness SLAs.
- Streaming features via Kafka and Spark Streaming for use cases where stale features lose accuracy.
- Data contracts. Schema, ownership, breaking-change policy — explicit between upstream producers and the ML team.
- PII handling. Tokenisation, masking, or full removal at ingestion, with the residual-risk decision documented.
Training & experimentation
- Kubeflow Pipelines or Argo Workflows for reproducible training and evaluation runs.
- Distributed training via Ray, PyTorch DDP / FSDP, DeepSpeed where model and dataset scale demand it.
- Experiment tracking. MLflow, Weights & Biases, or Kubeflow-native — with hyperparameters, code commit, dataset version, and metrics tied together.
- Notebook discipline. Per-user notebooks for exploration, with a documented promotion path to checked-in pipeline code.
- Fine-tuning. LoRA, QLoRA, full fine-tunes — chosen by data volume, target hardware, and licence posture.
Model registry & deployment
- Versioned registry. Every production model has a registry entry with version, hash, training data version, evaluation metrics, and signing.
- Promotion path. Dev → staging → production with documented gates — evaluation thresholds, security scan, governance sign-off.
- Canary and shadow deploys. Route a fraction of traffic to the candidate while the incumbent serves — or shadow-traffic the candidate offline while real users see the incumbent. Both with full metric comparison.
- A/B testing. Statistical-power planning, holdout discipline, decision-by-metric — not by gut feel.
- Rollback. One-command rollback to the prior version, runbook-rehearsed.
Monitoring & drift
- Operational metrics. Latency, error rate, throughput, GPU utilisation, cost-per-call — in the same dashboards as the rest of the platform.
- Quality metrics. Live evaluation against a held-out set, ongoing labelled-data collection from production sampling, business-outcome metrics where available.
- Data drift. Distribution shift in input features, statistically tested with documented thresholds.
- Concept drift. Shift in target / label distribution, detected via labelled feedback or downstream signal.
- Cost drift. Cost-per-call trends with alerts on unexpected growth — the metric that most often catches misuse before the bill arrives.
What you get at the end of an MLOps engagement
- Reproducible training and evaluation pipelines in Kubeflow or Argo
- A versioned model registry with lineage, evaluation metadata, and signing
- A documented promotion path from dev to production with explicit gates
- Canary, shadow, and A/B-testing patterns running on real traffic
- Operational, quality, and drift monitoring tied to your incident-response loop
- Rollback runbooks rehearsed for every model in production
AI workloads that survive internal audit and the next regulator visit.
AI workloads expand the attack surface in ways traditional appsec doesn't fully cover — prompt injection, training-data poisoning, tool-misuse via agents, opaque decision paths, model-output leakage. We design the controls that close those gaps and produce the evidence auditors will accept.
Model-risk management
- Model classification. Each model categorised by impact (advisory vs decisioning, internal vs customer-facing, regulated vs non-regulated) with the corresponding review and approval path.
- Use-case approval gate. Documented sign-off path before any model reaches production — security, audit, business owner, model-risk function.
- Periodic re-review. Annual or triggered re-evaluation of production models against drift, regulatory change, and incident learning.
- Model inventory. A single, authoritative list of every model in production, with owner, classification, last review date, and live metrics.
AI-specific threats
| Threat | Mitigation |
|---|---|
| Prompt injection | Input sanitisation, system-prompt isolation from untrusted content, tool-call confirmation for high-impact actions, output filtering, monitoring for jailbreak signatures |
| Indirect / cross-domain injection | Treat retrieved content as untrusted; never grant tool execution authority from retrieved instructions; explicit tool-grant policies per MCP server |
| Data leakage via output | Output filtering for PII, secrets, regulated content; retrieval-source classification policies; audit log of every model output |
| Training-data poisoning | Training-data provenance and lineage, signed dataset versions, anomaly detection on training-data updates, restricted ingest sources |
| Tool misuse via agents | MCP-layer authorisation, step budgets, human-in-the-loop gates for high-impact actions, audit trail per tool call |
| Membership inference / model extraction | Rate limiting, anomaly detection on query patterns, output randomisation where applicable, differential privacy in training where the use case allows |
| Hallucination on regulated content | Mandatory citations, RAG-grounded answers, post-generation verification against retrieved sources, human review for high-impact outputs |
AI red teaming
- Programme structure. Where AI red teaming sits in the SDLC, who runs it (in-house team or external), what findings are blocking for production rollout.
- Adversarial-prompt libraries. Maintained over time, with attempted attacks mapped to mitigations.
- Automated harness. Re-runs the red-team library against every prompt or tool change, blocking regressions.
- Manual deep dives. Quarterly hands-on red-team exercises against production-shaped systems, with findings landing in the engineering backlog.
Regulatory and assurance posture
| Framework | Where it bites |
|---|---|
| EU AI Act | Risk classification, conformity assessments, transparency obligations for high-risk systems, prohibited-use clarity |
| NIST AI RMF | Govern / map / measure / manage functions, profile-driven control selection |
| ISO/IEC 42001 | AI management system standard, audit-ready certification path |
| Sector regulators | Central-bank IT guidance on AI-assisted decisions, insurance-regulator guidance on AI underwriting and claims, telecom regulator guidance on AI in customer-facing channels |
| Privacy regimes | GDPR Article 22 (automated decisions), training-data lineage, data-subject rights against model-derived attributes |
Audit-evidence capture
- Every model call captured: input, output, model version, retrieval context, latency, cost, user identity.
- Every tool call captured: tool, input, output, calling agent identity, downstream effects.
- Evaluation runs versioned alongside the model, signed, retained per regulatory requirement.
- Model-risk decisions captured in the model registry: classification, approval, conditions of use, expiration.
- Incidents captured in the same incident system as the rest of the platform — AI is not a separate failure domain for audit.
What you get at the end of an AI governance engagement
- A model-risk management framework mapped to your regulatory profile
- An authoritative model inventory with classification, owner, and review cadence
- AI-specific threat-mitigation controls deployed and tested
- An AI red-teaming programme with library, harness, and the first round of exercises completed
- Audit-evidence capture wired through model calls, tool calls, and evaluation runs
- A residual-risk register accepted by your audit and model-risk functions
Have an AI programme that needs to survive audit?
Send us a short note describing the use case and the regulatory context. We'll write back with a concrete first-two-weeks scope and a definition of done for the engagement.