A specialist data practice for regulated workloads.
Seven aspects, one operating discipline. From data strategy and architecture through lakehouse platforms, real-time streaming, governance and quality, analytics and BI, and data-for-AI — built on the same fleet, same identity, and same audit posture as the rest of your platform.
Data as a platform — not as a project.
Most data programmes deliver dashboards. The interesting work happens earlier — in the architecture that lets every team get the data it needs without re-platforming, in the governance that survives the next regulator visit, and in the operational discipline that keeps quality from quietly decaying. We engineer that work.
Strategy & architecture
Reference architectures (centralised, federated, data-mesh), data-product thinking, build-vs-buy.
Data platforms
Lakehouse (Iceberg/Delta), warehouses (Snowflake/BigQuery/Redshift), lakes (S3/MinIO), Spark, query engines.
Streaming & real-time
Apache Kafka, Flink, Spark Streaming, change-data-capture, event-driven architectures.
Governance & quality
Catalog, lineage, MDM, data contracts, data-quality testing, privacy controls.
Analytics & BI
Semantic layer (dbt, Cube), BI tooling (Looker, Power BI, Tableau, Superset), self-service governance.
Data for AI
Feature stores, training pipelines, ML-ready data, retrieval grounding, audit-grade lineage.
Engagement archetypes
| Engagement type | Typical scope | Duration |
|---|---|---|
| Data strategy & architecture | Current-state assessment, target architecture, build-vs-buy, data-product backlog, roadmap | 4–8 weeks |
| Lakehouse platform stand-up | Object storage, table format (Iceberg/Delta), query engine, Spark, identity, governance, GitOps delivery | 10–16 weeks |
| Streaming platform stand-up | Kafka cluster (or managed), schema registry, streaming jobs, CDC integration, dead-letter handling, observability | 8–14 weeks |
| Data governance bring-up | Catalog, lineage, ownership model, data contracts, quality testing, privacy controls, audit evidence | 8–12 weeks |
| Analytics & BI modernization | Semantic layer via dbt, BI rollout, self-service governance, deprecation of legacy reports | 10–16 weeks |
| Data-for-AI engagement | Feature store, training pipelines, retrieval/embedding pipelines, lineage for AI workloads | 8–14 weeks |
| Master Data Management (MDM) | Golden-record design, matching/merging, source-system reconciliation, stewardship workflow | 12–20 weeks |
What makes us different
- Platform-anchored data. Data platforms run on the same OpenShift fleets as the rest of your regulated workloads, with the same identity boundary, security posture, and observability stack.
- Data products, not data swamps. Every dataset has an owner, a contract, an SLA, and a documented lineage. Untyped, undocumented, unowned pipelines are a defect.
- Audit posture by default. Schema changes, data movement, access events, and quality results all captured as evidence at the source — not reconstructed for the next audit.
- AI-ready by design. The data layer is built so that AI workloads ride on the same governed surface as analytics — not on shadow pipelines built by ML engineers in haste.
Decide the shape of the data estate before you platform it.
Most "data platform" projects fail because the architecture was decided implicitly, in the gap between business strategy and engineering capacity. We make that architecture explicit, debated, and decided — before any platform engineering starts.
Architecture patterns we use
| Pattern | Strengths | When we choose it |
|---|---|---|
| Centralised lakehouse | Single source of truth, simple governance, lower operating overhead | Small-to-medium estates, organisations with a central data team, regulated environments where governance simplicity matters |
| Hub-and-spoke (federated) | Central governance and shared dimensions; domain teams own their data products | Mid-sized organisations with multiple business units and strong-ish central platform team |
| Data mesh | Domain teams fully own data products; central platform team provides the substrate; governance is federated | Large, multi-domain organisations where centralising the data team becomes the bottleneck |
| Operational data store (ODS) layer | Cleaned, integrated operational data near the source systems | Mainframe-heavy estates, banks, insurance carriers with critical batch cycles |
| Hybrid | Different patterns for different domains in the same enterprise | Most realistic large-enterprise deployments |
Data-product thinking
We treat every analytical dataset as a data product — with an owner, a contract, an SLA, and a lifecycle:
- Owner. A named team (not a person) accountable for the dataset's accuracy, freshness, and availability.
- Contract. Schema, semantics, allowed values, freshness expectation, breaking-change policy — defined and reviewed before consumers depend on it.
- SLA. Documented freshness, availability, and quality targets.
- Discoverability. Surfaced in the catalog with description, lineage, and example queries.
- Versioning. Breaking changes go through a deprecation path; consumers get notice.
- Deprecation. Datasets that no consumer reads are deprecated, archived, deleted — on a schedule, not a whim.
Build vs buy
We make explicit decisions about which capabilities to self-host, which to take as managed services, and which to consume as SaaS:
- Self-host where data-sovereignty requires it, or where operating-cost dynamics favour it at scale.
- Managed cloud services (BigQuery, Snowflake, Redshift, Databricks, Synapse) where the data-residency posture and operating overhead trade-offs work out.
- Specialist SaaS (Fivetran, dbt Cloud, Hightouch, Atlan, Monte Carlo) for capability areas where building or self-hosting is not where the engineering team should spend its time.
What you get at the end of a strategy & architecture engagement
- A current-state diagnostic of the data estate with named gaps and concrete impact
- A target reference architecture matched to your organisation's size, structure, and regulatory profile
- A data-product backlog with owner, contract, and SLA defined for the first cohort
- Build-vs-buy decisions justified per capability area
- A phased roadmap with explicit Phase 01 next-step for execution
- An ADR set capturing the non-obvious choices
Lakehouse-first. Self-hosted where it earns its keep.
Our default data platform is a lakehouse — open table formats on object storage, with Spark and SQL query engines on top, running on the same OpenShift fleet as the rest of the platform. Managed cloud warehouses are absolutely on the table where they fit; we don't insist on self-hosting for its own sake.
Lakehouse stack
| Layer | Technologies we operate | What we deliver |
|---|---|---|
| Object storage | MinIO (self-hosted), AWS S3, Azure ADLS Gen2, GCS | Bucket design, encryption, lifecycle policies, replication topology, IAM integration |
| Table format | Apache Iceberg, Delta Lake, Apache Hudi | Format selection by workload (write pattern, read pattern, ecosystem fit), partitioning strategy, schema evolution policy |
| Compute & query engines | Apache Spark, Trino, Presto, Dremio, DuckDB (for federated and small-scale) | Cluster sizing, query routing, workload isolation, cost-attribution per tenant |
| Orchestration | Apache Airflow, Argo Workflows, Prefect, Dagster | DAG patterns, retry semantics, alerting, lineage capture, GitOps deployment |
| Transformation | dbt (Core or Cloud), Spark SQL, SQLMesh | Modelling patterns (staging / intermediate / marts), tests, documentation, lineage |
| Ingestion | Airbyte, Fivetran, Spark, custom connectors | Source onboarding, CDC integration, incremental patterns, full-refresh recovery |
Managed cloud warehouses
When sovereignty, scale, or operating overhead points to a managed warehouse, we deliver the same engineering discipline on top:
- Snowflake. Account / role architecture, RBAC patterns, virtual-warehouse sizing, cost attribution, masking and row-access policies.
- Google BigQuery. Dataset and project layout, slot reservations, column-level security, BigQuery-ML integration where it fits.
- Amazon Redshift / RA3. Managed storage, workload management, materialised views, integration with AWS Lake Formation.
- Databricks. Workspace design, Unity Catalog, photon engine tuning, ML and analytics on the same surface.
- Azure Synapse / Fabric. Dedicated vs serverless pools, integration with Azure data services, Fabric workloads.
Why we lean lakehouse for regulated workloads
- Open formats. Iceberg and Delta on object storage means data is portable. You are not contractually locked into a single vendor's query engine.
- Cost decoupling. Storage cost decouples from compute cost; query engines can be scaled or replaced independently.
- Co-location with platform. Running on the same OpenShift fleet as your other workloads simplifies identity, networking, security posture, and observability.
- Disconnected-friendly. Self-hosted lakehouse stacks run cleanly in disconnected and air-gapped environments where managed cloud is not an option.
What you get at the end of a data platform engagement
- A working lakehouse (or managed-warehouse) platform deployed to your environment
- Table-format and partitioning strategy justified against your workload
- Query-engine and orchestration layer in place with GitOps delivery
- First data products onboarded end-to-end through the new platform
- Cost-attribution and observability tied to your FinOps and ops stack
- Runbooks for the failure modes that matter: storage partition, schema drift, query-engine restart, replication lag
Events as the spine of regulated systems.
For regulated workloads — payment authorisation, fraud detection, claims intake, network telemetry — "real-time" is not a nice-to-have. We design and operate the streaming spine that those workloads ride on, with explicit guarantees about ordering, durability, and delivery semantics.
Streaming platforms
| Platform | Strengths | Typical fit |
|---|---|---|
| Apache Kafka | Industry-standard distributed log, durable, scalable, mature ecosystem | Default for general event streaming, system-of-record events, integration spine |
| Confluent Platform / Cloud | Kafka-compatible with managed-service simplicity, schema registry, ksqlDB, control plane | Customers willing to consume Kafka as a managed service |
| Red Hat AMQ Streams (Strimzi) | Operator-driven Kafka on OpenShift, fully GitOps-managed, vendor-supported | OpenShift-anchored environments wanting self-hosted Kafka |
| Apache Pulsar | Multi-tenant, geo-replication-native, BookKeeper-backed storage | Multi-tenant SaaS-style platforms, geo-replication-heavy workloads |
| NATS / JetStream | Lightweight, low-latency, good for service-to-service messaging | Internal microservice messaging, edge and IoT scenarios |
Stream processing
- Apache Flink. Stateful stream processing with strong exactly-once guarantees, event-time windowing, large-state support. Default for stateful streaming jobs.
- Spark Structured Streaming. Where the team already operates Spark and the streaming workload fits the micro-batch model.
- Kafka Streams & ksqlDB. For Kafka-native, simpler stream transformations.
- Debezium for CDC. Change-data-capture from operational databases into the streaming spine, with snapshot and incremental modes.
Schema, contracts, and evolution
A streaming platform without schema discipline becomes a swamp very quickly. We design schema management into the platform from day one:
- Schema registry. Confluent Schema Registry, Apicurio, or equivalent — with required schemas per topic and a compatibility policy.
- Schema formats. Avro for compact wire format with backward-compatible evolution; Protobuf where ecosystem fit demands it; JSON Schema for human-readable contracts.
- Compatibility policy. Explicit backward / forward / full compatibility per topic, with breaking-change deprecation paths.
- Topic contracts. Owner, semantics, retention, partitioning, expected throughput, allowed consumers — documented in the catalog.
Operational discipline
- Dead-letter handling. Every consumer has a DLQ pattern with documented retry, alerting, and reprocessing path. Silent loss is treated as a defect.
- Delivery semantics. Each topic and consumer documents its at-most-once / at-least-once / exactly-once posture and the trade-offs that justify it.
- Observability. Lag per consumer group, throughput per topic, schema-violation rate, DLQ depth — all in the same dashboards as the rest of the platform.
- Disaster recovery. Multi-region replication topology (MirrorMaker 2, Confluent Replicator, or equivalent), documented RPO/RTO, regularly drilled.
What you get at the end of a streaming engagement
- A production-shaped Kafka (or equivalent) cluster on OpenShift or your chosen cloud
- Schema registry, schema patterns, and per-topic contracts
- Stream-processing jobs onboarded with documented delivery semantics
- CDC integration from operational databases into the streaming spine
- Observability and DLQ handling tied to your incident-response loop
- DR drill evidence with documented RPO/RTO
Governance as engineering — not as a committee.
Data governance fails when it is run as a slide deck overseen by a steering committee. We embed governance into the same engineering surface as the rest of the platform — the catalog updates automatically, the lineage is captured at the source, the data contracts run as CI checks, and the audit evidence is a side effect of normal operation.
Catalog & discoverability
| Tool | Strengths | When we choose it |
|---|---|---|
| Open-source: DataHub, OpenMetadata, Apache Atlas | Self-hosted, customisable, integrates broadly across the data stack | On-prem and disconnected environments, organisations preferring an open-source-anchored stack |
| Commercial: Atlan, Collibra, Alation | Mature stewardship workflows, business-glossary tooling, vendor support | Large enterprises with mature governance functions and willingness to operate a SaaS catalog |
| Native: Unity Catalog, Snowflake Horizon | Tightly integrated with the host platform's auth, lineage, and access controls | When the data estate is anchored on a single managed platform |
Lineage
- OpenLineage. Standard for emitting lineage events from Spark, Airflow, dbt, and other tools into the catalog.
- Column-level lineage via dbt and supported catalogs — not just table-to-table arrows.
- Cross-system lineage from CDC source through streaming → lake → warehouse → BI tool — tracing a regulatory report all the way back to its operational origin.
- Lineage as audit evidence. A regulator question about "where did this number come from?" answerable in minutes, not weeks.
Data contracts & quality
- Schema and semantics contracts defined per data product, versioned in Git, reviewed in PR.
- Quality tests. dbt tests, Great Expectations, Soda — running in the same pipeline as the transformation, failing the pipeline if the contract breaks.
- Freshness SLAs. Each data product has a freshness target; misses produce alerts and incident tickets.
- Anomaly detection. Monte Carlo, custom heuristics, or platform-native tooling to catch statistical regressions before downstream consumers do.
- Quality dashboards per data product, showing SLA compliance, top failure modes, and active incidents.
Master Data Management (MDM)
- Domain-by-domain scoping. Customer, product, account, counterparty — we MDM one domain at a time, not boil-the-ocean.
- Golden-record design. Match-merge logic, deterministic and probabilistic patterns, with the survivor rules documented and reviewed.
- Source-system reconciliation. Reconciliation reports per source, with mismatches routed to data stewards.
- Stewardship workflow. Web UI for stewards to review proposed matches, approve merges, and resolve conflicts.
- Operational MDM vs analytical MDM. We are explicit about whether the golden record drives operational systems (writes back) or is purely analytical.
Privacy & access control
- PII detection and classification across the data estate, refreshed on schema changes.
- Row-level and column-level security tied to your identity provider, not bolted on per-tool.
- Masking and tokenisation for PII in non-production environments.
- Purpose-based access controls for regulatory regimes (GDPR purpose-limitation, sectoral consent regimes).
- Audit logging of every access event, retained per regulatory requirement.
What you get at the end of a governance engagement
- A working catalog with the first cohort of data products onboarded
- End-to-end lineage from source systems through transformations to consumption
- Data contracts running as CI checks, blocking breaking changes
- Quality tests, freshness SLAs, and an incident-response workflow tied to your existing IR loop
- Privacy classifications and access policies tied to your IdP
- An audit-evidence capture path that survives the next regulator visit
A semantic layer your reports can trust.
Most BI estates accumulate hundreds of reports that agree on the headlines and disagree on the numbers. The fix is rarely "buy a new BI tool" — it is putting a semantic layer between the warehouse and the BI tool so that "revenue", "active customer", and "delinquency" mean one thing.
Semantic layer
- dbt as the transformation and semantic backbone for analytical models — tests, documentation, lineage, version control, all in one engineering practice.
- Cube, MetricFlow, or LookML as the consumer-facing semantic layer, depending on the BI tool and the contracting model.
- SQLMesh for organisations needing first-class virtual environments and time-travel-friendly modelling.
- Metrics-as-code. Every metric defined once, versioned in Git, peer-reviewed, with explicit ownership.
BI tooling
| Tool | Strengths | Typical fit |
|---|---|---|
| Looker | Strong semantic layer (LookML), governance-friendly, embedded analytics | Mid-to-large enterprises wanting governed self-service |
| Microsoft Power BI | Ubiquitous in Microsoft-anchored estates, strong report-authoring | Microsoft-first enterprises, large business-user populations |
| Tableau | Strong visual-exploration, mature dashboarding | Analyst-heavy organisations, exploratory work |
| Apache Superset | Open-source, self-hosted, container-native | Disconnected environments, organisations preferring an open-source BI tier |
| Metabase / Lightdash / Evidence | Lighter-weight, modern-data-stack-native | Smaller estates, teams already on dbt |
Self-service governance
Self-service analytics fails when the boundary between governed and ungoverned queries is unclear. We design that boundary explicitly:
- Certified marts. Datasets that have passed quality, lineage, and ownership checks. Tagged in the catalog and the BI tool.
- Sandbox tier. Where analysts can explore, model, and prototype — without their work appearing in board-level dashboards by accident.
- Promotion path. Documented route for moving a sandbox dataset into the certified tier — review, tests, ownership, SLA.
- Deprecation path. Reports nobody reads get archived; orphaned datasets get deleted.
Reporting that matters
- Operational reporting tied to the source-of-record system.
- Regulatory reporting with end-to-end lineage from the report back to operational data.
- Embedded analytics in customer-facing applications.
- Executive scorecards that the board actually uses — with explicit definitions and a single source of truth.
What you get at the end of an analytics & BI engagement
- A working semantic layer with metrics defined once and reusable
- A primary BI tool rolled out with governed self-service patterns
- Certified marts onboarded for the first cohort of business areas
- A deprecation backlog of legacy reports replaced by certified equivalents
- An analytics governance model your data and business stakeholders both signed
AI rides on the governed data layer — or it doesn't ride at all.
Most enterprise AI programmes stall at the data layer. The lesson is consistent across customers: AI workloads built on shadow pipelines fail internal audit; AI workloads built on the governed analytical surface ship. We build the data layer so AI is a first-class consumer.
Feature stores
- Feast as the default open-source feature store on top of the lakehouse — offline store on Iceberg/Delta, online store on Redis or low-latency KV, registry in Git.
- Platform-native stores — Databricks Feature Store, Vertex AI Feature Store, SageMaker Feature Store — where the data estate is anchored on a single platform.
- Feature lineage. Every feature traceable back to its source data, with version history.
- Online/offline consistency. Documented guarantees on the relationship between training-time and serving-time feature values.
- Feature SLAs. Freshness and availability targets per feature, monitored in production.
Training-data pipelines
- Reproducible dataset versions. Every training run pinned to a specific dataset version, with signed manifests.
- PII handling. Tokenisation, masking, or full removal at ingestion. Differential privacy where the use case allows.
- Consent and purpose. Training-data inclusion tied to documented consent and purpose-limitation policies.
- Labelled-data pipelines with human-in-the-loop tooling (Label Studio, Argilla, vendor platforms), version control, inter-annotator agreement.
- Synthetic data for cases where real data is scarce, sensitive, or restricted — with documented validation against real-data behaviour.
Retrieval grounding for LLMs
RAG is, fundamentally, a data problem. The model is interchangeable; the retrieval is what makes the answer right. We build the retrieval surface as a data product:
- Source-document pipelines. Ingestion from authoritative sources with provenance, version, and last-modified tracking.
- Chunking strategies calibrated to your content: section-based for legal/policy documents, sentence-based for support content, hybrid for mixed corpora.
- Embedding pipelines with embedding-model version, batch vs streaming refresh, re-embedding triggers on document update.
- Vector store choice driven by latency, scale, sovereignty — not by ecosystem hype.
- Retrieval evaluation with a labelled eval set, recall@k metrics, blocking regressions on retrieval changes.
- Audit trail. Every retrieved chunk surfaced to the user with citation, retained for audit.
Cross-cutting
This tab overlaps with our AI practice on inference, agentic patterns, MLOps, and AI governance. The data work is the substrate; the AI work is what runs on top.
What you get at the end of a data-for-AI engagement
- A feature store deployed and integrated with your lakehouse
- Reproducible training-data pipelines with signed dataset versions
- Labelled-data tooling for the use cases that need it
- RAG retrieval surface treated as a versioned data product
- Lineage from operational data → features → model → output, surfaced to audit
- Documented consent, purpose, and PII handling for AI use cases
Have a data programme that needs engineering depth?
Send us a short note describing the problem. We’ll come back with a concrete first-two-weeks scope and a definition of done for the engagement.