Data & Analytics practice

A specialist data practice for regulated workloads.

Seven aspects, one operating discipline. From data strategy and architecture through lakehouse platforms, real-time streaming, governance and quality, analytics and BI, and data-for-AI — built on the same fleet, same identity, and same audit posture as the rest of your platform.

Discuss a data engagement Explore the practice

01 — Overview

Data as a platform — not as a project.

Most data programmes deliver dashboards. The interesting work happens earlier — in the architecture that lets every team get the data it needs without re-platforming, in the governance that survives the next regulator visit, and in the operational discipline that keeps quality from quietly decaying. We engineer that work.

Strategy & architecture

Reference architectures (centralised, federated, data-mesh), data-product thinking, build-vs-buy.

Data platforms

Lakehouse (Iceberg/Delta), warehouses (Snowflake/BigQuery/Redshift), lakes (S3/MinIO), Spark, query engines.

Streaming & real-time

Apache Kafka, Flink, Spark Streaming, change-data-capture, event-driven architectures.

Governance & quality

Catalog, lineage, MDM, data contracts, data-quality testing, privacy controls.

Analytics & BI

Semantic layer (dbt, Cube), BI tooling (Looker, Power BI, Tableau, Superset), self-service governance.

Data for AI

Feature stores, training pipelines, ML-ready data, retrieval grounding, audit-grade lineage.

Engagement archetypes

Engagement type	Typical scope	Duration
Data strategy & architecture	Current-state assessment, target architecture, build-vs-buy, data-product backlog, roadmap	4–8 weeks
Lakehouse platform stand-up	Object storage, table format (Iceberg/Delta), query engine, Spark, identity, governance, GitOps delivery	10–16 weeks
Streaming platform stand-up	Kafka cluster (or managed), schema registry, streaming jobs, CDC integration, dead-letter handling, observability	8–14 weeks
Data governance bring-up	Catalog, lineage, ownership model, data contracts, quality testing, privacy controls, audit evidence	8–12 weeks
Analytics & BI modernization	Semantic layer via dbt, BI rollout, self-service governance, deprecation of legacy reports	10–16 weeks
Data-for-AI engagement	Feature store, training pipelines, retrieval/embedding pipelines, lineage for AI workloads	8–14 weeks
Master Data Management (MDM)	Golden-record design, matching/merging, source-system reconciliation, stewardship workflow	12–20 weeks

What makes us different

Platform-anchored data. Data platforms run on the same OpenShift fleets as the rest of your regulated workloads, with the same identity boundary, security posture, and observability stack.
Data products, not data swamps. Every dataset has an owner, a contract, an SLA, and a documented lineage. Untyped, undocumented, unowned pipelines are a defect.
Audit posture by default. Schema changes, data movement, access events, and quality results all captured as evidence at the source — not reconstructed for the next audit.
AI-ready by design. The data layer is built so that AI workloads ride on the same governed surface as analytics — not on shadow pipelines built by ML engineers in haste.

02 — Data Strategy & Architecture

Decide the shape of the data estate before you platform it.

Most "data platform" projects fail because the architecture was decided implicitly, in the gap between business strategy and engineering capacity. We make that architecture explicit, debated, and decided — before any platform engineering starts.

Architecture patterns we use

Pattern	Strengths	When we choose it
Centralised lakehouse	Single source of truth, simple governance, lower operating overhead	Small-to-medium estates, organisations with a central data team, regulated environments where governance simplicity matters
Hub-and-spoke (federated)	Central governance and shared dimensions; domain teams own their data products	Mid-sized organisations with multiple business units and strong-ish central platform team
Data mesh	Domain teams fully own data products; central platform team provides the substrate; governance is federated	Large, multi-domain organisations where centralising the data team becomes the bottleneck
Operational data store (ODS) layer	Cleaned, integrated operational data near the source systems	Mainframe-heavy estates, banks, insurance carriers with critical batch cycles
Hybrid	Different patterns for different domains in the same enterprise	Most realistic large-enterprise deployments

Data-product thinking

We treat every analytical dataset as a data product — with an owner, a contract, an SLA, and a lifecycle:

Owner. A named team (not a person) accountable for the dataset's accuracy, freshness, and availability.
Contract. Schema, semantics, allowed values, freshness expectation, breaking-change policy — defined and reviewed before consumers depend on it.
SLA. Documented freshness, availability, and quality targets.
Discoverability. Surfaced in the catalog with description, lineage, and example queries.
Versioning. Breaking changes go through a deprecation path; consumers get notice.
Deprecation. Datasets that no consumer reads are deprecated, archived, deleted — on a schedule, not a whim.

Build vs buy

We make explicit decisions about which capabilities to self-host, which to take as managed services, and which to consume as SaaS:

Self-host where data-sovereignty requires it, or where operating-cost dynamics favour it at scale.
Managed cloud services (BigQuery, Snowflake, Redshift, Databricks, Synapse) where the data-residency posture and operating overhead trade-offs work out.
Specialist SaaS (Fivetran, dbt Cloud, Hightouch, Atlan, Monte Carlo) for capability areas where building or self-hosting is not where the engineering team should spend its time.

What you get at the end of a strategy & architecture engagement

A current-state diagnostic of the data estate with named gaps and concrete impact
A target reference architecture matched to your organisation's size, structure, and regulatory profile
A data-product backlog with owner, contract, and SLA defined for the first cohort
Build-vs-buy decisions justified per capability area
A phased roadmap with explicit Phase 01 next-step for execution
An ADR set capturing the non-obvious choices

03 — Data Platforms

Lakehouse-first. Self-hosted where it earns its keep.

Our default data platform is a lakehouse — open table formats on object storage, with Spark and SQL query engines on top, running on the same OpenShift fleet as the rest of the platform. Managed cloud warehouses are absolutely on the table where they fit; we don't insist on self-hosting for its own sake.

Lakehouse stack

Layer	Technologies we operate	What we deliver
Object storage	MinIO (self-hosted), AWS S3, Azure ADLS Gen2, GCS	Bucket design, encryption, lifecycle policies, replication topology, IAM integration
Table format	Apache Iceberg, Delta Lake, Apache Hudi	Format selection by workload (write pattern, read pattern, ecosystem fit), partitioning strategy, schema evolution policy
Compute & query engines	Apache Spark, Trino, Presto, Dremio, DuckDB (for federated and small-scale)	Cluster sizing, query routing, workload isolation, cost-attribution per tenant
Orchestration	Apache Airflow, Argo Workflows, Prefect, Dagster	DAG patterns, retry semantics, alerting, lineage capture, GitOps deployment
Transformation	dbt (Core or Cloud), Spark SQL, SQLMesh	Modelling patterns (staging / intermediate / marts), tests, documentation, lineage
Ingestion	Airbyte, Fivetran, Spark, custom connectors	Source onboarding, CDC integration, incremental patterns, full-refresh recovery

Managed cloud warehouses

When sovereignty, scale, or operating overhead points to a managed warehouse, we deliver the same engineering discipline on top:

Snowflake. Account / role architecture, RBAC patterns, virtual-warehouse sizing, cost attribution, masking and row-access policies.
Google BigQuery. Dataset and project layout, slot reservations, column-level security, BigQuery-ML integration where it fits.
Amazon Redshift / RA3. Managed storage, workload management, materialised views, integration with AWS Lake Formation.
Databricks. Workspace design, Unity Catalog, photon engine tuning, ML and analytics on the same surface.
Azure Synapse / Fabric. Dedicated vs serverless pools, integration with Azure data services, Fabric workloads.

Why we lean lakehouse for regulated workloads

Open formats. Iceberg and Delta on object storage means data is portable. You are not contractually locked into a single vendor's query engine.
Cost decoupling. Storage cost decouples from compute cost; query engines can be scaled or replaced independently.
Co-location with platform. Running on the same OpenShift fleet as your other workloads simplifies identity, networking, security posture, and observability.
Disconnected-friendly. Self-hosted lakehouse stacks run cleanly in disconnected and air-gapped environments where managed cloud is not an option.

What you get at the end of a data platform engagement

A working lakehouse (or managed-warehouse) platform deployed to your environment
Table-format and partitioning strategy justified against your workload
Query-engine and orchestration layer in place with GitOps delivery
First data products onboarded end-to-end through the new platform
Cost-attribution and observability tied to your FinOps and ops stack
Runbooks for the failure modes that matter: storage partition, schema drift, query-engine restart, replication lag

04 — Streaming & Real-Time

Events as the spine of regulated systems.

For regulated workloads — payment authorisation, fraud detection, claims intake, network telemetry — "real-time" is not a nice-to-have. We design and operate the streaming spine that those workloads ride on, with explicit guarantees about ordering, durability, and delivery semantics.

Streaming platforms

Platform	Strengths	Typical fit
Apache Kafka	Industry-standard distributed log, durable, scalable, mature ecosystem	Default for general event streaming, system-of-record events, integration spine
Confluent Platform / Cloud	Kafka-compatible with managed-service simplicity, schema registry, ksqlDB, control plane	Customers willing to consume Kafka as a managed service
Red Hat AMQ Streams (Strimzi)	Operator-driven Kafka on OpenShift, fully GitOps-managed, vendor-supported	OpenShift-anchored environments wanting self-hosted Kafka
Apache Pulsar	Multi-tenant, geo-replication-native, BookKeeper-backed storage	Multi-tenant SaaS-style platforms, geo-replication-heavy workloads
NATS / JetStream	Lightweight, low-latency, good for service-to-service messaging	Internal microservice messaging, edge and IoT scenarios

Stream processing

Apache Flink. Stateful stream processing with strong exactly-once guarantees, event-time windowing, large-state support. Default for stateful streaming jobs.
Spark Structured Streaming. Where the team already operates Spark and the streaming workload fits the micro-batch model.
Kafka Streams & ksqlDB. For Kafka-native, simpler stream transformations.
Debezium for CDC. Change-data-capture from operational databases into the streaming spine, with snapshot and incremental modes.

Schema, contracts, and evolution

A streaming platform without schema discipline becomes a swamp very quickly. We design schema management into the platform from day one:

Schema registry. Confluent Schema Registry, Apicurio, or equivalent — with required schemas per topic and a compatibility policy.
Schema formats. Avro for compact wire format with backward-compatible evolution; Protobuf where ecosystem fit demands it; JSON Schema for human-readable contracts.
Compatibility policy. Explicit backward / forward / full compatibility per topic, with breaking-change deprecation paths.
Topic contracts. Owner, semantics, retention, partitioning, expected throughput, allowed consumers — documented in the catalog.

Operational discipline

Dead-letter handling. Every consumer has a DLQ pattern with documented retry, alerting, and reprocessing path. Silent loss is treated as a defect.
Delivery semantics. Each topic and consumer documents its at-most-once / at-least-once / exactly-once posture and the trade-offs that justify it.
Observability. Lag per consumer group, throughput per topic, schema-violation rate, DLQ depth — all in the same dashboards as the rest of the platform.
Disaster recovery. Multi-region replication topology (MirrorMaker 2, Confluent Replicator, or equivalent), documented RPO/RTO, regularly drilled.

What you get at the end of a streaming engagement

A production-shaped Kafka (or equivalent) cluster on OpenShift or your chosen cloud
Schema registry, schema patterns, and per-topic contracts
Stream-processing jobs onboarded with documented delivery semantics
CDC integration from operational databases into the streaming spine
Observability and DLQ handling tied to your incident-response loop
DR drill evidence with documented RPO/RTO

05 — Data Governance & Quality

Governance as engineering — not as a committee.

Data governance fails when it is run as a slide deck overseen by a steering committee. We embed governance into the same engineering surface as the rest of the platform — the catalog updates automatically, the lineage is captured at the source, the data contracts run as CI checks, and the audit evidence is a side effect of normal operation.

Catalog & discoverability

Tool	Strengths	When we choose it
Open-source: DataHub, OpenMetadata, Apache Atlas	Self-hosted, customisable, integrates broadly across the data stack	On-prem and disconnected environments, organisations preferring an open-source-anchored stack
Commercial: Atlan, Collibra, Alation	Mature stewardship workflows, business-glossary tooling, vendor support	Large enterprises with mature governance functions and willingness to operate a SaaS catalog
Native: Unity Catalog, Snowflake Horizon	Tightly integrated with the host platform's auth, lineage, and access controls	When the data estate is anchored on a single managed platform

Lineage

OpenLineage. Standard for emitting lineage events from Spark, Airflow, dbt, and other tools into the catalog.
Column-level lineage via dbt and supported catalogs — not just table-to-table arrows.
Cross-system lineage from CDC source through streaming → lake → warehouse → BI tool — tracing a regulatory report all the way back to its operational origin.
Lineage as audit evidence. A regulator question about "where did this number come from?" answerable in minutes, not weeks.

Data contracts & quality

Schema and semantics contracts defined per data product, versioned in Git, reviewed in PR.
Quality tests. dbt tests, Great Expectations, Soda — running in the same pipeline as the transformation, failing the pipeline if the contract breaks.
Freshness SLAs. Each data product has a freshness target; misses produce alerts and incident tickets.
Anomaly detection. Monte Carlo, custom heuristics, or platform-native tooling to catch statistical regressions before downstream consumers do.
Quality dashboards per data product, showing SLA compliance, top failure modes, and active incidents.

Master Data Management (MDM)

Domain-by-domain scoping. Customer, product, account, counterparty — we MDM one domain at a time, not boil-the-ocean.
Golden-record design. Match-merge logic, deterministic and probabilistic patterns, with the survivor rules documented and reviewed.
Source-system reconciliation. Reconciliation reports per source, with mismatches routed to data stewards.
Stewardship workflow. Web UI for stewards to review proposed matches, approve merges, and resolve conflicts.
Operational MDM vs analytical MDM. We are explicit about whether the golden record drives operational systems (writes back) or is purely analytical.

Privacy & access control

PII detection and classification across the data estate, refreshed on schema changes.
Row-level and column-level security tied to your identity provider, not bolted on per-tool.
Masking and tokenisation for PII in non-production environments.
Purpose-based access controls for regulatory regimes (GDPR purpose-limitation, sectoral consent regimes).
Audit logging of every access event, retained per regulatory requirement.

What you get at the end of a governance engagement

A working catalog with the first cohort of data products onboarded
End-to-end lineage from source systems through transformations to consumption
Data contracts running as CI checks, blocking breaking changes
Quality tests, freshness SLAs, and an incident-response workflow tied to your existing IR loop
Privacy classifications and access policies tied to your IdP
An audit-evidence capture path that survives the next regulator visit

06 — Analytics & BI

A semantic layer your reports can trust.

Most BI estates accumulate hundreds of reports that agree on the headlines and disagree on the numbers. The fix is rarely "buy a new BI tool" — it is putting a semantic layer between the warehouse and the BI tool so that "revenue", "active customer", and "delinquency" mean one thing.

Semantic layer

dbt as the transformation and semantic backbone for analytical models — tests, documentation, lineage, version control, all in one engineering practice.
Cube, MetricFlow, or LookML as the consumer-facing semantic layer, depending on the BI tool and the contracting model.
SQLMesh for organisations needing first-class virtual environments and time-travel-friendly modelling.
Metrics-as-code. Every metric defined once, versioned in Git, peer-reviewed, with explicit ownership.

BI tooling

Tool	Strengths	Typical fit
Looker	Strong semantic layer (LookML), governance-friendly, embedded analytics	Mid-to-large enterprises wanting governed self-service
Microsoft Power BI	Ubiquitous in Microsoft-anchored estates, strong report-authoring	Microsoft-first enterprises, large business-user populations
Tableau	Strong visual-exploration, mature dashboarding	Analyst-heavy organisations, exploratory work
Apache Superset	Open-source, self-hosted, container-native	Disconnected environments, organisations preferring an open-source BI tier
Metabase / Lightdash / Evidence	Lighter-weight, modern-data-stack-native	Smaller estates, teams already on dbt

Self-service governance

Self-service analytics fails when the boundary between governed and ungoverned queries is unclear. We design that boundary explicitly:

Certified marts. Datasets that have passed quality, lineage, and ownership checks. Tagged in the catalog and the BI tool.
Sandbox tier. Where analysts can explore, model, and prototype — without their work appearing in board-level dashboards by accident.
Promotion path. Documented route for moving a sandbox dataset into the certified tier — review, tests, ownership, SLA.
Deprecation path. Reports nobody reads get archived; orphaned datasets get deleted.

Reporting that matters

Operational reporting tied to the source-of-record system.
Regulatory reporting with end-to-end lineage from the report back to operational data.
Embedded analytics in customer-facing applications.
Executive scorecards that the board actually uses — with explicit definitions and a single source of truth.

What you get at the end of an analytics & BI engagement

A working semantic layer with metrics defined once and reusable
A primary BI tool rolled out with governed self-service patterns
Certified marts onboarded for the first cohort of business areas
A deprecation backlog of legacy reports replaced by certified equivalents
An analytics governance model your data and business stakeholders both signed

Start a data engagement

Have a data programme that needs engineering depth?

Send us a short note describing the problem. We’ll come back with a concrete first-two-weeks scope and a definition of done for the engagement.