← Back

2026-06-02

Why Insurance Data Is Structurally Messier Than Banking Data — and How to Handle It

People who move from banking to insurance analytics expect a steeper learning curve on the business side. What actually breaks them is the data. Banking data is mostly transactional — a debit, a credit, a timestamp, a balance. Even derivatives and FX, with all their complexity, eventually resolve into rows you can reconcile against a ledger. Insurance is a different species. The data carries embedded uncertainty, multi-period state, and product structures that refuse to fit into the clean star schemas that work fine for a retail bank.

After building pipelines for life, pension, and health products across Turkish insurers and reinsurers, I have stopped treating insurance data as "a harder version of banking data." It is structurally different, and most off-the-shelf data quality frameworks quietly fail on it.

What Banking Data Actually Looks Like

Strip a core banking system down and you find a small number of well-behaved patterns:

Reconciliation is the discipline. If your numbers do not match the GL by end of day, you have a bug. The truth is knowable and the truth is now.

What Insurance Data Actually Looks Like

Insurance data violates almost every one of those assumptions.

1. The core fact is a probability, not a transaction. A policy is a contract over a future contingent event. The "value" of that policy at any point in time is the expected present value of cash flows under assumptions that themselves change quarterly — mortality tables, lapse rates, discount curves, morbidity assumptions. There is no GL row that tells you what a 10-year level-term policy is worth today. There is a model output, and it is wrong by construction. You are choosing how it is wrong.

2. State is multi-dimensional and time-layered. A pension contract in Turkey can be active, paid-up, partially surrendered, on premium holiday, in transfer to another company, or in the state-subsidy clawback window — sometimes several of these at once across different fund allocations. A health policy can be in-force at the policy level while a specific insured under it is suspended. Each state has its own valid transitions, effective dates, and accounting consequences.

3. Retroactive changes are normal, not exceptions. A claim reported in March may relate to an incident in November of the prior year. A medical underwriting decision can void coverage backdated to inception. Premium reconciliations from agents and bancassurance partners arrive 30–90 days late and rewrite history. In banking, a backdated entry triggers an audit. In insurance, it is Tuesday.

4. Product structures are nested and conditional. A unit-linked pension contract is a policy that contains contributions, allocated across funds, each with its own unit prices, fees stripped at different frequencies, with state subsidies calculated on a parallel schedule, and a tax treatment that depends on exit reason and tenure. There is no single grain. Pick a grain and you lose information; carry every grain and your warehouse becomes unusable.

5. The same entity is measured under multiple regimes simultaneously. One policy produces numbers under local statutory accounting, IFRS 17, Solvency II (or local equivalent), tax, and management reporting — and they will not agree, nor should they. "Reconciliation" in the banking sense is not the goal; controlled divergence is.

Why Standard Data Quality Frameworks Fail Here

Most DQ frameworks — Great Expectations, dbt tests, the usual rule libraries — assume that a record has a correct value and your job is to detect deviation. They work well for completeness, uniqueness, referential integrity. They are largely useless for the failure modes that actually matter in insurance:

Running naive null-checks and range-checks on insurance data produces thousands of false positives that teams learn to ignore — which means the real problems also get ignored.

What the Architecture Actually Needs

This is what I have converged on after enough rebuilds to know what survives contact with an actuary.

1. Bi-temporal everything

Every fact table needs two time dimensions: effective date (when the business event is true) and knowledge date (when the system learned about it). Not as a nice-to-have. As the grain. A claim payment booked today for an accident last October is two different rows from a claim payment booked today for an accident today, and your reserving model needs to distinguish them. Banking can mostly get away with one timestamp. Insurance cannot.

2. Snapshot, do not just stream

CDC and event streams are necessary but insufficient. You need periodic full snapshots of policy state — daily for active books, at minimum monthly — because the cumulative effect of retroactive changes is not recoverable from event logs alone once enough corrections pile up. Storage is cheap. Re-deriving last year's IFRS 17 CSM from a broken event stream is not.

3. Separate the policy administration truth from the actuarial truth

The policy admin system knows what the contract says. The actuarial system knows what the contract is worth under a set of assumptions. These are different facts with different lifecycles. Conflating them in one warehouse layer is the most common architectural mistake I see. Build two layers explicitly, and a reconciliation layer between them that tracks assumption versions as first-class data.

4. Make assumption sets versioned data, not config

Mortality tables, lapse curves, discount rates, expense assumptions — these are not configuration. They are inputs that drive material financial outputs and they change on a schedule. Version them in the warehouse with effective dates, owners, and approval records. When someone asks why the embedded value moved, you need to answer in SQL, not in a meeting.

5. Build DQ rules that understand product state

A premium DQ rule that does not know about premium mode, grace period, premium holiday, and paid-up status will generate noise. Encode the state machine of each product line in the DQ layer itself. Yes, this is more work. It is also the difference between a DQ system actuaries trust and one they route around.

6. Treat reinsurance as a parallel ledger

Ceded data is not a column on the gross record. It is its own contractual structure with its own retroactive corrections, its own settlement cycles, and its own reporting regime. Model it as a parallel fact stream that joins to gross at well-defined reconciliation points.

The Underlying Point

Banking data engineering rewards precision against a known truth. Insurance data engineering rewards discipline in the face of an unknowable one. The systems that work are the ones that stop pretending insurance is just banking with longer time horizons, and start treating uncertainty, retroactivity, and assumption-dependence as the core grain — not as exceptions to be cleaned away.

If your insurance warehouse looks like a banking warehouse with more columns, it is already broken. You just have not closed the quarter yet.