My 2 cents on Foreign Keys and Data Warehousing

As a data warehouse architect/designer, we always have to deal with the question: “should I use foreign keys ? Should I enforce them ?” Well, yes! for the first question, probably not for the second. Here is why.

Relationships between fact tables and dimension table are described by foreign keys. For example fact_order.dwh_customer_id relates to dim_customer.dwh_customer_id (Yes, when I am in charge of modeling, I choose to prefix all technical and surrogate dwh keys with dwh_, etl meta data is prefixed with … etl_ such as for etl_modification_time or etl_crc_value). It’s always a good idea to use the same column name in both tables when applicable: this would make understanding the data model easier for IT and the Business for one, and also reporting tools can also rely on this convention.

Now, should this relationship also be described by a foreign key DDL ? Yes! In a BI platform, “meta data” matters ! And a foreign key is additional meta information. It can be used by the query optimizer of the database to speed up queries, by data modeling tools, and least but not last, by the reporting solution! Microstrategy, for instance would read this meta information to automatically associate fact tables and their dimensions on that key information. It also helps these tools for (inner) join culling (although this can usually be forced anyway).

Foreign keys declaration are very valuable and come for free with (almost) all databases. Should they be enforced1 ? Ask this question to a DBA, he/she will say “Yes”. The database needs to maintain referential integrity not just metadata. Ask this question to an ETL developper, he/she will say “No”. Referential integrity is garanteed by the ETL processes, and foreign key enforcement makes data loading slower (this depends on the data volume to load). I agree, with both. I use them for metadata, and usually turn them off. I don’t recommend turning them off during loading and on afterwards, mostly because of continuous data ingestion (data streaming, restfull APIs…) and intra-day batches. If these are in operation on your infrastructure, you will never know for sure if FK are on or off at a particular point in time.

What I do recommand though is to have them tested regularly (after each integration or daily depending on data SLA) by an additional job/script. I have written a simple Python script that read FK information from the database metadata and then trigger templated (Jinja2) queries to test them (thus quite lite on network IO). I also use this script to test my SCD type 2 attributes consistency (only one active record per surrogate keys, proper end dates…). This is particularly relevant for database with no FK enforcing (such as RedShift or Netezza for instance)

  1. Some databases such as Amazon Redshift cannot enforce them at all: “Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift.”(*