Entries from August 2008 ↓

ETL: Null Surrogate Keys

I buy into most of what Kimball has to say on Datawarehousing and ETL. For instance, this site fact tables should have no columns with NULLs.

That means creating a row in every dimension that represents NULL or “Unknown”. The problem is, discount rx you can’t leave the natural_key for your “unknown” row as NULL, buy because in SQL NULL != NULL. This means that your dimension lookup while building your fact tables will never use the null row.

It seems common practice is to create a “magic” natural key value that you know refers to nothing. For a string, perhaps “??” or for a date we could use “Jan 1 4512 BC”. Then we could map NULL to “??” before the dimension lookup.

This seems like a really bad idea. I propose an alternative:

We can use a meta_data table with columns for “table_name” and “null_id”. When building each dimension and generating a null row, we insert a row into the meta-data table to indicate what the null surrogate key id is in that dimension.

When building the fact tables, we ensure that the schema allows NULL values for the dimension surrogate keys. Afterwards we process the meta_data table and generate update statements for each fact table. Something like “UPDATE blah_facts SET some_dimension_id = <TheNullSurrogateFromMetaDataTable> WHERE some_dimension_id IS NULL”.

And then for good measure we can alter the table to prevent NULLs in the dimension keys for all the fact tables.

No need for magic numbers.