ETL: Null Surrogate Keys

I buy into most of what Kimball has to say on Datawarehousing and ETL. For instance, fact tables should have no columns with NULLs.

That means creating a row in every dimension that represents NULL or “Unknown”. The problem is, you can’t leave the natural_key for your “unknown” row as NULL, because in SQL NULL != NULL. This means that your dimension lookup while building your fact tables will never use the null row.

It seems common practice is to create a “magic” natural key value that you know refers to nothing. For a string, perhaps “??” or for a date we could use “Jan 1 4512 BC”. Then we could map NULL to “??” before the dimension lookup.

This seems like a really bad idea. I propose an alternative:

We can use a meta_data table with columns for “table_name” and “null_id”. When building each dimension and generating a null row, we insert a row into the meta-data table to indicate what the null surrogate key id is in that dimension.

When building the fact tables, we ensure that the schema allows NULL values for the dimension surrogate keys. Afterwards we process the meta_data table and generate update statements for each fact table. Something like “UPDATE blah_facts SET some_dimension_id = <TheNullSurrogateFromMetaDataTable> WHERE some_dimension_id IS NULL”.

And then for good measure we can alter the table to prevent NULLs in the dimension keys for all the fact tables.

No need for magic numbers.

6 comments ↓

#1 Jarrod on 09.09.08 at 6:50 am

For DATE and TIMESTAMP sql types I use “the beginning of time” 0001-01-01 and 0001-01-01 12:00:00.000 UTC as my “null” values. For strings I just put the word “NULL” usually with some kind of “salt” to make it unique. Like “%%NULL%%” just in case “NULL” might be a legitimate input value to the field. For numbers I pick something that is outside the range of valid inputs, usually something like INT.MIN_VALUE.
This is way better than yet another table you have to join against. If you make the DEFAULT values self documenting as possible they are no longer “magic numbers”.

#2 Luke Galea on 09.09.08 at 6:57 am

Hmm. I see that those are self documenting, but only to a human.

In our case, we have a data mining tool that is querying against the data, and one way or another we’d need to document what the null value is. It could be only in the data mining tool, but I felt that having it be part of the data itself would be preferable.

A hybrid would be great though. Use self documenting values as you say, but still have a meta-data table that documents what those values are for each table. You can join on the table if you like, or not.. depending on your needs.

#3 George Brennan on 09.12.08 at 11:56 am

Use an integer surrogate key with a standard negative value for the dimension key e.g.(-1, unknown) , the the actual values per type ( ‘unknown’, ‘01/01/00′) don’t matter so much. Also, don’t update the fact tables post process, build the default surrogate key into the processing, then you don’t have to update 10 million facts ;-)

#4 Антон Павлович on 03.23.10 at 11:32 pm

Да, логически правильно…

……

#5 Kylie Batt on 04.15.10 at 9:35 pm

Супер клас!!!…

Бібліотека Російської класики That means creating a row in every dimension that represents NULL or “Unknown”…..

#6 Kylie Batt on 05.03.10 at 7:26 pm

Это просто великолепная фраза…

Козлов В.И., Юфіт Г.А. Проектування НВЧ пристроїв за допомогою ЕОМ ……

You must log in to post a comment.