Handling null values in the dimensional modeling.

Road2Data
3 min readApr 20, 2021

We usually heard the term 'null' when we work with data so what does it mean?

It looks similar to blank or 0 but actually, it is not. Null is a reserved keyword in SQL to represent missing values in the database. Operations on the missing values will always be missing values and 'null' cant be equal to 'null'. e.g SELECT null = null result will be unknown for this query as the database can't decide whether this is true or false.

Result of query 'Select Null = Null'

We have often seen the debates and read a lot of articles in to and fro about whether we should keep the Null values in the data warehouse or not.

So in this article, I am providing you my opinion on this topic. I would separate this topic into 4 parts and discuss it one by one.

  1. Foreign keys in the Fact tables.

If the foreign keys don't match with the surrogate keys in the dimension tables or foreign keys were not know at the time of extraction of the data then in that case what appropriate values should be put as foreign keys in the dimension table.

You should create a dimension entry for unknown values in the dimension table having a surrogate key as any negative number. We choose here negative numbers because for the know dimension entry keys are made of positive numbers.

Unknow value entry in the Dim Table

2. Attributes in the Fact Tables

Apart from foreign keys of the Dim tables, Fact tables used to have other attributes or degenerated dimensions and in my views here we should not replace the 'null' values with 'unknown' or any other strings. Replacing them might be misleading for the analysis.

e.g let's suppose for an order 'x' we receive the null value for its invoice date or null values in the cancellation reason field. Here null values mean that the order is not invoiced yet and it is not canceled by the user at the time when we last extracted the data from the source system.

3. Nulls in Measure values/Facts.

It means that value didn’t exist or we didn't capture the values correctly from the source system. In the above cases, it always better to put the fact values as null.

I have seen the companies where it's always a debate point that would we replace the null values to 0 as null doesn’t make any sense to the business users or keep as it as we are receiving from the source.

In that case, my opinion is at the database level, keep the values as null if they don't exist because for the aggregation like MIN, MAX, AND AVG results can be misleading.

for e.g avg of 2,2,2,0,2 is 1.6 but avg of 2,2,2,null,2 is 2.

Databases, Modern BI visualization tools, and modeling tools can easily handle the ‘Nulls’ during aggregations.

If business users want to see them as 0 instead of null then ifs better to replace the nulls at the presentation layer.

4. Nulls in the attributes of the dimension table.

As we all know that dimensions are made to store and organize the attributed data efficiently. Sometimes it might happen that all the attributes of the Dim have not been captured yet. In that case, putting the 'Null' values may generate the blank fields in the user reports.

As dimensions tables are always exposed to the end-users. So in my opinion replacing the 'Null' with 'NA' or 'Unknow' would make sense.

--

--

Road2Data

A data enthusiast, Having 8 years of experience in Data Warehousing and Business Intelligence field, Currently working as Senior BI developer at Home24 Berlin.