Tuesday, June 14, 2011

Slowly Changing dimension (type - 2) on Pentaho Kettle

Recently a'm in a need to use SCD Type -2 on the datamart design, and kettle spoon as the tool...

SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.
Courtesy : http://en.wikipedia.org/wiki/Slowly_changing_dimension

For example, take an employee table, if the employee moves to other company, the table could look like this, with incremented version numbers to indicate the sequence of changes:

emp_key | name  | emp_code |           company           | version
---------+-------+----------+-----------------------------+---------
       1 | Ashik | emp01    | WhiteHouseBusinessSolutions |       0
       2 | Ashik | emp01    | NetEd Learning Solutions    |       1
       3 | Ashik | emp01    | MindTree Ltd                |       2
       4 | Ashik | emp01    | CollabNet Inc               |       3

In Pentaho, Lets see how.,
I created two tables Employee and the Employee_detail.,

select * from employee;
 emp_key | emp_id
---------+--------
       1 | emp01

select* from employee_detail;
 emp_detail_key | name | code | company | version | date_from | date_to
----------------+------+------+---------+---------+-----------+---------
(0 rows)

Pentaho Will read the emp_id from the eployee table by a TableInput step, and I add the remaining fields as constants for the sake of this sample., and add Dimension Lookup/Update step from kettle to perform SCD -Type2




I just read the emp_id from employee using TableInput step.




and am trying to change the employee company name for every run.,


the Dimension Lookup/Update step used to do the SCD here, this componenet requires the date range fields and the version field on the SCD table. and I had specified emp_id as the lookup field.

and it changes the version accordingly., i.e whenever a change in the company pentaho creates a new version and the older ones remain as versioned.



and I had given the "type of update" as update for name and insert for company., thus name will get update if there any change in same and the company get versioned upon change.


So now the setup is ready to perform SCD., lets run the transformation.

After the first run...

select* from employee_detail;
 emp_detail_key | name  | code  |           company           | version | date_from  |  date_to
----------------+-------+-------+-----------------------------+---------+------------+------------
              0 |       |       |                             |       1 |            |
              1 | Ashik | emp01 | WhiteHouseBusinessSolutions |       1 | 2011-06-17 | 2199-12-31

Again am changing the company name..


now lets see the table..,

select* from employee_detail;
 emp_detail_key | name  | code  |           company           | version | date_from  |  date_to
----------------+-------+-------+-----------------------------+---------+------------+------------
              0 |       |       |                             |       1 |            |
              2 | Ashik | emp01 | NetEd Learning Solutions    |       2 | 2011-06-17 | 2199-12-31
              1 | Ashik | emp01 | WhiteHouseBusinessSolutions |       1 | 2011-06-17 | 2011-06-17

You can see the version update.. on change of company name.. after several runs..

select* from employee_detail;
 emp_detail_key | name  | code  |           company           | version | date_from  |  date_to
----------------+-------+-------+-----------------------------+---------+------------+------------
              0 |       |       |                             |       1 |            |
              1 | Ashik | emp01 | WhiteHouseBusinessSolutions |       1 | 2011-06-17 | 2011-06-17
              3 | Ashik | emp01 | MindTree Ltd                |       3 | 2011-06-17 | 2199-12-31
              2 | Ashik | emp01 | NetEd Learning Solutions    |       2 | 2011-06-17 | 2011-06-17

select* from employee_detail;
 emp_detail_key | name  | code  |           company           | version | date_from  |  date_to
----------------+-------+-------+-----------------------------+---------+------------+------------
              0 |       |       |                             |       1 |            |
              1 | Ashik | emp01 | WhiteHouseBusinessSolutions |       1 | 2011-06-17 | 2011-06-17
              2 | Ashik | emp01 | NetEd Learning Solutions    |       2 | 2011-06-17 | 2011-06-17
              4 | Ashik | emp01 | CollabNet Inc               |       4 | 2011-06-17 | 2199-12-31
              3 | Ashik | emp01 | MindTree Ltd                |       3 | 2011-06-17 | 2011-06-17

Thats it.., to get the latest version or the current record.,

select * from employee_detail where name = 'Ashik' and now() between date_from and date_to;
 emp_detail_key | name  | code  |    company    | version | date_from  |  date_to
----------------+-------+-------+---------------+---------+------------+------------
              4 | Ashik | emp01 | CollabNet Inc |       4 | 2011-06-17 | 2199-12-31