Wednesday, 27 August 2014

SSIS 2008 - Merge Join Transformation

Merge Join Transformation:

Just like Merge and Union All, even this transformation is used for merging two SORTED datasets. Feature that makes this transformation standout from Merge and Union All is the ability to merge two datasets based on a JOIN condition. Even Lookup transformation uses the Join feature using which it provides the matching record from the lookup dataset, but lookup provides only first matching record when it has multiple records matching whereas, Merge Join provides all the matching records from the other dataset which we are merging.


Using the below sample datasets, let’s explore the different JOIN’s provided by MERGE JOIN Transformation:

Dataset 1:
Col A
Col B
1
1.a
2
1.b
3
1.c1
3
1.c2
4
1.d
5
1.e

Dataset 2:
Col A
Col B
1
2.a
2
2.b1
2
2.b2
3
2.c
6
2.d
7
2.e

As shown below, configure and SORT the above datasets using ‘Excel Source’ and ‘SORT Transform’, drag ‘MERGE JOIN’ from Data Flow Transformations. First map the Output of SORT_Dataset_1 to Merge Join and select ‘Input’ as ‘Merge Join Left Input’:

Now, map the output of Sort_Dataset_2 to Merge Join:

Open ‘Merge Join Editor’ to configure it:

As shown above, we can see by default ‘Col A’ is selected as ‘Join Key’, if we want to change it click on the arrow joining the JOIN Keys and press DELETE. Now you can set other column as the Join Key. In this example we will keep the join key as ‘Col A’.

1 Inner Join: This is an equi join in which only the matching records from both datasets are provided as Output.
In the ‘Merge Join Transformation Editor’, from ‘Join Type’ drop-down select ‘Inner join’. ‘Swap Inputs’ button will remain disabled. Select all the columns from both datasets and give them the ‘Output Alias’ as shown below:

Output:

In the above output, it can be seen only those records which are present in both datasets are given as output. Also it can be seen, dataset 1 Col_A found two matches for value 3 in dataset 2 and dataset 2 Col_A found two matches for value 2 in dataset 1, due two which we have two output rows with value 2 and other two with value 3. This is the features in which differentiate Merge Join from lookup in which only first matching row is given as output.

2. Left Outer Join: In this all the records present in Left input and only the matching records from right input are provided as output. In case when there's no matching record in Right input, all the selected Right input columns will hold NULL. This transformation does not provide Right Outer join as Left Outer join can be used as Right Outer join just by clicking button "Swap Inputs".
Open the Merge Join Transformation Editor, from ‘Join type’ select ‘Left outer join’. Now we can see ‘Swap Inputs’ button is been Enabled.

Over here, dataset_1 is ‘Left dataset’ and dataset_2 is ‘Right dataset’. Click ‘OK’ and re-execute the task.

Output:

In the output, it can be seen all the records from Left dataset (dataset 1) and only matching records from Right dataset (dataset 2). For the records where no match was found, Right dataset columns hold NULL.

Using the Left-Outer join as a Right-Outer join:
Open ‘Merge Join Transformation Editor’, click on ‘Swap Inputs’, now you can see both the datasets will be swapped, now Dataset ‘Sort_Dataset_2’ will be Left input and ‘Sort_Dataset_1’ will be Right input:

Click ‘OK’ and re-execute the task.

Output:

In the output, all the records from Dataset 2 (Right table) and only matching records from Dataset 1 (Left table) are given in Output.


3. Full Outer join: In this join all the records present in both datasets are provided as output. In case when the Left dataset does not find any matching record in Right dataset based on the join condition, then in that case in output the right dataset columns will hold NULL and the vice-versa.

Open ‘Merge Join Transformation Editor’, from ‘Join type’ select ‘Full Outer join’.

Click ‘OK’. Re-execute the task.

Output:

In the output, all the records from Dataset 1 and Dataset 2 are given in Output, and where ever match is not found, for that NULL is given in the corresponding Dataset columns.

Like Merge even for this transformation:
Sorted datasets are mandatory.
Join condition columns must matching data types.
Requires two inputs and provides only one output.
Does not support error handling.
it’s a partially blocking Asynchronous transformation.

This transformation does not support cross join, but it’s possible to create the same just by adding a Dummy column in Derived Column Transformation in both the flows with value TRUE or 1. Place this transformation before Sort transformation in both the flows, sort the datasets on this column and use it as join condition as well in the merge join. Output of this will be a Cross join.

For this, in the above created DFT, add ‘Derived Column Transformation’ before SORT in the both the flows:

SORT both datasets on this newly added ‘Dummy_Column’:

Output:

Tuesday, 26 August 2014

SSIS 2008 - Union All Transformation

Union All Transformation:

Like Merge, Union All transformation is also used to merge datasets. But unlike Merge, this transformation can merge more than two datasets and Sorted data is not a mandatory requirement for this one. First Input that is mapped to this transformation becomes the output of this transformation. To map the columns they must have same datatype. This transformation does not support Error handling, neither does it eliminate duplicates.

Using this transformation, we can join ‘Heterogeneous’ sources as well, for e.g. we can join database table, flat file and a excel file in a single Union All transformation. The only mandate for this is, there data types should match.

 In the below example, we will merge three excel datasets:
Dataset 1:
Id
Name
Location
1
Test User 1
Loc1
7
Test user 7
Loc2
3
Test User 3
Loc3

Dataset 2:
Id
Name
2
Test User 2
6
Test User 6
8
Test User 8
4
Test user 4
10
Test User 10

Dataset 3:
Id
Name
Location
5
Test User 5
Loc4
9
Test user 9
Loc5

As shown below, use ‘Excel source’ to configure above:

Now drag, ‘Union All’ transform from Data Flow Transformation, first map the output of ‘Dataset 1’ to Union All, followed by ‘Dataset 2’ and ‘Dataset 3’.

As we mapped ‘Dataset 1’ as the first input to Union All transformation, we can see all the columns of ‘Dataset 1’ are been shown as Output Columns of Union All transformation.

In the editor, we can change the ‘Output Column name’ as well if needed. As ‘Dataset 2’ has only 2 columns, hence in the editor for output column Location under ‘Union All Input 2’ we can see ‘<ignore>’. 

Below shown is the output:

In the data viewer, we can see under column ‘Location’ 5 rows have NULL, all these columns are from ‘Dataset 2’, as in the editor it was been mapped as ‘<ignore>’, under ‘Union All input 2’. Also unlike MERGE transformation, the output data is not sorted.

Sunday, 24 August 2014

SSIS 2008 - Merge Transformation

MERGE Transformation:

This SSIS transformation is used to merge two SORTED datasets. Remember Merge transformation cannot be used on Un-Sorted datasets. To use Merge transformation:

       - Same numbers of columns should be present in both sources.
       - Data-types should match for mapping columns.
       - Data should be sorted.
       - It does not eliminate duplicates.

In the below example, we will merge two excel datasets:

Dataset 1:
Id
Name
1
Test User 1
7
Test user 7
3
Test User 3
5
Test User 5
9
Test user 9

Dataset 2:
Id
Name
2
Test User 2
6
Test User 6
8
Test User 8
4
Test user 4
10
Test User 10

As shown below, use ‘Excel source’ to configure above datasets and sort these datasets using ‘SORT transform’ using ‘ID’ column as the ‘Sort Key’ (Don’t know how to configure SORT transformation? Refer post: SSIS 2008 – SORT Transformation):

Now drag, ‘MERGE’ transform from Data Flow Transformation, map the output of Sort to Merge Transform input ‘Merge Input 1’

Similarly, map the output of other Sort transform to Merge Transform this will be mapped to ‘Merge Input 2’

In Merge Transform Editor, map the ‘Merge Input 1’ and ‘Merge Input 2’ as shown below:

Add a ‘Derived Column’ transform and configure the Data viewer to view the data:

Execute the data flow:

In the data viewer, we can see both datasets are merged in sorted order.

This transformation is almost similar to ‘UNION ALL’ transform with few differences:


MERGE Transformation
UNION ALL Transformation
Sorted data mandatory
Sorted data not a mandatory requirement
Merges only two datasets
Can merge more than two datasets.

Saturday, 23 August 2014

SSIS 2008 - Cache Transformation

Cache Transformation:


A new feature added in SSIS 2008. This transformation creates a Cache file which can be used in Lookup Transformation whereby reducing the time & memory overload to create a Cache file during execution.

If wisely used this transformation can be used to perform lookup on a flat file data without storing it in a database table.

To prove this statement, let’s take an Employee – Department example, in which we will generate a flat file which will hold employee id, employee name and its department id and name. In this, employee details are stored in database table, whereas Department details are stored in a flat file.

Employee data present in Database table:

Department data present in flat file:


First we will create a Cache file for Department flat file, which we will be used in the lookup:
1. Configure the flat file source to read the Department flat file:
As shown in the below screen shot, drag ‘Flat file source transform’ and map the department flat file to it:

2. Configure the Cache Transformation:
Drag the “Cache Transform” from the Data Flow Transformations. Map Flat file Source to ‘Cache Transform’. Open cache transform Editor to configure it.

Click on “New” to create “Cache Connection Manager” which will be used to create the Cache file.

In ‘General’ tab, type in ‘Connection Manager Name’, check ‘Use file cache’ checkbox and click on ‘Browse…’ select the location where you want to store the Cache file. In this case, we have given the Cache file name ‘lkp_Dept.caw’. “caw” is the Cache file extension.
Under ‘Columns’ tab, set ‘Index Position’ as 1 for column ‘Dept_id’. Index Position is sequential positive number. This number indicates the order in which lookup transformation compares rows in the reference dataset to rows in the input data source. The column with the most unique values should have lowest index position (Reference: http://msdn.microsoft.com/en-IN/library/bb895364.aspx)

After configuring the Cache Connection Manager, go back to Cache Transform Editor and click on ‘Mappings’ tab, as shown in the below screen shot map the source columns to its corresponding Cache file column.

3. Load Cache file:
Now we will execute the above created flow to load the cache file with the records present in the Department flat file.

Now if you check the lookup file path you will see Cache file created.

Now that we have created the cache file, we will create a Data flow to create the target file which will load the employee and its department information, department info will be looked up from above created Cache file ‘lkp_Dept.caw’.
As shown below, configure the ‘OLE DB Source’ to read the Employee data from the database table. This will be followed by ‘Lookup Transform’ which will be configured to read data from above created Cache file. Open the lookup transformation editor, in ‘General’ tab select ‘Connection Type’ as ‘Cache Connection Manager’

Under ‘Connection’ tab, select the Cache Connection manager, which was created in earlier DFT to create the cache file:

In ‘Columns’ tab, map Input Dept_Id column with lookup Dept_id:

Click 'OK'.
Now we will drag ‘Flat File Destination’, and map ‘Lookup Match Output’ to it. Create a new flat file connection, which we load the target file.

In mappings tab, map the source column to its corresponding target file column:

Click 'OK'.

Now we will execute the DFT to load the target file.

Data in Target file:


There are few things that about Cache Transformation to be remembered:

1. The Cache Transform writes only unique rows to the Cache connection manager:
Let’s add a duplicate record in the Department flat file and re-run the cache file creation data flow.

Now to view the cache file, we will create a new data flow using ‘Raw File Source’. As shown in the below screen shot, drag ‘Raw file Source’, under ‘connection manager’ select ‘Access mode’ as ‘File name’, click ‘Browse…’ select the above created cache file.

Now drag a Derived Column transformation, map Raw file output to it. Configure data viewer to view the Cache file:

Execute the data flow:

From the above screenshot you can see, department file has 4 rows, but the cache file holds only 3 rows, as the fourth row is duplicate, based on Dept_Id which is been assigned Index_Position 1, in cache transformation.


2. In a single package, only one Cache Transform can write data to the same Cache connection manager. If the package contains multiple Cache Transforms, the first Cache Transform that is called when the package runs, writes the data to the connection manager. The write operations of subsequent Cache Transforms fail.
To test this, we will create a copy of the data flow created earlier to load the cache file. Now while executing the entire package

we can see the last DFT which a copy of first one gets error out. Leading to package failure.

Error:

To avoid this failure, we need to create a new Cache Connection for this DFT.