OK, I Got The Data For Analytics — What Are My Options For ETL?
We have all heard it — A journey of a thousand miles begins with a single step.
And if you have anything to do with data, then you have heard this too… The journey of data science, analytics, data mining, statistics, analysis, reporting, insights or whatever else it is called in your business — they all start with data.
OK, I got the data. What next?
I am going to use the CRISP-DM terminology. So when I get the data I start the Data Understanding phase. Generally the first thing I do is to use some application that can open the data file and have a look at the raw data to understand the data format — like are there column headings, what is the delimiter, are there text qualifiers, are there headings you did not expect, etc.
As part of the Data Understanding I perform Data Exploration to understand the data types, missing data, what transformations are required, etc. If you want a quick way to produce a MS Word report of your data exploration of the data file you have received, I have written an article in TowardsDataScience.com on how to do this: Automated Data Profiling Using Python.
Once I understand how the raw data is in the data file, then I can decide on how to proceed to the next step of Data Preparation. This is where the decision about ETL comes in.
What are my options for ETL?
Extraction, Transformation and Loading is key step in any data related project.
Data Extraction
My preference for Extraction is just upload the data to a database table where all columns are of data type string. That is, irrespective of the data type of the data in the data file, load the data into a table where all respective columns as string type. This way I can get the data into a table quickly where I can perform further steps — like Data Validation and Transformation.
Note that if there are text qualifiers in the data, I will replace all text qualifiers in the table.
Data Validation and Transformation
At the very generic level, once the data is in a table in the database, the options for Data Validation and Transformation are:
- Use SQL
- Use a programming language
- Use an ETL application to perform the validation and transformation
The choice depends on the mainly the type and size of the data, experience and skillset of the person performing the work, the ETL standards followed by the company, available infrastructure, etc.
I am not going to focus on big data. I want to focus on what majority of companies need — that is processing small to medium data files of few millions of rows.
For performing data validation and transformation for few millions of rows of data, my choice is:
- SQL for data validation
- Specialist ETL application (like SSIS) or a programming language (like Python) for orchestrating the ETL steps and final data type conversion
The advantage of using SQL is that I can leverage the powerful database engine to perform the heavy workload rather that using an ETL application. Even though the data validation workload is done using SQL, the orchestration of the validations steps is better done using a specialist ETL application or programming language (hereafter I will generally refer as ETL tool). Also in my experience capturing transformation errors are easier using a specialist ETL tool.
I am also not covering error logging during the validation and transformation process as it depends on your choice of ETL tool.
Let us start with the below diagram.
Once the data is loaded into the Extraction table where all columns are text/string data type, the data validation SQL steps are as below. Please note that this is generic for this example and each data file may need additional or less steps depending on your business needs.
Data Validation
In this phase, I do the required data validations.
It is very important to note that the original data that is extracted into the Extraction table is not changed during the Data Validation phase. Only the additional columns created to track errors, DIM lookup columns, etc. are updated. This way the original data from data file is available in the Extraction table for troubleshooting.
Note that the SQL code displayed have line breakers to improve code readability. If you cut and paste the below code you will need to remove these line breaks to get it working correctly.
1] Check the number of data rows in the Extraction table
In this step I count the number of rows present in the Extraction table and check whether it is greater than a pre-agreed number of expected rows. To keep it simple I use a pre-agreed expected count of rows but depending on your scenario you may track/log the count of rows for past several days or weeks and compare the number of rows against the average rows from those logs. Or you may get a summary file with row count and you can compare against that value.
This comparison of expected row count and row count in the Extraction table is done in the ETL tool to determine what should happen depending on the evaluation.
SELECT
COUNT(0) AS sqlvar_ExtractionTableRowCount
FROM [dbo].[Extract_DIM_Customer]
Depending on the number of rows generally expected, in your ETL tool you can decide whether the validation process can proceed or not and log the error if applicable. For example, if slowly changing dimension tracking is performed and in this example if the customers not present in the data file are marked as ‘deleted’ customers in the DIM table, then proceeding with the data validation and transformation when there are not expected number of data rows present may produce invalid data in the production tables.
This is why I have taken the approach of using database backend power to perform the validation work load and using an ETL tool for orchestrating the steps.
2] Generate Checksum for the rows in Extraction table
Again, I use the database backend to generate the checksum for rows depending on selected or all columns.
UPDATE [dbo].[Extract_DIM_Customer]
SET
RowChecksum =
CHECKSUM([CustomerNumber], [CustomerName], [DoB],
[Address])
This is required when performing slowly changing dimensions checks.
3] Detect duplicate rows
Before starting to updating the invalid rows, I like to set the column IsValidRow to a default value so that I know the flags have been updated correctly during the validation process.
Update [dbo].[Extract_DIM_Customer]
SET [IsValidRow] = 'Z'
I use the [CustomerNumber] column to detect for duplicate rows and mark the duplicate rows as invalid. Using the below approach, I can even use combination of columns to determine the duplicates if required.
;WITH cte_Customer AS (
SELECT
[CustomerNumber]
,ROW_NUMBER()
OVER(
PARTITION BY
[CustomerNumber]
ORDER BY
[CustomerNumber]
) row_num
,[FAILED_COMPONENT]
,[FAILED_REASON]
,IsValidRow
FROM [dbo].[Extract_DIM_Customer]
)
UPDATE cte_Customer
SET
[FAILED_COMPONENT] = 'Data Validation'
,[FAILED_REASON] = '[Duplicate] '
,IsValidRow = 'N'
WHERE row_num > 1
Alternatively, the Checksum can also be used to detect duplicates and mark them as invalid rows.
As you can see, I can also generate useful error information so that when later I want to understand why a row is marked as invalid it will be easy to do so.
4] Check for valid data types and data ranges
I check whether the values in the Extraction table columns can be converted or type casted into required data types — as all columns in Extraction table are of type string. I also perform perform whether the column values are within expected range, if applicable. If the check fails and if it is deemed that the record cannot proceed to transformation stage, then the row is marked as invalid.
In this example, if the customer name or address is not present I want to capture that information but not mark the row as invalid row to proceed to transformation. But if the Date of Birth is invalid or customer number is of invalid data type, then I want to mark the row as invalid.
--* Initialise the required columns
UPDATE [dbo].[Extract_DIM_Customer]
SET [FAILED_REASON] = ''
,[IsValidRow] = 'Y'--*** Perform the checksUPDATE [dbo].[Extract_DIM_Customer]
SET [FAILED_REASON] =
IIF([CustomerNumber] IS NULL OR [CustomerNumber] = '' OR
TRY_CONVERT(INT, [CustomerNumber] ) IS NULL,
[FAILED_REASON] + ' [CustomerNumber]',
[FAILED_REASON]
)
,[IsValidRow] =
IIF([CustomerNumber] IS NULL OR [CustomerNumber] = '' OR
TRY_CONVERT(INT, [CustomerNumber] ) IS NULL,
'N',
[IsValidRow]
)UPDATE [dbo].[Extract_DIM_Customer]
SET [FAILED_REASON] =
IIF( ([CustomerName] IS NULL OR [CustomerName] = '') ,
[FAILED_REASON] + ' [CustomerName]',
[FAILED_REASON]
)UPDATE [dbo].[Extract_DIM_Customer]
SET [FAILED_REASON] =
IIF( ([Address] IS NULL OR [Address] = ''),
[FAILED_REASON] + ' [Address]',
[FAILED_REASON]
)UPDATE [dbo].[Extract_DIM_Customer]
SET [FAILED_REASON] =
IIF( ([DoB] IS NOT NULL OR [DoB] <> '') AND
TRY_CONVERT(DATE, [DoB], 103) IS NULL,
[FAILED_REASON] + ' [DoB]',
[FAILED_REASON]
)
,[IsValidRow] =
IIF( ([DoB] IS NOT NULL OR [DoB] <> '') AND
TRY_CONVERT(DATE, [DoB], 103) IS NULL,
'N',
[IsValidRow]
)
Either you can perform the check for every single column or only for selected important columns. Any columns that are not checked for valid data type during this validation stage will fail during the data conversion step in the transformation stage and error rows can be logged into a table.
Then why perform this validation step in SQL if it can be done in the ETL tool at transformation phase? I find it depending on your choice of ETL tool, doing this check in SQL will be compact and efficient than having multiple steps in the ETL tool. (On a 1.5 million rows with 60 column table, the checks takes about 3 seconds).
5] DIM Surrogate Keys Lookup
I prefer to handle the Late Arriving Dimensions (LAD) using SQL. This may differ based on your company’s strategy for LAD, but I use the strategy of inserting the LAD into the respective DIM table with the Business Key and some predetermined values for other columns in the DIM table.
For example, the customer data file has a column to specify the business key of which store the customer last shopped, and the DIM_Shops table has below values:
The Extraction table has below values for LastShoppedIn column:
As you can see, the D4 is a Late Arriving Dimension key.
The SQL I use to insert LAD into the DIM file is:
INSERT INTO [dbo].[DIM_Shops]([ShopNumber], [ShopName])
SELECT DISTINCT C.[LastShoppedIn], 'N/A'
FROM [dbo].[DIM_Shops] S
RIGHT OUTER JOIN [dbo].[Extract_DIM_Customer] C
ON S.ShopNumber = C.[LastShoppedIn]
WHERE S.ShopNumber IS NULL
AND C.[LastShoppedIn] <> ''
After the Late Arriving Dimension keys are updated in the respective DIM tables, I can perform the DIM surrogate Key lookup and update the DIM key columns in the Extraction table.
UPDATE [dbo].[Extract_DIM_Customer]
SET
[SK_LastShoppedIn] = S.[SK_Shops]
FROM [dbo].[Extract_DIM_Customer] C
INNER JOIN [dbo].[DIM_Shops] S
ON C.[LastShoppedIn] = S.[ShopNumber]
WHERE C.[LastShoppedIn] <> ''
Now the Extraction table has the information required for analytics, it is ready to proceed to Data Transformation stage.
Data Transformation
As in the diagram, next stage is Data Transformation.
Though the diagram shows filtering the valid rows from Extraction table and adding derived columns, default values, etc. as two separate steps, depending on the business case these two steps can be combined.
Again, if there are any complex joining of tables or lookups are involved in deriving the new columns, then SQL may be a better option.
SELECT
-- Assign default values
IIF(
[CustomerNumber] = '',
'999',
[CustomerNumber]
) as [CustomerNumber]
,IIF(
[CustomerName] = '',
'N/A',
LTRIM(RTRIM([CustomerName]))
) as [CustomerName]
,IIF(
[Address] = '',
'N/A',
LTRIM(RTRIM([Address]))
) as [Address]
--Derived column
,IIF(
[DoB] = '',
'01/01/1900',
DATEDIFF(YEAR, [DoB], GETDATE())
) as [CustomerAge]
FROM [dbo].[Extract_DIM_Customer]
WHERE [IsValidRow] = 'Y'
The next step is to use the ETL tool of choice to perform the data conversion based on the Transformation table. The reason for using the ETL tool for data transformation is it will be easier to capture the conversion or type cast errors to be logged into a error table.
From the Transformation table to Production table Data Load process, again similar approach can be adopted where the additions, deletions and updates are handled using SQL.
Conclusion
When it comes to data analytics, there are many ETL applications and programming languages you can use to perform the ETL. Since the focus of this article was to handle a few millions rows of data, my approach is as follow:
- Use SQL for data validation and range checks
- Use ETL application or programming language for orchestrating the steps
- Use ETL application or programming language used for orchestration perform the final data type conversions so errors can be captured and logged easily