A Simple Example Using Pentaho Data Integration (aka Kettle)
Let me introduce you an old ETL companion: its acronym is PDI, but it’s better known as Kettle and it’s part of the Hitachi Pentaho BI suite.
With Kettle is possible to implement and execute complex ETL operations, building graphically the process, using an included tool called Spoon.
I implemented a lot of things with it, across several years (if I’m not wrong, it was introduced in 2007) and always performed well.
So let me show a small example, just to see it in action.
The simplest way is to download and extract the zip file, from here. For those who want to dare, it’s possible to install it using Maven too.
The only precondition is to have Java installed and, for Linux users, install libwebkitgtk package.
Just launch the spoon.sh/bat and the GUI should appear
In this example, we’re gonna:
- retrieve a folder path string from a table on a database
- check if there are files inside it
- if no, exit otherwise move them to another folder (with the path taken from a properties file)
- check total file sizes and if greater then 100MB, send an email alert, otherwise exit
It’s not a particularly complex example but is barely scratching the surface of what is possible to do with this tool.
Let’s begin, loading the main job!
A Kettle job contains the high level and orchestrating logic of the ETL application, the dependencies and shared resources, using specific entries.
Each entry is connected using a hop, that specifies the order and the condition (can be “unconditional”, “follow when false” and “follow when true” logic).
A job can contain other jobs and/or transformations, that are data flow pipelines organized in steps. This job contains two transformations (we’ll see them in a moment)
Let’s see the entries with some details.
Here we retrieve a variable value (the destination folder) from a file property.
Next, we enter the first transformation, used to retrieve the input folder from a DB and set as a variable to be used in the other part of the process.
The third step will be to check if the target folder is empty.
Then we can continue the process if files are found, moving them…
…checking the size and eventually sending an email or exiting otherwise.
When everything is ready and tested, the job can be launched via shell using kitchen script (and scheduled execution if necessary using cron )
As you can see, is relatively easy to build complex operations, using the “blocks” Kettle makes available.
Moreover, is possible to invoke external scripts too, allowing a greater level of customization.
The major drawback using a tool like this is logic will be scattered across jobs and transformations and could be difficult, at some point, to maintain the “big picture” but, at the same time, it’s an enterprise tool allowing advanced features like parallel execution, task execution engine, detailed logs and the possibility to modify the business logic without being a developer.
As always, choosing a tool over another depends on constraints and objectives but next time you need to do some ETL, give it a try.