Big Big Things in my Little Little World: Pentaho Data Integration

Pentaho Data Integration or Kettle, consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations.

The name Kettle evolved from "KDE ETTL Environment" to "Kettle ETTL Environment" after the plan of developing the software on top of KDE (K Desktop Environment) was dropped. This tool was open sourced in December 2005 and acquired by Pentaho early in 2006. Matt Casters is the lead developer of Kettle.

ETTL stands for:

Data extraction from source databases

Transport of the data

Data transformation

Loading of data into a data warehouse

Kettle

Kettle is a set of tools and applications which allows data manipulations across multiple sources. The main components of Pentaho Data Integration are:

Spoon - a graphical tool which make the design of an ETTL process transformations easy to create.
Pan - is an application dedicated to run data transformations designed in Spoon.

Chef - a tool to create jobs which automate the database update process

Kitchen - it's an application which helps execute the jobs in a batch mode, usually using a schedule which makes it easy to start and control the ETL processing

Carte - a web server which allows remote monitoring of the running Pentaho Data Integration ETL processes through a web browser.

Downloading Pentaho Data Integration

Download Pentaho from http://sourceforge.net/projects/pentaho/files/Data%20Integration/5.4/pdi-ce-5.4.0.1-130.zip/download

Steps for installation(in Windows)

Unzip the folder
A folder named data-integration is created. In the folder data-integration open spoon.bat file(just double click it).

Steps for installation(in Linux)

Run the spoon.sh file.

Connecting to Progress database using Pentaho

Add the jars base.jar, openedge.jar, pool.jar, spy.jar, util.jar in the \Pentaho-Kettle\data-integration\libext\JDBC folder.
Double click Transformations in the view tab. Then a new transformation is created.
To change the name of the transformation, right click the newly created transformation ie, “Transformation 1” . Select settings and edit the transformation name.
To connect to Progress database, right click database connections and select new.
Select general, add a connection name(Eg: test). Select generic database as connection type.
Custom connection URL: jdbc:datadirect:openedge://hostName:50590;databaseName=dbName;defaultSchema=PUB
Custom driver class name: com.ddtek.jdbc.openedge.OpenEdgeDriver
Username: userName
Password: passWord
Click test. It will show connection successful. Click OK.

Exporting tables as csv files

Change to design view tab
Select Table input under Input(Drag and drop Table input to Transformation1 window)
Select Text file output under Output(Drag and drop Text file output to Transformation1 window)
Right click the Text file output icon in the window and select “edit step”. In the new window opened we can either specify a file name or we can browse for a location to save the file in the “file” tab. In the content tab we can specify the separator, ie To export as csv file specify the separator as ,(comma). Click OK.
Click Table input ,press shift along with left button of mouse and drag to Text file output. Thus a hop is created.
Right click Table input and select edit. Specify the SQL query to be executed. Click OK.
Click run(green triangle) . We can see the execution.

The importance of water is not known until the stream runs dry!!

Big Big Things in my Little Little World

Friday, 26 June 2015

Pentaho Data Integration

Kettle

Downloading Pentaho Data Integration

Steps for installation(in Windows)

Steps for installation(in Linux)

Connecting to Progress database using Pentaho

Exporting tables as csv files

No comments:

Post a Comment

Blog Archive

Total Pageviews