Friday, 26 June 2015

Pentaho Data Integration

Pentaho Data Integration or Kettle, consists of a core data integration (ETL) engine, and GUI applications that allow the user to define data integration jobs and transformations.
The name Kettle evolved from "KDE ETTL Environment" to "Kettle ETTL Environment" after the plan of developing the software on top of KDE (K Desktop Environment) was dropped. This tool was open sourced in December 2005 and acquired by Pentaho early in 2006. Matt Casters is the lead developer of Kettle.
ETTL stands for:
  • Data extraction from source databases
  • Transport of the data
  • Data transformation
  • Loading of data into a data warehouse

Kettle

Kettle is a set of tools and applications which allows data manipulations across multiple sources. The main components of Pentaho Data Integration are:
  • Spoon - a graphical tool which make the design of an ETTL process transformations easy to create.
  • Pan - is an application dedicated to run data transformations designed in Spoon.
  • Chef - a tool to create jobs which automate the database update process
  • Kitchen - it's an application which helps execute the jobs in a batch mode, usually using a schedule which makes it easy to start and control the ETL processing
  • Carte - a web server which allows remote monitoring of the running Pentaho Data Integration ETL processes through a web browser.

Downloading Pentaho Data Integration


Steps for installation(in Windows)

  1. Unzip the folder
  2. A folder named data-integration is created. In the folder data-integration open spoon.bat file(just double click it).

Steps for installation(in Linux)

    Run the spoon.sh file.

Connecting to Progress database using Pentaho

  1. Add the jars base.jar, openedge.jar, pool.jar, spy.jar, util.jar in the \Pentaho-Kettle\data-integration\libext\JDBC folder.
  2. Double click Transformations in the view tab. Then a new transformation is created.
  3. To change the name of the transformation, right click the newly created transformation ie, “Transformation 1” . Select settings and edit the transformation name.
  4. To connect to Progress database, right click database connections and select new.
  5. Select general, add a connection name(Eg: test). Select generic database as connection type.
  6. Custom connection URL: jdbc:datadirect:openedge://hostName:50590;databaseName=dbName;defaultSchema=PUB
  7. Custom driver class name: com.ddtek.jdbc.openedge.OpenEdgeDriver
  8. Username: userName
  9. Password: passWord
  10. Click test. It will show connection successful. Click OK.

Exporting tables as csv files

  1. Change to design view tab
  2. Select Table input under Input(Drag and drop Table input to Transformation1 window)
  3. Select Text file output under Output(Drag and drop Text file output to Transformation1 window)
  4. Right click the Text file output icon in the window and select “edit step”. In the new window opened we can either specify a file name or we can browse for a location to save the file in the “file” tab. In the content tab we can specify the separator, ie To export as csv file specify the separator as ,(comma). Click OK.
  5. Click Table input ,press shift along with left button of mouse and drag to Text file output. Thus a hop is created.
  6. Right click Table input and select edit. Specify the SQL query to be executed. Click OK.
  7. Click run(green triangle) . We can see the execution.
The importance of water is not known until the stream runs dry!!

No comments:

Post a Comment