You are on page 1of 25

JasperETL / Talend

Created By: Nitin Marwal


Contents
JasperETL / Talend ..................................................................................................................................... 1
Purpose of the Document .............................................................................. Error! Bookmark not defined.
Intended Audience ......................................................................................... Error! Bookmark not defined.
Technology ..................................................................................................... Error! Bookmark not defined.
Reference Project Name ................................................................................ Error! Bookmark not defined.
Contributors ................................................................................................... Error! Bookmark not defined.
Introduction ............................................................................................................................................... 3
Data Integration .................................................................................................................................... 3
Data Quality .......................................................................................................................................... 3
Master Data Management .................................................................................................................... 4
Talend key features: ................................................................................................................................. 4
Getting Started with Taland .................................................................................................................... 5
Installation: ............................................................................................................................................... 5
1/. Repository........................................................................................................................................... 6
2/. Palette ................................................................................................................................................. 6
Creating Job: ............................................................................................................................................... 7
Adding Meta Data ..................................................................................................................................... 8
Adding Components to Job ..................................................................................................................... 16
Mapping of Data: .................................................................................................................................... 20
Run the Job ............................................................................................................................................. 24
Conclusion ...................................................................................................... Error! Bookmark not defined.
Is this the Work Around or Best Solution?..................................................... Error! Bookmark not defined.
Document / Product / Component Repository Path ..................................... Error! Bookmark not defined.
Introduction

JasperETL is powered by Talend and uses Talend’s Data integration and OpenStudios features for ETl
purpose.

Talend MDM allows organizations to easily model and master any reference data, in any
domain without constraints. The unified data management platform unites Data Integration,
Data Quality, Master Data and Data Stewardship all through a single Eclipse-based
development environment.

Talend' data management solutions cover three key domains:

 Data Integration
 Data Quality
 Master Data Management

All Talend products are built on a unified Eclipse-based development environment, which
provides users with consistent ergonomics, fast learning curve and a high-level of reusability.
This offers unrivaled benefits in terms of resource optimization and utilization, and project
consistency.

Data Integration
Talend's data integration products include:

 Talend Open Studio, the community version, provided under the GPL v2 license and
freely downloadable
 Talend Integration Suite, the enterprise version, provided under a commercial
subscription license. Talend Integration Suite exists in 3 editions: Team Edition,
Professional Edition and Enterprise Edition
 Talend On Demand, the Software as a Service version
 Talend Integration Suite MPx, a massively parallel data integration platform
 Talend Integration Suite RTx, a real-time data integration platform

Data Quality
Talend's data quality products include:

 Talend Open Profiler, an open source data profiling tool provided under the GPL v2
license and freely downloadable
 Talend Data Quality, the enterprise data quality platform that includes data profiling and
data cleansing features
Master Data Management
Talend's master data management products include:

 Talend MDM Community Edition, an open source Master Data Management tool
provided under the GPL v2 license and freely downloadable
 Talend MDM Enterprise Edition, the enterprise version, provided under a commercial
subscription licenset.

Talend key features:


 Active Data Model - allows organizations to immediately model and master any data
domain without a constraining data model and conditionally drive integration and
synchronization with external systems to reduce system complexity and time to deploy.
Talend MDM permits an iterative definition of the data model to gain alignment from
business users and ensure adoption upon launch.
 Domain Driven Integration- With Talend MDM, the master data drives interactions with
external systems. The solution employs a unique event manager to drive when and where
data is synchronized, augmented or distributed. A graphical tool provides over 400 proven
components and connectors to build and deploy integration jobs with any application,
database or system.
 Master Data Quality - Talend MDM provides features that allow to validate, resolve,
standardize, cleanse and augment master data. The solution delivers a robust data
profiling tool. It packages native components for name and address standardization, and
callouts to external standardization services are provided. Callouts to external source
including lookups for hierarchies or some other reference codes can be performed based
on specific data criteria.
 Data Stewardship- The Talend MDM collaborative interface allows to search and author
hub data and appropriate stewardship tools help manage the process of updating the data.
The Ajax based interface is dynamically driven by Talend's Active Data Model. All
validations found on the model instantiate themselves as validations on Web-based forms.
Workflow process is easy to define and provides a strong set of tools for a team to
collaborate on and create a trusted and reliable set of master data.
 Talend Studio- Talend Studio is an intuitive development environment based on Eclipse
that allows building and managing the data model, defining integration jobs, administering
data quality and creating stewardship workflows to support the creation of master data all
in a single interface. It also provides unique functions for creating versions of hub data and
hierarchy management.
Getting Started with Taland

Installation:
Requirements:

- Java 1.5 or later

Download the zip from: http://www.talend.com/download.php#mdm

Extract it on your machine:

Here you will get two products Talend Server and Talend MDM. To run the application execute the
TMDMCE-win32-x86.exe under Talend MDM.

Create a local repository and a project based on the Language (Java / Perl) you suit with.

The Main Screen of Talend MDM is:

Over Here you can see the various windows such as:

1/. Repository

2/. Palette
3/. Other windows such as Component Properties, Run Job etc...

4/. The Middle area is your working zone. Where you can create various jobs, Business Models etc...

Now let’s see these in more detail:

1/. Repository
Repository is the Place where every Data is stores such as your Jobs, Business Models, MetaData
information and others.

Here the Screenshot of the same:

Under Job Design you create various jobs regards to your Data Transformation requirements.

Under Metadata you can define and create various connections with your source data that can be a CSV
or a database or any other format of data.

2/. Palette
Palette provides you all the components that you can use while preparing youe Business Model
or Job for Data Transfer from source data location to Destination Data Location.
Here is the Screenshot of the Palette window:

Here you have lots of components available for data Extraction, Transformation and Loading into the
Target Source.

Now we will see how we create a new Job into the System:

Creating Job:

 Right Click on JOB DESIGN under Repository window and select Create Job it will open a popup.
Here you can provide the basic details of the Job like name Purpose and Description. Now it will
create a new job for you and open it in the workspace:
Now you can create various metadata items regards to your source and Destination data.

Adding Meta Data


 Expend the Metadata under Repository here you can see the various options available with you
like DB connections, delimited files, xml file component etc. You can create any kind of
metadata based on your requirements.
 For this demo we will use the File Delimited component that will use a CSV file to read data as
source data.
 For this right click on File Delimited and select “Create File Delimited”. This will open a popup,
here you can provide the basic details like name, purpose etc…

Click on next and browse the partner CSV you will get the data shown below that:
Now click on the next:
Here you can set various parameters regards to your CSV settings. Now click on Next:
Here you will get the description of the schema as fields of your CSV file. Here I have selected
Website as Key because I want Partner not to be duplicated and to avoid duplicate records in
the system based on the website. In general you can set any number of columns as key as per
your requirement. Now click on finish and you will get the partner_csv under File Delimited as
your source data.
 Now we need to setup our destination database here I am using PostgreSQL Database.
 So right click on DBConnection under MetaData and select Create Connection this will open a
popup, here you can provide the name of the connection, Click on next to provide the connection
details:

Here you can select the target database type and provide the connection settings. After filling the details
click on finish and you will get the connection under MetaData > DBConnections:
 Now to retrieve the table schemas right click on your DB Connection and select retrieve schema:

Here you can select the schema type among TABLE , VIEW or SYNONYMs, here I have selected only
tables as I required only tables. You can use the SQL Queries as well to fetch your data. Now click next.
Here it will show you all the tables present in the database, so you can select the table which you want
and click next:
Here you can select the fields which you want for your data process and click on finish.

Now you will get your connection under MetaData:


Adding Components to Job
 Now open the test_job which we have created before by double click on the same:

 Now we need to add our CSV file in the job, so drag the CSV file into the job workspace, it will then
open a popup like this:

Select tFileInputDelimited as we want CSV as input source and click ok it will then create an item in job
workspace:
Now by double clicking on the component you can see the component properties in component
window:

Over here you can view all the settings regards to your CSV file and you can also edit the settings over
here as well.

 Now add the destination source for the data as the table we have created in DBConnections , just
drag the table from there to job workspace. It will open a popup window:
Here select the tPostgresqlOutput as this is going to be the output of the data flow:

Same way you can view or edit the properties of the res_partner component under component window.

 Now we need to add tMap component from palette window for mapping the input and output
fields in data flow so drag the tMap component from the palette window:
Now drag this into job workspace:

 Now to filter duplicate records add tUniqRow component from the palette window:

 Now we need to join the data flow from partner_csv to tMap that will be the input for tMap. For
this right click on the partner_csv and select row > main and connect to the tMap :
 Now we need to take output from tMap to the tUniqRow. For this right click on tMap and select
row> newOutput(Main) and name the output connection.

Then it will ask for matching target schema then click yes.

 Now we need to take output from tUniqRow to the res_partner. For this right click on tUniqRow
and select row> uniques and connect to res_partner.

Mapping of Data:
 Now double click on the tMap icon to map the target and source data flow, it will open the map like
this:
Now we need to map the input fields to the output fields: so drag the related columns to the target
columns or either click on Auto Map button on right hand side top.

Now you can see that I have used one extra column active there as it is required for making partner
active and mandatory in destination database. It is a Boolean field so it will take default value as
TRUE/FLASE so I have written true before active column as you can see:

Now we have matched the columns so you can click on ok button:

 Now double click on the tUniqRow component to view the properties:


Here you can select the key attribute as website so that it can check the uniqueness on the basis of
these columns.

 Now double click on res_partner to view the properties of this component under component
window:
Here you can view the basic connection settings. Here two fields are important:

1. Action on table: this defines how you connection will treat your table.
2. Action on Data: this defines what operation you are going to perform, it can be
insert / update or the combination of the both.
 Now we are done with all the configuration and ready to run our job. So click on the Run window
near component window:

 Before run check our CSV which we have created to import the Partners we have some duplicate
records.

Here you can see that we have some duplicate records that we will avoid by using tUniqRow
component.
Run the Job
 Now you can run our job:

To run the job click on Run button:


Here you can see that 6 rows flowed from partner_csv but only 4 rows flowed to the destination as
tUniqRow filtered the duplicate records.

So this is how we use Talend for ETL purpose. For more reference you can check-out the help which is
available in detail at help menu in Talend.

You might also like