You are on page 1of 624

Pentaho Data Integration

for
Database Developers

July 2011

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Welcome Agenda
Audience and prerequisites
Learning objectives
Class process
Course outline

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 2

Audience and Course Prerequisites


Intended audience
Course targeted for database administrators and database
developers.
Portions of the course assumes knowledge of SQL and relational
database concepts.
Course prerequisites
There are no Pentaho Training prerequisites for this course.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 3

Learning Objectives
At the end of the course, you should understand:
The basic architecture and features of Pentaho Data Integration.
The concept and features of the advanced Pentaho Data Integration
Enterprise Edition.
How PDI supports you in the Agile BI approach.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 4

Learning Objectives
At the end of the course, you should be able to
Load and write data from and to different data sources
Join data from different sources
Use PDI and ETL design patterns (like restartable solutions)
Influence the performance aspects of databases and transformations
Build portable and flexible jobs and transformations
Schedule jobs and transformations
Use logging, monitoring and error handling features of PDI
Load, transform, and create complex XML structures
Use scripting (JavaScript, Formula, Java) in transformations
Apply clustering and partitioning solutions for high volumes

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 5

Learning Objectives
What are your objectives?
What are you expectations?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 6

Course Process
Daily schedule:
9:00 am 5:00 pm
1 hour lunch break -noon
15 minute morning break -10:30
15 minute afternoon break -3:30
The course is a combination of lecture, demo and labs.
Feel free to ask questions or to seek clarification!
Online survey will be provided for feedback and suggestions.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 7

Pentaho Data Integration


Overview

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Pentaho Data Integration (PDI) Introduction


PDI is the product associated with the
KETTLE open source project:
KETTLE is open source software that
makes up the core of PDI Enterprise
Edition.
PDI Enterprise Edition is whole
product.
Professional technical support
Maintenance releases
EE-only features including
enterprise security integration,
scheduling, and more
Documentation
Member of the Pentaho BI Suite

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 9

Use as a BI Platform Component


PDI Jobs and Transformations can be
run in Pentaho BI Platform:
PDI 3.X Jobs/Transforms execute in
Pentaho BI Platform 1.7.X and
beyond.
PDI 4.X Jobs/Transforms execute in
Pentaho BI Platform 3.6.X and
beyond.
For example: Fill a Pentaho Report
with data from a Transformation.
Details in a separate module.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 10

Enterprise Edition (EE) Data Integration Server


Standalone without the BI Platform
PDI Enterprise Edition Architecture:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 11

Enterprise Edition (EE) Data Integration Server


Primary features and functions:
Execution: Executes ETL jobs and transformations using the Pentaho Data
Integration Engine.
Security: Allows you to manage users and roles (default security) or integrate
security to your existing security provider such as LDAP or Active Directory.
Content Management: Provides the ability to centrally store and manage your
ETL jobs and transformations. This includes full revision history on content and
features such as sharing and locking for collaborative development
environments.
Scheduling: Provides the services allowing you to schedule and monitor
scheduled activities on the Data Integration Server from within the Spoon design
environment.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 12

The Enterprise Console


Provides a thin client for managing deployments of Pentaho Data
Integration Enterprise Edition including:
Management of Enterprise Edition licenses.
Monitoring and controlling activity on a remote Pentaho Data
Integration server.
Analyzing performance trends of registered Jobs and
Transformations.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 13

The KETTLE Project


What is KETTLE?
Recursive acronym much like GNU (GNUs Not Unix).
Kettle Extraction Transformation Transportation & Loading
Environment.
Created out of frustration with other solutions
Custom built PL/SQL, C-SQL (embedded SQL), hacked VB solutions
Commercial products:
Oracle Warehouse Builder
Information Builders iWay
SQL Server DTS
Data Mirror

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 14

PDI (KETTLE) History


Early Years: 2001 2005
KETTLE project started by 2001 by Matt Casters
Years focused on easy-to-use, maintainability and deployability
KETTLE 2.3: December 2005
First LGPL-licensed open source version
Project acquired by Pentaho: April 2006
Provided paid staff of developers
Offered support and services to customers

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 15

PDI (KETTLE) History


Pentaho Data Integration 2.4: February 2007
Parallel processing support
Multi-tab interface for developers editing multiple transformations
Integration of transformation design and job execution user
interfaces
Pentaho Data Integration 2.5: May 2007
Enhanced MySQL database support
Transformation Explorer for organizing and accessing transformations

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 16

PDI (KETTLE) History


Pentaho Data Integration 3.0: Nov 2007
ETL Developer productivity
Rapid, community-fueled evolution
Performance and scalability
Clean separation of data and
metadata  Reduced Java object creation
Clustering improvements:
Support for multiple step copies
Data re-partitioning
Dynamic cluster schemas

Faster flat file reading:


Use of non-blocking I/O (NIO) to read large blocks at a time
Parallel file reading
Support for lazy conversion
Simplified algorithms

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 17

PDI (KETTLE) History


Pentaho Data Integration 3.1: September 2008
Enterprise Console for remote monitoring and performance trend
analysis
Ease of use improvements including a consolidated log, execution
history, step performance graph results panel
Numerous new steps, job entries and expanded data source support
Pentaho Data Integration 3.2: May 2009
Dynamic Clustering - dynamically distribute execution to available
cluster/cloud nodes
Named Parameters
Usability improvements - annotate hops, more visual feedback
Over 20 new Transformation Steps/Job Entries with numerous
updates to existing steps

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 18

PDI (KETTLE) History


Pentaho Data Integration 4.0: June 2010
Data Modeling and Visualization Perspectives (Pentaho Agile BI)
EE Data Integration Server with scheduling, revision history and
more
Usability improvements
New steps/Job entries
Pentaho Data Integration 4.1: November 2010
Hadoop Integration (Enterprise Edition only)
One click disable/enable of steps downstream of a hop
Metadata Injection (experimental first steps)
Many new steps and job entries

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 19

PDI Version 4.2 (July 2011)


Graphical performance and progress feedback for transformations
Metadata Injection
Report bursting by the Pentaho Reporting Output step
Automatic Documentation step
Talend Job Execution job entry
Single Threader step for parallel performance tuning of large
transformations
Allow a job to be started at a job entry of your choice
The XML Input Stream (StAX) step to read huge XML files at optimal
performance
The Get ID from Slave Server step allows multi-host or clustered
transformations to get globally unique integer IDs from a slave server
4.2 continued on next page

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 20

PDI Version 4.2


Carte improvements
1.
2.
3.
4.
5.

reserve next value range from a slave sequence service


allow parallel (simultaneous) runs of clustered transformations
list (reserved and free) socket reservations service
new options in XML for configuring slave sequences
allow time-out of stale objects using environment variable
KETTLE_CARTE_OBJECT_TIMEOUT_MINUTES

Memory tuning of logging back-end with


1. KETTLE_MAX_LOGGING_REGISTRY_SIZE
2. KETTLE_MAX_JOB_ENTRIES_LOGGED
3. KETTLE_MAX_JOB_TRACKER_SIZE allowing for flat memory usage for
never ending ETL in general and jobs specifically.
4.2 continued on next page

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 21

PDI Version 4.2


Repository Import/Export
1. Export at the repository folder level
2. Export and Import with optional rule-based validations
3. Import command line utility allow for rule-based (optional)
import of lists of transformations, jobs and repository export
files: http://wiki.pentaho.com/display/EAI/Import+User+Documentation
ETL Metadata Injection
1. Retrieval of rows of data from a step to the metadata
injection step
2. Support for injection into the Excel Input step
3. Support for injection into the Row normaliser step
4. Support for injection into the Row Denormaliser step
And many more new steps and job entries

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 22

Why Pentaho Data Integration?


Ease of use:
100% metadata driven (define WHAT you want to do, not HOW to do
it)
No extra code generation means lower complexity
Simple setup, intuitive graphical designers and easy to maintain
Flexibility:
Never forces a certain path on the user
Pluggable architecture for extending functionality

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 23

Why Pentaho Data Integration?


Modern Standards-based Architecture
100% Java with broad, cross platform support
Over 100 out-of-the-box mapping objects (steps and job entries)
Enterprise class performance and scalability
Lower total cost of ownership (TCO)
No license fees
Short implementation cycles
Reduced maintenance costs

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 24

Pentaho Data Integration Adoption


Wide range of production deployments:
Small and medium-sized companies
Large enterprises
Rapid product evolution
Driven by Pentaho investment
Includes significant community
contributions
Contribution-friendly
architecture
Natural fit for additional data
sources, targets and
transformations

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 25

Common Uses
Data warehouse population:
Built-in support for slowly changing
dimensions, junk dimensions and other data
warehouse concepts.
Export of database(s) to text-file(s) or other
databases.
Import of data into databases, ranging from
text-files to Excel spreadsheets.
Data migration between database applications.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 26

Common Uses
Exploration of data in existing database
(tables, views, synonyms, ).
Information enrichment by looking up data
in various information stores (databases,
text-files, Excel spreadsheets, ).
Data cleansing by applying complex
conditions in data transformations.
Application integration.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 27

Example
1

or PeopleSoft, Siebel, JD Edwards, Axapta, Navision, SugarCRM, Compiere, and others

or DB2, Teradata, Microsoft, Sybase, MySQL, PostgreSQL, Ingres, etc.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 28

Agile BI
Modeling and Visualization perspectives

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 29

Agile BI
Modeling and Visualization perspectives.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 30

EE Data Integration Server


Enterprise Repository and Content Management:
New repository based on JCR (Content Repository API for Java)
Improved Repository Browser
Enterprise security:
Configurable Authentication including support for LDAP and MSAD
Task permissions to control what actions a user/role can perform
such as read/execute content, create content and administer
security
Full revision history on content allowing you to compare and restore
previous revisions of a Job or Transformation.
Ability to lock Transformations/Jobs for editing.
Recycling bin concept for working with deleted files.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 31

EE Data Integration Server


Scheduling

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 32

Since PDI 4.0 & 4.2: Usability


Hover-over menus & simplify the
connection of steps
Graphical indicators on hops representing
the flow of information between steps.
New activity indicators on jobs and steps (since 4.2) help highlight current
activity and bottlenecks during execution.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 33

Since PDI 4.0: Improved Logging


Internal object IDs (small API change)
Logging channels (GUIDs)
Step logging
Sniffing (debugging, data lineage)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 34

Pentaho Data Integration Components


Spoon
Graphical environment for modeling
Transformations are metadata models
describing the flow of data
Jobs are workflow-like models for
coordinating resources, execution and
dependencies of ETL activities
Pan
Command line tool for executing
transformations modeled in Spoon

Spoon Interface Designing a Transformation

Kitchen
Command line tool for executing jobs
modeled in Spoon
Job Example

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 35

Pentaho Data Integration Components


Carte
Lightweight web\HTTP server for remotely
executing Jobs and Transformations.
Carte accepts XML containing the transformation to
execute and the execution configuration.
Enables remote monitoring
Used for executing transformations in a cluster
Remote servers running Carte are referred to as
Slave Servers.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 36

Enterprise Edition (EE) Data Integration Server


Enterprise Edition alternative to Carte providing
Execution and remote monitoring (can act as master/slave similar to Carte)
Integrated scheduling
Enterprise Security options
Enhanced content management including revision history and locking

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 37

Repository: the Metadata Store


Kettle can store metadata in
XML files
RDBMS repository
Enterprise repository
Objects stored in repository
Connections
Transformations
Jobs
Schemas
User and profile definitions
Repository supports
collaborative development

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 38

How to start the user interface?


Start Spoon.bat (Windows) or Spoon.sh (Linux, MacOS) in the Kettle
folder.
The command launch-designer.bat /.sh is also possible in the archive
based installation.
For our trainings we do not use the repository, all training data will be
stored in the file system as KTR- (Transformations) or KJB-files (Jobs) in
XML format.
Another option would be to start via Java Web Start:
Latest version loaded via the Internet (usable for configuration
management).
JNLP-files (Java Network Launching Protocol) located in the
KETTLE/webstart folder.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 39

Transformations
Transformations are a network of logical tasks (Steps):
Read a flat file
Filter it
Sort it
Load it into MySQL

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 40

Steps Job Entries

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 41

Hops - within Transformations


Are data pathways that connect steps together
Allow schema metadata to be passed from step to step
Determine the flow of data through the steps
Example: The pathway for all data and the true and false path from a Filter
rows step.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 42

Hops Data Movement: Copy or Distribute?


Specify if data can either be copied or distributed between multiple hops
leaving a step (right click on a step and select Data Movement).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 43

Hops Other Types in Transformations


Info Steps: When data is retrieved (pulled) from another step.
Error handling steps: When error handling is enabled.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 44

Hops - within Jobs


Are defining the execution sequence for job entries
There are three types of hops within Jobs (right click on the hop):
Unconditional
Follow when result is true
Follow when result is false

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 45

Data Flow, Threading mechanism


All Steps run are started and run in parallel:
Initialization sequence is not predictable
PDI takes care of the correct data flow
Pulling and pushing data from step to step
PDI capable to process an unlimited number of rows:
Steps vary on execution speed and memory consumption
Set the threshold on number of rows will wait to be processed by
next step:
If this number of waiting rows is reached, the source step waits
for room.
When there is room to process, more rows are put into the data
stream.
 Transformation Properties / Miscellaneous / Number of rows in
rowset

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 46

Threading mechanism

There are additional options that influence the threading mechanism in


Transformation Properties/Miscellaneous
1. Manage thread priorities: If there
is not much to do for a thread
(step), put the thread to sleep for
some milliseconds. This reduces
locking overhead of the buffers
and is enabled by default.

2. (Serial) Single Threaded: This allows to build thread pools in combination


with sub-transformations. It will not work if any step is getting or putting rows
from/to more than one step (e.g. the Stream lookup step).
3. KETTLE properties: KETTLE_BATCHING_ROWSET,

KETTLE_ROWSET_GET/PUT_TIMEOUT

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 47

Values, Metadata and Data


Values are like columns in data rows:
Composed of the metadata and the data.
PDI version 3.0 separated metadata and data
Metadata is only transported with the first data row.
All subsequent data rows reference to this metadata.
PDI maps database (JDBC) data types to PDI data
Implementation can (and often is) different from database to
database.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 48

Values, Metadata and Data: Formatting etc.


Metadata is used for formatting when:
Data is presented e.g. in a preview.
Data is written to the outside world, e.g. to a text- or XML-file.
Metadata is NOT used for formatting when:
Data is just loaded from one table and written to another table.
Metadata is used to create SQL-statements to know the field types, length etc.
Metadata is used to check the data for the right data type.
Note: A change in metadata does not change the data, e.g.
A modification of length does not change/truncate the data.
A new formatting does not change the data.
But, when you modify the type in the Select Values step, the data is
converted to a new data type (e.g. from a String to a Number with the given
formatting).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 49

Numeric Data Types


Number (double, floating point)
Double precision can handle only 15 significant digits
Is sometimes not exact
 To avoid this, use BigNumber but will have performance drop
Integer
Is optimal for storing and processing data from a performance
viewpoint.
BigNumber
Offers a extremely high precision level but needs more memory and
CPU then the other numerical values.
Note: Formatting Number Data Types is done by default with the
pattern #.#;-#.# that means with only one digit precision.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 50

Other Data Types


String
Used mainly for CHAR, VARCHAR and MEMO/CLOB fields
A note on NULL handling: PDI follows Oracle in its use of empty
strings and NULLs:
1. They are considered to be the same when they are compared by
steps.
2. Empty strings are converted to NULL. The latter makes a lot of
problems e.g. on data migration projects (more details can be
found in JIRA feature request PDI-2277).
PDI 4.0 introduced an environment variable (system wide) to change
this behaviour and to keep empty strings as empty strings:
KETTLE_EMPTY_STRING_DIFFERS_FROM_NULL
Set it in the kettle.properties file to "Y" to make it work.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 51

Other Data Types


Date
Includes date and time
Note: Formatting Date Data Types is done by default with the
pattern yyyy/MM/dd HH:mm:ss.SSS
Boolean
True or false, representation in the database depends on database
boolean support.
Note: Please see the following option to turn this on when your
database supports this: Database connection/Advanced/Supports
boolean data type
Binary
Can hold any binary data like pictures, used mainly for BLOB data.
Serializable
An object to transfer from/to specific steps (internally only).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 52

Data Types and Silent Conversions


Until version 2.5.x, data types mostly converted silently.
Can lead to some undiscovered type mismatches.
PDI 3.0
Data types are strictly checked
No silent conversions take place (with some exceptions for
compatibility)
Example for a type conflict:
Samples directory: Denormalizer - Simple example.ktr

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 53

Lazy Conversion
Lazy conversion is a delayed conversion of data types
Provides a performance boost.
Conversion takes place only where it is really needed
Ideally at output steps
Sometimes not at allreading from text-file and writing back to
text-file
If output format is the same as the input format, no conversion
Steps support lazy conversion
Specifically: CSV File Input, Fixed File Input and Table Input
Other steps support it transparently

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 54

Lazy Conversion
Binary form of data can cause issues:
For example: sorting on binary form ignores character set sorting
sequence.
New feature in the Select Values step
Covered in more detail in a separate module
Converts binary data to and from binary character data

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 55

Handling the User Interface


Main tree
Lists all open Transformations and Jobs and their contents.
Core objects and favorite Steps/Job entries
Core is toolbox with all the available Steps/Job entries (plug-ins are
in bold).
Favorites are static most used steps.
Notes
Can appear anywhere on graphical view.
Right clicking on the canvas and selecting Add Note.
Options and settings
Options are valid for the entire PDI environement.
Settings are valid for a particular transformation or job.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 56

Handling the User Interface


Draw new hops

Other methods:
Middle or scroll wheel button, click on the first step and drag onto the
second.
Use SHIFT+Click and drag from one step to another
Select 2 steps, right click on one of them and select New hop.
Drag Hops onto the canvas.
Inserting a step (or job entry) between others:
Move the step over the arrow until the arrow becomes drawn in bold
Release the mouse-button
Window sizes:
May have to resize some dialogs to all parameters

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 57

Handling the User Interface


Click right on the first column in any dialog table (grid) for a list of all
the options.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 58

Handling the User Interface


Click right on a Step for a list of all the options in the context menu.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 59

Run and Preview


Can execute the entire Transformation or just preview a particular step.
Preview also possible by selecting a step and pressing the F10 key.
Need at least two steps connected with a hop to run or preview
Closing a preview (vs. choosing the stop/get more rows options) will
leave the Transformation in a paused execution state. If you attempt to
restart the Transformation, it will tell you it can not be started twice.
Note: Preview may be destructive.
Subsequent steps are also initiated (could cause truncation of target
table).
Rows passed to subsequent steps; not stopped at the previewed step.
You may temporarily disable the hop after the step to preview.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 60

Log View
Shows statistics associated with execution of a Transformation.
Used to understand performance and to check the results.
Logging can be very granular down to the row level if needed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 61

Safe Mode
Available in the Execute a Transformation/Job window
Used in cases that mix rows from various sources
Makes sure that these rows all have the same layout (metadata).
Forces each Transformation to check layout of each row.
Error thrown on row that differs in layout from first row
Step and offending row are reported
Has performance tradeoffs:
Checking on each row slows performance.
Source of an error found sooner, useful in trouble shooting.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 62

Usage of Safe Mode at design time


Performing a union on data with different layouts generates warning
The name of field number 1 is not the same as in the first row
received: you're mixing rows with different layout. Field
[ThisIsAStringValue String(10)] does not have the same name as field
[ThisIsANumberValue Number].

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 63

Analyzing Errors

2010/05/18 16:38:00 - Generate Rows.0 - ERROR (version 4.0.0-) : Couldn't parse Integer
field [WrongType] with value [abc] -->
org.pentaho.di.core.exception.KettleValueException:

Log entries include


Step (Generate Rows) generating the error
Detailed information with the PDI version (useful when submitting
cases)
Any stack traces (useful for bug tracking due to program errors)
Error lines are in red since 4.0 (easier to find)
Use Show Error Lines to find errors easier

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 64

Debugging
Introduced in PDI 3.0
Provides condition break points

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 65

Replaying a Transformation
Is implemented for Text File Input and Excel Input
Allows files containing errors to be sent back to source and corrected.
Uses .line file to reprocess file:
ONLY lines that failed are processed during the replay.
Uses the date in the filename of the .line file to match the replay
date.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 66

Special PDI Application Files


User-specific files found in .kettle directory of users home directory
kettle.properties: Default properties file for variables
shared.xml: Default shared objects file
db.cache: The database cache for meta data
repositories.xml: The local repositories file
.spoonrc: User interface settings, last opened transformation/job
.languageChoice: User language (delete to revert language)
Some temporary logs stored also in temp folder :
Usually cleaned by Java Virtual Machine (VM).
May need to be cleaned (deleted) due to defects in VM.
Directory of temporary logs determined by OS and VM
(for example: typically C:\Windows\Temp on Windows).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 67

Special PDI Application Files KETTLE_HOME


The HOME directory may change depending on the user who is logged on. As a
result, the configuration files that control the behavior of PDI Jobs and
Transformations are different from user to user. Setting the KETTLE_HOME
variable can be performed system wide by the operating system or just before
the start of PDI using a shell script or batch. For example: use the SET
command.
Point the KETTLE_HOME to the directory that contains the .kettle directory. The
.kettle gets appended by PDI. For example: when you have stored the common
files in C:\Pentaho\Kettle\common\.kettle you need to set the KETTLE_HOME
variable to C:\Pentaho\Kettle\common.
When running PDI from the Pentaho BI Platform, please see the Knowledge Base
for setting the variable.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 68

Pentaho Data Integration


The Conceptual Model any questions?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 69

Pentaho Data Integration - Conceptual model

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 70

Connections, Inputs and Outputs

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Database Connections
Multiple database connections to different databases can be created.
With a PDI repository:
Defined connections readily available to transformations and jobs.
Connection information for the repository itself is stored in
repositories.xml.
Without a PDI repository:
Connection definition contained in a single Transformation or Job.
Can share connection definitions in subsequent Transformations and
Jobs.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 72

Database Connections
Available database connections appear in the Main Tree.

Choose Share in the context menu of any connection to share it.


Shared connections appear in bold.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 73

Database Connections
General database connection options

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 74

Access Via JDBC


Pentaho Data Integration ships with the most suitable JDBC drivers for
the listed databases.
Additional drivers can by added to the lib/libext directory.
Use Generic tab of connection dialog to use unlisted drivers.
Permits connections to non-listed databases.
Existing drivers can be replaced in the lib/libext directory.
Special database issues and experiences with different JDBC drivers can
be found in the Pentaho Data Integration Wiki:
http://wiki.pentaho.com/display/EAI/Special+database+issues+and+
experiences

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 75

Other Access Methods


ODBC connections are possible:
ODBC connections must be defined in Windows.
ODBC connections made via ODBC-JDBC-Bridge.
Some limitations of the SQL syntax
Generally slower than JDBC due to additional layer.
Use a JNDI connection to connect to a datasource defined in an
application server like JBoss or Webshere.
Plugin specific access methods are supplied by a specific database
driver (like SAP R/3 or PALO connections).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 76

Advanced Database Connections


Pooling
Useful for performance tuning
Can limit number of connections for database client licensing reasons
Driver Specific Options (Options tab)
Pass additional parameters to the drivers
Allows driver specific tuning for performance
Database vendor documentation is available by clicking help button

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 77

Advanced Database Connections


Clustering
Enables clustering for the database connection and creation
connections to data partitions.
Requires partition ID, name and port of host, and user name and
password
Identifiers (Advanced tab)
Directs how SQL is generated
Post Connection SQL

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 78

Quoting
Quoting is used when reserved names or special characters are used.
For example: Field names, sum, V.A.T., overall sales.
PDI has an internal list of reserved names for most of the supported
database types.
PDIs automatic quoting can be overridden.
Feedback on quoting is always welcome to improve quoting
algorithms.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 79

Database Explorer
In the toolbar:
In the connection context menu:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 80

DB Cache: The Metadata Cache


Metadata for fields each connection and SQL statement is cached in
internal file db.cache.
Metadata cache must be refreshed.
Refreshed automatically when the table is changed in the PDI
context (from the SQL statement window).
Manual refreshing of cache maybe necessary.
Typically detected by missing fields or mismatches in Show
input/output fields of a step.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 81

Impact Analysis
What impact does the Transformation have on the used databases?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 82

SQL Editor
Creates the needed DDL for the output steps related to a database
table, often CREATE statements for tables or indices.
SQL button in the toolbar creates all needed DDL for tables.
No automatic mechanism to alter tables when the layout changed.
For example: A field type from a source table is changed
DDL can be easily and manually changed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 83

Steps Covered In This Section


Input Steps

Output Steps

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Special Dimension Steps

US and Worldwide: +1 (866) 660-7555 | Slide 84

Text File Input


Reads a large number of different text files, including CSV files generated by
spreadsheets.
Text File Input Options:
Filename specification options: Files may be added to selected files list.
Accept filenames from previous step: Filename can come from any source.
Content specification: Allows user to specify format of the files being read.
Error handling: Allows user to define how to react to errors.
Filters: Allows user to specify the lines to be skipped in the text file.
Fields: Allows user to define characteristics of the fields.
Formats: Includes formatting of number and date fields.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 85

CSV File Input


Reads a CSV file format only.
Due to the internal processing, this step is much faster.
Options are a subset of Text File Input
NIO buffer size: Set the buffer size used by the Java I/O classes
(NIO, improved performance in the areas of buffer management).
Lazy conversion: This step supports lazy conversion.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 86

Fixed File Input


Reads a fixed file format only.
Due to the internal processing, this step is much faster.
Options are a subset of Text File Input
NIO buffer size and lazy conversion options are identical to the CSV
file input options.
Run in parallel: Must be checked if Transformation is executed in
cluster with many workers processing on large file.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 87

Table Input
Reads information from a database, using a connection and SQL.
Table Input Options
Step Name: The name has to be unique in a single Transformation.
Connection: The database connection used to read data from.
SQL: The statement used to read information from the database
connection; may be any query.
Insert data from step: The input step name where parameters for
the SQL come from, if appropriate.
Limit: Sets the number of lines that are read from the database.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 88

Excel Input
Reads information from one or more Excel files.
The options provided by PDI GUI for accepting excel inputs include:
Step Name: Name of the step.
File Tab: To define the filenames, with variable support.
Sheet Tab: To define the sheet(s) to import.
Fields Tab: To specify the fields that need to be read from the
Excel files.
Error handling Tab: Allows user to define how to react when error
is encountered.
Content Tab: Includes the sub options of Header, No empty rows,
Stop on empty rows, Field name, Sheet name field, Row number
field and Limit.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 89

Access Input
Reads information from one or more Access files.
No ODBC connection necessary.
Allows Access files to be read on non-Windows platforms.
Access Input Options:
Step Name: Name of the step.
File Tab: To define the filenames, with variable support.
Content Tab: Specify the table name and the inclusion of the file
name, table name, row number and limit.
Fields Tab: Specify the fields that need to be read from the Access
files.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 90

XBase Input
Reads data from most types of DBF file derivates called the XBase
family. (dBase III/IV, Foxpro, Clipper, ...)
Options:
Step name: Unique name (in transformation) of the step.
Filename: Name of XBase file with variable support.
Limit size: Only read this number of rows; zero means unlimited.
Add rownr?: Adds a field to the output with the specified name that
contains the row number.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 91

Apache Virtual File System (VFS) support


Apache VFS allows references to files from virtual any location.
Apache VFS support available in the Pentaho Platform and Pentaho
Analysis.
All file names are treated as URIs.
file:///somedir/somefile.txt
zip:http://somehost/downloads/somefile.zip
http://myusername@somehost/index.html
sftp://myusername:mypassword@somehost/pub/dl/somefile.tgz
webdav://somehost:8080/dist
Further information: http://commons.apache.org/vfs/filesystems.html

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 92

Generate Rows
Outputs a number of rows, default is empty but optionally containing a
number of static fields.
Options

Step name: Unique name (in Transformation) of the step.

Limit: The number of rows user wants to output.

Fields: Static fields user might want to include in the output row.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 93

Get System Info


Retrieves information from the PDI environment.
Options
Step name: Unique name (in transformation) of the step.
Fields: The fields to output.

System Information Types


Date and time information
Run-time Transformation metadata
Command line arguments

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 94

LDAP Input
Reads data from a LDAP server.
Options
Host: Hostname or IP address of the LDAP server.
Port: The TCP port to use, typically 389.
User Authentication: Enable to pass authentication credentials to
server.
Username/Password: For authenticating with the LDAP server.
Search base: Location in the directory from which the LDAP search
begins.
Filter String: The filter string for filtering the results.
Fields: Define the return fields and type.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 95

De-serialize/Serialize to/from File


Read and write data and PDI metadata together from and to a file.
Use Cases
Transfer data from one Transformation to another where the memory
is not sufficient to hold the amount of data.
Save data and metadata to be processed at another time.
Transfer data to another user with no need to reanalyze the
metadata (instead of text files).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 96

Text File Output


Exports data to a variety of different text file formats, including CSV.
Options
Extension
Append
Separator
Enclosure
Header/Footer
Zipped
Include step number/date/time in file name
Encoding
Right pad fields
Split every or n row(s)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 97

Table Output
Insert (only) information in a database table.
Options
Target table
Commit size
Truncate table
Ignore insert errors
Partition data over tables
Use batch update for inserts
Return auto-generated key
Name auto-generated key field
Is the name of the table defined in a field

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 98

Insert / Update
Automates simple merge processing:
Look up a row using one or more lookup keys.
If a row is not found, insert the new row.
If found and targeted fields are the identical, do nothing.
If found and targeted fields are not identical, update the row.
Options
Step name
Connection
Target table
Commit size
Keys
Update fields
Do not perform any updates (If used, operates like Table Output,
but without any Insert errors caused by duplicate keys).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 99

Update
Same as the Insert / Update step except no insert is performed in the
database table.
ONLY updates are performed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 100

Delete
Same as the Update step, except rows are deleted.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 101

Excel Output
Exports data to an Excel file
Options
Sheet name
Protect sheet with a password
Use a template (e.g. with a preformatted sheet)
Append or override the contents of the template

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 102

Access Output
Exports data to an Access file
ODBC not required
Can be used on non-Windows platforms
Options

Database filename (.mdb)


Create database (creates/overrides the file)
Target table
Create table
Commit size

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 103

XML Output
Writes rows from any source to one or more XML files
Options
File name
Extension
Include stepnr in file name
Include date in file name
Include time in file name
Split every N rows
Parent XML element
Row XML element
Fields
Zipped
Encoding

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 104

Dimension Update / Lookup


Implements slowly changing dimensions (Type I and Type II)
Can be used for updating a dimension table and for looking up values in
a dimension. (Lookup, if not found, then update/insert)
Each entry in the dimension table has the following fields
Technical key: The primary (surrogate) key of the dimension
Version field: Version of the dimension entry (a revision number)
Start of date range: Field containing validity starting date
End of date range: Field containing the validity ending date
Keys: Business keys used in source systems; used for lookup
functionality.
Fields: Actual information of a dimension and can be set individually
to update all versions or to add a new version when a new value
appears.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 105

Combination Update / Lookup


Stores information in a junk-dimension table
Dimension table are often comprised of one or more combinations of simple
dimension attributes where each unique combination is the distinct key.
Steps activity
Lookup combination of business key field1.. fieldn from input stream in
dimension table
If this combination of business key fields exists, return its technical key
(surrogate id)
If this combination of business key doesn't exist yet, insert a row with the
new key fields and return its (new) technical key
Put all input fields on the output stream including the returned technical
key, but remove all business key fields if remove lookup fields is true
Will only maintain the key information
Non-key information in the dimension table needs to be updated separately.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 106

Introduction to the Training Data

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Introduction to the Training Data


Represents fictitious company  Steel Wheels
Buys collectable model cars, trains, trucks, etc.
from manufacturers
Sells to distributors across the globe
Data adapted from the sample data provided by
Eclipses BIRT project.

pentaho_oltp database has many tables:


Offices, Employees, Customers, Products,
Orders, Orderdetails, Payments
pentaho_oltp

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 108

Pentaho Training Data: Tables


Offices
7 offices worldwide (San Francisco, Boston, NYC, Paris, Tokyo,
Sydney, London)
Headquartered in San Francisco, CA
Each office is assigned to a sales territory (APAC, NA, EMEA or
JAPAN)
Employees
23 employees: 6 Execs and 17 Sales Reps
Each assigned to one of the seven offices
Sales Reps also assigned to a number of customers (distributors)
New Sales Reps (that are still in training) dont have assigned
customers

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 109

Pentaho Training Data: Tables


Customers
Steel Wheels has 122 customers worldwide
Approximately 20 of those new customers without a sales rep any orders
Each has a credit limit which determines maximum outstanding balance
Products
110 unique models purchased from 13 vendors
Classified as 7 distinct product lines: Classic Cars, Vintage Cars, Motorcycles,
Trucks and Buses, Planes, Ships, Trains.
Additionally models are classified based on their scale (e.g. 1:18, 1:72 etc.)
Cost paid and MSRP (suggested retail price)
Payments
Customers make payments on average 2-3 weeks after they place an order.
In some cases one payment covers more than 1 order.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 110

Pentaho Training Data: Tables


Orders
2560 orders, which span the period from 1/1/2000 to 12/31/2007
Each in a given state: In Process, Shipped, Cancelled, Disputed,
Resolved, or On Hold.
OrderDetails
Order line items reflect negotiated price and quantity per product
Training data has 23,640 OrderDetails

pentaho_oltp

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 111

Database Schema OLTP (Source Database)


Database schema is in $PENTAHO_TRAINING/data/models

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 112

Database Schema OLAP (Target Database)


Database schema is in $PENTAHO_TRAINING/data/models

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 113

Data Warehouse Steps

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Target Database Design (your Data Warehouse)


Prior to defining mappings from sources, the target database must be
designed.
Staging tables and/or file format designs are often identical to source
format to simplify extract processing.
Target star schemas must be designed to meet analytical processing
(OLAP) needs as well as feasibility of load from identified source data.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 115

Source to Target Mapping


Identifies how each target table/column will be populated from the
sources.
Includes details of the following:
Source table/column or how the value is otherwise derived
Data types/lengths and any format Transformation
Special cleansing or Transformation logic
Exception handling
This document will aid in creation of the actual programming
specifications for the ETL developers or in creation of instructions
(technical meta data) for the ETL tool.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 116

Dimensional Design in your Data Warehouse


Best Practice Analytical Database Design (Kimball, et al.)
Geography
Employee

Product
Sales Facts

Customer

Time

Fact Table

Dimension Table

A fact table contains items that you want

Dimensions are the ways that you want to

to measure. For example:


Revenue
Amount sold
Average price
Metrics are the values you are trying to
report.

look at the data. For example:


By customer
By date
By product
Dimensions provide context in reports
(grouping, labels, filters, etc.).

Dimensional models are often called Star-Schemas.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 117

Why Fact- and Dimension tables?


The idea behind this separation is that you have a large fact table with
much smaller dimension tables joined to it.
Fact tables are huge compared with the dimensions. They are usually
huge compared with anything you have had in an OLTP database (e.g.
when you store historical data or add data from external sources).
We aim to store as many facts as possible with less space to need and
with a speedy access.
Fact and dimension tables should be joined by integers.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 118

Slowly Changing Dimensions


In a dimension table the value of an attribute may change over time.
You may need to reference the old and new value.
Examples:
You have a product dimension with the price as an attribute. The
price changes over time. You may want to calculate the profit for
certain periods by the product price for a certain time. Then you
need to keep track of these changes.
You have a customer dimension with the location (e.g. defined by
the zip code) as an attribute. Customers can relocate and get
another zip code. The geographic organized sales team made
budgets for their customers and want to compare this with the
actual. What happens when a customer relocates into another sales
region?
You can solve this easy by storing your dimensions with the concept of
Slowly Changing Dimensions.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 119

Slowly Changing Dimensions


Type 1 Dimension:
New information overwrites the old information.
Old information is not saved, it is lost.
Can only be used in applications where maintaining a chronicle of data is not
essential; used for update only.
Type 2 Dimension:
New information is appended to the old information.
Old information is saved, it is effectively versioned.
Can be used in applications where maintaining a chronicle of data is
required so that changes in a data warehouse can be tracked.
Type 3 Dimension:
New information is saved alongside the old information.
Old information is partially saved.
Additional columns are created to show the time from which the new
information has taken effect.
Enables view of facts in both current state and what-if past states of
dimensional values

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 120

Dimension Update / Lookup


Implements slowly changing dimensions: Type 1 and Type 2
Can be used for updating a dimension table and for looking up values in a
dimension. Lookup, if not found, then update/insert.
In dimension implementation each entry in the dimension table has the
following fields:
Technical key: This is the primary (surrogate) key of the dimension.
Version field: Shows the version of the dimension entry (a revision number).
Start of date range: This is the fieldname containing validity starting date.
End of date range: This is the fieldname containing the validity ending date.
Keys: These are business keys used in source systems such as customer no,
product id, etc. These are used for lookup functionality.
Fields: These fields contain the actual information of a dimension and can be
set individually to update all versions or to add a new version when a new
value appears.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 121

Example for Slowly Changing Dimensions


We create a new dimension for the product from the pentaho_oltp
database into the pentaho_olap database.

OLTP

2011, Pentaho. All Rights Reserved. www.pentaho.com.

OLAP

US and Worldwide: +1 (866) 660-7555 | Slide 122

Example for Slowly Changing Dimensions


The dimension step added the fields:
productid: The technical key
version: The version of the dimension row as a reference
date_from, date_to: The valid date range for the actual dimension
row.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 123

Example for Slowly Changing Dimensions


We renamed the following fields and added the dimension name.
This is recommended and useful for later analysis or reports to know,
fields belong to a certain dimension.
quantityinstock
buyprice
msrp

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 124

Example for Slowly Changing Dimensions


The data within our dim_product table looks like this:

Look at the first line: This is the default row returned when you lookup
a dimension and the key is not found. This row (for not found entries) is
created automatically by PDI with null values.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 125

Example for Slowly Changing Dimensions


Now we have a price change for two products:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 126

Example for Slowly Changing Dimensions


The price change came from a price file where the other information
like productname failed, so they are missing in the new versions:

We can look it up before the dimension load.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 127

Example for Slowly Changing Dimensions


You have to uncheck the Update the dimension? box.

And enter the fields you want to retrieve (mind the change of the
columns, especially Type of return field instead of Type of dimension
update):

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 128

Example for Slowly Changing Dimensions


The handling of the Stream Datefield.

When the date field is empty, the actual date and time for lookup and
new inserts is used.
When you have a date field with a valid from date, you can use this
here.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 129

Example for Slowly Changing Dimensions


Example for a different date used.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 130

Example for Slowly Changing Dimensions


If you do a lookup of a dimension and want to include the effective
version and the date ranges of your found row, add them to fields:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 131

Lookups

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Lookups
The Lookup feature of PDI accesses a data source to find values per a
defined matching criteria, i.e. key.
The following steps have lookup functionality in PDI:
Commonly Used
Database Lookup
Stream Lookup
Merge Join
Others

Database Join
Call database procedure
Dimension Update/Lookup
Combination Update/Lookup
HTTP Lookup

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 133

Database Lookup
Lookup attributes from a single table based on a key-matching criteria
Options for performing database lookup include:
Lookup table: The name of the table where the lookup is done.
Enable cache: This option caches database lookups for the duration of
the Transformation.
Enabling this option can increase performance.
Danger: If other processes are changing values in the table do not
set this option.
Load all data from table: Preload the complete data in memory at the
initialization phase. This can replace a Stream Lookup step in
combination with a Table Input step and is faster.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 134

Database Lookup (cont)

SELECT
ATTRIB1 as FullName
FROM
lookup_table
WHERE
ID = <<value of field in stream>>

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 135

Stream Lookup
Allows users to lookup data using information coming from other steps in the
transformation.
The data coming from the Source step is first read into memory (cache) and is
then used to look up data for each record in the main stream.
Options provided by Kettle GUI for performing stream lookup include:
Source step: The step from which to obtain the in-memory lookup data
Key(s) to lookup value(s): Allows user to specify names of fields that are used
to lookup values. Values are always searched using the equal comparison.
Fields to retrieve: User can specify the names of the fields to retrieve, as
well as the default value in case the value was not found or a new fieldname
in case the user wishes to change the output stream field name.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 136

Stream Lookup (Example)


The following Transformation add information coming from a text-file to
data coming from a database table:

B is the source step. It is where the in-memory lookup stream resides.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 137

Merge Join
Takes TWO sorted streams and performs a traditional JOIN on EQUALITY
INNER = Only output a row when the key is in both streams
LEFT OUTER = Output a row even if there is no matching key in 2nd
Step
RIGHT OUTER = Output a row even if there is no matching key in 1st
Step
FULL OUTER = Output a row regardless of matching
Options provided by PDI GUI for merge join include:
First Step: Step to refer to as the 1st
Second Step: Step to refer to as the 2nd
Keys for 1st: The key fields from the 1st Stream
Keys for 2nd: The key fields from the 1st Stream
Join Type: The key fields from the 1st Stream

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 138

Merge Join (cont)

FULL
OUTER

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 139

Database Join
Options provided by Kettle GUI for database join procedure include:
SQL: The SQL query to launch towards the database.
Number of rows to return: 0 means all, any other number limits the
number of rows.
Outer join?: When checked, will always return a single record for each
input stream record, even if the query did not return a result.
The parameters to use in the query.
Parameters noted as ? in the query
Order of fields in parameter list must match the order of the ? in the
query.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 140

Call Database Procedure


Executes a database procedure (or function) and gets the result(s) back.
Options for call database procedure include:
Proc-name: Name of the procedure or function to call.
Enable auto-commit: This can be used to perform updates in the
database using a specified procedure. The user can either have the
changes done using auto-commit or by disabling this. If auto-commit is
disabled, a single commit is performed after the last row is received
by this step.
Result name:
When calling a database function this field needed.
When calling a database procedure this field must not be
entered.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 141

Call Database Procedure (cont)


Result type: Type of result of function call. Not used in case of a
procedure.
Parameters: List of parameters that the procedure or function needs:
Field name: Name of the field.
Direction: Can be either IN (input only), OUT (output only), INOUT (value
is changed on the database).
Type: Used for output parameters so that Kettle knows what comes back.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 142

Call Database Procedure (cont)


Other buttons available in this step are:
Find it... button: Searches the specified database connection for
available procedures and functions.
Get fields button: Fills in all the fields in the input streams to make
the process easier by deleting the lines that are not needed and reordering the ones that are needed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 143

Dimension Lookups
Uses the same dimension step used for updating
Same fields, same setup
Stream Datefield stream_date between EFFECT and EXPIRE

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 144

HTTP Lookup
Covered in Web Service module

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 145

Field Transformations Part 1

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Field Transformations
Field Transformations are steps that operate at the field level within a
stream record.
The step types covered in this section include:
Select Values
Calculator
Add Constants
Null If

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 147

Select Values
This step type is used to:
Select/remove fields from the process stream.
Rename fields
Specify/change the length and/or precision of fields.
3 Tabs are provided:
Select and Alter: Specify the exact order and name in which the fields
have to be placed in the output rows.
Remove: Specify the fields that have to be removed from the output
rows.
Meta-data: Change the name, type, length and precision (the metadata) of one or more fields.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 148

Select Values (cont)


Options provided for this step include:
Step name: Name of the step. This name has to be unique in a single
Transformation.
Attributes that can be changed for a given field:
Fieldname: The fieldname to select or change
Rename to: To be left blank if rename not required.
Length: Number has to be entered to specify the length (-1: no length
specified).
Precision: Number has to be entered to specify the precision (-1: no
precision.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 149

Calculator
Provides a list of functions that can be executed on field values.
An important advantage Calculator has over custom JavaScript scripts is
that the execution speed of Calculator is many times that of a script.
Besides the arguments (Field A, Field B and Field C) the user also needs
to specify the return type of the function.
You can also opt to remove the field from the result (output) after all
values were calculated. This is useful for removing temporary values.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 150

Calculator (cont)
The list of functions supported by
the calculator includes commonly
used mathematical and date
functions.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 151

Add Constants
Adds constants to a stream.
The use is very simple:
Specify the name
Enter value in the form of a string
Specify the formats to convert the value into the chosen data type.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 152

Null If
If the string representation of a field is equal to a specified value, then
the output value is set the null (empty).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 153

Set Transformations

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Set Transformations
Set Transformations are steps that operate on the entire set of data
within a stream.
The operations operate across all rows and not strictly within a row
The steps covered in the section include:
Filter Rows
Sort Rows
Join Rows
Merge Rows
Unique Rows
Aggregate Rows
Group By

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 155

Filter Rows
Filter rows based upon conditions and comparisons with full boolean logic
supported.
Output can be diverted into 2 streams: Records which pass (true) the condition
and records which fail (false).
Often used to:
Identify exceptions that must be written to a bad file
Branch transformation logic if single source has two interpretations
The options provided for this step include:
Send true data to step: Which step receives those rows which pass the
condition.
Send false data to step: Which step receives those rows which fail the
condition.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 156

Sort Rows
Sort rows based upon specified fields, including sub sorts, in ascending
or descending order.
The options provided for this step include:
A list of fields and whether they should be sorted ascending or not.
Sort directory: This is the directory in which the temporary files are
stored when needed. The default is the standard temporary directory
for system.
Sort size: The more rows you can store in memory, the faster the
sort. Eliminating need for temp files reduced costly disk I/O.
The TMP-file prefix: Choose a recognizable prefix to identify the
files when they show up in the temp directory.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 157

Join Rows
Produces combinations of all rows on the input streams.
The options provided by PDI on this feature include:
Step name: Name of the step; name has to be unique.
Main step to read from: Specifies the step to read the most data
from. This step is not cached or spooled to disk, the others are.
The condition: User can enter a complex condition to limit the
number of output rows. If empty, the result is a cartesian product.
Temp directory: Specify the name of the directory where the system
stores temporary files.
Temporary file prefix: This is the prefix of the temporary files that
will be generated.
Max. cache size: The number of rows to cache before the systems
reads data from temporary files.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 158

Merge Join
The Merge Join step performs a classic merge join between data sets
with data coming from two different input steps. Join options include
INNER, LEFT OUTER, RIGHT OUTER, and FULL OUTER.
The options provided by PDI on this feature include:
Step name: Name of the step; name has to be unique.
First Step: Specify the first input step to the merge join.
Second Step: Specify the second input step to the merge join.
Join Type: INNER, LEFT OUTER, RIGHT OUTER, or FULL OUTER
Keys for 1st step: Specify the key fields on which the incoming data
is sorted.
Keys for 2nd step: Specify the key fields on which the incoming data
is sorted.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 159

Sorted Merge
The Sorted Merge step merges rows coming from multiple input steps
providing these rows are sorted themselves on the given key fields.
The options provided by PDI on this feature include:
Fields: Specify the fieldname and sort direction
(ascending/descending).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 160

Merge Rows
Compares and merges two streams of data
Reference Stream
Compare Stream
Mostly used to identify deltas in source data when no timestamp is
available
Reference Stream = The previously loaded data
Compare Stream = The newly extracted data from the source
Usage note: Ensure streams are sorted by comparison key fields
The output row is marked as follows:
identical: The key was found in both streams and the values to
compare were identical.
changed: The key was found in both streams but one or more
values is different.
new: The key was not found in the reference stream.
deleted: The key was not found in the compare stream.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 161

Unique Rows
Removes duplicates from the input stream.
Usage Note: Only consecutive records will be compared for duplicates, thus the
stream must be sorted by comparison fields.
The options provided for this step include:
Add counter to output?: Enable this to know how many rows duplicated for
each row in the output.
Counter fields: Name of the numeric field containing the number of
duplicate rows for each output record.
Fieldnames: A list of field names on which the uniqueness is compared. Data
in the other fields of the row is ignored.
Ignore Case Flag: Allows case insensitive matching on string fields.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 162

Aggregate Rows
Generates unique rows and produces aggregate metrics.
The available aggregation types include SUM, AVERAGE, COUNT, MIN,
MAX, FIRST and LAST.
THIS STEP TYPE IS DEPRECATED AND SHOULD NOT BE USED
Use Group By step type instead.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 163

Group By
Calculates aggregated values over a defined group of fields.
Operates much like the group by clause in SQL.
The options provided for this step include:
Aggregates: Specify the fields that need to be aggregated, the method (SUM,
MIN, MAX, etc.) and the name of the resulting new field.
Include all rows: If checked, the output will include both the new aggregate
records and the original detail records. You must also specify the name of
the output field that will be created and hold a flag which tells whether the
row is an aggregate or a detail record.
Very nice feature: Aggregate function Concatenate strings separated by can
be used to create a list of keys like 117, 131, 145,
The input needs to be sorted, another option is to use the Memory Group
By step that handles unsorted input.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 164

Pivot Transformations

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Pivot Transformations
Pivot Transformations are steps which flip the axis of the data (from rows
to columns and vice-versa).
Steps that are covered in this section:
Row Normalizer
Denormalizer
Row Flattener

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 166

Row Normalizer
Normalizes rows of data
For example:

Weekdate

Metric Type

Quantity

2001-01-07

Miles

1996

2001-01-07

Loaded Miles

1996

2001-01-07

Empty Miles

weekdate

Miles

Loaded_miles

Empty_miles

2001-01-07

1996

1996

2001-01-28

587

539

48

2001-01-28

Miles

587

..

..

..

..

2001-01-28

Loaded Miles

539

2001-01-28

Empty Miles

48

This result transforms column names into row descriptor values


It is possible to normalize more then on field at a time, whereby groups

of columns generate unique rows.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 167

Row Normalizer (cont)


The options provided for this step include:

Step name: Name of the step. This name has to be unique in a single
transformation.

Type field: The name of the type field. (Metric Type in our example)

Fields: A list of fields to normalize.


Fieldname: Name of the fields to normalize (Miles, Loaded_Miles,
Empty_Miles in our example).
Type: Give a string to classify the field (Miles, Loaded Miles, Empty
Miles in our example).
New field: Can give one or more fields where the new value should
transferred to (Quantity in our example)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 168

Row Denormalizer
Denormalizes data by looking up key-value pairs.
For example:

Weekdate

Metric Type

Quantity

2001-01-07

Miles

1996

2001-01-07

Loaded Miles

1996

2001-01-07

Empty Miles

2001-01-28

Miles

587

2001-01-28

Loaded Miles

539

2001-01-28

Empty Miles

48

2011, Pentaho. All Rights Reserved. www.pentaho.com.

weekdate

Miles

Loaded_miles

Empty_miles

2001-01-07

1996

1996

2001-01-28

587

539

48

..

..

..

..

US and Worldwide: +1 (866) 660-7555 | Slide 169

Row Denormalizer (cont)


The options provided for this step include:
Step name: Name of the step. This name has to be unique in a single
transformation.
Key field: The field that defines the key (Metric Type).
Group fields: Specify the fields that make up the grouping
(Weekdate).
Target fields: Specify the fields to de-normalize by specifying the
String value for the key field (Quantity  Miles, Loaded_Miles,
Empty_Miles).
Options are provided to convert data types. Most designs use Strings to
store values so this is helpful is the value is really a number or date.
In case there are key-value pair collisions (key is not unique for the
group specified) specify the aggregation method to use to compute the
new value

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 170

Row Flattener
Flattens sequentially provided rows

Usage Notes
Rows must be sorted in proper order.
Use denormalizer if Key-Value pair intelligence is required for
flattening.
For example:

Field1
A
D

Field1
A
A
D
D

Field2
B
E

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Field2
B
B
E
E

Field3
C
F

Field3
C
C
F
F

Target1
One
Three

Flatten
One
Two
Three
Four

Target2
Two
Four

US and Worldwide: +1 (866) 660-7555 | Slide 171

Row Flattener (cont)


The options provided for this step include:

Step name- Name of the step. This name has to be unique in a single
transformation.

The field to flatten- The field that needs to be flattened into


different target fields (e.g. Flatten)

Target fields- The name of the target fields to flatten to (e.g.


Target1, Target2)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 172

Closure Generator
This step was created to allow you to generate a Reflexive Transitive
Closure Table for Mondrian.
Technically, this step reads all input rows in memory and calculates all
possible parent-child relationships. It attaches the distance (in levels)
from parent to child.
The options provided for this step include:

Step name- The name that uniquely identifies the step.


Parent ID field- The field name that contains the parent ID of the
parent-child relationship.
Child ID field- The field name that contains the child ID of the parentchild relationship.
Distance field name- The name of the distance field that will be added
to the output
Root is zero- Check this box if the root of the parent-child tree is not
empty (null) but zero (0)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 173

Field Transformations Part 2

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Field Transformations Part 2


Field Transformations are steps that operate at the field level within a
stream record
The step types covered in this section include:
Add Sequence
Regex Evaluation
Split Fields
Value Mapper

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 175

Add Sequence
Adds a sequence number to the stream.
A sequence is an ever-changing integer value with a defined start value and

increment.
Options provided for this step include:
Name of value- Name of new field that is added to stream.
Use DB to get sequence- Option to be enabled when sequence is to be driven by a
database sequence.
Connection name- Choose the name of the connection on which the database sequence
resides.
Sequence name- The name of the database sequence.
Use counter to calculate sequence- Enable to have sequence generated by Kettle. Be
careful, Kettle generated sequences are created anew for each run of the

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 176

Add Sequence (cont)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 177

Add Sequence (db sequence)


PDI Sequences aren't persistent
Use Databases Sequences for generating Surrogates

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 178

Regex Evaluation
Evaluates a Regular Expression
Field to Evaluate name of the EXISTING field that contains the
string you want to perform the evaluation against
Result Fieldname name of the NEW field to put the result. Values:
Y/N
Regular Expression The regular expression to evaluate
Other Options: Case Sensitivity, Encodings, Whitespace, etc

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 179

Regex Evaluation (cont)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 180

Split Fields
Split fields based upon delimiter
Options provided for this step include:
Field to split- The name of the field you want to split.
Delimiter- Delimiter that determines the end of values in the field.
Fields- List of fields to split into.

Original Field:

Multiple Fields:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

12/31/2007

12

31

2007

US and Worldwide: +1 (866) 660-7555 | Slide 181

Split Fields (cont)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 182

Value Mapper
Maps input value to a new output value based on a mapping table
This is usually done in a data driven manner with a database table,
however this step allows you to define the mapping table in your code
Useful is the mapping table is small and rarely or never changes
For example, if user wants to replace Gender Types:
Fieldname to use: gender_code
Target fieldname: gender_desc
Default Upon: If you don't match, then put this (else statement)
Source/Target Mapping: F->Female, M->Male

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 183

Value Mapper (cont)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 184

Loading the Time Dimension


and the Fact Table

2011, Pentaho. All Rights Reserved. www.pentaho.com.

The Time Dimension


Every Data Warehouse needs a Time Dimension. Since everything has to
exist in time, it is probably the most general dimension in data
warehouse.
In general it has not many rows (if your granularity is on a daily basis,
you would need about 3,650 rows for ten years).
It covers hierarchies e.g. weeks, month, quarter, year.
It can also hold special holidays or other calendar specific information
(workdays, fiscal years), so the row can be very wide.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 186

The Time Dimension


Example:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 187

The Time Dimension


Example Transformation:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 188

Loading the Fact Table


In the majority of cases, the dimension tables are loaded and updated
before the fact table is loaded.
Sometimes you create a dimension out of (new) fact table entries.
When we assume, all dimensions are in place, we have to do the
following tasks:
Assemble the measures for the facts, e.g. the orders and order
details (decide to do this with or without delta loading).
Lookup the keys with the dimension lookup steps
Optional: do some extra cleanup
Load the fact table

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 189

Loading the Fact Table


Examples for the tasks:
Assemble the measures for the facts, e.g. the orders and order
details (decide to do this with or without delta loading).

Start with the details like order lines with the product specific issues
Lookup order headers to get some customer specifics and the order
date

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 190

Loading the Fact Table


Examples for the tasks:
Lookup the keys with the dimension lookup steps

This replaces e.g. the product code by the technical key productid

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 191

Loading the Fact Table


Examples for the tasks:
Optional: do some extra cleanup
Load the fact table

We have an exception handling and calculate the total price.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 192

Introduction to Jobs

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Jobs
Jobs aggregate up individual
pieces of functionality to
implement an entire process

Individual Pieces: FTP Files, Load


Staging, Load Warehouse.
Job: Nightly Warehouse Update

Job Entries

Think workflow for ETL


The basic composition of a Job is

Job Hops
Job Settings
Job Entries

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Job Hops

US and Worldwide: +1 (866) 660-7555 | Slide 194

Job Hops
A job hop is a graphical representation of one or more data streams between two

steps
A hop always links two job entries and can be set (depending on the type of

originating job entry) to execute the next job entry

unconditionally,
after successful execution,
or failed execution

The execution order is indicated with an arrow on graphical view pane

Unconditional

True (Success)

False (Error)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 195

Job Hops (cont.)


Besides the execution order, a job hop also specifies the condition on
which the next job entry will be executed.
Unconditional specifies that the next job entry will be executed
regardless of the result of the originating job entry.
Follow when result is true specifies that the next job entry will
only be executed when result of originating job entry was true.
Follow when result is false specifies that the next job entry will
only be executed when the result of the originating job entry was
false.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 196

Job Settings
Job settings are the options that control the behavior of a job and the
method of logging a jobs actions.

NOTE: Logging is covered in


Detail in the Logging Module

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 197

Job Entry
A job entry is a primary building block of a job
Execute transformations, retrieve files, generate email, etc.
A single job entry can be placed multiple times on the canvas.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 198

Job Entry Types


The following types are covered in this section
Start
Dummy
OK
Error
Transformation
Sub-Job
Shell
eMail
SQL
FTP
Table Exists
File Exists
Evaluation
SFTP
HTTP

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 199

Start
Defines the starting point for job execution
Only unconditional job hops are available from a Start job entry.
The start icon also contains basic scheduling functionality.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 200

Dummy
Use the Dummy job entry to do nothing in a job.
This can be useful to make job drawings clearer or for looping.
Dummy performs no evaluation.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 201

Transformation
Execute a transformation.
The options provided for this job entry are:

Name of the job entry- This name has to be unique in a single job. A
job entry can be placed several times on the canvas, however it will
be the same job entry.
Name of transformation- The name of the transformation to start.
Repository directory- The directory in repository where
transformation is located.
Filename- Specify the XML filename of the transformation to start.
Specify log file- Check this if you want to specify a separate logging
file for the execution of this transformation.
Name of log file- The directory and base name of the log file (for
example C:\logs).
Extension of the log file- The filename extension (for example: log or
txt)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 202

Transformation (cont.)
Include time in filename - Adds the system time to the filename.
Logging level - Specifies the logging level for the execution of the

transformation.
Copy previous results to arguments - The results from a previous transformation

can be sent to this one using the Copy rows to result step.
Arguments - Specify the strings to use as arguments for the transformation.
Execute once for every input row - Support for looping has been added by

allowing a transformation to be executed once for every input row.


Clear the list of result rows before execution- Checking this makes sure that the

list of result rows is cleared before the transformation is started.


Clear the list of result files before execution- Checking this makes sure that the

list of result files is cleared before the transformation is started.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 203

Transformation (example)
Using Repository

Using Files

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 204

Job (aka Sub Job)


Executes a job.
The options provided for this job entry are:

Name of the job entry- This name has to be unique in a single job. A
job entry can be placed several times, however it will be the same job
entry.

Name of transformation- The name of the job to start.

Repository directory- The directory in the repository where the job is


located.

Filename- If you're not working with a repository, specify the filename


of the job to start.

Specify log file- Check this if you want to specify a separate logging
file for the execution of this job.

Name of log file- The directory and base name of the log file (for
example C:\logs)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 205

Job (aka SubJob) (cont.)


Extension of the log file- The filename extension (for example: log or txt)
Include date in filename- Adds the system date to the filename.
Include time in filename- Adds the system time to the filename.
Logging level- Specifies the logging level for the execution of the job.
Copy previous results to arguments - The results from a previous transformation

can be sent to this job using the Copy rows to result step in a transformation.
Arguments - Specify the strings to use as arguments for the job.
Execute once for every input row - This implements looping. If the previous job

entry returns a set of result rows, you can have this job executed once for every
row found. One row is passed to this job at every execution. For example you can
execute a job for each file found in a directory using this option.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 206

Job (aka Subjob) example


Using Repository

Using Files

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 207

Job (aka SubJob) Example

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 208

Shell
Executes a shell script on the host where the job is running.
The options provided for this job entry are:

Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.

Filename - The filename of the shell script to start.

Specify log file - Check this if you want to specify a separate logging
file for the execution of this shell script.

Name of log file - The directory and base name of the log file (for
example C:\logs).

Extension of the log file - The filename extension (for example: log or
txt).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 209

Shell (cont.)
Include date in filename - Adds the system date to the filename.
Include time in filename - Adds the system time to the filename.
Logging level - Specifies the logging level for the execution of the shell.
Copy previous results to arguments - The results from a previous transformation

can be sent to the shell script using Copy rows to result step.
Arguments - Specify the strings to use as arguments for the shell script.
Execute once for every input row - This implements looping. If the previous job

entry returns a set of result rows, you can have this shell script executed once for
every row found. One row is passed to this script at every execution in
combination with the copy previous result to arguments. The values of the
corresponding result row can then be found on command line argument $1, $2, ...
(%1, %2, %3, ... on Windows).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 210

Mail
Send an e-Mail.
The options provided for this job entry are:

Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.

Destination address - The destination for the e-Mail.

Use authentication - Check this if your SMTP server requires you to


authenticate yourself.

Authentication user - The user name to authenticate with.

Authentication password- The password to authenticate with.

SMTP server - The mail server to which the mail has to be sent.

Reply address - The reply address for this e-Mail.

Subject - The subject of the e-Mail.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 211

Mail (cont.)
Include date in message - Check this if you want to include date in the e-Mail.
Contact person - The name of the contact person to be placed in the e-Mail.
Contact phone - The contact telephone number to be placed in the e-Mail.
Comment - Additional comment to be placed in the e-Mail.
Attach files to message - Check this if you want to attach files to this message.
Select the result files types to attach - When a transformation (or job) processes

files (text, excel, dbf, etc) an entry is being added to the list of files in the result
of that transformation or job. Specify the types of result files you want to add.
Zip files into a single archive - Check this if you want to zip all selected files into

a single archive.
Zip filename - Specify the name of the zip file that will be placed into the e-mail.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 212

SQL
Execute an SQL script
You can execute more than one SQL statement, provided that they are
separated by semi-colons
The options for this job entry are:

Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.

Connection - The database connection to use.

SQL script - The SQL script to execute.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 213

FTP
Retrieve one or more files from an FTP server.
The options provided for this job entry are:

Name of the job entry - The name of the job entry.

FTP server name - The name of the server or the IP address.

User name - The user name to log into the FTP server.

Password - The password to log into the FTP server.

Remote directory - The remote directory on FTP server from which files are taken.

Target directory - The directory on the machine on which Kettle runs in which you want
to place the transferred files

Wildcard - Specify a regular expression here if you want to select multiple files.

Use binary mode? - Check this if the files need to be transferred in binary mode.

Timeout - The FTP server timeout in seconds.

Remove files after retrieval? - Remove the files on the FTP server, but only after all
selected files have been successfully transferred.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 214

Table Exists
Verifies if a certain table exists on a database.
The options provided for this job entry are:

Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry.

Database connection - The database connection to use.

Table name - The name of the database table to check.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 215

File Exists
Verifies if a certain file exists on the server on which PDI runs.
The options provided for this job entry are:

Name of the job entry - The name of the job entry. This name has to
be unique in a single job. A job entry can be placed several times on
the canvas, however it will be the same job entry

Filename - The name and path of the file to check for

Variable - Select the variable to use as filename

Browse - Look for the file on the file system

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 216

JavaScript Evaluation
Calculates a boolean expression.
This result can be used to determine which next step will be executed.
The following variables are available for the expression:

errors: number of errors in the previous job entry (long).

lines_input: number of rows read from database or file (long).

lines_output: number of rows written to database or file (long).

lines_updated: number of rows updated in a database table (long).

lines_read: number of rows read from a previous transformation step (long).

lines_written: number of rows written to a next transformation step (long).

files_retrieved: number of files retrieved from an FTP server (long).

exit_status: the exit status of a shell script (integer).

nr (integer): the job entry number. Increments at every next job entry.

is_windows: true if Kettle runs on MS Windows (boolean).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 217

SFTP
Retrieves one or more files from an FTP server using the Secure FTP
protocol.
The options provided for this job entry are:
Name of the job entry - The name of the job entry.
SFTP-server name (IP) - The name of the SFTP server or the IP
address.
SFTP port - The TCP port to use. This is usually 22.
User name - The user name to log into the SFTP server.
Password - The password to log into the SFTP server.
Remote directory - Remote directory on SFTP server from which get
files.
Target directory - The directory on the machine on which Kettle runs
in which you want to place the transferred files.
Wildcard - Specify a regular expression if you want to select multiple
files.
Remove files after retrieval? - Remove the files on the SFTP server,
US and Worldwide: +1 (866) 660-7555 | Slide 218
2011, Pentaho. All Rights Reserved. www.pentaho.com.
but only after all selected files have been successfully transferred.

HTTP
Gets a file from a web server using the HTTP protocol.
The options provided by Chef on this feature are:

Name of the job entry - The name of the job entry.

URL (HTTP) - The URL to use (example:


http://www.kettle.be/index.html).

Run for every result row - Check this if you want to run this job entry
for every row that was generated by a previous transformation. Use
the Copy rows to result.

Fieldname to get URL from - The fieldname in the result rows to get
the URL from.

Target filename - The target filename what to call the downloaded


file.

Append to target file - Append to the target file if it already exists.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 219

HTTP (cont.)
Add date and time to target filename - Check this if you want to add
date and time yyyMMdd_HHmmss to the target filename.

Target filename extension - Specify the target filename extension in


case you're adding a date and time to the filename.

Username - The username to authenticate with. For Windows


Domains, put the Domain in from of the user like this
DOMAIN\Username

Password - The password to authenticate with.

Proxy server - The HTTP proxy server name or IP address.

Proxy port - The HTTP proxy port to use (usually 8080).

Ignore proxy for hosts - Specify a regular expression matching the


hosts you want to ignore/ separated.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 220

Advanced Job Concepts

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Exchanging data between transformations


Excursion Mapping: Within a Transformation you have a sub
transformation, called Mapping:
Example for a main Transformation:

Example for a sub Transformation:

Use case: If you need a part of a transformation to be reused by other


transformations.
Due to the mapping specifications you have all fields at design time.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 222

Exchanging data between transformations


Within a Job you have two or more Transformations exchanging data.
Rows are exchanged between transformation within memory, example:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 223

Exchanging data between transformations


If you want to put/get data, use the steps Copy rows to result and
Get rows from result within a transformation.
Within a job, uncheck the Clear list of result rows before execution
for the transformation that should receive data from the previous
transformation.

Note: Result rows can be accumulated by rows from subsequent Jobs or


Transformations (since Version 2.5).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 224

Exchanging data between transformations


You can also set arguments like a fixed input parameter:

In the transformation you get the argument with the Get System Info
step:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 225

Exchanging data between transformations


Exchanging data via Arguments: Option: Copy previous results to args

When Execute for every input row is not checked, only the first row
will be taken (when arguments are used).
Clear the list of result rows must be checked, otherwise this could
lead to an infinite loop (because this transformation is generating rows,
too).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 226

Exchanging data between transformations


Exchanging data via the Result set: Option: Execute for every input row
The transformation or job get exactly one result row from the preceding
result rows set (Clear the list of result rows must be checked).

The transformation getting the result row looks like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 227

Exchanging data between transformations


Exchanging data via the Result set: Option: Clear the list of result rows
Rule of thumb:
When you use the options: Copy previous results to args or Execute
for every input row you have to check Clear the list of result rows
Otherwise: When Clear the list of result rows is UNchecked, all
rows will be copied to the calling transformation or job.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 228

Exchanging data between transformations


Exchanging data via Files: Instead of exchanging data only in memory,
you can materialize the data. Use cases:
The data you want to transfer blows up your memory.
Transfer data to another location
Process data independent of time and location
File bugs or feature requests together with data to reproduce ;-)
With PDI you can Serialize and Deserialize to a file. As an advantage to
other file formats like Textfiles or Excel, the PDI meta data is stored
together with the data and you do not need to figure this out.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 229

Command Line Parameters


A job can be called from a .bat /.sh with parameters like this:
call kitchen.bat /file:directory\parajob.kjb test1 test2
These parameters can be transferred to subsequent Transformations.
Attention: Copy previous results to args MUST NOT be checked,
otherwise the parameters are not transferred to the transformation.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 230

Concept for processing files and tables


Process all tables:
Start with a Job that defines all tables to process, e.g.:

Start a Sub-Job to process all tables (result rows):

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 231

Concept for processing files and tables


The Sub-Job (processing exact one row) sets a variable that is accessible
from subsequent Transformations or Jobs.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 232

Excursion: Variables and Data Flow


Why do you have to set variables in a different transformation?
To recap:
All Steps run in an own thread, that means they are started and run
in parallel
The initialization sequence is not predictable and PDI is taking care
of the correct data flow (pulling and pushing data from step to
step)
 When you define and use a variable in the same transformation, it is
not clear if the setting of the variable takes place before the use.
 Thus you have to split the transformation for the setting and
referencing part of variables.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 233

Concept for processing files and tables


The concept for processing files is similar:

Note: Wildcards are regular expressions. See the example how to


process all files starting with test and ending with .txt

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 234

Concept for processing files and tables


If you only use steps that allow rows as an input parameter you do not
need variables (like Text File Input: Accepting filenames from previous
step or Table Input).
Use of a variable vs. information from a previous step:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 235

Concept for processing files and tables


Most of the output steps like Text file output or Table Output do not
support this, so you need variables. Here is an example for the Excel
output:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 236

Concept for sending results & alerts by


mail

For every file that is written to the file system e.g. by the Text File
Output step or Excel Output step, its filename is stored in a List of
result files.
This list can be processed by a transformation with the step Get files
from result.

You can programmatically create this list by the step Set files in
result:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 237

Concept for sending results & alerts by


mail

The list of result files can be used to automatically send all produced
files via mail:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 238

Concept for sending results & alerts by


mail

Beneath the mail addresses (To, CC, BCC, Reply To, From) and the mail
server settings you can enter the following options:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 239

Concept for sending results & alerts by

When you uncheck Only send comment in mail body you get the following details in the
mail
mail message body together with the message comment:
Enclosed you find the latest ....
Job:
----JobName
Directory
JobEntry

: Mail send testfile


: /
: Mail 1

Message date: 2007/10/29 15:35:02.565


Previous results:
----------------Job entry Nr
Errors
Lines read
Lines written
Lines input
Lines output
Lines updated
Script exist status
Result

:
:
:
:
:
:
:
:
:

1
0
0
0
0
0
0
0
true

Path to this job entry:


-----------------------Mail send testfile
Mail send testfile : : start : Start of job execution (2007/10/29 15:35:01.858)
Mail send testfile : : START : start : Start of job execution (2007/10/29 15:35:01.858)
Mail send testfile : : START : [nr=0, errors=0, exit_status=0, result=true] : Job execution finished (2007/10/29 15:35:01.859)
Mail send testfile : : Generate Testfile : Followed unconditional link : Start of job execution (2007/10/29 15:35:01.860)
Mail send testfile : : Generate Testfile : [nr=1, errors=0, exit_status=0, result=true] : Job execution finished (2007/10/29
15:35:02.437)
Mail send testfile : : Mail 1 : Followed link after success : Start of job execution (2007/10/29 15:35:02.438)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 240

Concept for sending results & alerts by


mail

For sending all produced files, you select General:

Select other file types to add logfiles for different detail levels.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 241

Concept for sending results & alerts by


mail

If you want to add a log you need to define this in a previous job entry
for a transformation, e.g.:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 242

Concept for sending results & alerts by


mail

Example of sending an alert by mail:

When you create more files then you want to send, there is the option
to Clear list of result files within running a transformation or sub-job.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 243

Running Job Entries in Parallel


By default, job entries run sequential only
Even when they are designed this way:

Both transformations are not running in parallel. The sequence is


depending on the order of the job entry creation.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 244

Running Job Entries in Parallel


When you select on the context menu of any job entry Launch next
entries in parallel, all following job entries are called in parallel:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 245

Running Job Entries in Parallel


When you want to synchronize the job entries again, you could imagine
a construct like the following:

Attention: In this case, the job entry is called two times and does not
start when both entries are finished. (see the following slide)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 246

Running Job Entries in Parallel


When you want to synchronize the job entries again, you need a
construct with a wrapper job like this:

The job entry will call the job that includes the parallel tasks.
Now, the log entry will be executed, when both entries are finished.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 247

Conditions
With conditions you can change the pathway of your job.
Most of the job entries support to give a result back as true or false.
E.g. if a transformation fails, the result is false.
More complex conditions can be handled by the JavaScript job entry.
This is different to the JavaScript for transformations and only
evaluates.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 248

Conditions
Within the JavaScript job entry you can use the following variables:
lines_input
lines_output
lines_updated
lines_rejected
lines_read
lines_written
files_retrieved
errors
exit_status
nr
is_windows
Note: All variables beginning with lines_ need some preparation

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 249

Conditions
Example for checking the processed lines:

The JavaScript looks like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 250

Conditions
You have to enable and define what step of the transformation should
be taken for the variable lines_written. Do this within the
transformation:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 251

Conditions
Example for checking for a specific value within the result lines:

The JavaScripts look like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 252

Further advanced job concepts


Further advanced job concepts (like looping within jobs) can be found in
the chapter ETL patterns.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 253

Common Scripting Uses

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Classic vs. Modified JavaScript


Before 3.0 there was a JavaScript and Modified JavaScript step.
3.0

Since 3.0 there is only the Modified JavaScript step with a compatibility
mode:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 255

Classic vs. Modified JavaScript


Why compatibility?
The 3.0 version divided data and meta data and thus using the
handling of PDI values has changed.
All listed functions in the manual for the old step are not valid in the
new step.
Instead, you use the standard Java object methods for the classes:
String : java.lang.String
Number: java.lang.Double
Integer: java.lang.Long
Date: java.lang.Date
BigNumber: java.math.BigDecimal
Boolean: java.lang.Boolean
Binary: java.lang.byte[]
The compatibility mode is slower and should be avoided

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 256

Test-Script
The Test-Script button creates test data for the input fields, thus
depending on the needed format this can fail. Example:

Different defaults for testing depending on the value type:


String: test value test value ...., Numericals: 0, Boolean: true, Date:
actual date
You can change the default values and test the script with OK

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 257

Built-in Functions
There are a lot of build-in functions with samples:

Right click on a function and select Sample

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 258

Start and End Scripts


You can define multiple scripts and select if this should be a script that
is executed additionally at the start or end of the transformation.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 259

Constants: Influence the Processing


With the constants you can skip subsequent rows at a certain point or
set the transformation in an error state.
You must set trans_Status to the desired state
trans_Status = CONTINUE_TRANSFORMATION;
trans_Status = SKIP_TRANSFORMATION; //skip actual row
trans_Status = ABORT_TRANSFORMATION; // ends normal
trans_Status = ERROR_TRANSFORMATION; // ends with an error
Attention: Make sure to set trans_Status = CONTINUE_TRANSFORMATION
outside of an if section, because the JavaScript step determines if the
status should be changed by analyzing the script and does not find
trans_Status within if sections reliable.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 260

Constants: Influence the Processing


Example: Only 5 of 10 rows are processed.

For the usage of getProcessCount() and WriteToLog() look at the


samples from the Special functions.

Get the samples by a right


click on a function.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 261

Internal API objects


You can use the following internal API objects:
_TransformationName_: a String with the actual transformation
name
_step_: the actual step instance of
org.pentaho.di.trans.steps.scriptvalues_mod.ScriptValuesMod
rowMeta: the actual instance of org.pentaho.di.core.row.RowMeta
row: the actual instance of the actual data Object[]
With these API objects you can access the internal instances. The rich
functionality can not be covered in full here. To use them in the right
way you have to look at the classes in the source code.
Some Examples are also available in your local directory kettle/samples
(e.g. JavaScript - dialog.ktr) and on the Wiki.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 262

Internal API objects


Example for the _step_ functionality:
Analyse the URL and hostname for an existing connection:

With _step_ you can access almost all information from the context
of your transformation and environment.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 263

Internal API objects


getInputRowMeta(): Get you Meta data of the input, same as rowMeta
getOutputRowMeta(): Get you Meta data of the output, e.g. when you
want to add fields by the JavaScript.
row[]: Array of the actual row
Normally all input fields are available in the JavaScript and you can
access. If you need to process your row dynamically, this array is
useful.
If you want to change the size of the output row to add new fields,
use the function: newRow =
createRowCopy(getOutputRowMeta().size());
putRow(): When you want to produce extra rows.
A detailed example for this is on the Wiki:
http://wiki.pentaho.org/display/EAI/Migrating+JavaScript+from+2.5.x+to+3.0.0

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 264

Internal API objects


Example for processing the entire row:
This transformation demonstrates how to process all fields of a row
by JavaScript. Here all 'E' characters will be replaced by 'Z'.
A use case is converting all strings coming from a host system with
wrong conversions of special characters (e.g. German umlauts like
).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 265

Use Java Classes


You can access all Java classes by preceding Packages to the needed
class, example:

If you want to add 3rd party libraries, you can add the jar to the
classpath and use them.
Note: Keep in mind by using PDI classes, they can change in newer
versions.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 266

Replacing Values
Use case: You want to replace parts of a string.
Actually, it is not possible to change the value of an existing one in
version 3.0
In the old-style engine, it was possible to change strings to numbers,
etc. The result was a big mess where data types got mixed a lot.
Number / Integer mixups especially since JS resorts to using doubles
for almost anything.
As such, the developers want to discourage doing exactly that as
much as possible.
For doing a replacement you have to use the compatibility mode.
The developers actually discuss a solution like this:
Add a column in the "Fields" section: "Replace value? (Y/N)" - In that
situation it can indeed convert values and verify data types.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 267

More Common Use Cases


Convert dates: use the date functions like date2str(), str2date()
Convert special string formats: use the string functions like indexOf(),
substr()
Working days: use the functions isWorkingDay(), getNextWorkingDay()
Check formats: e.g. isDate(), isNum()
Special: e.g. getDigitsOnly(), resolveIP(), LuhnCheck() [checks credit
card numbers]

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 268

JavaScript Language Reference & Libraries


PDI uses Rhino for the JavaScript interpreter
The Rhino project with documentation:
http://www.mozilla.org/rhino/

Since PDI version 3.0 the JavaScript engine is not sealed any more:
Sealing prevents using of common JavaScript libraries.
This was actually very limiting for experienced users because there
are some very good JavaScript libraries that contain many useful
functions.
Since PDI version 3.0 you can use them by including them in the
classpath.
http://jslib.mozdev.org/

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 269

Scripting and Performance


In general the JavaScript step should be avoided, when the performance
is critical.

Alternatives for scripting:

Formula step (faster than JavaScript step)

User Defined Java Expression (faster than Formula step)

User Defined Java Class used as an alternative for plug-ins

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 270

Formula step
The Formula step is based on the Open-Formula-Syntax
You can reference values with square brackets: [value]
Get help for every function with clicking on the function
Applying business rules (if / then / else) with more complex logic is
possible

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 271

User Defined Java Expression


This step is compiling into pure Java and you can use the standard Java
object methods for the values
Applying business rules (if / then / else) with more complex logic is
possible with the conditional expression operator like
(a > b) ? a : b
The condition (a > b) is tested. If it is true, the first value a is
returned. If it is false, the second value b is returned.
Another example testing for a string:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 272

User Defined Java Class


With this step, it is possible to access all internal step logic like you can
do with an own custom made plug-in. The benefit of doing it with this
step is the deployment process is simplified.
Further information can be found in the Wiki
http://wiki.pentaho.com/display/EAI/Writing+your+own+Pentaho+Data+Integration+Plug-In

Look at the Code Snippets, e.g. Main to get samples:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 273

Dynamic Transformations

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Dynamic Transformations
Use case: One transformation fits all
You want to use only one master transformation and want to control
the
Input-type (e.g. CSV, Fixed File, Excel)
Preprocessing
Field Mapping
Validation
Enrichment
You can accomplish this easily by the use of sub transformations
(mappings) and variables

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 275

Dynamic Transformations
Use case: One transformation fits all
Here is a sample transformation that calls different subtransformations that are controlled by variables by a job and out of
this, it is completely flexible.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 276

Dynamic Transformations
Use case: Dynamic field mapping
You get a lot of different input files and need to output this into a
harmonized file structure.
In this case, you can use the ETL Metadata Injection step controlling
this transformation:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 277

Dynamic Transformations
ETL Metadata Injection step: sample controlling transformation:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 278

Dynamic Transformations
With the possibility of controlling your transformation by meta data you
will be very flexible and can accomplish e.g. processing invoice data
from many customers or suppliers with very little transformations or
jobs.

More steps will support this powerful feature in the next releases.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 279

Using XML

2011, Pentaho. All Rights Reserved. www.pentaho.com.

XML Steps Overview


Get Data from XML
Powerful step with XPath and handling large files capabilities

XML Input Stream (StAX)


This step is capable of processing very large and complex XML files
very fast using the StAX parser

XML Output
Basic XML Output for simple and flat structures

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 281

XML Steps Overview


Add XML
Adds XML to a data stream for more complex XML Outputs

XML Join
The XML Join Step allows to add xml tags from one stream into a
leading XML structure from a second stream (similar to Add XML but
more powerful).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 282

XML Job Entries Overview


XSL Transformation
XSL transformation job entry transforms (by applying XSL document)
XML documents into other documents (XML or other format, such as
HTML or plain text).

XSD Validator
Validate a XML file against a XML Schema Definition (XSD).

DTD Validator
This step provides the ability to validate a XML document against a
Document Type Definition (DTD).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 283

Get Data from XML


Sample:

[]

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 284

Get Data from XML


Define the XPath

For more information on XPath, see this tutorial:


http://www.w3schools.com/XPath/
You can reference parent elements in the fields like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 285

Get Data from XML


The result looks like this

You have the order number and order line together


It is also possible to split the header rows and order line to different
steps and only use one reference from the header within the order
line records (this example is covered in the labs).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 286

Get Data from XML


Get Data from XML: Use of tokens
This is useful when you want to reference parts from another
hierarchy structure like in this example with <User>:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 287

Get Data from XML


Get Data from XML: Use of tokens
The definition looks like this:

And the result is:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 288

Get Data from XML


Get Data from XML: further advanced options
Process XML data that is coming from a field: just define the field
and check XML Source is defined in a field

Process large XML files: define Prune path to handle large files

When the prune path is given, the file is processed in a streaming


mode in chunks of data separated by the prune path. At first this
can be almost the same as the "Loop XPath" property with some
exceptions details can be found in the documentation.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 289

XML Input Stream (StAX)


XML Input Stream (StAX) vs. Get Data from XML:
The Get Data from XML step is easier to use but uses DOM parsers
that need in memory processing and even the purging of parts of
the file is not sufficient when these parts are very big.

The XML Input Stream (StAX) step uses a completely different


approach:
Since Kettle has so many own steps to process data in different
ways, the processing logic has been moved more into the
transformation and the step itself provides the raw XML data
stream together with additional and helpful processing information.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 290

XML Input Stream (StAX)


Choose this step, whenever you have limitations with other steps or
when you are in need of parsing XML with the following conditions:

Very fast and independend of the memory regardless of the file


size (GBs and more are possible due to the streaming approach)
Very flexible reading different parts of the XML file in different
ways (and avoid parsing the file many times)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 291

XML Input Stream (StAX)


XML Sample with different element blocks:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 292

XML Input Stream (StAX)


A preview may look like this (depending on the selected field):

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 293

XML Input Stream (StAX)


You see you really get almost the original streaming information with
Elements and Attributes from the XML file together with helpful other
fields like the element level.

Since the processing logic of some XML files can sometimes be very
tricky, a good knowledge of the existing Kettle steps is recommended
to use this step. Please see the different samples of this step for
illustrations of the usage.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 294

XML Input Stream (StAX)


Sample for processing the file:
XML Input Stream (StAX) Test 2 - Element Blocks.xml

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 295

XML Input Stream (StAX)


The output looks like this for the Analyzer List block:

The output looks like this for the Products block:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 296

XML Input Stream (StAX)


There are a lof of options in the step to help to solve your needs:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 297

XML Output
Basic XML Output for simple and flat structures
The usage is easy: Beneath the filename, you have to define your
root and row XML element name.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 298

XML Output
A result could look like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 299

Add XML
Adds XML to a data stream for more complex XML Outputs
This step allows you to encode the content of a number of fields
in a row in XML. This XML is added to the row in the form of a
String field.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 300

Add XML
Enter the field names and optionally the Element name or if this
should be included as an attribute.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 301

Add XML
A use case is to build more complex (nested) XML structures.
Here is a basic example:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 302

XML Join
XML Join allows to add xml tags from one stream into a leading XML
structure from a second stream.
Together with the Add XML step, XML Join is used for building more
complex XML files. It replaces the stream join of fields and
simplifies the creation.

continues here in the real sample

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 303

Portable Transformations and Jobs

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Use Cases for Portability


Going from Development  Test  Production
Change your database server
Change your database name
Change usernames and passwords
Reuse transformations or jobs with different files or databases
Use relative paths and be independent of the directory structure
and many more

 All without changing the transformations or jobs

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 305

Set Environment Variables: Variables scope


When you set variables they can be valid in:
The Java Virtual Machine (JVM)
The parent job
The grand-parent job
The root job
Keep in mind that valid in the JVM can lead to race conditions.
The preferred scope is the root job. The parent or grand-parent scopes
are only needed, if the same variables are used and referenced in
different levels of your jobs.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 306

Set Environment Variables: Properties File


You can set environment variables also from the properties file
kettle.properties.
By default this file is located in your $HOME/.kettle directory (e.g.
C:\Users\jb\.kettle):

You can also point the KETTLE_HOME environment variable to the


directory that contains the .kettle directory. Please see the chapter
PDI Overview for more details.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 307

Set Environment Variables: System wide


You can set environment variables also from the operating system level,
but they are only valid in the JVM when you bypass them (see next
slide).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 308

Set Environment Variables: JVM wide


You can set environment variables that are valid in the entire Java
Virtual Machine. They can reference to operating system variables or set
to fixed values. Use the D option in your .bat or .sh files.
Example for .bat: -DDATABASE="%DATABASE%
Example for .sh: -DDATABASE=$DATABASE
You can also set them fixed here, e.g.: -DDATABASE=salesdb

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 309

PDI own Variables


PDI has a lot of own variables that can be referenced. Especial for the
use case of flexible directory structures the following are useful:
Internal.Transformation.Filename.Directory
Internal.Job.Filename.Directory
pentaho.solutionpath (when you run in a Pentaho BI environment)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 310

Referencing Variables
Whenever you see this icon, you can use variables:
Press Ctrl-Space to see a list of available variables.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 311

Referencing Variables
If you want to test your transformation at design time, make sure you
set the variable for test purposes. (Edit / Set environment variables)

Spoon automatically detects variables, that are referenced but are not
set and lists them here.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 312

Referencing Variables
Variables are very useful in:
Flexible file processing:
Flexible table processing:
Table Input:

Table Output:
And flexible database connections:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 313

Named Parameters
Named parameters are a system that allows you to parameterize your
transformations and jobs. On top of the variables system that was
already in place prior to the introduction in version 3.2, named
parameters offer the setting of a description and a default value. That
allows you in turn to list the required parameters for a job or
transformation.
They can be set in the job or transformation properties.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 314

Named Parameters
When starting a job or transformation you can overwrite the default:

Named parameters can also be set within a job entry:

Or from the command line, e.g. for kitchen for example:


"-param:MASTER_HOST=192.168.1.3" "-param:MASTER_PORT=8181"

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 315

Shared Objects
A variety of objects can be placed in a shared objects file on the local
machine.
The default location for the shared objects file is
$HOME/.kettle/shared.xml.
Objects that can be shared using this method include:
Database connections
Steps
Slave servers
Partition schemas
Cluster schemas
To share one of these objects, simply right-click on the object in the
tree control on the left and choose share.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 316

Shared Objects
This is useful especially for Connections like we do it in this course.
Thus we do not have to enter the information again for new
transformations.
If we want to change one of the connection properties like the user
name, we could do it once for all transformations.
When you use Slave servers, Partition schemas, Cluster schemas this is
the same.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 317

Shared Objects
If you want to change the location of the shared objects file you can do
this in the properties or your transformation.
This is recommended, when it should be independent of the users home
directory.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 318

JNDI for database connections


When using JNDI, you create a named
connection. This is used predominately
inside of an application server container
(Tomcat, JBoss, etc.). In a development
environment, you essentially mimic JNDI.
Each developer will use the same name,
but different connection information.
When you move to a production server,
you will configure a JNDI connection with
the same name on the server.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 319

JNDI
To configure the connection for the use at design time, edit the file:
data-integration\simple-jndi\jdbc.properties
This file can be found in the DI server in:
data-integration-server\pentaho-solutions\system\simple-jndi
Here is a sample connection for a shared connection named
SampleData:
SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/url=jdbc:hsqldb:hsql://localhost/sampledata
SampleData/user=pentaho_user
SampleData/password=password

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 320

Logging

2011, Pentaho. All Rights Reserved. www.pentaho.com.

What is Logging?
Summarized Information about the Job or Transformation execution
Number of records Inserted
Total Elapsed Time spent in a Transformation

Detailed information about Job or Transformation execution


Exceptions
Errors
Debugging Information

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 322

Reasons to Enable Logging


Reliability
See if a Job finished with errors
Review what errors were encountered
Headless Operation
Most ETL in production isn't run from the GUI
Need a place to watch initiated job results
Performance Monitoring
Useful information for both current performance problems and
capacity planning

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 323

Two Types of Logging


Log Entries
Traditional logging in the sense of the word
File based approach
Verbose
Database Logging
Summarized results
RDBMS based approach
Concise and structured

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 324

Log Entries: Introduction


ALWAYS contains Timestamp
Usually contains the Step name that logged the entry
The rest is varies by what the log entry is

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 325

Log Entries: File Locations


Defaults to Spoon_xxx.log in your temporary files folder, e.g.:
C:\Users\Username\AppData\Local\Temp

Location can be set on command line


kitchen.sh -logfile=/tmp/mylogfile.log
WARNING
Log files can get BIG, several hundreds of MB s depending on logging
levels.
Place them in a separate directory, and periodically archive or purge

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 326

Log Entries: Logging Levels


Sets the verbosity and information which is logged
Logging levels are additive
Basic Level = Minimal + Basic Log Entries
You get all the entries from the previous levels PLUS the level you've
selected
Levels
Error
Nothing
Minimal
Basic
Detailed
Debug
Row Level

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 327

Log Entries: Nothing


You'll see log entries like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 328

Log Entries: Error & Minimal


Error
Builds on Nothing
Will place log messages if any errors have occurred. If no errors occur
there will be no log output.
Minimal
Builds on Error
Places raw minimum of log entries. Typically a Transformation Started
and a Transformation Ended.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 329

Log Entries: Basic


Basic
Builds on Minimal
Logs information about individual steps

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 330

Log Entries: Detailed


Detailed
Builds on Basic
Each Step provides MORE information about the execution. Whereas
previous log level only provided summary level for each step this level
encourages each step to print out additional information.
Database steps provide information about the database connection and
statements they're executing

pentaho_oltp - Setting preparedStatement to [SELECT orderdate,


requireddate, shippeddate, status, customernumber FROM orders
WHERE ordernumber = ? ]

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 331

Log Entries: Debug & Row Level


Debug
Builds on Detailed
VERBOSE. Log pretty much everything. Useful for developers or for
tracking down obscure issues.
Row Level
Builds on Debug
MOST VERBOSE. Dumps actual values of rows passing through
operators. Useful for tracking down when a data value is causing an
OraException with no helpful error message.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 332

Log Entries: Common Uses


Unknown Database Exception
Using Basic logging for normal operation you get a Database
Exception complaining about invalid characters in a column.
Exception doesn't tell you which VALUEs are at issue, only the column.
Turn on Row Level logging to find the row that is throwing the
exception.
Determining Prepared Statement Syntax
Data Warehouse DBAs want to a report of all the SQL your ETL
application will execute.
This SQL will be put through an analysis and tuned so that lookup
indexes are properly implemented.
Turn on Detailed logging and collect the set of SQL being used by
steps.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 333

Database Logging: Introduction


Logs information into database tables in a structured format for reporting
and monitoring.
NOTE: Connection to PDI Logging and DW can be the SAME or Different.

Job1

PDI Logging

Transform1

Transform2

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Data Warehouse
Staging
Data Marts

US and Worldwide: +1 (866) 660-7555 | Slide 334

Database Logging: Location and Properties


Database Location that will receive log records. Comprised of:
Connection
NOTE: Best to use a DIFFERENT connection from your actual ETL.
Schema / Table Name
Name of the schema and table to receive the records.
NOTE: Jobs and Transforms log in different formats and different
tables.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 335

Database Logging: Jobs and Transformations


Both Jobs and Transformations log
Jobs log to one table (pdi_log_job)
Transformations log to (pdi_log_trans)
NOTE: Use both in practice

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 336

Transform Log Format: Structure


The table for logging contains the following columns (see Transformation
Setting / Logging for the full description):

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 337

Transform Log Format: Example


The following table shows a transformation executed THREE times
successfully

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 338

Transform Log Format: Step Selection


A transformation can have huge numbers of steps. Which one does PDI
choose to report as it's summary for the entire transformation?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 339

Transform Log Format: Step Selection (cont)


Need to configure PDI to say which STEP row count should be reported at
the transformation summary level.
Select ONE STEP for the following columns
LINES_READ
LINES_INPUT
LINES_WRITTEN
LINES_OUTPUT
LINES_UPDATED
LINES_REJECTED

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 340

Job Log Format: Structure


The table for logging contains the following columns
This is the same structure as the transformation log table but with the
fields
ID_JOB - Primary Key of the job log entry
JOBNAME - Name of the job

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 341

Job Log Format: Example


The following shows the successful execution of two jobs (END) and one
job that was in process at the time the SQL was executed against the
table pdi_log_job.
NOTE: The LINES_* come from the last Transform executed!

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 342

Database Logging: Creating Tables


Create a Connection for the logging
(PDI_LOG_CONNECTION)
Open the Settings for either the
Transformation or Job
Select the connection and type in a
table name
Suggestion: pdi_log_job for Jobs
Suggestion: pdi_log_trans for
Transformations
Hit the SQL button at the bottom
to get the DDL for the table.
Execute.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 343

Database Logging: Indexes


When using job or transformation logging, this information is used in
many ways, for example:
The History in the Spoon Job log table is shown
It is used by the Pentaho Enterprise Console
It is used, when you analyze your log tables
Out of this, having the right indexes on the tables helps in
performance.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 344

Database Logging: (optional) LOG_FIELD


Big Text column that stores
the contents of the Log
Entries at the end of the run.
WARNING: Increases by at
least an order of magnitude
the space needed for the
logging tables.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 345

Database Transformation Logging: Gotchas


Transformation Name and Job Name are important! If you're using .ktr
and .kjb you have to make sure and set these.

Start and End dates are not intuitive.


Start is not when the transformation / job began. It is set to the last
end date of the prior transformation / job. The first date is set to an
"infinite" date in the past, like 1900 or one hour before: '1899-12-31
23:00:00'
The startdate/enddate columns contains the date-range since the last
time that the transformation ran without error.
Note: If you want to get a hold of this date range, use a "Get
System Info" step and select options "Start of date range" and "End
of date range".
The logdate is the date that the last record was inserted into the
transformation and as such the time the transformation stopped
running. (the ending date)
The replay date is the date you can use to "replay" this transformation
and is effectively the time that the transformation was executed and
started. (the start date)
US and Worldwide: +1 (866) 660-7555 | Slide 346
2011, Pentaho. All Rights Reserved. www.pentaho.com.

Database Logging: Common Uses


Headless Operation
DBA scripts, Pentaho reports to watch the tables to see the state and
progress of remotely executed Jobs/Transforms.
Running ETL processes
Find the processes that are currently running or stuck.
Determining speed and throughput
LINES IN / Total Transform time = AGGREGATE RECORD THROUGHPUT
Chart that at different LINES IN (X axis) and ELAPSED TIME (Y Axis) and
you can see how your ETL scales. BE SCARED OF THE HOCKEY STICK.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 347

Database Logging Features since 4.0


Internal Object IDs, Logging channels (GUIDs)
Small API change
Solid logging architecture
Log separation (no mixture of transformation and job logs anymore)
Hierarchy: Where is each logging entry coming from?
Step logging
Performance logging

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 348

Step / Job Entry Logging


Very detailed information about every step in a transformation (or every
job entry in a job)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 349

Step / Job Logging


... Continued ...

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 350

Logging Channels
Very detailed information about every channel, e.g. the object type and
name:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 351

Logging Channels
Deep information about the executed transformation filename or
repository details (Object and Revision) and the logging channel hierarchy
(Parent / Root):
This is also usable for lineage analysis when a report is build out of these
information

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 352

Performance Logging
You need to enable performance Logging and Monitoring:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 353

Performance Logging
You can see the results in the
Executiong Results / Performance Graph:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 354

Performance Logging
You can see the results also in the log table for further analysis:
A snapshot was taken every second

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 355

Performance Logging
The snapshot contains the processed rows:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 356

Performance Logging
The snapshot contains also the buffer situation.
This is very useful in analyzing bottle necks, e.g. when the number of
input buffer rows if often higher than the output buffer row (the
ratio), then this step takes more time for processing and is most likely
the bottle neck.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 357

Real time monitoring: Drill Down and Sniffing


When you need real time information about your process:
Drill down into running sub jobs, transformatins or sub-transformations
(mappings)
Turn on sniffing
Combine this with debugging

Drill down is possible when the box is blue:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 358

Real time monitoring: Drill Down and Sniffing


When a transformation is running, you can turn on sniffing in the context
menu of any step:

The result shows you the real


rows actually processed. To slow
it down for better visualization,
you can add a Delay Row step.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 359

Real time monitoring: Drill Down and Sniffing


To define a start point for your sniff testing, you can use a break point:

The normal preview looks like this (not the sniffing, but similar):
Note: Press the Close button
and not the Stop button, see
next slide....

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 360

Real time monitoring: Drill Down and Sniffing


Activate the sniffing in the pausend Transformation and then resume it:

You can analyze the detailed row before and after the step or at any
other position within one transformation:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 361

Database Logging and Kettle.properties


A lot of database logging definitions can be set by system variables
There is no need anymore to define this in every job and transformation
The list of variables and description can be seen in the Menu: Edit / Edit
the Kettle.properties file:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 362

Scheduling and Monitoring


Logging is tightly linked to Monitoring and Scheduling. You need to know
if your scheduled jobs run successful or not, how much time they need
etc.
This is discussed in more detail in the chapters Scheduling and
Monitoring and Operations Patterns

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 363

Error Handling within Transformations

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Step Error Handling


Within a transformation you can check if a step fails and direct the
problematic rows to another stream.

The entire transformation will not fail and continue to process your
data.
Note: Not all steps support this error handling feature at this time.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 365

Step Error Handling


To configure error handling, select Define Error handling... from the
context menu of the step.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 366

Step Error Handling


The data stream for the problematic rows looks like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 367

Step Error Handling


You can also set thresholds to set the entire transformation in an error
state. Depending on the Nr. Of rows or percentage, you will get errors
like this:

Too many rows where rejected by the error handling, 1 is the


maximum and 2 rows where rejected. This transformation is being
asked to stop.

The maximum percentage of rejected rows of 66 has been reached.


2 rows where rejected out of 3. This transformation is being asked
to stop.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 368

Step Error Handling


You can also combine this with the Abort step in the case you want
another Error Message when the transformation stops.

With Always log rows, all rows in this data stream will be logged.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 369

Step Error Handling


Note: When you filter your log after the word error, make sure to
have it in the message (e.g. the button Show error lines will not
detect them).

You see CR / LF within the data. If you want to eliminate them, you can
use a JavaScript.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 370

Step Error Handling


JavaScript example for eliminating CR / LF (using compatibility mode).

Now the result looks like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 371

Step Error Handling


Performance aspects I: This type of error handling can lead to a
performance loss when a lot of errors arise. This also depends on the
involved steps.
When you can check your data before the problematic step, this
would be an alternative. The steps to use could be:

Filter Rows Step


RegEx Evaluation Step
JavaScript Step

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 372

Step Error Handling


Performance aspects II: For the Input / Output step this could be an
alternative.
The technique used in Insert/Update is first to do a lookup and then
perform an insert or an update when needed. For high-performance
situations you can use error handling to speed up the operation. Using
batch inserts and a primary key you can work up to 3 times faster (with
a low updates to inserts ratio).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 373

ETL Patterns

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Introduction to Patterns
pattern / noun
1)a form or model proposed for imitation
2)something designed or used as a model for making things <a
dressmaker's pattern>
3)an artistic, musical, literary, or mechanical design or form
Informally

How do I accomplish <<common data scenario x,y>>using PDI?

Common Data X

2011, Pentaho. All Rights Reserved. www.pentaho.com.

ETL Pattern in PDI

Common Format Y

US and Worldwide: +1 (866) 660-7555 | Slide 375

Identify Candidates
Notice data SIMILARITIES, not
SPECIFICS
Data Characteristics

Similar data types


Similar source systems
Similar content

PEG

PATTERN

HOLE

Similar To...

Processing Characteristics

Similar algorithms
Similar square pegs and round
holes

PEG

PATTERN

HOLE

Housekeeping Characteristics

Similar loading/tracking
techniques

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 376

Pattern : Batching
Tag DML with something to identify it as part of one load process
Batch

ID: 27
Load Date: 10-Oct-2007
FACT/SCD II records have a column that identify which batch they
were inserted during

Logical Rollback

DELETE from FACT where batch_id < = 15


Roll back to any point in time (yesterday, 10 days ago, etc)

Partial Load Rollback

Commit every 1000 records is good for performance, but what happens
when you only get part way through a load
DELETE from FACT where batch_id = (current batch_id) to cleanup

Auditing

What batch did we insert that fact record?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 377

Pattern: Batching Overview


Get ID from somewhere

Get BATCH_ID
from somewhere

Set Variable
Use Variable in all INSERT /
UPDATE operations
Your data ends up having an extra
attribute and looking like this:

Set variable
${BATCH_ID}

Use variable
${BATCH_ID}
on INSERT,
UPDATE steps

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 378

Pattern: Batching Get Batch ID


Get BATCH_ID wherever you want

Database table for your own


batching
PDI Batch ID
Database Sequence
max(batch_id) on FACT table
...

Use Set Variables step

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 379

Pattern: Batching Use Batch ID


To get the BATCH ID at any point
in the ETL process configure a
Get Variables step
This retrieves the variable
PARENT_BATCH_ID and adds it to
the stream, like any other field
Include it as a column with your
INSERT / UPDATE operations in the
database

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 380

Pattern: Batching Combine with Logging


Example: Use the Batch ID from the log table and get the
information when the transformation run with what Batch ID.

Note: A combination for keeping data in a consistent state with the


Batch-ID is shown in the transactional section later on.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 381

Pattern : Change Data Capture


Find data that's been CRUD-ed since your last ETL run
CRUD Create, Replace, Updated, or Deleted
Compare todays data to yesterdays data

Which records are new?


Which records have been deleted?
Which records have been changed?
Which records are identical?

Detect changes

Only process changes for more efficient processing


Slowly Changing Dimension logic

Special ETL for a new CUSTOMER, deleted customer, etc.

Route changes to their appropriate processing pipeline

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 382

Pattern : Change Data Capture


Compare data you got the last time to the data right now
Route to the appropriate processing

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 383

Pattern : Change Data Capture


Merge Rows

TWO inputs

Yesterdays data (STAGE)

Current data (OLTP)


Keys

Data needs to arrive sorted

Know what to compare


Values

What to compare

Flagfield

Name of the field to put the new


identical changed deleted

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 384

Pattern : Change Data Capture


Write your own Slowly Changing Dimension logic
Deleted

Mark customer deleted


Move to historical table for reporting

Created

Create a summary table for customer?


Insert record into an interface table with another system?
Call a webservice to get ZIP information only on NEW customers

Changed

Write your own Slowly Changing Dimension logic


Insert into a HISTORY table in addition to updating a reporting table

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 385

Pattern : Change Data Capture


Steps for simplification
Switch / Case: route directly to the steps for deleted, changed, new,
identical

Synchronize after merge: process the delete, insert, update directly


on the database depending on the flagfield content

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 386

Pattern : State Based Calculations


Calculate things as something moves through a set of states
Process focused metrics
Business Process for a Customer Order

Web Site Placed


Warehouse Received
Warehouse Shipped
Customer Received

Great Questions about the movement of these states

What's the average amount of time to go from Warehouse Received to


Warehouse shipped by the day of the week?
How long does it take to complete an order (start to finish) by
warehouse (East Coast, West Coast, etc)
What's the percentage of time spent on things we can control (Web
Site, Warehouse) vs our service provider (shippers)

Elapsed Time is key

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 387

Pattern : State Based Calculations


Get Data in Order
Calculate Previous Date (ie, like SQL LAG(DATE) )
Calculate difference in dates (This Time versus Previous Time)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 388

Pattern : SBC Get Data in Order


Get your data sorted
Grouping ID
Grouping ID
Customer ID
Order ID
Time
Time
State Change timestamp

Time

Grouping ID

Grouping ID

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Time

US and Worldwide: +1 (866) 660-7555 | Slide 389

Pattern : SBC Calculate Previous Date


Use the Analytic Query step
Hold your previous row and add the previous value to the stream

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 390

Pattern : SBC Calculate Previous Date


The result looks like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 391

Pattern : SBC Calculate Time Difference


Calculate DAYS_BETWEEN_ORDERS (Example with JavaScript step)
This Order date Previous Order Date (in days)
(2000/09/27 2000/05/20) = 59 DAYS

var DAYS_BETWEEN_ORDERS;
var one_day=1000*60*60*24;
if ( PREV_ORDER_DATE.getDate()
!= null ) {
DAYS_BETWEEN_ORDERS =
Math.round(Math.abs(orderdate.get
Date().getTime() PREV_ORDER_DATE.getDate().get
Time()) / one_day);
}
else {
DAYS_BETWEEN_ORDERS = 0;
}

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 392

Pattern : Create / Update Fields (UPSERT)


Want to update something in the warehouse
Some columns are ONLY touched on INSERT (Created Time, etc)
Some columns are touched on UPDATE (Updated Time, etc)

Auditing Requirements
Track what process made what changes to what data
Easy for downstream ETL
SELECT * FROM ODS_TABLE where UPDATED_TIME >= last time I ran

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 393

Pattern : Create / Update Fields (UPSERT)


Get your regular fields

customernumber
customername

Calculate the values you might use

CREATE_TIME (current time)


CREATE_USER (username)
UPDATE_TIME (current time)
UPDATE_USER (username)

Configure UPSERT properly

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 394

Pattern : Create / Update fields (UPSERT)


Use Insert / Update step
Configure fields that are to be only
touched on INSERT

Update = N

Configure fields that are to be


touched on UPDATE

Update = Y

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 395

Pattern : Create / Update fields (UPSERT)


Some fields are ONLY changed on
INSERT

CREATE_TIME
CREATE_USER

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Some fields are changed on

INSERT AND UPDATE


UPDATE_TIME
UPDATE_USER

US and Worldwide: +1 (866) 660-7555 | Slide 396

Pattern : Create / Update fields (UPSERT)


An Upsert (Insert and Update) can also be accomplished by a
combination of a Table Output step with constraints on the database,
error handling and an Update step.
Depending on the database (time used for constraints checking,
building indexes), this can be faster than the Insert / Update step.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 397

Pattern: Transactions I (DB Transactions)


Transformation wide DB transactions (e.g. commit, rollback): You want
to ensure that there is no data loss when a Transformation is run and
multiple target tables are needed in a consistent state (e.g. child /
parent dependencies like customer invoice headers and details)
Check the option Make the transformation database transactional

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 398

Pattern: Transactions I (DB Transactions)


Additional: Set the commit sizes in all database related steps that need to
be transactional to an almost infinite number like 99999999.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 399

Pattern: Transactions (Other Approaches)


Reasons for the use of other patterns then DB transactions:
1. A DB transaction can only be Transformation wide and not Job wide
2. The amount of data in a transaction could create a heavy workload for the
database (e.g. DB keeps it in a temporary store, keeps the tables in a
consistent state, handles locking situations)

Other approaches to keep the data in a consistent state (all in


combination with a delta loading and not a full load, see next slides
for more details):

Use a physical table centric approach: Table rename/delete


Mark the changed records with a Dirty Flag
Use of Batch-IDs
Combination with an own Status table keeping the latest processed keys

The best choice is depending on your data sizes, your transactions and
your database behaviour.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 400

Pattern: Transactions II (Table Centric)


Also useful, when target tables can't have partial data
Example: Data load takes 3 hours but you can't be doing PHYSICAL
inserts into the FACT_TABLE. Users need to see all the data or NONE.
Nothing in between.
Pre Processing

Create a temp swap tables that match structures of the target tables.
When this is combined with delta loading, also copy the original table
table to the swap table
When you have referential integrity / foreign keys (what is not a good
practice for a data warehouse) that concept can only hardly be used

Transformation

Load the swap table(s)

Post Processing

Drop the old target tables


Rename swap tables to the target tables

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 401

Pattern: Transactions II (Table Centric)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 402

Pattern: Transactions III (Dirty Flag)


The dirty flag
Often a boolean column (please check the database connections if your
database support boolean data types)
Used to mark the last inserted rows in an actual run

Pre Processing

Delete all rows with a dirty flag set in all target tables. Records in this
state mean the last job was not finished successfully and the partial data is
deleted.

Transformation

Load the table(s) with the dirty flag set

Post Processing

Reset the dirty flag in all target tables

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 403

Pattern: Transactions III (Dirty Flag)


Transformation setting the dirty flag

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 404

Pattern: Transactions III (Dirty Flag)


Job handling the dirty flag

delete from oltp_xxx where


dirty_flag is true;

START TRANSACTION;
UPDATE oltp_orders SET dirty_flag=false where dirty_flag is true;
UPDATE oltp_orderdetails SET dirty_flag=false where dirty_flag is true;
COMMIT;

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 405

Pattern: Transactions IV (Batch IDs)


The Batch ID
There are many ways to get a batch ID (see previous slides). In this
pattern, we are using the batch ID from the logging table.
This example does keep the tables in a consistent state but does not
include a reload of failed records. This can be combined with a status table
that holds the processed keys (see next slides).

Pre Processing

Get the batch IDs from the logging table where the logging record indicates
an error. Delete all rows in all target tables for these batch IDs. Records
with these batch IDs mean the last job was not finished successfully and
the partial data will be deleted.

Transformation

Load the table(s) with the batch ID

Post Processing

No action needed (eventually update indexes etc.)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 406

Pattern: Transactions IV (Batch IDs)


Transformations are adding the Batch ID from the calling job

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 407

Pattern: Transactions IV (Batch IDs)


Job handling the Batch-ID

delete from oltp_xxx where


job_batch_id in
(select ID_JOB from
log_test_batch_id where
ERRORS>0);

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 408

Pattern: Transactions IV (Batch IDs)


Job handling the Batch-ID
Enable job logging
Tick Pass batch ID? in the job settings

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 409

Pattern: Transactions V (Status Table)


The Status Table
Create a status table that holds for all jobs/transformations all tables that
need to be in a consistent state. For all tables the last processed keys
(source/target) and the status is saved.

Pre Processing

No action needed (eventually define loops and chunks of data, see next
slides)

Transformation

Load the table(s) with the previous keys (delta loading) and save the last
processed record working in chunks is possible

Post Processing

No action needed (eventually update indexes etc.)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 410

Pattern: Transactions V (Status Table)


Example of a transformation with a status table
Blocking Step
Wait for the last record to be
processed.

Add constants
Add the table names (for the key
range).

Insert/Update
Store the last processed key and
batch id for the table.
Note: When the target key is different (e.g. auto generated), both keys
(source & target) would need to be stored.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 411

Pattern: Transactions V (Status Table)


Example to combine a status table with delta loading: Taking the last
processed key into account when loading from source.

Oltp_job_status
Get the last processed key.
Use max() to get at least one row.

Source key null?


Set the key to 0 when this is the first load.

Orders
Add a where clause: WHERE ordernumber > ?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 412

Pattern: Transactions V (Status Table)


Example to combine with
delta loading in chunks:
You only want to process
e.g. 1000 rows per
transformation.
The chunk size can also be
set as a named parameter

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 413

Pattern: Transactions V (Status Table)

Results of delta loading in chunks (2560 rows in total, 3 chunks: 1000, 1000,
560 records)
Logging statistics of step orders and content of table oltp_job_status of
first, second and third run:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 414

Pattern: Transactions V (Status Table)


Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Options to combine: Dirty flag, Batch IDs (delete the records that are
in an error state as described before) or you can also use the chunk
size as the commit size and set the Make the Transformation
transactional flag to use DB transactions. The latter simplifies the
handling a lot but the practice is depending on the database and
chunk size.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 415

Pattern: Transactions V (Status Table)


Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Have multiple keys per table (e.g. for the orderdetails table:
ordernumber_from, ordernumber_to and the orderlinenumber. The
ordernumber_to will be taken from the previous run of the order
headers)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 416

Pattern: Transactions V (Status Table)


Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Details:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 417

Pattern: Transactions V (Status Table)


Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Details:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 418

Pattern: Transactions V (Status Table)


Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Details:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 419

Pattern: Transactions V (Status Table)


Example to combine with delta loading in chunks and table
dependencies: Keep the tables in a consistent state.
Now, the job is pretty simple and:
handles all kinds of aborts (database, PDI, etc.)
the job is restartable
all tables are kept in a consistent state

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 420

Pattern: Loops
In general loops are allowed in Jobs (in contrast to Transformations
were this is not possible). An Example for a loop is given here: Wait
for a file and process it with two transformations. After the processing
of the file is finished, the job should loop and wait for the next file.

X
Dont do it this way: This will work in general and for a long while.
But due to design reasons sooner or later this will lead to a
StackOverflowError, even when the StackSize is increased. This can
happen in production after some hours, days or weeks.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 421

Pattern: Loops
Options:
Externalize the loop to the operating system within a shell or batch file.
The errorlevel could be checked also.
Iterations: Stop the job after a certain number of iterations and restart it
in a loop in the operating system (the loop could also be in a shell or batch
file)
Let it schedule (e.g. from the DI server) in an interval but avoid
overlapping runs (e.g. a job takes longer than the interval)
Use the interval setting in the Start job entry (if this is suitable for the job)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 422

Pattern: Loops
How to stop a job after a certain number of iterations?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 423

Pattern: Loops
How to stop a job after a certain number of iterations?
The JavaScript step decrements a variable and validates if it should
continue:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 424

Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 425

Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 426

Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 427

Pattern: Loops
How to loop to load data in chunks until a maximum number of
iterations?
You can try this with a maximum number of 2 iterations since it needs
3 when the chunk size is set to 1000. In this case it processes 2000
rows in two cycles for the transformation. When the job is run a
second time, it processes the remaining rows (560).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 428

Pattern: Loops
How to avoid overlapping runs of jobs?
You can use a semaphore, e.g. setting a file or writing an entry in a
specific table and check this.
You can check the log entry of a job to execute if this job is still running:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 429

Pattern: Loops
How to avoid overlapping runs of jobs?
Similar as before (loading chunks), but the SQL is different:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 430

Pattern: Loops
How to avoid overlapping runs of jobs?
Similar as before (loading chunks), but the evaluation is different:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 431

Pattern: Restartable Solutions and Dependencies


With all the patterns described before you can build a completely
restartable solution.
Step error handling and step validations were not used in these
examples, but could be implemented when known issues should be
handled.
Combine the previous samples with variables and you have a
framework for executing jobs that are restartable and reliable.
It is also possible to store dependencies and sequences of jobs and
transformations in a table and let these run from a master job
controlling these.
The latter is not describes here further and could also be combined
with scheduling and monitoring analyzing log files and the status of
servers.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 432

Pattern: Transactions with Dimensions


Dimension tables normally do not need transactions, when they are
filled with the Slowly Changing Dimension (SCD) step. Example: When
a record is loaded a second time and there will be no change, no new
record nor version is added to the dimension table.
When a rollback is needed for a SCD table:
Rolling back a dimension record is very complicated since you need to
change the valid dates.
Best approach would be to use DB transactions or the table centric
rename.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 433

Enterprise Repository

2011, Pentaho. All Rights Reserved. www.pentaho.com.

The Enterprise Repository


is based on the Security and Content Management modules in
the EE Data Integration Server:
Security allows you to manage users
and roles (default security) or
integrate security to your existing
security provider such as LDAP or
Active Directory
Content Management provides the
ability to centrally store and
manage your ETL jobs and
transformations. This includes full
revision history on content and
features such as sharing and
locking for collaborative
development environments

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 435

Setup
The EE Data Integration Server must be started.
When Spoon starts up you will prompted to connect to a
repository (or use in the menu: Tools / Repository / Connect)
Add a new Enterprise Repository

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 436

Setup
Enter an ID (server reference) and name (local reference) for
your repository connection

Log on to the Enterprise Repository by


entering the following credentials:
user name = joe, password = password.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 437

Security
The Data Integration Server is configured out
of the box to use the Pentaho
default security provider
This has been pre-populated with a set of
sample users and roles including:
Joe Member of the admin role with full
access and control of content on the Data
Integration Server
Suzy Member of the CEO role with
permission to read and create content, but
not administer security

Note: See the Security Guide available in the Pentaho Knowledge


Base for details about configuring security to work with your
existing security providers such as LDAP or MSAD.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 438

Security Deleting Users and Roles


Note: After deleting a user (or role) the security object is still
existing and is referenced. Example when the user pdi2000 is
deleted:

Please see the Best Practices for Deleting Users and Roles in the
Pentaho Enterprise Repository in the PDI Administration Guide

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 439

Content Management
Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 440

Content Management
New repository based on JCR (Content Repository API for Java)
Improved Repository Browser
Enterprise Security
Configurable Authentication including support for LDAP and MSAD
Task Permissions defining what actions a user/role can perform such as
read/execute content, create content and administer security
Granular permissions on individual files and folders

Full revision history on content allowing you to compare and restore


previous revisions of a job or transformation
Ability to lock transformations/jobs for editing
'Recycling bin' concept for working with deleted files

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 441

Content Management Team Projects


Have one DI Server that holds the development repository,
security and scheduling
Deploy the DI Clients to the team members
Depending on your environment, the options are:
Have additional DI Servers for test and production
Have dev/test/prod directories below your team project and
change the directory by named variables (see also the following
slides)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 442

Content Management Team Projects


Private and public directories proposed ongoing
Use your private directory for your own work and tests
Use the public directory for your team projects

Deployment scenario depending on the team size et al.


When you want to work on a transformation or job:
Lock it with your name or
Move it to your private directory
When you finished the work:
Unlock it or
Move it back to the public project directory
Note: When you move it, it can still be referenced when links are
specified with Specify by reference (see next slides)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 443

Content Management Team Projects


When team members still need the part you are working on,
you need to copy it (when the links are entered without
references):
At this time you can only do a Save as for copying
The backdraw with Save as is: you loose the Version history
Therefor a move to another work place, referencing or locking is
the best ongoing when possible

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 444

Content Management - Backup and Deployment


The actual backup strategy is to backup the whole folder:
/data-integration-server/pentaho-solutions/

This includes the repository, security and scheduling e.g.


pentaho-solutions/system/jackrabbit/
pentaho-solutions/quartz/

Please remember to stop and start the DI server, more details


can be found in the PDI Admin Guide in the Knowledge Base

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 445

Content Management - Backup and Deployment


Backup from one server and restoring to another server could
be a option for the deployment,
but not, when you have different security and scheduling on the
test and production servers
You may omit the pentaho-solutions/quartz/ folder in this case.

Another deployment method would be to Export and Import


the repository.
At this time there is only a manual task
The whole repository including the private folders is ex- and
imported there is no project wide ex- and import

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 446

Content Management - Backup and Deployment


When you chose the option to have dev/test/prod directories
below your team project and change the directory by named
variables:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 447

Content Management Specify by Reference


You can link to a resource (like a sub-job, a transformation, or a subtransformation) by name or reference.
The advantage of specifying by name and directory is:
You can use variables for the name and directory

The advantages of specifying the reference are:


You can move the referenced object to another location
You can rename the object
You can rename parts of the directory

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 448

Content Management File Based Repository


The new file based repository stores the jobs and transformations as
.kjb and .ktr files in XML format below a given directory
The main difference is: It can be referenced like a job or
transformation stored in a database or enterprise repository

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 449

Content Management Upgrade from 3.x


How to move from a collection of content files or an old database
repository to a new enterprise repository?
Move from a Database repository:
Export the DB repository (with version 3.x or upgrade the DB repository
before and export with version 4.x)
Import into the EE repository

Move from content files (.ktr / .kjb):


Connect to your new enterprise repository.
Go to the File menu and click Import from an XML file
Note: Currently there is no quick and easy way to accomplish this process.
If you have any references to other job or transformation files in your
saved jobs, you must update each of those references to point to the new
location in the enterprise repository.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 450

Scheduling and Monitoring

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Scheduling within the DI Server

Data Integration Engine This is a


Carte instance. Carte is also used
in clustering (see a separate
chapter).
Scheduling Internally the Quartz
scheduler is used and the tasks
are executed in the Data
Integration Engine.

Note: Scheduling within the DI Server is integrated


with the Enterprise Repository (Content
Management + Security).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 452

Scheduling Kettle Content Before PDI 4.0

Spoon

Carte

Pentaho BI Server
Scheduling

OR

Files
Db

Script (CRON)

OR Job A
Job B

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 453

Scheduling Kettle Content NEW in PDI 4.0

Spoon

Data Integration Server


Scheduling

Db

OR

Files

OR

Job A
Job B

= Enterprise Repository

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 454

Scheduling options
Before PDI 4.0:
Via the operating system (e.g. CRON jobs, task scheduler)
Via the BI Suite scheduler (via xActions)
Via the Start-Job-Entry
It was complicated to schedule remote jobs (with Carte)

Scheduling since PDI 4.0:


Same options are still available, but:
Dont schedule via the Start-Job-Entry (will be deprecated soon)
As an additional option: Carte is now integrated in the DI Server
as the Data Integration Engine and scheduling jobs and
transformations are executed in this Carte instance

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 455

Scheduling Kettle Content


Pentaho BI Server

Spoon

Carte

Scheduling

 BEFORE 4.0

OR
Script (CRON)

Files
Db

OR Job A
Spoon

Data Integration Server

Job B

AFTER 4.0 

Scheduling

Db

OR

Files

OR

Job A
Job B

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 456

Scheduling - Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 457

Scheduling - Demo

Note: Pause/Complete does not mean


the job or transformation is
paused/completed, but the
scheduling.
The Start and Stop buttons refer to the
scheduler and not the transformations
or jobs.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 458

How to schedule remote jobs


This was the proposed ongoing before 4.0 and is still valid if
you want to schedule jobs on remote Carte servers additional
to the Data Integration server.
Since there are no options in Kitchen to run jobs on a remote
Carte server you need a wrapper job to define the remote Carte
server. [Defining a remote Carte server would need too many
properties that would lead to too complex command line options
for Kitchen.]
Define the wrapper job like this:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 459

How to schedule remote jobs


Define the slave server, e.g.:

and job entry accordingly, e.g.:


Uncheck Wait for the remote job to
finish [otherwise you would like to
wait]

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 460

How to schedule remote jobs


Note: Jobs scheduled on the DI Server cannot execute a
transformation on a remote Carte server. You may see an
error line like this one when trying to schedule a job to run
on a remote Carte server:
UserRoleListDelegate.ERROR_0001_UNABLE_TO_INITIALIZE_US
ER_ROLE_LIST_WEBSVC!com.sun.xml.ws.client.ClientTranspor
tException: The server sent HTTP status code 401:
Unauthorized

To fix this, follow the instructions in Executing Scheduled


Jobs on a Remote Carte Server in the PDI Administrator's
Guide.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 461

Monitoring - Pentaho Enterprise Console

Monitoring is tightly linked to


scheduling: You need to know if your
scheduled jobs run successful or not,
how much time they need etc.
Pentaho Enterprise Console (PEC)
provides you with an interface for
monitoring a DI Server / Carte
instance

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 462

Pentaho Enterprise Console - Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 463

Pentaho Enterprise Console - Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 464

Pentaho Enterprise Console - Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 465

Pentaho Enterprise Console - Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 466

Pentaho Enterprise Console - Demo

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 467

Monitoring - Pentaho Enterprise Console


Actual limitations of the PEC:
Only one DI Server or Carte Server can be monitored
Scheduling and Monitoring is not linked together in the UI

The PEC functionality is planned to be fully integrated into


a separate PDI Monitoring perspective in a future release.
Further options to monitor:
Via a Web-Browser to monitor the DI Server PDI status
Within Spoon: Monitoring Slave servers

The next slides give an overview of the monitoring that is


discussed in more detail in the chapter Clustering and
Partitioning

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 468

Monitoring DI Server: PDI Status

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 469

Monitoring DI Server: PDI Status

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 470

Monitoring within Spoon: Slave server Status

Note: When monitoring the DI Server you need to enter


the Web App Name: pentaho-di

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 471

Monitoring within Spoon: Slave server Status

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 472

Agile BI and PDI

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Contrasting Development Processes

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 474

Pentahos Agile BI
Pentahos Agile BI initiative seeks to break down the barriers to
expanding your use of Business Intelligence through an iterative
approach to scoping, prototyping, and building complete BI
solutions.

It is an approach that centers on the business needs first,


empowers the business users to get involved at every phase of
development, and prevents projects from going completely off
track from the original business goals.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 475

Agile BI

Business Users to
operate with or
without IT resources

Faster
Time
to Value

Continuous, realtime flow and


access of data

Conventional BI Apps
to be built and
deployed rapidly
within a single design
environment

Cloud

2011, Pentaho. All Rights Reserved. www.pentaho.com.

On-Premise
US and Worldwide: +1 (866) 660-7555 | Slide 476

Agile BI Phases
Individual / Departmental

Agile Exploration
Agile Data Transformation
Solution Prototyping
Institutional

Infrastructure Design
Dimensional Modeling
Iterative Solution Development
Operational Deployment

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 477

Agile BI Core Tasks


Core Tasks for Individual/Departmental Phases

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 478

Agile BI Core Tasks


Core Tasks for InstitPhases

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 479

Pentahos Agile BI
In support of the Agile BI methodology, the Spoon design
environment provides an integrated design environment for
performing all tasks related to building a BI solution including ETL,
reporting and OLAP metadata modeling and end user visualization.
Business users will be able to
start interacting with data
building reports with zero knowledge of SQL or MDX
work hand in hand with solution architects to refine the solution.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 480

PDI 4.0: Agile BI - Model


New Modeling and Visualization perspectives

A Data Transformation

2011, Pentaho. All Rights Reserved. www.pentaho.com.

becomes an

Analysis Model

US and Worldwide: +1 (866) 660-7555 | Slide 481

PDI 4.0: Agile BI - Visualize


Once the Model is created you can
use the Drag-and-Drop Analyze
Reporting tool to drill, slice, dice
and pivot your data.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 482

Data Quality - Example on Finding Issues

UK is missing
the territory in
this example.
This can be
corrected very
fast to EMEA in
the
transformation.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 483

Data Quality - Example on Fixing Issues


There are many ways to accomplish this. Here is an example with the
Formula step:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 484

Limitations
You need to use a Table Output step for the Visualization

Limitations of the 4.0 release: The table needs to have all fields
that you want to analyze. At this time there is no support to join
other tables (Snow-Flake- or Star-Schema).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 485

Agile BI - Models

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 486

Agile BI - Models
Create and change Models (like a Mondrian Schema for Analysis or
Ad-Hoc Reporting) on the fly

Support for further functionality (known from the Schema


Workbench e.g. Star- / Snow-Flake Schemas) will be coming soon in
a future release

Easy publishing of your schema to the BI Server

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 487

Visualization with the integrated Analyzer


Pentaho Analyzer is an interactive analysis tool and provides you with a
rich drag-and-drop user interface that makes it easy for you to create
reports quickly based on your exploration of your data.
You can also display Pentaho Analyzer reports in a dashboard (in the BI
suite).
The user can query the data in a database without having to understand
how the database is structured.
The Analyzer presents data multi-dimensionally and lets you select what
dimensions and measures you want to explore.
Use the Analyzer to Drill, Slice, Dice, Pivot, Filter, Chart Data and to
create Calculated Fields.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 488

Visualization with the integrated Analyzer


Integrated in the
context menu of
a step

Integrated in the
database dialog

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 489

Analyzer Report Overview

Drag fields from the field list to this area

Drag and drop Fields here.


The Field Panel
Fields from the
Data Model are
listed here.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 490

The Available Fields Pane


To see the list of fields that are available to you
when you build your report, click the Show
Fields button to display the Available Fields
pane.
By Pressing the VIEW button you may organize
the list in three ways:
1.By Category (default)
2.By Type. Lets you see the list where all
number fields (blue) come first, followed by
text fields (orange).
3.A->Z. No grouping.
To change the organization, simply click the
View button next at the top of the pane.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 491

Types of Fields
The following types of fields are available:
Text Fields (Names, Types, Categories, etc.): Product Name is an
example of a Text field.
Time Period Fields: Fiscal Year and Order Month are examples of
Time Period fields.
Number Fields: These types of fields are designed for summing,
dividing, creating averages, etc.
Fields are color-coded by type in both the report and the Available
Fields pane.
Text Fields and Time Period Fields: Orange
Number Fields: Blue

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 492

The Field Panel


This panel shows all dimensions with
the fields in their hierarchies.
Fields are drug onto the report canvas.
Hierarchies cannot be split across axis.
Right Click on the field for additional
options

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 493

Dragging Fields to the Canvas


The X Axis (Years)

The Y Axis (Territory)

Drag the field onto the canvas until

Drag the field onto the

you see the Horizontal Line.

canvas until you see the


Vertical Line.

The Result

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 494

Using Filters
Pentaho Analyzer offers the following options for filtering reports:
Filtering Text Fields: Text fields contain non-numeric information, so
you can choose to include or exclude certain values at will. Time
Periods, Names, Types, and Categories are examples of text field
groups; Product Line is an example of a specific text field.

Filtering Number Fields: Number fields include numeric information.


Sales Revenue is an example of a number field. You can create a
numeric filter using Greater/Less Than or Top Ten.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 495

Types of Filters
Pentaho Analyzer offers the following options for filtering reports:
Filtering Text Fields: Text fields contain non-numeric information, so
you can choose to include or exclude certain values at will.
Selecting from a list of values. Pentaho will display a list of values,
and you choose to include or exclude certain values
Match part of a string. You type in part of the name (string) that the
name Contains or Does not Contain
Filtering Number Fields: Number fields include numeric information.
Greater/Less Than...
Top 10, etc...
You can have only one numeric filter on a report at any given time.
When the report is generated, the numeric filter is applied after
other filters are applied.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 496

How Filters work together


Filters are applied in the following order:
Text field filters, such as Product Line = Snow Sports or Time Period =
2006 Q4. (Note that the order between these filters is irrelevant.)
Greater/Less Than component of numeric filters. This filter will further
restrict the data.
Top Ten component of numeric filters. This filter will even further
restrict the data.
Note: Another way to express this is: All text field filters are applied
first (#1) creating a first "invisible" version of the report. Second,
Greater/Less than filters are applied on this invisible report (#2), and -finally -- based on this report, the Top 10 filter is applied (#3).

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 497

Methods of Adding a Filter


A filter always acts on a field, so the first step is always to select a
field.
To add a new filter, use one of the following methods:
Method 1: Click on a field in the report, and select Filter from the
menu. (This method assumes you use the field in the report).
Method 2: From the Available Fields pane, find the field you want
to filter and drag the field into the Filters pane OR to the "(+) 0
Filter in Use" area.
Method 3: In the Available Fields pane, find the field you want to
filter and right click on the field, and select Filter.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 498

Calculations, Totals, and Sorting


Once you add fields and filters to your report, you can calculate and
manipulate the data on your report.
There are the following three primary methods:
1.Changing the way totals are displayed.
For example, display totals as Averages.
2.Adding new fields that originate from existing fields.
For example, based on the field Revenue, you create % of Revenue.
3.Creating new numbers on the fly.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 499

Displaying Grand Totals and Subtotals


By default, Grand Totals and Subtotals don't display when you view a
report in table format.
To show or hide Grand Totals or Subtotals, do the following:
In the BI Suite:
Click the More Actions / Set Report Options
In PDI:
Click the Show report options icon

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 500

Displaying Grand Totals and Subtotals


Select or deselect the appropriate checkboxes in the Totals section.

Click the OK button to save your specifications.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 501

Displaying Totals as Averages, Max, Min, etc


Grand totals and subtotals are normally the sum of each individual row
or column value but, you can choose to also summarize the data in
other ways:
Sum (Default. Displayed as Total in table report)
Average (See also: More about Averages)
Max
Min
To display these, click the number field (such as "Sales Revenue") in the
report and select Show Average, Max, Min, etc from the menu.
Note: The various forms of totals (max, min, etc) will only display if
your report is set to display totals

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 502

Displaying Totals as Averages, Max, Min, etc


Click the number field (such as "Sales Revenue") in the report and
select Show Average, Max, Min, etc from the menu.

Note: The various forms of totals (max, min, etc) will only display if
your report is set to display totals

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 503

Displaying Totals as Averages, Max, Min, etc


Click the number field (such as "Sales Revenue") in the report and
select Show Average, Max, Min, etc from the menu.

Note: The various forms of totals (max, min, etc) will only display if
your report is set to display totals

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 504

Creating New Calculated Fields


Right click on the column header for a number in your report and select
User Defined Number Calculated Number

Use expression keys to create


the new expression

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 505

Agile BI a Professional Services View

Agile BI is a way to meet simple analytic requirements

Agile BI is a technique for building a product backlog of analytic


requirements

Can be used to enhance an existing DW

Once the ROI of an analytic can be estimated, appropriate investments can be


made

Agile BI is a way to build executive confidence in the development and


project management process

This confidence leads to a long-term sustainable investment strategy

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 506

"What is Agile BI?"


The winner from the What is Agile BI? contest (Q2/2010):

Please see:
http://www.pentaho.com/what_is_agile/
http://www.pentaho.com/agile_bi/

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 507

Pre and Post Processing

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Pre and Post Processing


Things done at the start or end of ETL processing
ETL is typically done in BATCH

We can optimize for a BATCH usage pattern

Pre and Post can be at different levels of the process

Can be Pre entire process (email administrator about begin of process)


Can be Post entire process (email administrator about end of process)
Can be Pre single phase (prepare database for bulk inserts)
Can be Post single phase (prepare database for regular use after
inserts)

Logical

Update Summary tables

Physical

Usually database Focused


Drop and Recreate Indexes, etc

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 509

Pre and Post Processing (cont)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 510

Constraints
Foreign Key Constraints can kill performance
DML require a lookup to ensure consistency
Usually not a big deal for 10s and 100s of DML statements / sec
Batching we hope for 1000s and 10000s of DML statements / sec
Most efficient to do INSERTs and check consistency once in a batch
system
Pre Processing

Drop or Disable constraints at the beginning of your load routine

Post Processing

(optional) Validate before trying to enable constraints


SELECT PROD_ID FROM ORDER_DETAILS WHERE PROD_ID NOT IN(
SELECT DISTINCT PROD_ID FROM PRODUCTS);
Recreate or Renable Constraints
STEPS TO USE:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 511

Indexes
Rebuilding Indexes while your doing DML might be wasting resources
Why do the work 1000x times when you can do it once?
Many IDXes will need to be rebuilt entirely if enough DML has occurred
Bitmap Index is popular index for BI
Pre Processing

Drop Indexes used for BI workload


Leave Indexes used for Dimension / PK Lookups

Post Processing

Create Indexes

STEPS TO USE:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 512

Statistics
Statistics help databases create effective plans for queries
If they're off you'll have poorly performing queries
Some Databases have thresholds that automatically trigger a statistics
collection
You DO want updated statistics at the end of your load
You DO NOT want statistic updating to be occurring throughout your
load
Pre Processing

(optional) Disable automatic statistic triggering in your DB

Post Processing

Gather statistics on tables you DMLed


exec DBMS_STATS.GATHER_TABLE_STATS(SCHEMA, MYTABLE1);
STEPS TO USE:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 513

Summary Tables
For this example, OLAP AGGREGATEs and SUMMARIES are the same thing
Summary tables can improve performance and ease of use of reporting
tools

Tables that contain top level data


Sales by Year, Product and Region
100 000 -> 1 record (good compression)

Pre Processing

(optional) Drop Summary Table

Post Processing

Update summaries by executing long running queries against FACTs


INSERT INTO SUMMARY_TABLE ( Year, Product, Sales)
( SELECT YEAR, PRODUCT, SUM(SALES) FROM HUGEFACTTABLE ....)
STEPS TO USE:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 514

Table Exchange Loading


FACT_TABLE is being used by reporting tools and can't have partial data
Data load takes 3 hours but you can't be doing PHYSICAL inserts into
the FACT_TABLE.
Users need to see all the data or NONE. Nothing in between.
Pre Processing

Create table NEW_RECORDS that matches structure of FACT_TABLE

Post Processing

Create a FACT_TABLE_TOSWAP that has records in FACT_TABLE +


NEW_RECORDS
INSERT INTO FACT_TABLE_TOSWAP (
SELECT * FACT_TABLE UNION SELECT * FROM NEW_RECORD)
Truncate/Rename FACT_TABLE
Rename FACT_TABLE_TO_SWAP to FACT_TABLE
STEPS TO USE:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 515

Clean up Staging
Your staging area (database, files) has scratch data that is unnecessary
after loading DW
Temporary Tables
Files from Source systems no longer necessary
Pre Processing

None

Post Processing

Delete files from ${temp_csv_file_location}


truncate table TEMP_PRODUCT_DATA
...

NOTE: Cleanup CAN occur during Pre processing instead of Post.


STEPS TO USE:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 516

ETL patterns
Most pre and post processing is needed in common situations
E.g. Transactions
Please see also the chapter ETL patterns

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 517

Tuning and DBA Topics

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Tuning Strategy
START
HERE
INSTRUMENT AND IDENTIFY
TUNING CANDIDATES

MEASURE AND
MONITOR
IMPROVEMENT

TUNE INDIVIDUAL
TRANSFORMS, JOBS, AND
DATABASE

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 519

Tuning Strategy (cont)


Identify and Instrument
How long does the various parts of your ETL load take?
Pick biggest bang for the buck - It will take time to tune a
transform/job

100 Minute Load Process

75 min in LoadCustomer Transformation

5 min each for remaining 5 Transformations

Option 1 Improve One of the 5 Minute Transformations


Spend (1) day Tuning mapping, achieve 50% improvement in speed!
Time Improved: 5 min - (5 min * ) = 2.5 min
Time Job: 100 min 2.5 min = 97.5 min
2.5 % IMPROVEMENT for 1 DAY OF WORK

Option 2 Improve LoadCustomer


Spend (1) day Tuning mapping, achieve 10% improvement in speed!
Time Improved: 75 min - (75 min * .9) = 7.5 min
Time Job: 100 min 7.5 min = 92.5 min
7.5 % IMPROVEMENT for 1 DAY OF WORK

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 520

Scalability and Planning

Bad
Good

Scalability

Important questions for


operational planning
How long does it take to
process 100k records?
How long does it take to
process 125k records?

40
35

30

Does your ETL scale?


Beware the Hockey Stick

Hours

Scalability

25
20
15
10
5
0
100k

200k

300k

400k

500k

600k

Number of Records

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 521

Database Logging
Covered in different module
Provides the data needed for tuning strategy, planning
SQL Query (pseudo)

select
TRANSNAME,
RECORDS,
ELAPSED_TIME (Start - End)
from PDI_TRANSFORM_LOG
T RANSNAME
load_csv_data
load_csv_data
load_csv_data
load_csv_data

2011, Pentaho. All Rights Reserved. www.pentaho.com.

RECORDS ELAPSED_T IME


2602
18
5523
35
1111
15
100000
225

US and Worldwide: +1 (866) 660-7555 | Slide 522

Database and SQL Tuning

2011, Pentaho. All Rights Reserved. www.pentaho.com.

ORDER

Approach
1)Identify tuning candidate
(previous slides)
2)Review against basic Tuning
Concepts
1)Make change
2)Measure
3)Rinse and Repeat
3)Tune SQL and Database
1)Make change
2)Measure
3)Rinse and Repeat

US and Worldwide: +1 (866) 660-7555 | Slide 523

Tuning Concept - Disclaimer


Your Mileage May Vary

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 524

Tuning Concept Memory Settings


64-bit - if you have it, use it
64-bit architecture is common, machines with 8, 16, 32 GB RAM
But: In some cases it may be better to go with 32-bit to reduce
memory usage. In a 64-bit server, every pointer and every integer will
take twice as much space as in a 32-bit server. That overhead can be
significant, and is depending on your use case.

Steps that can benefit from more memory

Sort
Stream Lookup
Sort
Join Rows
Sort (did we mention this one already?)

Avoid swapping and you can improve your performance by an order of


magnitude

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 525

Tuning Concept SORT


SORT by default
starts swapping rows
at 5000
Configurable (in
Transform)
Location of
SWAPPING also
configurable
NOTE: It's # of ROWs
not the size (MB)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 526

Tuning Concept DB vs PDI Sort


Sorting in database
means no sort is
necessary in PDI
Pros of DB Sort

Fast DBs are


usually fast at
sorting
Less data moving
over wire (ELT)

Cons of DB Sort

Less metadata and


more difficult to
read
Database is doing
more real work

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 527

Tuning Concepts Database Latency


Executing a query takes time. How much?
JDBC Driver preparation
Network to
RDBMS Parse/Query Execution Plan
RDBMS Execute
Network from
JDBC Driver result return

Dominant time is usually in Execute and the others are negligible in BI


workloads
Unless

You are doing the above thousands of times / sec

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 528

Tuning Concepts Database Latency Steps


The following steps will perform
ONE database operation for every row
Database Lookup
Database Join
Call DB Procedure
Table Input
Delete
TWO database operations for every row
Update
Insert / Update
ONE, TWO, THREE database operations depending
Dimension Operator
Combination Dimension Operator

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 529

Tuning Concepts Database Latency Numbers


For small-ish loads (1000s and 10000s of rows) this is usually not an issue
In large volumes it adds up

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 530

Tuning Concepts Database Latency Caching


Turn on Caching in steps that support it.
Doesn't improve the DML
(INSERT/UPDATE/DELETE) but will speed the
SELECTS
NOTE: Table must not change for caching to
provide accurate results

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 531

Tuning Concept Stream when Possible


Smaller data sets
Small is a function of your
memory
Rough Guess (10k records)

Slightly slower until all lookup


rows have been loaded
Blazing fast once all lookup rows
have been loaded
When possible use the Merge Join
step. It joins on sorted streams
and is very fast and has more join
options.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 532

Tuning Concept Latency vs Memory


More Streaming requires more memory
Faster Sorting requires more memory
More Caching requires more memory
Tradeoffs

Memory on a non HA ETL server is usually less expensive than


memory on HA database
Dial in your memory settings (depending on the row size etc.)
StreamLookup to 200k rows with 1GB of RAM
Change a DB Lookup with 500k rows into a StreamLookup
requires approximately 2.5 GB of RAM

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 533

Tuning Concept Joined Select vs Lookups


Joined Select
Join your two tables using SQL
in the Table Input step

Advantages of Joined Select

PDI memory (lookup data


doesn't have to reside in
memory)
Latency (no trip back to the
database for a lookup)
Fast(er)

Disadvantages of Joined Select


Increases load on the DB
Less readable (it's SQL)
Breaks metadata driven
principle
2011, Requires
Pentaho. All Rights Reserved.
www.pentaho.com.
knowledge
of SQL

US and Worldwide: +1 (866) 660-7555 | Slide 534

Tuning Concept Reduce Cardinality SOONER


Reduce the number of
records you are
processing SOONER
than LATER
Reduces the number of

Lookups
Row Passing
....

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 535

SQL and Database Tuning


Have PDI tell you what SQL it will be using for SELECT/INSERT/UPDATEs
Turn to logging level Detailed or above to get SQL
INSERT INTO PDI_TRAINING (C1,C2) VALUES (?, ?)

Use traditional database tuning techniques

Not covered here


Explain Plans
Indexes
Storage Engines
...

Other good database design precepts

Use integers for keys instead of strings if possible


...

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 536

Interpreting Runtime Data

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Runtime Data
Tabular data in Spoon that helps a developer understand information
about the transformation as it runs
Information about Number of Records streaming
Time
Records / Second
Status of Steps
Input / Output records on hops

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 538

Basics

On Log view when running Transformations from Spoon


Basics Columns

Stepname Name of the Step (lookup_region, read_orders)


Copynr If multiple copies are started which one this is (0,1,2,3,4, ...)
Read Number of records received from PREVIOUS step
Written Number of records passed to NEXT STEP
Input Number of records read from a file, database, etc.
Output Number of records output to a file, database, etc.
Rejected Number of records rejected
Errors Number of errors
Active Status of step (Initializing, Running, Finished, etc)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 539

Basics Example

11468 Read from DB

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Read 1466 Records

Read 1465 Records

US and Worldwide: +1 (866) 660-7555 | Slide 540

Time and Records


Time

Number of Seconds from Start of


Step to either
Now() if still running
Time of Last recorded if
Finished

Speed

Number of records / Time


Time follows above formula so this
shows
Live throughput while running
Aggregate throughput for entire
run if Finished
NOTE: The Speed of a step is not
solely dependent on itself. Next
slides clarify.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 541

Input / Output
Input / Output figure gives information about the
# of Records on the Input Hop
# of Records on the Output Hop
Hops can hold a configurable number of rows (0..N)
Example

Step1 has no INPUT and Hop1 as OUTPUT


Step2 has Hop1 as INPUT and Hop2 as OUTPUT
Step3 has Hop2 as INPUT and no OUTPUT

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 542

Input / Output Example


Hop1

Step1 output = 10000


Step2 input = 10000

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Hop2

Step2 output = 0
Step3 input = 0

US and Worldwide: +1 (866) 660-7555 | Slide 543

Input / Output Interpretation


Where rows are sitting on hops gives you information for improving
performance.
Formal

Look for the furthest downstream step with few records on its OUTPUT
and many records on its INPUT

Informal

Look for the first 0 on OUTPUT side and 10000 on the INPUT side

Step3 is Processing rows as fast as Step2 produces


Step1 is producing rows faster than Step2 consumes

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 544

Backlog
Slow steps cause a backlog
Look at all the steps Input / Ouput

10000

See what is CAUSING the backlog


(downstream with 0 on it's Output)
NOT just what IS backed up

10000

10000

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 545

Clustering and Partitioning

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Clustering and Partitioning


Clustering:
Computer cluster is a group of linked computers, working together
closely so that in many respects they form a single computer. The
components of a cluster are commonly, but not always, connected
to each other through fast local area networks.
Clusters are usually deployed to improve performance and/or
availability over that provided by a single computer, while
typically being much more cost-effective than single computers of
comparable speed or availability.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 547

Clustering and Partitioning

Columbia, the new (2004) supercomputer, built of 20 SGI Altix clusters, a total of
10240 CPU
Credit: NASA Ames Research Center/Tom Trower

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 548

Clustering and Partitioning


Define Cluster Nodes (Slave/Master Servers)
One of them must be a Master (check the window height in the slave server dialog
to see the option Is the master)
The default username/password is cluster/cluster

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 549

Clustering and Partitioning


Define Cluster Schemas (bunch of Nodes)
Create a new Kettle cluster schema
Select the master and slave servers

Note: When you get an error that the Socket port is already in use, try another
port. This can also happen when an error arises and the port is not closed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 550

Clustering and Partitioning


Set up Carte on the slave machines
Carte is a simple web server that allows you to execute transformations and jobs
remotely.
The username / password is by default: cluster / cluster
It can be changed in your Pentaho Data Integration distribution in
pwd/kettle.pwd
To encrypt a password, use encr.bat/.sh with a parameter
This command line tool obfuscates a plain text password for use in XML and
password files.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 551

Clustering and Partitioning


Start a sample transformation clustered

Cx2 means these steps are executed clustered on two slave servers.
All other steps are executed at the master server
To execute the transformation:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 552

Clustering and Partitioning


Monitoring in Spoon
On the Slave / Master servers click on Monitor in the context menu

You will also see a preloaded Row generator test as a test transformation in
every Carte instance..

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 553

Clustering and Partitioning


Monitoring via a Browser

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 554

Clustering and Partitioning


What is happening in the background?

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 555

Clustering and Partitioning


Special considerations
The cluster-nodes are a target from another step this works.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 556

Clustering and Partitioning


Special considerations
An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 557

Clustering and Partitioning


Special considerations
An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes.
The slave server has two Socket Writers from dim_time to each of the slaves

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 558

Clustering and Partitioning


Special considerations
An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes
Each slave server has a Socket Reader and gets only the half of the rows

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 559

Clustering and Partitioning


Special considerations
An Info Step (e.g. used by Stream Lookup) is outside the cluster-nodes
Solution: Change the Data Movement to Copy to next steps

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 560

Clustering and Partitioning


Special considerations
When the Sort step is a bottleneck: Let it run clustered!

But you need a Sorted Merge step, that does the following in the background but
in an clustered environment:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 561

Clustering and Partitioning


Clustering the databases (Partitioning)
When we look at the first simple example again:

Since data ends up on 2 different servers with 2 different database connections on


the same database you can get into trouble (deadlocks) if one server is doing
updates to the same ID as the other one.
This makes it actually a candidate for partitioning because if we create 2
partitions (0 and 1) we guarantee that the same ID will always end up on the same
server.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 562

Clustering and Partitioning


Partition for clustered databases

When you want to use clustered databases, uncheck the Dynamically create
the schema and the database clusters are taken into account.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 563

Clustering and Partitioning


Partitioning methods for distributing the data

Round-robin: That is the standard method when no explicit


partitioning is defined.
Mirror to all partitions: Data are copied to all slaves.
Mod partitioner ("hash-partitioned) : Distribute the data by an ID
and guarantee that the same ID will always end up on the same
server.
This method takes an ID (Integer or even the hash code of a
String, Date, etc.) and divides that number by the total number
of partitions and take the remainder (modulo).
For example:
id=26, 3 partitions --> 26%3 = partition 2
id=37, 3 partitions --> 37%3 = partition 1
id=39, 3 partitions --> 39%3 = partition 0

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 564

Clustering and Partitioning


Write to a clustered database without clustering the transformation

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 565

Clustering and Partitioning


Read from a clustered database without clustering the transformation

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 566

Clustering and Partitioning


Read from a file in a clustered environment

Remember to check "Running in parallel (otherwise you get all rows on all
clusters)
The file is divided internally into trunks of data for each cluster-node to process.
The same principle works for the Fixed file input.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 567

Clustering and Partitioning


Dynamic Clusters (available since 3.2.0)
A dynamic cluster is a cluster schema where the slave servers are only known at
runtime.
This situation is occurring in those situations where hosts are being added or
removed at will, such as in cloud computing settings. It will also handle fail over
situations.
More details of this powerful feature can be found over here:
http://wiki.pentaho.com/display/EAI/Dynamic+clusters

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 568

Hadoop

2011, Pentaho. All Rights Reserved. www.pentaho.com.

The Case for Big Data


Enterprises increasingly face needs to store, process and maintain
larger and larger volumes of structured and unstructured data
Compliance
Competitive Advantage
Challenges associated with big data
Cost storage and processing power
Timeliness of data processing

Google trends for Hadoop

Why Hadoop?
Low cost, reliable scale-out architecture for storing massive
amounts of data
Parallel, distributed computing framework for processing data
Proven success in solving Big Data problems at fortune 500
companies like Google, Yahoo!, IBM and GE
Vibrant community, exploding interest, strong commercial
investments

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 570

Hadoop for Data Integration and BI


Top Use Cases for Hadoop*
1. mine data for improved business intelligence
2. reducing cost of data analysis
3. log analysis
Top Challenges with Hadoop*
1. Steep technical learning curve
2. Hiring qualified people
3. Availability of appropriate products and tools
Unfortunately, Hadoop was not designed specifically for ETL and BI use cases:
Its not a database
High latency queries and jobs not ideal for all BI use cases
Skill set mismatch for traditional ETL users and BI Solution architects
*Based on a survey of 100+ Hadoop users conducted by Karmasphere, Sept. 2010

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 571

Pentaho BI Suite for Hadoop


Lowers technical barriers by providing an easyto-use ETL environment for managing data in

Interactive Analysis

Hadoop

Batch Reporting
and Ad Hoc Query

Provides end-to-end BI Tools addressing common


Hoc Query and Interactive Analysis
Extreme ETL scalability through integration with

Data Marts
Agile BI

BI use cases with Hadoop including Reporting, Ad

Hadoops MapReduce framework

Hadoop

Workflow Integration of Hadoop jobs with

PDI ETL Jobs

external ETL and BI activities


Reduces costs through our subscription-based
pricing model, reduced dependency on high paid
technical resources, and easier maintainability

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Log
File
s

DBs and
other sources

US and Worldwide: +1 (866) 660-7555 | Slide 572

Big Data Does Not Replace Data Marts


Its not a database
High latency
Optimized for massive data-crunching
Databases are immature
Databases are no-SQL

2010, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 573
Slide
US and Worldwide:
(866) 660-7555
| Slide

What Hadoop Really is.


Core components
HDFS a distributed file system allowing massive storage across a cluster of
commodity servers
Map-Reduce
Framework for distributed computation, common use cases include
aggregating, sorting, and filtering BIG data sets
Problem is broken up into small fragments of work that can be computed
or recomputed in isolation on any node of the cluster
Related Projects
Hive a data warehouse infrastructure on top of Hadoop
Implements a SQL like Query language, including a JDBC driver
Allows MapReduce developers to plugin custom mappers and reducers
Hbase the Hadoop database AH HA!
A variant of NoSQL databases, problematic for traditional BI
Best at storing large amounts of unstructured data

2010, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 574
Slide
US and Worldwide:
(866) 660-7555
| Slide

Hadoop and BI?


Instead of this...

2010, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 575
Slide
US and Worldwide:
(866) 660-7555
| Slide

Hadoop and BI?


You have to do this in Java...
public void map(
Text key,
Text value,
OutputCollector output,
Reporter reporter)
public void reduce(
Text key,
Iterator values,
OutputCollector output,
Reporter reporter)

2010, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 576
Slide
US and Worldwide:
(866) 660-7555
| Slide

Pentaho Data Integration


Data Marts, Data Warehouse,
Analytical Applications
Pentaho Data
Integration
Design
Hadoop

Pentaho Data
Integration

Deploy
Orchestrate

Pentaho Data
Integration
2010, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 577
Slide
US and Worldwide:
(866) 660-7555
| Slide

Visualize

Reporting / Dashboards / Analysis


Web Tier
DM & DW

RDBMS

Optimize
Hive
Hadoop
Files / HDFS

Load
2010, Pentaho. All Rights Reserved. www.pentaho.com.

Applications & Systems

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 578
Slide
US and Worldwide:
(866) 660-7555
| Slide

The Road Ahead


Streaming Data Source Support
In support of near-realtime use cases
Long/Always running data processing jobs
NoSQL Integration
Facilitate BI use cases on top of Hbase, possibly others like Cassandra
Contiguous Meta-data
Data Lineage and Impact Analysis covering the entire big data
architecture
The End of MapReduce ( as a concept ETL users need to understand)
Push down optimization of Transformations that generate native
MapReduce tasks in Hadoop

2010, Pentaho. All Rights Reserved. www.pentaho.com.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and +1
Worldwide:
+1 (866) 660-7555
| 579
Slide
US and Worldwide:
(866) 660-7555
| Slide

Operations Patterns

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Operations General Overview


Health checks for PDI components
DI Server
Carte and clustered environments
Kitchen/Pan
Detect dead locks
Health checks for external components
JVM (e.g. memory, used CPUs)
Server (e.g. test with a network ping)
Databases (up and running)
Signal to noise detection:
Defining what is normal (noise) and notify on exceptions (e.g. set
thresholds absolute or relative eventually by average)
Define what is unusual and notify on these events
Define Events (e.g. Actions, Notifications & Alerts)
Constraint #1: Minimise the footprint and impact to the system by the
measurements

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 581

Operations Define Actions by Events


When operations detects an event, an action can be taken, e.g.:
Start/Stop/Restart a process (Job or Transformation)
Start/Stop/Restart a server (DI Server or Carte)
Notify by logging or sending an alert (e.g. mail)

The actions for specific events can be defined in different operational


check routines and they call all the same event handler.
By this concept, actions can be added and changed easily and are
processed at one common place.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 582

Operations Define Actions by Events


Examples for logging actions:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 583

Operations Define Actions by Events


Examples for starting
Jobs
Transformations
Shell jobs and
Sending mail

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 584

Pattern: Watchdog
Watchdog:
A watchdog timer is a computer hardware or software timer that
triggers a system reset or other corrective action if the main program,
due to some fault condition, such as a hang, neglects to regularly
service the watchdog (writing a service pulse to it, also referred to
as kicking the dog, petting the dog, feeding the watchdog or
waking the watchdog).
The intention is to bring the system back from the unresponsive state
into normal operation. [], more e.g. on Wikipedia
Most of the PDI health checks can be accomplished with the
Watchdog concept

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 585

Pattern: Watchdog
Watchdog timers for multitasking (e.g. many PDI jobs and cluster nodes):
A software crash might go undetected by conventional watchdog strategies
Success lies in weaving the watchdog into the fabric of all of the system's
tasks, which is much easier than it sounds.
Build a watchdog task
Create a data structure (database table) that has one entry per task
When a task starts it increments its entry in the structure. Tasks that only
start once and stay active forever can increment the appropriate value
each time through their main loops, e.g. every 10,000 rows
As the job or transformation runs the number of counts for each task
advances.
Infrequently but at regular intervals the watchdog runs.
The watchdog scans the structure, checking that the count stored for each
task is reasonable. One that runs often should have a high count; another
which executes infrequently will produce a smaller value.
If the counts are unreasonable, halt and let the watchdog timeout and fire
an event. If everything is OK, set all of the counts to zero and exit.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 586

Pattern: Watchdog
An example implementation with PDI:
This is task oriented, not server oriented
This means it will be checked, if the task (a Transformation or Job) is
running as expected independently in what environment (e.g.
clustered or not)

Task 1
Watchdog
Task
Definitions
1..n

Counter
1..n

Task 2

Event
Task n

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 587

Pattern: Watchdog
The Watchdog Task Definitions table
Environment variable: ${watchdog_task_table}
Example table name for operations:
op_watchdog_task
Fields & Descriptions
wd_task_id

Unique Task ID

wd_task_description

Task Description (optional)

wd_task_disabled

1=Do not check the task (the counters are still incremented by
the tasks)

wd_task_min_count

When >0: Check if the counter is at least at this value after


the cylce time

wd_task_max_count

When >0: Check if the counter is below this value after the
cylce time

wd_task_cycle_minutes

Defines the check time in minutes.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 588

Pattern: Watchdog
Fields & Descriptions (Watchdog Task Definitions table continued)

wd_task_lenient_count

When an exception is detected, be


lenient for x times.

wd_task_event_type

In case of an exception, fire this event.

wd_task_event_details

See the events section for more


details.

wd_task_last_reset

When (date & time) was the counter


last reset by the watchdog?

wd_task_last_detection_count

When lenient count is used, the


number of detections is logged here by
the watchdog.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 589

Pattern: Watchdog
The Watchdog table for counting
Environment variable: ${watchdog_table}
Example table name for operations: op_watchdog
Fields & Descriptions
wd_task_id

Unique Task ID

wd_hostname

Hostname of the last task (informational only)

wd_ip_address

IP-Adress of the last task (informational only)

wd_slave_server

Slave Server of the last task (informational only)

wd_last_run

Date & time of the last task run and counter


increment (informational only)

wd_counter

The counter is incremented by every task run. This


is checked by the watchdog and reset.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 590

Pattern: Watchdog
In the sample implementation, an event is fired after the cycle time, when:
The cycle time is reached AND
the counter is zero OR
the counter is below min OR
the counter is above max
When lenient count is defined, it waits to fire an event until it reaches the
number of events.
Watchdog Environment Variables
watchdog_task_table The Watchdog Task Definitions table
watchdog_table

The Watchdog table for counting

watchdog_mode

normal: normal operation


disabled: do not increment the timer by the task
learn: reserved for future use, not used yet

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 591

Pattern: Watchdog
Overview of the sample implementation jobs and transformations:
Create a sample environment
Define a database connection operations_db and share it
Create test tables with job test_watchdog_create_tables
Fill the watchdog task table with samples by transformation
test_fill_watchdog_task
Samples for incrementing the watchdog counters by tasks
When you want to run the watchdog task within a job, have a look at
test_watchdog_job that is calling transformation watchdog_task to increment
the counter
When you want to run the watchdog task within a transformation, have a look
at transformation test_watchdog_task_streaming that is calling a subtransformation (mapping) watchdog_task_streaming. Make sure to define a
threshold at what number of processed rows the counter should be
incremented. This is used to avoid performance problems. It is also possible to
call the watchdog_task_streaming at the end with a blocking step or at the
beginning.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 592

Pattern: Watchdog
Sample for the watchdog to check the counter:
Job watchdog_main should be run in an interval, e.g. every minute
This job is calling other jobs and transformations to implement the logic: job
watchdog_check_wrapper, job watchdog_check and transformation
watchdog_check
When an event is triggered, it is calling the job event
Test run:
When you let watchdog_main run for the first time, you will get the following
in the log entries for all tasks to check:
Result from watchdog_check - Detection: 0 (0=ok, 1=detection, 2=lenient ok)
[last_date is not valid, looks like the first run: initializing]
After the cyle time is reached and no task was run, you will get:
Result from watchdog_check - Detection: 1 (0=ok, 1=detection, 2=lenient ok)
[wd_counter is null or 0]
And the event is fired (in this case, the logging):
PDI Operations Event - Log ERROR: Task 99 exceeded
Feel free to let the tasks run to increment the counters, change some settings
and watch the different results!

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 593

Pattern: Health Check for the JVM


Analyze the available memory over a time period
Send an event on memory shortage (defined by thresholds)
An example implementation to collect the available memory can be found
in transformation JVM_collect_data:
Together with the last log date, hostname, current process identifier
(PID) and memory information, the environment variable
${operations_instance_id} is also logged to differentiate by this id.

You may define thresholds in a different table and check these similar to
the other operations patterns and fire events accordingly, e.g. when the
available memory goes below 20 percent.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 594

Pattern: Failover of a master server


Controller:
Check if the main master server is up and running
If it is down, switch to a secondary master server
Master server
The master server is also checking, if the controller is up and running
If it is down, switch to a secondary controller

Master

Controller

Failover
Definitions
& Status

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 595

Pattern: Failover of a master server


An example implementation with PDI:
Supporting of multiple failover masters and controllers
If the main server is up again, an automatic switch back is too complex
The switch back could be accomplished on a process based
approach or when all processes (jobs and transformations) are
finished. This could lead to a down time for specific processes. Out
of this, an automatic switch back is not implemented.
As an option, a controlled and manual switch to another server is
possible
The slave servers are not taken into account since the master is
capable of handling these with the Dynamic Clustering option
A dedicated DI Server, Carte Server or Kitchen instance is needed for
the controller.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 596

Pattern: Failover of a master server


The Failover Master Definitions table
Environment variable:${failover_master_table}
Example table name for operations:
op_failover_master
Fields & Descriptions
fm_id

Unique failover master ID

fm_description

Description of this server

fm_status_url

The URL of the DI Server (e.g. http://localhost:9080/pentahodi/kettle/status?xml=Y) or Carte Server (e.g.


http://localhost:8084/kettle/status/?xml=Y) to check the
status in XML format

fm_user

User for authentication (tests are joe or cluster)

fm_password

Password (encrypted is also possible, see encr.bat./sh)

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 597

Pattern: Failover of a master server


Fields & Descriptions (Failover Master Definitions table continued)
fm_is_disabled

1=Do not check this server

fm_is_controller

1=This is a controller (otherwise a master)

fm_is_primary

1=This is the primary master or controller


(otherwise secondary, failover)

fm_is_active

1=This is the actual active server (master or


controller) [this will be changed automatically
by the controller or master]

fm_failover_order

1...n: Order to activate failover masters or


controllers (since multiple failover servers are
possible)

fm_last_check

2011, Pentaho. All Rights Reserved. www.pentaho.com.

Date & Time of the last status check


US and Worldwide: +1 (866) 660-7555 | Slide 598

Pattern: Failover of a master server


Fields & Descriptions (Failover Master Definitions table continued)
fm_last_status

1=Online, 0=Offline

fm_last_response_time

Last response time in ms

fm_last_response_message

Last response message in XML (most times on


success), HTML or Exception (up to 250
chars) depending on the failure.

fm_last_nr_jobs

Actual number of running jobs

fm_last_nr_transformations

Actual number of running transformations

fm_controlled_switch_to

[not used, yet]

fm_controlled_switch_initia

[not used, yet]

ted

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 599

Pattern: Failover of a master server


Overview of the sample implementation jobs and transformations:
Create a sample environment
Define a database connection operations_db and share it
Create the test definition table with job test_create_failover_master_table
Fill the definition table with samples by transformation
test_fill_failover_master
Note: To simplify testing, you can disable or enable checks by setting the
field fm_is_disabled accordingly.
Failover of a master server, Environment Variables

failover_master_table

The Failover Master Definitions table

operations_instance_id=1

This is the instance id for each DI Server or Carte


Server. This is the according fm_id.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 600

Pattern: Failover of a master server


We propose to set the environment variables in the kettle.properties file on
each server.
If you want to set up a test environment with multiple DI or Carte servers on
one machine, you need to modify the KETTLE_HOME variable for each DI or
Carte server instance accordingly (see Knowledge Base for more information
about the KETTLE_HOME variable).
A startup batch script could look like this:
set KETTLE_HOME=C:\Pentaho\Kettle\KETTLE_HOME_3
cd "C:\Pentaho\pdi-ee-4.0.1-GA\data-integration"
start carte.bat 127.0.0.1 8085

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 601

Pattern: Failover of a master server


Sample for the failover process:
Job failover_main should be run in an interval, e.g. five minutes.
This is running on every master and every controller, also on the
failover servers.
This job is calling other jobs and transformations to implement the
logic:
job failover_check_status (includes transformation
failover_master_table_type_specific and job
failover_check_online) to check the status of the servers.
When it is running on the active controller, it is checking the
master servers and vice versa: the active master is checking the
controller(s). This is updating the status in the table.
Then job and transformation failover_change_status is executed
to change the active master (or controller) when it is down.
In the job failover_change_status, it is also possible to implement
to fire events depending on a status change.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 602

Pattern: Failover of a master server


Sample for checking if this is the active server:
Include job failover_am_i_active into your existing jobs. This makes
it possible to use the same scheduler settings on all servers and let
them active. (It would also be possible to start and stop the
scheduler, but then you have to ensure that historic jobs are not
fired up again when the scheduler starts.)
A sample implementation is in test_am_i_active
It would also possible to set an environment variable (e.g.
stop_processing) by a remote execution job (fired by an event). This
variable can then be checked by jobs or within streaming
transformations.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 603

Pattern: Failover of a master server


Test run for the failover procedure
Define the master and controller servers accordingly in the Failover Master
Definitions table, you may recreate the definition table with samples by
transformation test_fill_failover_master
Set the environment variable operations_instance_id accordingly to your
definitions for each server instance
Start the DI and/or Carte Servers up
Execute job failover_main and look at the results in the Failover Master
Definitions table. This depends on what environment variable
operations_instance_id it is running. If this is the active master, it checks all
controllers. If this is the active controller, it checks all masters.
The following example shows one active and one unreachable server:

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 604

Pattern: Failover of a master server


Test run for the failover procedure
You may change the environment variable operations_instance_id
accordingly to your definitions for each server instance to check the
other servers.
Execute job failover_main and look at the results in the Failover
Master Definitions table.
Please check: When your active master server gets into an offline
status, the next available master server will get the active flag after
the job is executed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 605

Pattern: Failover of a master server


Test run for the failover procedure
You may check what happens, when you start the job
test_am_i_active on an active and an inactive server.
If you want to change back to the primary master, you can watch on
the # of jobs and transformations of the failover and switch back
manually by changing the active flags in the Failover Master
Definitions table.
Another possible extension by the knowledge of the number of
running jobs and transformations, is to implement a load balancer
depending on a threshold of parallel running processes.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 606

Pattern: Health check for clustered environments


Tasks:
Check if the Master server is up and running
Check if the Slave servers are up and running
Start a transformation on the cluster nodes and see if the response
times are reasonable.
You can modify the pattern Failover of a master to accomplish this
task. Simply add the slave servers to the list of servers, add another type
to the definition table for a slave and modify the job
failover_check_status to take this new type into account.
It is also possible to auto register slave servers to a master and get the
list of registered slave servers from the master as part of the Dynamic
clustering option.
A sample implementation for this pattern is not available at this time.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 607

Pattern: Workload Balancing


Tasks:
Have a workload queue for jobs and transformations
Gather information about the actual workload on each server
Depending on thresholds for specific measures (# of jobs, # of
transformations, memory usage, CPU usage, response times etc.) you
can route the next job or transformation from the workload queue to a
specific server
Optionally, you may define specific thresholds for different jobs and
transformations (e.g. a job needs the server exclusive or dedicated) or
fire events when a) resources are not available for a certain amount of
time or b) the queue gets too big or queue entries are not processed in
a certain amount of time.
You can modify the patterns Failover of a master and Health check for
clustered environments to accomplish this task and add some additional
logic around.
A sample implementation for this pattern is not available at this time.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 608

Pattern: Analyzing log entries


Analyze the log entries for:
Unusual short or long running jobs (signal to noise detection by thresholds)
Dead lock situations
Number of processed rows (min/max)
Assumptions:
Database logging is enabled
When analyzing dead lock situations, the logging interval must be set and must
be below the check cycle time
When analyzing the number of rows, these must be set in the logging of the
transformation settings accordingly
Performance considerations:
Analyzing log entries could be a performance burden to the database and the
overall system. To minimize this, indexes should be set accordingly. By the
design of this patterns, we limited the access to the logging tables to the
minimum needed, e.g. by temporarily storing parts of the log data to a
temporary local file.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 609

Pattern: Analyzing log entries


The Analyze Log Definitions table
Environment variable:${analyze_log_table}
Example table name for operations: op_analyze_log
Fields & Descriptions
al_id

Unique Analyze Log ID

al_type

1=Job, 2=Transformation

al_name

Job or Transformation name for checking


the log entries (*=all)

al_is_disabled

1=Do not check this entry

al_cycle_minutes

Defines the check time in minutes.

al_last_batch_id

This is the last completed batch_id (set to 1 at the beginning). Used to limit the
number of log entries to check.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 610

Pattern: Analyzing log entries


Fields & Descriptions (Analyze Log Definitions table continued)

al_deadlock_minutes

When >0: Check for new log entries after this time
to detect deadlock situations.

al_deadlock_event_type

When a deadlock is detected, fire this event

al_deadlock_event_details

See the events section for more details.

al_min_minutes

When >0: Check at the end of a process, if it took at


least x minutes

al_max_minutes

When >0: Check within the process, if it took


already more than x minutes

al_time_event_type

When a timing issue is detected, fire this event

al_time_event_details

See the events section for more details.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 611

Pattern: Analyzing log entries


Fields & Descriptions (Analyze Log Definitions table continued)

al_min_rows

When >0: Check at the end of a process, if it


processed at least x rows

al_max_rows

When >0: Check within the process, if it processed


already more than x rows

al_row_event_type

When a # of rows issue is detected, fire this event

al_row_event_details

See the events section for more details.

al_is_status_check_failed

=1: Check for failed jobs and transformations

al_status_event_type

When a status issue is detected, fire this event

al_status_event_details

See the events section for more details.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 612

Pattern: Analyzing log entries


The Analyze Log table for checking
Environment variable:${analyze_log_check_table}
Example table name for operations:
op_analyze_log_check
Fields & Descriptions
al_id

Unique Analyze Log ID

al_last_check

Date and time of last check

al_last_check_batch_id

Corresponding BATCH_ID of the log.

al_last_channel_id

Corresponding CHANNEL_ID of the log. This


is used together with the al_id to determine
the corresponding log entry.

al_last_status

This is the copy of the last status from the


original log

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 613

Pattern: Analyzing log entries


Fields & Descriptions (Analyze Log table for checking continued)
al_last_log_crc

This is the last CRC from the log message of


the original log entry used to detect changes
and deadlock situations

al_last_log_crc_change

Date and time of the last CRC change

al_is_finished

=1 means, this entry reflects a finished


transformation or job without issues

al_detection

>0 means, this entry reflects an entry with an


issue (1=failed, 2=deadlock, 3=min time,
4=max time, 5=min rows, 6=max rows)

Note: When al_is_finished=1 or al_detection>0, this entry will no


longer be checked to avoid multiple events for the same log entry

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 614

Pattern: Analyzing log entries


In the sample implementation, an event is fired after the cycle time, when:
The cycle time is reached AND
When checking for failed status: Status is stop or ERRORS>0 OR
When checking for deadlocks: No change in the log message field for the
specified amount of time OR
When max/min rows are defined: the # of max rows is reached while
running or after the process has finished, the # of min rows is not reached
OR
When max/min minutes are defined: the # of max minutes is reached
while running or after the process has finished, the # of min minutes is not
reached
For details, please see the transformation analyze_log_check
Analyze Log Environment Variables

analyze_log_table

The Analyze Log Definitions table

analyze_log_check_table

The Analyze Log table for checking

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 615

Pattern: Analyzing log entries


Overview of the sample implementation jobs and transformations:
Create a sample environment
Define a database connection operations_db and share it
Create test tables with job test_create_log_tables
Fill the analyze log table with samples by transformation
test_fill_analyze_log_tables
Note: To simplify testing, you can disable or enable checks by setting
the field al_is_disabled accordingly.
Samples for simulating malfunctions
When you want to simulate a deadlock situation, have a look at
transformation and job test_deadlock.
When you want to simulate an error and set a process to failed, have a
look at transformation & job test_failed.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 616

Pattern: Analyzing log entries


Sample for the analyzing log entries:
Job analyze_log_main should be run in an interval, e.g. every five minutes
This job is calling other jobs and transformations to implement the logic: job
analyze_log_check_wrapper, transformation analyze_log_load_temp, job
analyze_log_check and transformation analyze_log_check
When an event is triggered, it is calling the job event
When multiple exceptions for a process are found, the highest value of the
detection is used to fire the event (see definition of field al_detection).
Test run, general test:
When you let analyze_log_main run for the first time, you will get the
following in the log entries:
Result from analyze_log__check - Detection: 0 (0=ok, 1=failed, 2=deadlock,
3=min time, 4=max time, 5=min rows, 6=max rows) [unknown]
This is correct since no log entries to analyze are there.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 617

Pattern: Analyzing log entries


Test run for a failed transformation:
Prepare by filling the analyze log table with samples by transformation
test_fill_analyze_log_tables (set al_is_disabled to 0 for al_id=5)
Execute transformation test_failed
Execute analyze_log_main again, you will get the following in the log entries:
Result from analyze_log__check - Detection: 1 (0=ok, 1=failed, 2=deadlock,
3=min time, 4=max time, 5=min rows, 6=max rows) [Status indicates failed
(=stop) or ERRORS>0.]
And the event: PDI Operations Event - ERROR: Transformation test_failure
failed
Have a look at table op_analyze_log, and check that al_last_batch_id has
changed for this log entry
Have a look at table op_analyze_check_log, and check that al_detection is 1
for this log entry

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 618

Pattern: Analyzing log entries


Test run for a failed job:
Prepare by filling the analyze log table with samples by transformation
test_fill_analyze_log_tables (set al_is_disabled to 0 for al_id=4)
Execute job test_failed
Execute analyze_log_main again, you will get the following in the log entries:
Result from analyze_log__check - Detection: 1 (0=ok, 1=failed, 2=deadlock,
3=min time, 4=max time, 5=min rows, 6=max rows) [Status indicates failed
(=stop) or ERRORS>0.]
And the event: PDI Operations Event - ERROR: Job failed
The definition * was used for this check, when you look at table
op_analyze_log and the test entry al_id=4
Have a look at table op_analyze_log, and check that al_last_batch_id has
changed for this log entry
Have a look at table op_analyze_check_log, and check that al_detection is 1
for this log entry
Reruns of analyze_log_main will not log this failed job any more.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 619

Pattern: Analyzing log entries


Test run for a deadlock transformation or job:
Prepare by filling the analyze log table with samples by transformation
test_fill_analyze_log_tables (set al_is_disabled to 0 for al_id=2 for the
transformation or al_id=1 for the job)
Execute transformation or job test_deadlock
Note: Due to http://jira.pentaho.com/browse/PDI-4557,you need to execute
this remotely and not in the same Spoon instance
Execute analyze_log_main again, you will not get anything about a deadlock in
the log entries until you reached the al_deadlock_minutes time. You may see
messages like: [Not finished, yet. Checking next time again.] or [Check cycle
time not reached, yet. Checking next time again.]
Execute analyze_log_main again after the al_deadlock_minutes time is
reached, you will get the following in the log entries:
Result from analyze_log__check - Detection: 2 (0=ok, 1=failed, 2=deadlock,
3=min time, 4=max time, 5=min rows, 6=max rows) [Deadlock detected.]
And the event: PDI Operations Event - ERROR: dead lock in transformation
test_deadlock
Like before: Reruns of analyze_log_main will not log this deadlock any more.

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 620

Further Information

2011, Pentaho. All Rights Reserved. www.pentaho.com.

More Resources
Kettle project page:
http://kettle.pentaho.com
Enterprise Edition Documentation, Knowledge Base Articles and more
http://kb.pentaho.com/
Community Documentation (WIKI):
http://wiki.pentaho.com/display/EAI/
For up to date information, check the forums:
http://forums.pentaho.org/forumdisplay.php?f=69
Bug and Feature Requests with Road Maps (JIRA):
http://jira.pentaho.com
FAQ for Bug and Feature Requests:
http://wiki.pentaho.com/display/EAI/Bug+Reports+and+Feature+Requests+FAQ

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 622

More Resources
Community:
http://community.pentaho.com
Pentaho Open Source Business Intelligence Suite - European User Group
http://xing.com/net/pug
Pentaho Open Source Business Intelligence at LinkedIn
http://www.linkedin.com/groups?gid=105573
Other user groups
http://wiki.pentaho.com/display/COM/Pentaho+User+Groups

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 623

2011, Pentaho. All Rights Reserved. www.pentaho.com.

US and Worldwide: +1 (866) 660-7555 | Slide 624

You might also like