High-performance ETL transforms require only CPU

The Date Generation Transform is similar to the Row Generation Transform only that the output column is of type
date. Throughput 1'500'000 rows/sec CPU Memory Disk DoP 100% None None singularization point
Transform Usage
Transform Settings
No Memory, just CPU and the throughput is high: 1.5 million rows per second.
The Row Generation transform does nothing else than generating one row after the other with one INTEGER column only. Throughput 1'900'000 rows/sec CPU Memory Disk DoP 100% none none singularization point
Transform Usage
Transform Settings
Obviously, it does not require any memory, but as much CPU as it can get and is rather quick: 100'000'000 records in less than a minute or in other words a throughput of 1.9 million rows per second.
The Case Transform is used to route one row to different outputs depending on conditions. It has options to either output the row just once - the first valid condition only - or to all outputs with valid conditions. The advantage is mainly usability as making distinct conditions is not always straight forward (condition1: a<10; condition2: a>=10; and what about null values, hmmm?). But also performance is slightly better as the number of checks is reduced. But mostly it is for the user to get a better understanding of the designed flow. Throughput 458'000 rows/sec CPU Memory Disk DoP 100% none none Supported
Transform Usage
Transform Settings
The Case Transform itself does not have a performance impact but the expression evaluation inside. In the example used the first condition is valid for the first 10% of the rows, the second for the next 10% etc. So to find the valid condition, for the first 10% of the rows just one comparison has to be made to find out this is the one to go. For the last 10'000'000 rows 9 comparisons have to be made for each single row!
The impact of that can be seen best when sorting the monitor file on execution time. The time for the different pathes is increasing and increasing.
Anyway, by looking at the process monitor it can be easily stated that Case of course does not require any memory it consumes just CPU. And the performance is still 317'000 rows per second. But actually, we made a mistake when entering the conditions: We typed in the actual ranges altough we do not have to.
This way we not only have to test 10 times the first condition for all 100 milion plus 9 time the second plus 8 times the third...... makes an overall of 550 Million times the evaluations of the conditions. But also the condition is unnecessary complex. With the checkbox "Can be true for one case only" we just need to make an exclusion test.
If the value is less than 10'000'000 the first condition is true, otherwise a second test - the one if the value is less than 20 million will be performed. No need to test if the value is greater than 10 million. The performance impact is huge because of the large number of executions: Suddenly the entire flow takes only 218 seconds which is a throughput of 458'000 rows per second. As a comparison, I modified the the conditions in a way that the first one is always valid (132 sec to finish) or in another run the last condition is valid always (282 secs). It is really the number of executions of conditions that is impacting the throughput. So we learned to keep the conditions as simple as possible and to have the condition valid most often as first. One the other hand, even in worst case we have a throughput of 350'000 rows per seconds, ten times the number the database will be able to process the data. So actually, nothing to worry about. And if, we can still parallelize the transform. Loading a table with surrogate key To delta load a table which has a surrogate key as primary key you simply combine Table Comparison with Key Generation Transform. Behind the scenes, some not so obvious things go on. The flow itself is simple.
The secret is the Table Comparison Transform, that perform the lookup on the target table to get the current status. With this lookup, it can read and will output all the columns, even if they are not part of the input schema. An example is the surrogate key of the target table. The Table Comparison does get the logical primary key as input, tries to find the matching row in the comparison (target) table and if found, will get all the missing column values as well like the corresponding KEY_ID.
The only thing to watchout is, if using the row-by-row setting of the Transform, the compare table has to have an index on the logical(!!) primary key - CUSTOMER_ID in this example. Now the Key Generation Transform gets insert and update rows but will overwrite the KEY_ID column only for rows with OP code of insert - rows that do not have a KEY_ID yet.
And at then end, the target table loader does execute the insert or the update statement. For the update statement, it needs to know what the physical primary key is, so make sure the target table got imported with primary key column(s) into the DI repo.
The Validation Transform is a combination of a Query and a Case Transform. First, conditions are tested if met like yes, the column GENDER only contains M, F, ?, then a flag is set if this failure of the condition causes the entire row to be marked as failed. And finally all valid rows are routed to one output and all failed rows to another just like in a Case Transform. Throughput 934'000 rows/sec CPU Memory Disk DoP 100% none none supported
Transform Usage
Transform Settings
The major advantage is that it is simpler to use and has some help for the most common validation tests. Performance will be similar to Query and Case transform which is fast. As long as no complex functions are used, or even database functions like lookup_ext().
The Merge Transform is nothing else than an UNION ALL operator. It takes the rows from all its inputs and copies them to the output. Obviously, all inputs have to be of same structure including datatypes and column order. Although it theoretically could be pushed down to the database sometimes, it is not. Throughput 1'470'000 rows/sec CPU Memory Disk DoP 100% none none supported
Transform Usage
Transform Settings
The Merge Transform consumes no memory and requires almost no CPU, no matter if it is merging data from one or ten inputs. To proof that, two versions of the dataflow have been created, one from the previous example with the Row Generation Transform plus the Merge Transform. And a second case where 10 Row Generation Transforms build 1/10th of the data and all are merged together. The performance of all is pretty much the same as with Row Generation only, obviously there is some overhead in having 10 row generation transforms versus just one, but its marginal.
The Map CDC Operation is used to read log tables/files where each change (insert/update/delete) is saved. In the flow we therefore need to map all rows logged as insert to an insert, all updates to update and the deletes to delete. This can be done with a case transform plus three map operations, the Map CDC is just more convinent. Throughput 196'000 rows/sec CPU Memory Disk DoP 100% none none supported
Transform Usage
Transform Settings
Another side effect of this kind of processing is the order. It would not make a lot of sense to delete the row first and then insert it, the order has to be maintained. Therefore, this transform will always sort the data per sequencing column first, second per the "additional primary key" columns. In most cases the data will be read sorted already, e.g. by pushing down the order by clause of a query or because the flat file is read sequentially anyway so the checkbox "input already sorted" will be checked. If data is not sorted uncheck this box and the engine will add a sort operation upfront. In other words, the core transform itself is streamlined, the optimizer adds a sort operation first if required and this is obviously not streamlined. Possible OP codes are
I...Insert B...Before Image of update U...Update D...Delete The performance of this transform is not top, 500 secs execution time, which is still a throughput of 196'000 rows per second.
Okay, for the Map CDC another query had to be added as well to define the additionally required columns for the OP code and the sequence but this is not the world.
The Map Operation is used to deal with OP codes of the row. Each row can be marked with normal/insert/update/delete. Most transforms require normal rows and send out normal rows only. But with this transform the OP code can be set or changed or all rows with a specific OP code can be discarded. (see How to delete records? for an use case) Throughput 3'300'000 rows/sec CPU Memory Disk DoP 100% none none supported
Transform Usage
Transform Settings
In the previous examples we used the Map Operation to discard all normal rows, and the upstream transforms generated these normal rows. So we implicitly know already that it is of high performance and does not require memory. But to actually measure its own performance we can simple compare the Row Generation Transform test case (52 secs) with another version which has an extra Map Operation: 82 secs.
The Pivot Transform is used to pivot multiple columns into multiple rows, e.g. the source has 12 columns for the revenue in Jan, Feb, etc but in the target table we want just one column called revenue with 12 rows. Throughput 1'000'000 rows/sec CPU Memory Disk DoP 100% none none supported
Transform Usage
Transform Settings
Since the transform requires some columns to pivot, the dataflow got an additional query to map these columns to constants. Executing the flow with the new query but without the pivot transform takes 220 seconds - so that is the amount of time we need for generating the additional columns and the additional transform. When adding the Pivot transform, the performance is 1005 secs.
First of all, it is obvious that the Pivot transform does not require additional memory, only CPU. The throughput is at 100'000rows per second if the input row count is taken as the basis and since the output in this example is 10 times as big, 1 million rows per second in case the target table row count is taken as a basis. Also, it is difficult to say, the Pivot transform itself is causing the increase in execution time from 200 to 1000 secs as very likely the handover of the 1 billion rows to the Map transform will be responsible for much of it. We can simply test the pivot transform itself by reducing the list of pivoted columns inside the transform to just one: 298 secs. So it should be save to say the Pivot transform itself does not consume much time, it is more the number of rows being outputted is responsible for the additional time. Either way, the performance is good.

High-performance ETL transforms require only CPU

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High-performance ETL transforms require only CPU

Uploaded by

Copyright:

Available Formats

The Date Generation Transform is similar to the Row Generation Transform only that the output column is of type

You might also like