You are on page 1of 2

DoP and Partitions

In the previous chapter we said the source table has to be partitioned in order to allow for parallel reading. So either the table is partitioned
already or we edit the table via the object library and maintain partition information ourselves. The result will be similar but not identical.
If the table is physically partitioned, each reader will add a clause to read just the partition of the given name. In Oracle that would look like "select
* from table partition (partitonname)". That has three effects. First, the database will read the data of just the one partition. No access the other
partitions. Second, that does work with hash partitions as well. And third, if the partition information is not current in the DI repo, e.g. somebody
did add another partition and had not re-imported that table, DI would not read data from that this new partition. And to make it worse, DI would
not even know it did not read the entire table although it should. In order to minimize the impact, the engine does check if the partition information
is still current and raise a warning(!) if it is not.
Another problem with physical partitions is, the data might not be distributed equally. Imagine a table that is partitioned by year. If you read the
entire table, it will be more or less be equal row numbers in each partition. But what if I am interested in last years data only? So I have 10
readers, one per partitions and each reader will have the where clause YEAR >= 2007. Nine of them will not return much data, hmm? In that case
it would be a good idea to delete the partition information of that table in the repository and add another, e.g. partition by sales region or whatever.
Something that is not possible yet, is having two kinds of partitions. In above example you might have an initial load that reads all years and a
delta load where you read just the changed data and most of them are in the current year obviously. So for the initial load using the physical
partition information would make sense, for the delta the manual partition. That cannot be done yet with DI. On the other hand, a delta load deal
with less volumes anyway, so one can hope that parallel reading is not that important, just the transformations like Table Comparison should be
parallelized. So the deltaload dataflow would have DoP set but not the enable-partitions in the reader.
Manual partitions do have an impact as well. Each reader will have to read distinct data, so each one will have a where clause according to the
partition information. In worst case, each reader will read the entire table to find the rows matching the where condition. So for ten partitions we
created manually, we will have ten readers each scanning the entire source table. Even if there is an index on the column we used as manual
partition, the database optimizer might find that reading index plus table would take longer than just scanning quickly through the table. This is
something to be very careful with. In the perfect world, the source table would be partitioned by one clause we use for the initial load and
subpartitioned by another clause, one we can use as manual partition for the delta load. And to deal with the two partitions independently, the
delta load is done reading from a database view instead of the table so we have two objects in the object library, each with its one partitioning
scheme.
As said, DoP is used whenever the database throughput is so high, one transform of DI cannot provide the data fast enough. In some cases that
would be more than a million rows per second if just simple queries are used, with other transforms like Table Comparison in row-by-row mode it
is just in the 10'000 rows per second area.
But normally you will find the table loader to be the bottleneck with all the overhead for the database. Parse the SQL, find empty space to write the
row, evaluate constraints, save the data in the redo log, copy the old database block to the rollback segment so other select queries can still find
out the old values if the were started before the insert,... So when we aim for high performance loading, very quickly you have no choice other
than using the API bulkloaders. They bypass all the SQL overhead, redo log, everything and write into the database file directly instead. For that,
the table needs to be locked for writing. And how do you support number of loaders if the table is locked by the first loader already? You can't.
The only option for that is to use API bulkloaders loading multiple physical tables in parallel, and that would be loading partitioned tables. Each
API bulkloader will load one partition of the table only and hence lock the partition exclusively for writing, but not the entire table. The impact for
the DI dataflow is, as soon as the enable partitions on the loader is checked, the optimizer has to redesign the dataflow to make sure each loader
gets the data of its partition only.

Each stream of data has a Case transform that routes the data according to the target table partition information into one loader instance. This
target table partition obviously has to be a physical partition and it has to be current or the API bulkloader will raise an error saying that this
physical partition does not exist.
Using the enable partitions on the loader is useful for API bulkloaders only. If regular inert statements are to be created, the number of loaders
parameter is probably the better choice.

You might also like