Winning ETL battles

By Sergey Alentyev – Senior Database Architect – Cobalt Intelligence Center of Excellence

By the end of another long day of ETL project development I feel pretty tired. Looking at my ETL tool mapping, I don’t see Sources, Transformations, LookUps and Targets connected by many arrows. My imagination shows me the historical battlefield plan with troops locations, phalanxes, fortresses, cavalry attacks.

Obviously I need some rest.

But this picture didn’t pop up without any reason. As military commander maneuvers right parts of his troops at the right time, destroys enemy’s forces, occupies his positions, fortress or city, ETL Architect or Developer moves and processes a lot of data from the sources to the stage, fact and dimension targets. In both cases this is all about efficient utilization of existing resources at the right time. Lets try to find analogies in military history that can help us better understand such resource utilization. Probably we will be able to make some useful conclusions.  Sounds crazy? Lets see.

Battle of Thermopylae

A plan for ETL battle

A plan for ETL battle

We have powerful database server, very strong ETL server, plenty of memory and CPUs. Our ETL mappings and workflows are written by very experienced developers. But why it takes so long time to run? First, we ship all these massive data from source database to ETL server. On ETL server we process all our data record by record.  Or in little better case we run few parallel threads where each thread also processes the data row by row. And finally, we ship all results back to the target database which is the same as the source database. No matter how much more memory and CPU we would have we won’t scale. It reminds me of something…

480 BC. Persian Emperor Xerxes had amassed a huge army to set out to conquer all of Greece. The Athenian general Themistocles had proposed that the allied Greeks block the advance of the Persian army at the pass of Thermopylae. Relatively small army of Spartans and other Greeks leads by king Leonidas blocked the narrow pass. You might have seen the movie.

The decisive moment.

The Decisive Moment (look familiar?)
(Image from http://en.wikipedia.org/wiki/File:Battle_issus_decisive.gif)

If Persian army meets the Greeks at wide place they would smash them very quickly just because they significantly outnumbered them (vary significantly by different historians as Persians: 70K to more than one million warriors, Greeks: 5K – 7K).  For two days all Persian attacks ended with nothing but Persian casualties. On third day a traitor shows Persians the secret pass, and the Greeks were surrounded. We know the end by books and movies.

Besides the unquestionable Greeks courage, military skills and love to the Motherland there was another reason for initial failure of Persian army: well-chosen terrain. Persians had a long journey to the battlefield. They could not use massive force of their army. They could not process in parallel, so to speak. The pass was 100 meters wide at most. They had to process almost in serial mode.

If we can’t utilize the resources available to us at the right moment we can lose out ETL battle. Or at least not win.

Macedonian phalanx

When I see ETL tool row-by-row processing, or INSERT… VALUES… SQL script, or some row-by-row looping like in PL/SQL:

DECLARE
cursor c is
select …
from …
where …;
BEGIN
FOR r IN c LOOP
INSERT INTO target_test VALUES(r…);
END LOOP;
END;

I imagine how two enemy troops are standing against each other on the battlefield. One side starts the attack – one starting to run towards the enemy line with a battle cry, then another, then another. Looks like a not very powerful attack.

PL/SQL and ETL Tools Bulk processing is better and should make a big difference.
Like PL/SQL bulk INSERT:

DECLARE
cursor c is
select …
from …
where …;
TYPE t IS TABLE OF c%ROWTYPE;
l t;
BEGIN
OPEN c;
LOOP
FETCH c BULK COLLECT INTO l LIMIT 1000;
FORALL i IN 1..l.COUNT
INSERT INTO target_test VALUES l(i);
COMMIT;
EXIT WHEN c%NOTFOUND;
END LOOP;
END;
/

We send 1000 warriors together to attack. Scary picture! But when they arrive close to the enemy line they attack again one by one, even at a faster pace.

INSERT … SELECT … will be more effective and faster, closer to the situation where at least first line attack at the same time. INSERT … SELECT … in parallel is the situation when many warriors attack together.

When I see a massive parallel processing of huge partitioned table (something like:

INSERT /*+ append parallel(t) */ INTO target_table t
SELECT /*+ parallel(s) */ …
FROM source_partitioned_table s
WHERE …; )

…it resembles a Macedonian phalanx, probably the first known usage of parallelism. Each phalangite carried his primary weapon – a sarissa, a double-pointed pike over 6 m (18 ft) in length. The weapons of the first five rows (partitions?) of men all projected beyond the front of the formation. Five men can fight at the same time. The parallel degree is 5! We can rewrite our pseudo-query like that:

INSERT /*+ append parallel(t 5) */ INTO target_table t
SELECT /*+ parallel(s 5) */ …
FROM source_partitioned_table
WHERE …;

Phalanx

The Phalanx – A Powerhouse Parallel Strategy
Image from http://room162c.edublogs.org/2010/04/19/battle-of-thermopylae/

Besides courage, military and political talent of Alexander the Great the Macedonian phalanx was one of the main reasons why this young man conquered the known world.  It was very effective usage of limited resources.  Even his troops were usually smaller then his enemy’s army they could apply more force concurrently. They ran in parallel!

Napoleon on the Borodino field

I’m trying to improve performance for big aggregation ETL processes where we doing a lot of I/O, joins and sorts.  We have a lot of CPU and fast disks, we added bunch of memory. At this time there is nobody but our big query running on the database  instance.

But I see that we’re not using this big memory.  All parameters show that we should be able to use much more. And we dump intermediate results at temporary segments and then read it by one or two block at a time. Why are we not allowed to use more of the resources of our superb database?

Back to history!

1812. The French army (best army in the world for that time) supported by many European troops and leaded by emperor Napoleon (possibly the best commander of all times) invaded Russia. For several months Russians were avoiding the decisive battle fighting locally and exhausting French army. Napoleon was thirsting for such a battle. He wanted to destroy the whole enemy forces in one huge battle as he did many times before using his brilliant military talent. Finally September 7, 1812 the biggest and bloodiest single-day action by that time happen.

250,000 troops was involved from both sides, with at least 70,000 casualties. Napoleon’s generals saw  victory close at hand and thought that a final push would be enough to decide the fate of the Russian army and of the war. They begged Napoleon to deploy his final and best reserve – the Imperial Guard. But during the battle, Napoleon saw how hard the Russians fought and how hard his progress had been won. Napoleon changed his mind.

“I will most definitely not; I do not want to have it blown up. I am certain of winning the battle without its intervention.”

He refused to commit the Guard as a final reserve so far from France and reliable logistics supports. What if the Russians found other resources and attacked tomorrow?

In the end, both sides claimed victory. French took empty burned Moscow but soon left, without provisions, beaten from all sides by Russians and a bitter winter.

Everybody who knows something about history knows the final result. From half a million French and other Europeans army invaded Russia only about 22,000 crossed the border back.

But why am I telling this story now? Because Napoleon at Borodino reminds me of our best-in-the-world database management system (that was initially designed for OLTP) as it behaves when it comes to huge queries where a lot of resource needs to be used, where we need to do a lot of I/O, joins, sorts, etc. We have a lot of memory but we are not using it all or even a big portion of it automatically. We need to beg Napoleon – our database – to allocate it using manual memory allocation, hidden parameters and other tricks. But our OLTP based database says I can’t allocate such huge resources to the one battle (query, report, ETL process). What if other sessions will need these resources a second later? “I was always successful with such OLTP approach in my previous campaigns and now I’m going to win this battle anyway even without using my best reserves.”

Sound familiar? “I will most definitely not; I do not want to have it blown up. I am certain of winning the battle without its intervention.”

Winning the guerilla war

To win the major battles we switch from our brilliant but OLTP-based database to one of the best MPP databases. Everything that we were struggling with before (all huge analytics queries, big ETL scans, joins and sorts) is flying now. But guess what, all small queries, online reports, everything that select and process small narrow data sets run longer. Even after tuning SQL for MPP the average small query run much slower then before. Instead of doing just a few I/O it has to send jobs to all segments, most likely do full tables scans and do it with physical reading, then probably do some kind of data shipment between segments (servers). So, we have to pay a minimum MPP price for any operation.

Imagine, you are the Roman general who just recently won the big battle against Rome’s rival. The enemy’s troops are destroyed, the major cities are occupied by Roman forces. But the rebels’ small units are still fighting for their Motherland. They pick at your troops daily, with very small units. And each time you respond by sending several heavy centurions to destroy each group of rebels. And each time, you win the individual battle.

But how much time and resources do you spend? When many gangs attack simultaneously you send all your reserves, leaving cities unprotected. At such moments they can take the cities even with relatively small forces. And you can lose everything that was taken with such giant efforts.

Would it be better to have special light cavalry small units that can do the job much faster and with fewer resources?

What happens with our MPP super system when we need to run online Web applications with hundreds and thousands really small queries running concurrently? Can we quickly exhaust our I/O, CPU and memory? Probably we would need to have some data marts on OLTP based databases for such applications.

No doubt that the future of Data Warehouses is a Big Data and the future of ETL is a Big Processing of Big Data. For Big Data Processing we need to have Big Resources. But even when we have it we need to utilize it right.

I hope this small article can help ETL and Data Warehouse Developers to better visualize the ETL process, its bulk nature in new era and understand the importance of right resource utilization a little bit better. I hope this helps you marshal your forces and win your own ETL battles in the future.

Agile Cake

By Bruce Szalwinski – Senior Data Architect – Cobalt Intelligence Center of Excellence

I was kicking back on the couch last Saturday afternoon watching Duff Goldman create another amazing “edible, structural model display piece” otherwise known as cake.  His replica of the Radio City Music hall was lavishly decorated with lights, a Christmas tree and lots of glitter.  As usual, the cake had been constructed in a week and Duff was delivering the finished product to another satisfied consumer.  A repeatable process that results in satisfied customers?  Where have I heard of this before?  Had I found a way to combine my love of cake with my love of software development?  Could it be that I could finally explain to my neighbor what I do at work in terms that he would understand?  Fortunately for me, it was an Ace of Cakes marathon so I had plenty of time to explore the relationship between cake and code.

At its heart, the agile process focuses on delivering the highest business value in the shortest amount of time.  Working software is delivered rapidly and repeatedly in increments as short as one week up to one month.  Small teams self-organize to determine the best way to deliver the highest priority features.  Can I justify watching TV all afternoon by finding these three pillars of agile development in a bakery?

On Monday’s, the folks at Charm City Cakes gather to discuss the cakes that will be delivered that week.  In the world of agile development, this is a sprint planning meeting.  Mary Alice, acting in the role of the Product Owner, defines the features of each cake such as flavor and general size, sets the delivery date and negotiates a price that will result in profitability for the bakery.  The set of cakes that she presents to the decorators is analogous to the product backlog.  The decorators, serving in the role of Scrum Team, pull cakes off of the backlog that they then commit to complete that week.  They collaborate on how each cake will be constructed, what scaffolding will be required to support the cake and the set of team members that are required in the construction of the cake.  These constant one week intervals produce an even rhythm within the bakery resulting in an amazing number of cakes being delivered. It is interesting to note that nobody ever makes a cake that is scheduled for delivery weeks later.  Not only might the requirements change, most of all the cake would be stale.  This is something to remember when adding “extra” features during development.

As the episodes rolled on, I noticed the camera panning over a wall of clipboards.  It turned out that each clipboard represented one of the cakes to be delivered that week.  The clipboard also contains the customer requirements and any associated artifacts needed to complete the cake.  This maps well to the agile philosophy of preferring working software over comprehensive documentation.  As each cake was completed, its associated clipboard was marked as done.  What a simple and elegant burn down chart! The progress for the week is on display for all in the bakery to see.  As folks finish their assigned cake, they can spot other opportunities where they can assist their fellow decorators so that the whole team is successful.

No agile team would be complete without a Scrum Master and Duff clearly serves that role.  From dealing with the fondant supplier to fixing the air conditioning system, Duff is constantly removing obstacles to ensure that the team is fully functional and productive.  As the owner of the business, he sets the values and practices that are carried throughout the rest of the bakery.  He is constantly encouraging his staff to expand into new roles.  In the “Hard Knock Cake” episode, he models pair programming when he states that “Catherine can pipe anything, it’s time to teach her the guts of cake making”.  He then proceeds to pair her with Ben to expand her skill set.

Having completed the marathon, I was pleased to find such a strong relationship between good cake and good code and look forward to sharing some of these ideas with my scrum team.  The folks at Charm City Cakes clearly embody the Agile Manifesto by valuing individuals, collaboration and of course working cake.  As Duff says in the tag line for the show, “It’s all about the cake!”

%d bloggers like this: