Debugging mapreduce tasks
It took me a while to get my head around the concepts of how kettle integrated with the mapreducer tasks. When I did, the first thing I noticed is how complex it is to know what's happening. Until Matt Casters and friend get the chance to implement PDI-9148, we need to do things manually - as in inspecting logs, etc.
My first approach was writing to text files. I tested direct output to hdfs, but for some reason didn't work. Using direct file system means that output will be spread through all the cluster nodes. This approach generally sucks.
An improved Write To Log step
If it's not there, just do it yourself, the code is open. So I did. I added the ability of specifying a limit to the output of the step. This is very useful to inspect how the dataset is looking inside a map or reduce task. Once I deployed this change to my cluster, this is how my tasktracker log looks like (I ran this with a previous writeToLog version and ended up with a crashed browser and almost half a gigabyte of log files). This shows the first 5 lines of our dataset, with the key and value of our dataset:
I'll work with the kettle team in order to put this into the main code line, hopefully will be in 4.4.1 and 5.0. This is PDI-9195