Wednesday, January 23, 2013

Debugging kettle tasks in MapReduce - Sane WriteToLog

Finally started to play with bigdata and pentaho. On my specific case, Cloudera CDH3u4. At Mozilla we have a few clusters of over 80 machines that we're using to backup a bunch of services

Debugging mapreduce tasks


It took me a while to get my head around the concepts of how kettle integrated with the mapreducer tasks. When I did, the first thing I noticed is how complex it is to know what's happening. Until Matt Casters and friend get the chance to implement PDI-9148, we need to do things manually - as in inspecting logs, etc.

My first approach was writing to text files. I tested direct output to hdfs, but for some reason didn't work. Using direct file system means that output will be spread through all the cluster nodes. This approach generally sucks.

I also thought about using some hand-made logic in a javascript step, but then looked at the WriteToLog step. This step generally works, but with a great flaw on it; it has no way to limit the output of it. If we have millions of rows, we'll have a huge log generated - and that's not good.


An improved Write To Log step


If it's not there, just do it yourself, the code is open. So I did. I added the ability of specifying a limit to the output of the step. This is very useful to inspect how the dataset is looking inside a map or reduce task. Once I deployed this change to my cluster, this is how my tasktracker log looks like (I ran this with a previous writeToLog version and ended up with a crashed browser and almost half a gigabyte of log files). This shows the first 5 lines of our dataset, with the key and value of our dataset:


I'll work with the kettle team in order to put this into the main code line, hopefully will be in 4.4.1 and 5.0. This is PDI-9195



Tuesday, January 8, 2013

CDE / CDF components and templates in Pentaho solution directory

Someone brought to my attention that this is a very useful, though undocumented feature. TL;DR,  you can have templates and components in the solution directory and not in the plugin directory, that gets wiped on upgrades


It's a common requirement having to develop templates or components to extend the capabilities of the Ctools. The normal place is to put it in the solution/system/plugin directory, along the others.

However, that has a huge inconvenience - whenever we upgrade / reinstall the plugin we need to copy the resources back, not forgetting to do a backup before.

That's not actually needed since a while back. If we put the resources directly under the pentaho-solution// all the Ctools will know what to do.

Here's a real life example that can serve as reference:


cde
├── components
│   ├── EmailPrpt
│   │   ├── component.xml
│   │   ├── emailPrpt-implementation.js
│   │   └── emailPrpt.xaction
│   └── VideoGallery
│       ├── ceebox-implementation.js
│       ├── component.xml
│       ├── css
│       │   └── ceebox.css
│       ├── images
│       │   ├── cee-close-btn.png
│       │   ├── cee-next-btn.gif
│       │   ├── cee-next-btn.png
│       │   ├── cee-prev-btn.gif
│       │   ├── cee-prev-btn.png
│       │   └── loader.gif
│       ├── jquery.ceebox.js
│       └── jquery.swfobject.js
├── styles
│   └── Clean.html
├── templates
│   └── index.xml
└── widgets
    ├── IncomeStatementDetailTable.cdfde
    ├── IncomeStatementDetailTable.component.xml
    ├── IncomeStatementDetailTable.wcdf
    ├── index.xml
    ├── sample.cda
    ├── sample.cdfde
    └── sample.wcdf
cdf
├── components
│   ├── jfreechart-cda.xaction
│   └── traffic.xaction
├── includes
│   ├── index.xml
│   └── Operations
│       ├── facilityAccount.cda
│       ├── facilityAccount.cdfde
│       ├── facilityAccount.wcdf
│       └── index.xml
└── index.xml

Friday, January 4, 2013

CDF Async Support

Introduction

This is a huge change! Since the beginning, CDF behaved in a synchronous way. The problem is that with several components, performance suffers with it. This change represents a major overhaul of the main CDF code in order to change that. Now, if a priority is specified, all components are executed simultaneously, speeding up the render of a dashboard.

We tried to maintain backward compatibility. Since the new async behavior requires a new - and simpler - way of defining component interaction, by default old dashboards will still render in a "fake synchronous" mode, by applying a specific heuristic where sequential sets of priorities are assigned to components, emulating old behavior.

This blog post (who's contents are also available in CDF's documentation) is a guide to converting old components and dashboards to the new async style, and developing new ones based on asynchronous querying.

This is currently in dev and will soon make it's way into stable releases. CDE support is obviously included 

Rationale

The first step to understanding the changes in the async patch is understanding the CDF component lifecycle. When a component is updated, the basic update lifecycle looks like this:
preExecution -> update -> postExecution
 
Usually, though, there will be a call to a data source, with a subsequent call to postFetch, and only then is the component rendered:
preExecution -> update -> query -> postFetch -> redraw -> postExecution
 
This is a more typical lifecycle, and one that has some important limitations. First, preExeuction and postExecution are entirely the responsibility of CDF itself, rather than the component. Because CDF has no control over the contents of the update method, it has no way of ensuring that, should the component execute an asynchronous query, postExecution only runs after redraw. In this case, you're likely to see this instead:
preExecution -> update -> postExecution -> query -> postFetch -> redraw
 
Which breaks the contract for postExecution running after the component is done updating. The solution here is that the component itself must take control of postExecution, while keeping the burden of implementing the lifecycle in CDF rather than passing it to the component developer. On a related topic, postFetch has become a de facto standard part of the lifecycle, yet its implementation was left to the component implenters, which leads to a fairly large amount of boilerplate code.
Our objective here was to retool the base component so as to deal with both of these issues, thus allowing queries to be performed asynchronously while reducing the developer effort involved in creating a component.

Component execution order and Priority

There are no major changes in the way components behave. There is, however an important caveat - since all components (that have been converted) will be executed simultaneously, we can no longer rely on the order of execution.
There's now an additional property named priority. The priority of component execution, defaulting to 5. The lower the number, the higher priority the component has. Components with same priority with be executed simultaneously. Useful in places where we need to give higher priority to filters or other components that need to be executed before other components.
This way there's no longer the need to use dummy parameters and postChange tricks to do, for instance, cascade prompts.

Backward compatibility and changes

We did a big effort in order to maintain backward compatibility, but some care has to be taken. What we do is assume that if components have no priority, we give them a sequential value, trying to emulate the old behavior. It's recommended that proper priorities are set in order to take advantage of the new improvements.
If using CDE, please note that if you edit a dashboard and save it, all components will have a default priority of 5. This may break the old behavior. If you need to change a dashboard, make sure you tweak the priorities, if needed.

Developing Components

Components desiring to use asynchronous queries should inherit from the new UnmanagedComponent, instead of BaseComponent. The UnmanagedComponent base class provides pre-composed methods that implement the core lifecycle, for a variety of different scenarios:
  • synchronous implements a synchronous lifecycle identical to the core CDF lifecycle.
  • triggerQuery implements a simple interface to a lifecycle built around Query objects.
  • triggerAjax implements a simple interface to a lifecycle built around AJAX calls.
Since all these lifecycle methods expect a callback that handles the actual component rendering, it's conventional style to have that callback as a method of the Component, called redraw. It's also considered standard practice to use Function#bind or _.bind to ensure that, inside the redraw callback, this points to the component itself.

Use synchronous if Your Component Doesn't Use External Data

Components that don't use any external data at all can continue subclassing BaseComponent without any change of functionality. However, for the sake of consistency (or because you want querying to be optional -- see the section for details), your can use subclass UnmanagedComponent and use the synchronous lifecycle method to emulate BaseComponent's behaviour:
update: function() {
  this.synchronous(this.redraw);
 }
 
If you want to pass parameters to redraw, you can pass them as an array to synchronous:

update: function() {
  /* Will call this.redraw(1,2,3) */
  this.synchronous(this.redraw, [1,2,3]);
} 
 
 

Use triggerQuery when You Want Your Component To Use CDA/Query Objects

If you're using a CDA data source, you probably want to use triggerQuery to handle the component lifecycle for you. triggerQuery expects at a minimum a query definition and a redraw callback to process the query results. The query definition is an object of the form:

{
  dataAccessId: 'myQuery',
 file: '/path/to/my/datasourceDefinition.cda'
}
 
Typically, if you're using CDE, these properties will be added to one of either this.queryDefinition or this.chartDefinition so you can just use this pattern:

update: function() {
 var redraw = _.bind(this.redraw,this);
 this.triggerQuery(this.queryDefinition, redraw);
}
 

Alternating Between Static And Query-Based Data

As the lifecycle methods are completely self-contained, you can switch between them at will, deciding on an appropriate lifecycle at runtime. A common pattern (used e.g. in SelectComponent, and the CccComponent family) is exposing a valuesArray property, and using static data if valuesArray is provided, or a query if it is not. Using UnmanagedComponent, this convention would like like this:

update: function() {
 var redraw = _.bind(this.redraw,this);
 if(this.valuesArray && this.valuesArray.length > 0) {
  this.synchronous(redraw,this.valuesArray);
 } else {
  this.triggerQuery(this.queryDefinition,redraw);
 }
}
 

Rolling Your Own

If you prefer having absolute control over your component, you can eschew the use of any of the lifecycle methods. Instead, you're expected to follow these guidelines:
  • Call this.preExec() as soon as possible, and bail out if it returns false.
  • If this.preExec() returned true, call this.block() before any meaningful amount of work is done.
  • If you called this.block(), make sure to always call this.unblock() as well once all relevant work is done.
  • If you want to use any sort of AJAX, consider using triggerAjax()
  • Call this.postExec() once all processing is done
  • You can override this.block and this.unblock to implement component specific UI blocking. If you override either, you must override the other as well.

New and Changed Features

Component Cloning

If your component holds any references to other components, you need to override the clone method so as to ensure that you don't accidentally clone the target component. For example, if your component has a property named otherComponent pointing at another component, you should override clone using this general template:

clone: function(parameterRemap,componentRemap,htmlRemap) {
 var other = this.otherComponent;
 delete this.otherComponent;
 var that = this.base(parameterRemap,componentRemap,htmlRemap);
 this.otherComponent = that.otherComponent = other;
 return that;
}
 

New Base Component Class: UnmanagedComponent

UnmanagedComponent is a new base class for components. It provides the base on which all asynchronous components should be built.

Per-Component isManaged Flag

Each component should have a member property with isManaged, indicating whether CDF should managed the component's lifecycle. Components where isManaged is false need to implement it's own calls to the lifecycle

Component Stub

Here's an example of a stub to be used whenever you need to use a new component. Just override the redraw function to what you need and define the component as you'd usually do

 
ExampleComponent = UnmanagedComponent.extend({

  update: function() {
    var redraw = _.bind(this.redraw,this);
    if(this.valuesArray && this.valuesArray.length > 0) {
      this.synchronous(redraw,{resultset: this.valuesArray});
    } else {
      this.triggerQuery(this.queryDefinition,redraw);
    }
  },

  redraw: function(data){

    /* Specific code goes here */
    if(!this.isInitialized){
      this.compiledStr = Mustache.compile(" Got a result set with  {{nrRows}} rows and {{nrCols}} columns 

");
      this.isInitialized = true;
    }
    $("#"+this.htmlObject).html(this.compiledStr({
      nrRows: data.resultset.length||0,
      nrCols: data.resultset[0]?data.resultset[0].length || 0:0
    }));

  }

});

customComponent = 
  {
  name: "regionSelector",
  type: "example",
  parameters:[],
  valuesArray:[["1","Lisbon"],["2","Dusseldorf"]],
  priority: 5,
  htmlObject: "sampleObject",
  executeAtStart: true
};
 

Debugging lifecycle

In order to be able to more easily track the lifecycle of CDF, we added some extra debugging features. If you use Firefox's Firebug or a recent version of Chrome (I believe >= 24) you'll be able to track the execution in a nice way: