Friday, May 25, 2012

CDV - Request For Comments

Community Data Validation 

Motivations

We need a way to do data validation

Use cases

Below are some of the use cases we want to tackle. Emphasized are the ones we think the current spec satisfies
  • Global
    • Is the server running?
    • Is the server running properly?
  • Connectivity
    • Do we have all the access we should? (network / database)
  • Query specific
    • Do we have up to date data?
    • Can we trust the data?
    • How long did the queries take to run?
    • Do we have wrong data? (duplicated users in community)
    • Do we have a big number of 'unknowns'? (tk=1 in DW)
    • Do we have peaks or valleys in the data? (due to double process or no process)
    • Is the data stalled? (eg: number of twitter followers not updating)
    • Did the data format change
    • We need a way to handle known effects (eg: Christmas dip)
    • We need to correlate independent datasources (eg: comparing AUS with Blocklist evolution)
    • We need to connect the long running queries from CDA to CDV
    • Be able to validate big chunks of reprocessing
    • Do we have clearly wrong lines in resultset? (eg: a line there)
  • Dashboards
    • Are the dashboards rendering properly?
      • Do we have all the components?
      • Any layout change?
      • Any js errors?
    • Are the dashboards performing properly?
    • Can CDF talk with CDV to report client-side errors?
    • Alternatively, can CDA talk with CDV to report query errors?
      • Who caused the error?
  • ETL
    • Did the etl run?
    • Are we processing the expected amount of data?
    • Is the etl taking the expected time to run?
    • Did the etl finish before X am?
    • Test etl against tracer bullets?

Work flow

We expect from this system:
  • A central dashboard that allows us to quickly glimpse the overall status of our system.
    • Did all tests pass?
    • Which one failed?
    • Why?
    • When was the last time the tests ran
  • We need simple ways to define the tests (based on existing CDAs)
  • We need to which queries failed and which queries took long time to run
  • We need push notification system by email
  • We need to make sure it can talk to nagios
  • We need an outside test to check if server is up

Logging types

Every test will result in the following levels:
  • Critical
  • Error
  • Warn
  • Ok
Each specific test will be responsible for converting the output of that test (validation function for cda, tbd for kettle) into that status. The object format is:
{
 level: "Critical",
 type: "Missing data",
 description: "Whatever string the user defined" 
}
On each test definition, we need to be able to optionally set a timing threshold for the queries, and that will automatically generate a log with Type 'Duration'

Test types

There are 4 possible types of tests:
  • CDA based query validation
  • ETL monitoring
  • Datawarehouse validation (a specific set of the cda based query validation)
  • Dashboard validation (we may opt to leave this one out for now as we'll try to infer the errors from CDA's 405)

CDA based query

Workflow

We want to select one or more cda / dataAccessId from our system, define the input parameters and select the type of validation we need.
The shape of the function will be: f( [ query, [params] ], validationFunction )
The generic test will be the implementation of the validation function:
validationFunction = function ( [ {metadata: [] , resultset: [[]]} ] ) :  value
That will be freely mapped to the log outputs

ETL monitoring query

The workflow defined here has to match with the previous section. We'll build specific CDA queries that will read the kettle log files. From that point on, specific validations will have to be built for this logs
We'll need, in pentaho, to define which connection refers to the kettle logging tables. Either by defining a special jndi or specifying in the settings.
We'll need to test for:
  • Time
  • Start /end time
  • Amount of data processed

Datawarehouse schema validation

There are some specific tests we can do on the sanity of a datawarehouse.
  • Coherent amount of data on a daily / hourly basis
  • Test the same as before with specific breakdowns
  • Test for the amount of 'unknowns' on dimensions

Invocation and Scheduling

There are 2 ways to call the validations:
  • By url request
  • Scheduled calls
Url will be based on the id / query name (tbd). The schedule calls will be cron based, with the following presets:
  • Every hour
  • Every day
  • Every week
  • Every month
  • Custom cron

User interface

This are the features in the main user interface (this is the ultimate goal, the implementation may be broken into stages):
  • See existing validations
    • Allow to fire a specific validation
    • Get the url of a specific validation / all validations
  • Create / Edit validation
    • Define query name
    • Define queries and parameters
    • Define validation function
    • Chose log alerts (when to throw error / severe / warn / ok)
    • Choose duration thresholds
    • Define error message
    • Define cron
  • Validation status dashboard
  • CDA Query error dashboard (Should this belong to CDA instead?)
    • Query and parameters
    • Error
    • Incidents
  • Duration dashboard to identify slow points in the system
    • Query and parameters
    • Duration
    • Incidents

Technical approach

All the specific information will be stored in solution/cdv/queries/). The files will have the format _queryName.cdv and will internally be a JSON file with the following structure:
{
  type: "query",
  name: "validationName",
  group: "MyGrouping" 
  validation: [ 
    { cdaFile: "/solution/cda/test.cda", dataAccessId: "1" , parameters: [...] },
    { cdaFile: "/solution/cda/test2.cda", dataAccessId: "2" , parameters: [...] }
  ],
  validationType: "custom",
  validationFunction: "function(resultArray,arguments){ return 123 }",
  alerts: {
     /* This functions will be executed from bottom up. As the functions return true, the next one
        will be executed and the last matching level will be thrown.
        The exception to this rule is the optional okAlert(v) function. If this one returns true, no other calls will be made 
     */
     criticalAlert: "function(v){ return v > 10 }",
     errorAlert: undefined,
     warnAlert:  "function(v){ return v > 5 }",
     okAlert: "function(v){return v<3;}",
     alertType: "MissingData",
     alertMessage: "function(level,v){return 'My error message: ' + v)"  /* this can either be a function or a string */
  },
  executionTimeValidation: {
      expected: 5000,
      warnPercentage: 0.30,
      errorPercentage: 0.70,
      errorOnLow: true
  },

  cron: "0 2 * * ? *" 
}

Preset validations

We won't need to manually define all kinds of validations. CDV will support a preset that can also be extended by adding the definitions to solution/cdv/validationFunctions/ . The template for one such Javascript file looks like this:
wd.cdv.validation.register({
  name: "Existence",
  validationArguments: [
    {name: "testAll", type:"boolean", default: true}
  ],

  validationFunction: function(rs, conf) {
    var exists = !!conf.testAll;

    return rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);
    });
  },

  alertArguments: {
    {name: "failOnExistence" type: "boolean", default: true},
    {name: "failureLevel", type: "alarmLevel", default: "ERROR"},
    {name: "failureMessage", type: "String", default: "Failed Existence Test: ${result}"}
  },

  alertMapper: function(result, conf) {
    var success = conf.failOnExistence && result,
        level = success ? "OK", conf.failureLevel,
        message = success ? conf.successMessage : conf.failureMessage; 
    return Alarm(level, message, result);
  }
});
The wd.cdb.validation API is defined in the Validation Module.
There are 5 objects there that we need to analyze:
  • validationFunction(rs, conf) - This is the validation function that will be executed after the query runs
  • validationArguments - Definition of the arguments that will be used within the validation function
  • alertArguments - Definition of the arguments that will be sent to the alertMapper
  • alertMapper(result, conf) - Mapping between the validation result and the alerts

Preset validations or custom validations

When we define a query, we can chose which validation function to use and pass the parameters that specific validation requires.
Alternatively, we can use a custom validation function. That validation function has the following format, where all we need is to return the alarm level (this is a spec, may change after implementation)
function(rs, conf) {

    var exists = rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
      return conf.testAll ? (curr && prev) : (curr || prev);
    });

    return exists ? Alarm.ERROR : Alarm.OK;
}

CDA integration

We need a tight integration between CDA and CDV to report:
  • Errors in CDA queries
  • Long running CDA queries
  • Queries with obvious errors in the structure (eg: missing lines)
It will obviously need to take into account the fact that CDV may not be installed and can't have performance impacts in CDA

External interfaces

We can have several external interfaces supported:
  • Email
  • Http
  • Nagios integration
  • Server up check
The last one is a very specific check. All the other integrations will fail if suddenly the server hangs, and we must be notified of that. On http and nagios integration, we'll be able to get reports on the individual tests and also on the test groups. This will not rerun the tests but get the report on the last status of a test.
On the http case, we can pass a flat to force a test to be rerun.
For nagios, we can have an export of test rules

Settings

We'll be able to define the group rules, mainly for connectivity reasons. So the settings (that later can be converted to an UI), will look like this:
  • Groups
    • Email
      • Threshold
      • Destinations
    • SMS
      • Threshold
      • Destinations
Sms is in here by example but not planned to be supported

Tuesday, May 22, 2012

CDC - Community Distributed Cache

CDC - Community Distributed Cache

One more member for the CTools - and this one is big. Available through the ctools-installer using the -b dev flag, and requires Pentaho 4.5.

About

CDC stands for Community Distributed Cache and allows for high-performance, scalable and distributed memory clustering cache based on Hazelcast for both CDA and Mondrian.
CDC is a pentaho plugin that provides the following features:
  • CDA distributed cache support
  • Mondrian distributed cache support
  • Ability to switch between default and CDC cache for cda and mondrian
  • Gracefully handles adding / removing new cache nodes
  • Allows to selectively clear cache of specific CDE dashboards
  • Allows to selectively clear cache of specific schemas / cubes / dimensions of mondrian cubes
  • Provides an API to clean the cache from the outside (eg: after running etl)
  • Provides a view over cluster status
  • Supports multiple pentaho servers using the same cluster (eg: stage and production)
  • Supports several memory configuration options

Motivation

Performance is a key point not only in business intelligence softwares but generally in any user interface. The goal of CDC is to give a Pentaho implementation based on Mondrian / CDA a distributed caching layer that can prevent as much as possible the database to be hit.
One added functionality is the ability to clear the cache of only specific mondrian cubes. Even though Mondrian has a very complete api to control the member's cache, Pentaho only exposes a clean all functionality that ends up being very limited in production environments.
The cache being able to survive server restarts is a design bonus, and supported by CDA out of the box. It will be supported by Mondrian as soon as MONDRIAN-1107 is fixed.

Requirements

  • Mondrian 3.4 or newer (in Pentaho 4.5)
  • CDA 12.05.15

Usage

It's very simple to configure CDC.
  • Install CDC using either the installer (soon to be available) or ctools-installer. If you do a manual install, be sure to copy the contents of solution/system/cdc/pentaho/lib to server's WEB-INF/lib
  • Download the standalone cache node
  • Execute the standalone cache node in the same machine as pentaho or in the same internal network (launch-hazelcast.sh), optionally editing the file and changing the memory settings (defaults to 1Gb, increase at will). You can launch as many nodes as you want.
  • Launch pentaho and click on the CDC button:
CDC main screen
  • Enable cache usage on CDA and Mondrian
  • Restart pentaho server
  • Check if the settings screen are satisfactory. Usually the defaults work fine.
CDC settings
Open analyzer, jpivot or a CDE dashboard that uses CDA and you should see the cache being populated

Cluster info

Hazelcast has a very good Management Center, so it's outside the scope of CDC to reimplement that kind of features. However, we do support a simple cluster information dashboard gives an overview of the state of the nodes.
CDC cluster info
Note about lite nodes: Pentaho server is itself a cache node. However, it's configured in such a way that doesn't hold data, thus the term lite node

Clean cache

With CDC you can selectively control the contents of the cache, allowing you to clean either specific dashboards or cubes. The business case around this is simple: We need to clear the cache after new data is available (usually as a result of a etl job). CDC allows not only to do that but also to do it from within the etl process.

CDA

CDA cache clean
CDC offers a solution navigator so that we can select a dashboard. When we select that dashboard, all the CDA queries used by that dashboard will be cleaned.
Clicking on the URL button we'll get a url that we can call externally (from an etl job). Be aware that you need to add the user credentials when calling from the outside (eg: &userid=joe&password=password)

Mondrian

Mondrian cache clean
This one is very similar to the previous one, but navigates through the available cubes. One can then either clean the entire schema, a specific cube or even the individual cell cache for a specific dimension (use this latest one with care).

Issues, bugs and feature requests

In order to report bugs, issues or feature requests, please use the Webdetails CDC Project Page

License

CDC is licensed under the MPLv2 license.

Friday, May 18, 2012

New CTools releases: 12.05.15

New release train:


CDA:
  • Fixed [REDMINE CDA038] - CDA does not accurately determine metadata of variable column kettle output
  • Fixed [REDMINE CDA269] - cache lookups can return incorrect data with kettle data sources
  • Fixed [REDMINE CDA577] - ETL metadata injection results in no data
  • Fixed [REDMINE CDA612] - CDA hazelcast cache too prone to fatal failures
  • Fixed [REDMINE CDA613] - Parameter separator should be configured per-parameter instead of in cda.properties
  • Stopped using values in listeners to prevent memory problems (req hazelcast 2)
  • Prevent cache locks on hazelcast failure
  • Timeouts configurable; maps always fetched to prevent issues with original instance being shutdown; stats accessed synch due to unidentified issue
  • KettleDataAccess: extraCacheKey now String
  • Separator as a per-parameter attribute, serialization changed
  • Typos, imports, extraCacheInfo toString
  • Building against pho-3.8-SNAPSHOT, hazelcast 2.0.2
  • Cache bean: cast exception can happen only on return if used inside another plugin

CDF:
  • Added support on tables for urlTemplate, for backward compatibility
  • DashboardContext: get roles via SecurityParameterProvider
  • Can include session attributes in Dashboards.context according to xml
  • [PATCH][FEAT] Experimental feature that allows to hide overcrowded category labels in a categorical axis. 
  • [PATCH][FIX]  Bar valuesAnchor='top' now works.
  • Handle cases where hash doesn't change so browsers won't jump to top of page
  • Changed duplicate component separator to underscore.
  • Add warning for when CDA isn't detected in auto-includes
  • Added autoinclude documentation
  • Solved [Redmine CCC574] - third bar is not removed from barline when having 2 series PATCH FIX Error occurring when hiding 
  • Second axis series - now it does not try to hide at all, but does not throw also.

CDE:
  • Added SiteMap component
  • Added SiteMap component
  • NewMapComponent was using cdw files for rendering cgg markers. Switched to direct js support since CGG no longer supports CDW
  • Added target container property to duplicate component.
  • CDA cache clean folders and dashboards show the names they have in PUC
  • Added getResponse() to ContentGenerator
  • cleanup: unused imports, warnings, some frequent strings
  • first step at centralization/normalization of repository access
  • Solved bug in ie8
  • SyncronizedCdfStructure was hiding stack trace; XmlStructure warnings 
  • ext-editor: +listener, optional button
  • SiteMap passes li to action function.
 CGG:
  • Disabled CDW support due to security considerations. Was not in use anyway.
  • Updated pvc-d1.0.js file to keep in sync with CDF


Get them through http://ctools.webdetails.org or the ctools installer

Tuesday, May 8, 2012

CDG - Community Datawarehouse Generator



CDG - Community Data Generator

About

CDG is a datawarehouse generator and the newest member of the Ctools family. Given the definition of dimensions that we want, CDG will randomize data within certain parameters and output 3 different things:
  • Database and table ddl for the fact table
  • A file with inserts for the fact table
  • Mondrian schema file to be used within pentaho
While most of the documentation mentions the usage within the scope of Pentaho there's absolutely nothing that prevents the resulting database to be used in different contexts.

Motivation

Several times, webdetails had to prepare dummy data to feed dashboards / demonstrations. The traditional approach is always one of the following:
  1. Use data from steel-wheels (the pentaho sample datawarehouse)
  2. Build specific transformation to return static pieces of data
Both have severe disadvantages. With number one, we found out that not only the datawarehouse has severe inconsistencies, worst of all the end customers most of the time can't transpose to their business terms like trains and cars.
For number 2, while we can build specific sets, it ends up being time consuming and since it's not a full datawarehouse we can't show all the abilities of the solution we're demonstrating.
So we decided to build CDG - a datawarehouse generator that we can quickly use to build scenarios where the end consumer can feel comfortable with, written with Kettle

Usage

Version 1 of CDG is called by editing a kettle transformation. In the source you'll find a src directory but that is our "development area". The code you need to use is inside kettle.
Once you open the transformation in kettle/generateDW.ktr you'll see the following:
CDG kettle transformation
There are only 2 things that need to be changed:
  1. In transformation properties you defined the name of the database and fact table
  2. In Dimension Info you configure the transformation parameters
Change the parameters you want (or just run with the default), run the transformation, and 3 files will appear in the output directory: database and table ddl, insert scripts and Mondrian schema file.

Configuring Totals

Inside the Dimension Info step you'll find a mention to the total:
/* SET THE APPROXIMATE TOTAL FOR THE VALUES */
var total = 5000;
This will be approximate total for all the breakdowns. We need to specify something within the order of magnitude of what we're trying to show. CDG will then take that value and randomize it.

Configuring Dimensions

In the same file you configure the dimensions. You can have as much as you want, just paying attention to the fact that if you use a lot of dimensions / high cardinality we can quickly end up with a huge database. While there's nothing particularly wrong with that, it's then up to you to do specific optimizations like indexes or even aggregate tables. That's outside the scope of CDG.
Here's a sample dimension definition:
var countries = [
  {countryName: "Italianos", proportion: 30},  
  {countryName: "Portugueses", proportion: 18},  
  {countryName: "Alemães", proportion: 12},  
  {countryName: "Espanhóis", proportion: 10},  
  {countryName: "Japoneses", proportion: 15},  
  {countryName: "Coreanos", proportion: 8},  
  {countryName: "Chineses", proportion: 3},  
  {countryName: "Outros", proportion: 5}  
];
The sample provided in CDG is in Portuguese to specifically test character encoding support. The generated files are in UTF-8 and we recommend always using utf-8 in the database too
By defining this object, CDG will create a dimension with 8 members and one level called countryName. You could have other properties in there and CDG would create a mondrian schema with different levels. The provided example has only one.
There's a special property in there called proportion. That will be used by CDG to do the breakdown of the total. In the example, roughly 30% of the total will be assigned to Italians and so on. In all aspects of the code there's a random factor in place.

Configuring Date Dimension

The date dimension is always a specific case, since most of the times acts as a snapshot dimension.
Since configuring all possible members of this dimension would be very time consuming, we provide an utility function that generates all the dates between 2000 and 2012 down to the month. This is standard javascript, so feel free to change this function either to change the date range, month names or even adding the day level (be aware that adding the day level will substantially increase the number of values in the fact table)
/* CONFIGURE THE DATE DIMENSION. */

var dateDim = [];

var months = [
    [1,"Jan","Janeiro"],    [2,"Fev","Fevereiro"],    [3,"Mar","Março"],    [4,"Abr","Abril"],
    [5,"Mai","Maio"],    [6,"Jun","Junho"],    [7,"Jul","Julho"],    [8,"Ago","Agosto"],
    [9,"Set","Setembro"],    [10,"Out","Outubro"],    [11,"Nov","Novembro"],  
    [12,"Dec","Dezembro"]
];

range(2000,2012).map(function(year){
    range(0,12).map(function(month){
        var m = months[month];
        dateDim.push(
            {"year":year ,"monthNo": m[0], "monthAbbrev":m[1], "monthDesc": m[2]}
        );
    });
});
;

Final configuration

In the end of the script there's the final configuration that will be used by CDG:
/* MAKE THE FINAL CONFIGURATION. DIMENSIONS CAN EITHER BE SNAPSHOT OR REGULAR BREAKDOWNS */ 

var outputArray = [

    {name: "Date", dimension: dateDim, toBreakdown: false, increment: 0.05 }, 
    {name: "Provices", dimension: provinces, toBreakdown: true },
    {name: "Countries", dimension: countries, toBreakdown: true },
    {name: "Dates", dimension: gender, toBreakdown: true } 

 ]
In here we define the names and types of the dimension. The property of toBreakdown should be true for normal dimensions and false for snapshot dimensions. If it's a snapshot dimension, you need to specify the increment property. This value of 0.05 basically means that we'll have roughly 5% increase each month.
You can then run the transformation.

Output

After running the transformation, we get this output:
cdg/kettle/output
├── cdgsample.ddl
├── cdgsample.mondrian.xml
`── cdgsample.sql
  1. A ddl file to create the database and the table
  2. A file with sql inserts to populate the datawarehouse
  3. A mondrian schema file to use within mondrian / import to pentaho
Schema workbench

Result

The result, after declaring this new datasource and registring the cube in mondrian, is a new cube that we can use.
Saiku

Issues, bugs and feature requests

In order to report bugs, issues or feature requests, please use the Webdetails CDG Project Page

License

CDG is licensed under the MPLv2 license.