Motivations
We need a way to do data validationUse cases
Below are some of the use cases we want to tackle. Emphasized are the ones we think the current spec satisfies- Global
- Is the server running?
- Is the server running properly?
- Connectivity
- Do we have all the access we should? (network / database)
- Query specific
- Do we have up to date data?
- Can we trust the data?
- How long did the queries take to run?
- Do we have wrong data? (duplicated users in community)
- Do we have a big number of 'unknowns'? (tk=1 in DW)
- Do we have peaks or valleys in the data? (due to double process or no process)
- Is the data stalled? (eg: number of twitter followers not updating)
- Did the data format change
- We need a way to handle known effects (eg: Christmas dip)
- We need to correlate independent datasources (eg: comparing AUS with Blocklist evolution)
- We need to connect the long running queries from CDA to CDV
- Be able to validate big chunks of reprocessing
- Do we have clearly wrong lines in resultset? (eg: a
line there)
- Dashboards
- Are the dashboards rendering properly?
- Do we have all the components?
- Any layout change?
- Any js errors?
- Are the dashboards performing properly?
- Can CDF talk with CDV to report client-side errors?
- Alternatively, can CDA talk with CDV to report query errors?
- Who caused the error?
- Are the dashboards rendering properly?
- ETL
- Did the etl run?
- Are we processing the expected amount of data?
- Is the etl taking the expected time to run?
- Did the etl finish before X am?
- Test etl against tracer bullets?
Work flow
We expect from this system:- A central dashboard that allows us to quickly glimpse the overall status of our system.
- Did all tests pass?
- Which one failed?
- Why?
- When was the last time the tests ran
- We need simple ways to define the tests (based on existing CDAs)
- We need to which queries failed and which queries took long time to run
- We need push notification system by email
- We need to make sure it can talk to nagios
- We need an outside test to check if server is up
Logging types
Every test will result in the following levels:- Critical
- Error
- Warn
- Ok
{
level: "Critical",
type: "Missing data",
description: "Whatever string the user defined"
}
On each test definition, we need to be able to optionally set a
timing threshold for the queries, and that will automatically generate a
log with Type 'Duration'Test types
There are 4 possible types of tests:- CDA based query validation
- ETL monitoring
- Datawarehouse validation (a specific set of the cda based query validation)
- Dashboard validation (we may opt to leave this one out for now as we'll try to infer the errors from CDA's 405)
CDA based query
Workflow
We want to select one or more cda / dataAccessId from our system, define the input parameters and select the type of validation we need.The shape of the function will be: f( [ query, [params] ], validationFunction )
The generic test will be the implementation of the validation function:
validationFunction = function ( [ {metadata: [] , resultset: [[]]} ] ) : value
That will be freely mapped to the log outputsETL monitoring query
The workflow defined here has to match with the previous section. We'll build specific CDA queries that will read the kettle log files. From that point on, specific validations will have to be built for this logsWe'll need, in pentaho, to define which connection refers to the kettle logging tables. Either by defining a special jndi or specifying in the settings.
We'll need to test for:
- Time
- Start /end time
- Amount of data processed
Datawarehouse schema validation
There are some specific tests we can do on the sanity of a datawarehouse.- Coherent amount of data on a daily / hourly basis
- Test the same as before with specific breakdowns
- Test for the amount of 'unknowns' on dimensions
Invocation and Scheduling
There are 2 ways to call the validations:- By url request
- Scheduled calls
- Every hour
- Every day
- Every week
- Every month
- Custom cron
User interface
This are the features in the main user interface (this is the ultimate goal, the implementation may be broken into stages):- See existing validations
- Allow to fire a specific validation
- Get the url of a specific validation / all validations
- Create / Edit validation
- Define query name
- Define queries and parameters
- Define validation function
- Chose log alerts (when to throw error / severe / warn / ok)
- Choose duration thresholds
- Define error message
- Define cron
- Validation status dashboard
- CDA Query error dashboard (Should this belong to CDA instead?)
- Query and parameters
- Error
- Incidents
- Duration dashboard to identify slow points in the system
- Query and parameters
- Duration
- Incidents
Technical approach
All the specific information will be stored in solution/cdv/queries/). The files will have the format _queryName.cdv and will internally be a JSON file with the following structure:{
type: "query",
name: "validationName",
group: "MyGrouping"
validation: [
{ cdaFile: "/solution/cda/test.cda", dataAccessId: "1" , parameters: [...] },
{ cdaFile: "/solution/cda/test2.cda", dataAccessId: "2" , parameters: [...] }
],
validationType: "custom",
validationFunction: "function(resultArray,arguments){ return 123 }",
alerts: {
/* This functions will be executed from bottom up. As the functions return true, the next one
will be executed and the last matching level will be thrown.
The exception to this rule is the optional okAlert(v) function. If this one returns true, no other calls will be made
*/
criticalAlert: "function(v){ return v > 10 }",
errorAlert: undefined,
warnAlert: "function(v){ return v > 5 }",
okAlert: "function(v){return v<3;}",
alertType: "MissingData",
alertMessage: "function(level,v){return 'My error message: ' + v)" /* this can either be a function or a string */
},
executionTimeValidation: {
expected: 5000,
warnPercentage: 0.30,
errorPercentage: 0.70,
errorOnLow: true
},
cron: "0 2 * * ? *"
}
Preset validations
We won't need to manually define all kinds of validations. CDV will support a preset that can also be extended by adding the definitions to solution/cdv/validationFunctions/ . The template for one such Javascript file looks like this:wd.cdv.validation.register({
name: "Existence",
validationArguments: [
{name: "testAll", type:"boolean", default: true}
],
validationFunction: function(rs, conf) {
var exists = !!conf.testAll;
return rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
return conf.testAll ? (curr && prev) : (curr || prev);
});
},
alertArguments: {
{name: "failOnExistence" type: "boolean", default: true},
{name: "failureLevel", type: "alarmLevel", default: "ERROR"},
{name: "failureMessage", type: "String", default: "Failed Existence Test: ${result}"}
},
alertMapper: function(result, conf) {
var success = conf.failOnExistence && result,
level = success ? "OK", conf.failureLevel,
message = success ? conf.successMessage : conf.failureMessage;
return Alarm(level, message, result);
}
});
The wd.cdb.validation API is defined in the Validation Module.There are 5 objects there that we need to analyze:
validationFunction(rs, conf)- This is the validation function that will be executed after the query runsvalidationArguments- Definition of the arguments that will be used within the validation functionalertArguments- Definition of the arguments that will be sent to thealertMapperalertMapper(result, conf)- Mapping between the validation result and the alerts
Preset validations or custom validations
When we define a query, we can chose which validation function to use and pass the parameters that specific validation requires.Alternatively, we can use a custom validation function. That validation function has the following format, where all we need is to return the alarm level (this is a spec, may change after implementation)
function(rs, conf) {
var exists = rs.map(function(r){return r.length > 0}).reduce(function(prev, curr){
return conf.testAll ? (curr && prev) : (curr || prev);
});
return exists ? Alarm.ERROR : Alarm.OK;
}
CDA integration
We need a tight integration between CDA and CDV to report:- Errors in CDA queries
- Long running CDA queries
- Queries with obvious errors in the structure (eg: missing lines)
External interfaces
We can have several external interfaces supported:- Http
- Nagios integration
- Server up check
On the http case, we can pass a flat to force a test to be rerun.
For nagios, we can have an export of test rules
Settings
We'll be able to define the group rules, mainly for connectivity reasons. So the settings (that later can be converted to an UI), will look like this:- Groups
- Email
- Threshold
- Destinations
- SMS
- Threshold
- Destinations
- Email


