Friday, February 17, 2012

New CDA and CCC releases: 12.02.16

2 more releases. Details:


CDA:

Main feature: Change to samples location. Migrated from bi-developers to plugin-samples. All Ctools will now install their samples to this location.

Changelog:

  • Changed samples to plugin-samples.
  • Plugin now compliant with Pentaho Marketplace
  • Added Boolean to AbstractExporter
  • Cda cache entries print
  • Removed some exceptions that were never thrown



CCC:

This is the standalone CCC release, so there's no impact on ctools-installer usage or stable CDF/CDE.

The most relevant feature in this release is the inclusion of a new chart type: Normalized Bar Chart - a  100% stacked/normalized/segmented bar chart.


  • New chart type "NormalizedBarChart" (100% stacked/normalized/segmenred bar chart)
  •  Added new "selectable" chart option that is false by default. Currently, selection is supported by BarChart/WaterfallChart and HeatGridChart.
  •  Fixed Redmine Bug #193: BoxplotChart does not show box tooltips correctly
  •  Fixed Redmine Bug #208: BarLine charts break when Independent Line Scale is set to false
  • Fixed Redmine Bug #298: Wrong extension point in ccc documentation
  • Fixed Redmine Bug #316 - Tooltips of Dots in line charts do not always appear
  •  Fixed Redmine Bug #317 - Timeseries axes ticks, in some cases, get drawn far to the left, off the chart
  • Functions 'getZZZScale' now have a 'keyArgs'-like interface. This interface makes it easy to add new arguments to the functions.
  •  Refactoring  pvcPanel.js into separate files.
  • Put classes in pvcCategoricalAbstract.js and pvcDataDimension.js into separate files
  • Fixed bug in value/label code of Categorical charts
  • Added support for value and label distinction, for title and subtitle, on the bullet chart.
  • Fixed ZOrder bug - did not detect correctly a new child with ZOrder already applied
  • Fixed tipsy tooltip issue with dots and lines hiding the tooltip imediately on mouseleave, which is incompatible with the point/unpoint pseudo-events, tipically used in these cases to attach the tipsy behavior.
  • Fixed pvc.mergeDefaults function that did copy specified options with an === undefined value   
  • Added logging of received Metadata and resultset to DataEngine.
  • Added logging of received options, in pvc.Base#preRender.
  • DataDimension inherited code from Composite Ordinal Axes related to onversion of data to tree form and obtaining labels for its elements. New chart options 'getCategoryLabel' and 'getSeriesLabel'.
  •  Click and Double-click in AxisPanel has been refactored to allow usage by other axis types, besides the composite axis (like linear and normal-ordinal).
  • Axis value selection is now integrated in the click event (and thus, xAxisClickAction and yAxisClickAction of HeatGrid have been removed).
  • Axis 'clickAction' and 'doubleClickAction' now receive an object argument that contains the properties: 'value', 'label', 'absValue', 'path', 'absLabel'. This breaks the previous interface of these handlers (its .toString method returns the 'value' property, to ensure some backward compatibility).
  • Ordinal axes have changed the type of "datum" that is passed to property functions. This can affect functions given tothe following extensions points: 'Ticks_', 'Label_', 'Grid_'. The "datum" received is equal to that described for the clickAction event handler.
  • Fixed bug in MetricAbstract that affected drawing of base linear axes.
  • Changed options handling by chart classes. They now all declare options in a static property called 'defaultOptions', which is handled in the instance constructor by the new function: pvc.mergeDefaults. This allowed that options would reach the root base class (which by difficulties in option handling in some cases were not passed to base classes), which was required so that the new "getLabel" options could reach the base class, by the time the DataEngine is created. This involved collecting all options of every chart class and placing a default for them in the defaultOptions object. Only options declared here are read from the user options.
  • Some refactoring of MultiValueTranslator.
  • Improved lasso selection, to behave better near min. width or height of selection rectangle (method _createSelectionOverlay lost the w and h arguments).
  • Fixed bug in the HeatGrid and Waterfall tooltip calculation code, that caused tooltips to be shown delayed one mark instance. To this end, a new method has been added to pv.Mark that allows defining a local mark property: 'localProperty'. This is in most situations a better choice that of using 'def', which isn't evaluated per instance of a Mark, but per build of a Mark.
  • Performance improvement on heatgrid: don't reevaluate tooltip when shapes are re-rendered
  • Solved lasso show delay
  • Fixed some bugs related to the timeSeries option being wrongly read on the panel or on the chart.
  • compositeAxis: issue with labels overlapping with map when aligned to the right
  • Fixed bug related to the last showTooltips change.
  • Normalized showTooltips option and tipsySettings options on Categorical charts.
  • Generalized categorical charts rubber band detection of selected data.
  • Added pvc.CategoricalAbstract#clearSelections method to support Pentaho Analyzer.
  • Solved missing ;

Monday, February 13, 2012

CDC (Community Distributed Cache) - Almost there

I recently did a post on CDC. In case you're wondering where we are on this development, what better way than to share our own project management tool?


There's some things you can see from this:
  • It's nearly done \o/
  • It's going to be awesome
  • Developing a plugin is really expensive
Some people seem to forget that the fact that a tool doesn't have any acquisition costs doesn't mean that it didn't cost a lot to make. The reported numbers in redmine are true. That's how much CDC is costing us. Remember that the next time you need to implement a project - give us a call ;)


Going back to what really matters, here's some samples of how is currently looking:



See you soon!

Wednesday, February 1, 2012

CDE2 - Feature requests

If someone started to work on a new version of CDE, what features would you like to see there?


Answer here on in the Ctools section of the pentaho forums

Friday, January 27, 2012

CDA release - 12.01.26

Aaaand, since there's been a lot of time without a release, CDA 12.01.26 is here.



  • Solved Redmine Bug #104: CDA Cache Manager -> UI didn't update after delete a query
  • Implemented Redmine Feature #105: CDA Cache Manager -> Delete all queries belonging to a cda file 
  • Fix a bug where html output would duplicate output in some cases
  • Cache refactor; cache monitor: +removeAll, require admin permissions
  • cachemanager: user feedback for server requests
  • Support for cache plugin bean. Serialization changed
  • Added version info to cachemanager and SelfTest Page
  • Sorted out some images on SelfTest Page

Ctools installer with -b stable will get this version

Thursday, January 26, 2012

CCC release - 12.01.25

New CCC release 12.01.25 (standalone version, soon to be included in the next stable CDF release) and already available if you're using ctools installer with -b dev



Changelog:

  • Implemented Redmine Feature #107 - Control number of labels on the linear Axis for categorical charts (show "MinorTicks" option, including 2nd axis) 
  •  Implemented Redmine Feature #108 - Control number of ticks on the linear Axis for categorical charts ("DesiredTickCount", including 2nd axis with independent scale) 
  • Implemented Redmine Feature #109 - Rounded maximum for linear axis in categorical charts ("DomainRoundMode" option, including 2nd axis with independent scale)
  • Solved Redmine Issue #78 - Fix the vertical order in which series are drawn, so that when applicable, they show from top to bottom.
  • Solved Redmine Issue #121 - Tooltips in barcharts do not appear if bars overflow.
  • Solved Redmine Issue #103 - Ordinal axis grids not being drawn
  • MultiValueTranslator: issue when no categories
  • Solved valueFormat receives numeric value, doesn't parse
  • Fixed typo of property name in LegendPanel 
  • Add multi-series barline support
  • useCompositeAxis compatible with flat arrays
  • vml namespace conflict: revert sparkline, declaration in protovis-msie no longer lazy
  • align horizontal text in composite vertical axis towards the chart; revert convention breaks in multiline conditional expressions
  • workaround issue in 16th decimal position in IE9 64bit
  • Fixed regression with bulletcharts being translated in 10px down
  • Added new (and some of the missing) documentation to the testZZZ.html files
  • Fixed the drawing of bars and grid lines on the ordinal scale: they were not centered with the tick and label 
  • In linear axis, made minorTicks "extend" (major)ticks, so that visibility (through .visible or .strokeStyle) of the later affects the former. 
  • testZZZ.html files documentation mencioned '{x,y}AxisFullGrid_' instead of the correct value'{x,y}AxisGrid_'.
  • Fixed linear axis grid to show a line on the last tick (as opposed to the ordinal axis, that does not show the last line). When EndLine is active, it is drawn above the last grid line.
  • Fixed bug in the positioning of linear scale labels that revealed it self (don't know why) only on time series charts * Fixed bug in time series scale range calculation when with a second axis * Fixed bug in the drawing of minor ticks on time series scales (date arithmetic issues)
  • Fixed regression bug in ScatterCharts (DotChart, LineChart, StackedLineChart and StackedAreaChart) that caused null values to break line drawing. 
  • Fixed the visibility of the first grid line of a time series axis - it did not show because, in this case, the first tick is not on the origin.
  • Fixed compatibility issue between jQuery.sparkline and protovis-msie when in IE8.
  • heatgrid: +scalingType:'discrete' (interval-based, no color interpolation)
  • tipsy w/ followMouse: don't fall out of window
  • Heatgrid: ignore null values in min/max calculations; nullShape not taking correct index into account;
  • solved dangling variable reference

Great stuff! :)

Tuesday, January 24, 2012

Firefox Telemetry - From adhoc R analysis to CDE Dashboards

Time to put on my analyst hat. No matter what dashboard we're trying to build, it will all fail if the underlying analysis isn't good.

 A dashboard is a great way to allow users to get information on a specific subject, considering we know what we're going to show. On this case I had absolutely no idea.

 When we're sitting on a bunch of data, we need to go through a discovery phase where we'll actually decide what information can be valuable to the user.


Telemetry Analysis

Telemetry is a project from Mozilla that aims to make the products better - Firefox, Thunderbird and Fennec (codename for Firefox Mobile) by analyzing performance data sent by users while doing their real-world activity and the impact that developer changes had on that performance. The goal is simple - make better, happier, more productive.
 
As one can imagine, we have a bunch of data. All the submissions are primarily stored in HBase and later aggregated into ElasticSearch, allowing more versatile / real time analysis. We were then able to get a dynamic view over the data that summed up all the contributions from the users:


I previously blogged about the techniques that allow us to get data from/to elasticsearch, and once again it proved invaluable method. On this case, due to the huge amount of data, we had to use kettle's UDJC step, initially developed by Mozilla Metrics' chief engineer Daniel Einspanjer along with Jackson JSON processor to achieve high performance while processing the huge dataset of information and submitting some kettle improvements along the way.


New questions

This allowed developers to be able to view the impact of their changes and had the best effect a data tool can have while answering some questions - raise other questions.
 
 Most of the following questions were related to time-based analysis, and be able to track, over time, the impact of the changes on a specific probe over a period of time. This would have the immediate effect of giving people data to decide if a specific release channel would be ready to pass to the next channel on the rapid release cycle and answer some of the new questions that the new process brings:

- Is Aurora ready to move to Beta?

- Are we getting the expected performance improvement in Nightly?

As a stretch goal, my personal objective was to implement any kind of system that allowed us to quickly identify regressions on the code without having to manually go through all the probes.


Back to basics - Kimball's DataWarehouse

This required a new approach on the data. Or rather, an old approach. In Business Intelligence, we live in exciting times where we have tons of available technologies that allow us to choose the best tool for the job (I recently did a blog post on the subject). But  let's not forget 20 years of knowledge. This specific set of questions required building a standard, Kimball style data warehouse.


Telemetry Evolution

The goal is to have a way to track the improvements on the project's code by tracking, over time, the evolution of some key metrics we chose. Currently, the ones that are being tracked are:
  • Mean
  • Standard Dev.
  • Median
  • Percentiles (25,50,75)
Since telemetry data is stored in buckets on the client side, this values are not statistically accurate; they are not the mean and stddev of a particular probe, eg CYCLE_COLLECTOR, they are, instead, the mean of the bucketed values after submission. Same for all the others. However, if not an absolutely accurate representation of the end user's scenario, proved to be very effective in quantifying and measuring changes.


Concepts

We'll consider platform builds for a specific application, version and OS as having the same codebase. So our key is platformBuildID-appName-appVersion-OS, and we consider that to be our "primary key", and all submissions with similar keys are aggregated together and considered to be generated from the same code.


On a daily basis we'll query telemetry and we'll query for the builds made on the last 7 days (configurable value). In this there's the assumption that after 7 days is enough for sampling and changes to the main kpis after that period would be due to environmental changes and not due to code reasons. This number is currently being studied to find out the best value to use.

We're also discarding, for this datawarehouse, all submission with less than 500 counts, in order to have a good enough sample size.



R Analysis

Spent about two weeks building this datawarehouse. With no guarantees that the results would yield anything decent. So once I had a resultset I could work with, took the opportunity to use R to analyze the data. This has been, for ages, an item on my to-do list.

R is an insanely powerful statistical analysis tool with tons of packages that will guarantee that the bottleneck  will be your own mathematical knowledge (or lack of), making it one of the analysts' favorite tool.

R does wonders when we have the data in a tabular format and want to do ad-hoc analysis, so I picked a resultset and started playing with the data. I used CYCLE_COLLECTOR probe evolution on windows platform and Nightly channel.

The first thing I did was trying to get a feeling of the shape of the data (this, obviously, after a couple of days trying to find my way around R). After a while, it was looking like this:


The initial analysis led to a relation between the submission counts and mean / std dev. The higher the count, the lower the mean and standard deviation. This is coherent with something the metrics team already knew - the initial submissions are not representative of the general population, so on this case size really matters.

Also tried for a while to find a statistical model to this data, mostly around fitting a normal distribution and thus trying to get more analysis from the parameters, like the CDF and other density functions. This proved to be a frustrating task, as no decent fit came from it.

Due to the all the distinct types of probes in the code, we decided only take in consideration means and standard deviation, and looking at the evolution on time. This is the view we decided to use:


In a single chart we could be able to tell the evolution of the CYCLE_COLLECTOR with point position represents the mean, size of the points represent standard deviation (not the accurate value, but according to the scale) and color coded representing the size of the sample.


From R to CDE Dashboard

The next step, after knowing the kind of analysis we need to give to developers, is to build a dashboard that allows users to get this data from the BI system automatically, up to date and with the ability to quickly parametrize it. And obviously skip the need for the consumers of the data to have knowledge on R.

All the Ctools were made having in mind the capability to be able to virtually build *anything*, and replicate a R analysis is a very good challenge. Here's the end result after.... 2 days


With all the live connections to the data users can freely play with the data and change the parameters to be able to quickly see the impacts of the code



Discovery

One of the biggest advantages of a datawarehouse is that it comes with an astonishing query language, MDX. In our case (and for anyone using pentaho as a BI server) we're using Mondrian as the Rolap engine that allows those queries to run.

MDX is very well suited for answering business questions, behaving particularly well on time-based analysis. So the next step was building a table that could compare the last 7 day average with the prior 28 days average. Big shifts would indicate either improvements or regressions. Here's the resulting table, ordered  by default on regressions:


A regression of 1590% was immediately noticed. Clicking on that row allowed to inspect the actual histogram distribution:



I immediately checked with one of the firefox developers that mentioned that there was an error in that specific build that caused the counters of this probe to be totally skewed up. Success!

It's instantly rewarding to find out that the number of improvements absolutely outnumber the amount of regressions. One of my favorite ones that show all the improvements that developers have been putting in the code is IMAGE_DECODE_ON_DRAW_LATENCY


This is currently being used by the internal product developers to give them metrics over their code and the metrics team is working on allowing contributors outside the company to be able to take advantage of this tools.


Help Mozilla helping you

This is only possible to do with the help of users that are willing to submit their performance data back to mozilla. This is what we do with your data. There's absolutely nothing that can be traced back to you, as privacy is always the number one concern at mozilla. And here's how you can help:





Wednesday, January 18, 2012

Multiple parameters in CDF / CDE

This tech tip shows how to configure multiple parameters to work in CDE / CDF.

We can use any query, but on this case we'll start with the parameter wizard, located in the datasources panel.

Create a new dashboard and go to the datasources panel. Under the "Wizards" select parameter wizard. Select a cube / dimension and drag it to the rows.


 I selected the multiplebutton but any of the other multiple selection component would work too. The multiple button component supports both single selection (default) or multiple selection. We need to activate it in the component that was generated by the wizard


The generated code works, but we may want to define a set of defaults be preselected. In order to do that, we need to delete our simple parameter and  add a custom parameter with the following code:
["[Markets].[EMEA]","[Markets].[NA]"]



This array has to match the id's we use in the parameter. That way, when we preview the dashboard we get those parameters selected by default as we wanted

Have fun!


-pedro