Tuesday, January 24, 2012

Firefox Telemetry - From adhoc R analysis to CDE Dashboards

Time to put on my analyst hat. No matter what dashboard we're trying to build, it will all fail if the underlying analysis isn't good.

 A dashboard is a great way to allow users to get information on a specific subject, considering we know what we're going to show. On this case I had absolutely no idea.

 When we're sitting on a bunch of data, we need to go through a discovery phase where we'll actually decide what information can be valuable to the user.


Telemetry Analysis

Telemetry is a project from Mozilla that aims to make the products better - Firefox, Thunderbird and Fennec (codename for Firefox Mobile) by analyzing performance data sent by users while doing their real-world activity and the impact that developer changes had on that performance. The goal is simple - make better, happier, more productive.
 
As one can imagine, we have a bunch of data. All the submissions are primarily stored in HBase and later aggregated into ElasticSearch, allowing more versatile / real time analysis. We were then able to get a dynamic view over the data that summed up all the contributions from the users:


I previously blogged about the techniques that allow us to get data from/to elasticsearch, and once again it proved invaluable method. On this case, due to the huge amount of data, we had to use kettle's UDJC step, initially developed by Mozilla Metrics' chief engineer Daniel Einspanjer along with Jackson JSON processor to achieve high performance while processing the huge dataset of information and submitting some kettle improvements along the way.


New questions

This allowed developers to be able to view the impact of their changes and had the best effect a data tool can have while answering some questions - raise other questions.
 
 Most of the following questions were related to time-based analysis, and be able to track, over time, the impact of the changes on a specific probe over a period of time. This would have the immediate effect of giving people data to decide if a specific release channel would be ready to pass to the next channel on the rapid release cycle and answer some of the new questions that the new process brings:

- Is Aurora ready to move to Beta?

- Are we getting the expected performance improvement in Nightly?

As a stretch goal, my personal objective was to implement any kind of system that allowed us to quickly identify regressions on the code without having to manually go through all the probes.


Back to basics - Kimball's DataWarehouse

This required a new approach on the data. Or rather, an old approach. In Business Intelligence, we live in exciting times where we have tons of available technologies that allow us to choose the best tool for the job (I recently did a blog post on the subject). But  let's not forget 20 years of knowledge. This specific set of questions required building a standard, Kimball style data warehouse.


Telemetry Evolution

The goal is to have a way to track the improvements on the project's code by tracking, over time, the evolution of some key metrics we chose. Currently, the ones that are being tracked are:
  • Mean
  • Standard Dev.
  • Median
  • Percentiles (25,50,75)
Since telemetry data is stored in buckets on the client side, this values are not statistically accurate; they are not the mean and stddev of a particular probe, eg CYCLE_COLLECTOR, they are, instead, the mean of the bucketed values after submission. Same for all the others. However, if not an absolutely accurate representation of the end user's scenario, proved to be very effective in quantifying and measuring changes.


Concepts

We'll consider platform builds for a specific application, version and OS as having the same codebase. So our key is platformBuildID-appName-appVersion-OS, and we consider that to be our "primary key", and all submissions with similar keys are aggregated together and considered to be generated from the same code.


On a daily basis we'll query telemetry and we'll query for the builds made on the last 7 days (configurable value). In this there's the assumption that after 7 days is enough for sampling and changes to the main kpis after that period would be due to environmental changes and not due to code reasons. This number is currently being studied to find out the best value to use.

We're also discarding, for this datawarehouse, all submission with less than 500 counts, in order to have a good enough sample size.



R Analysis

Spent about two weeks building this datawarehouse. With no guarantees that the results would yield anything decent. So once I had a resultset I could work with, took the opportunity to use R to analyze the data. This has been, for ages, an item on my to-do list.

R is an insanely powerful statistical analysis tool with tons of packages that will guarantee that the bottleneck  will be your own mathematical knowledge (or lack of), making it one of the analysts' favorite tool.

R does wonders when we have the data in a tabular format and want to do ad-hoc analysis, so I picked a resultset and started playing with the data. I used CYCLE_COLLECTOR probe evolution on windows platform and Nightly channel.

The first thing I did was trying to get a feeling of the shape of the data (this, obviously, after a couple of days trying to find my way around R). After a while, it was looking like this:


The initial analysis led to a relation between the submission counts and mean / std dev. The higher the count, the lower the mean and standard deviation. This is coherent with something the metrics team already knew - the initial submissions are not representative of the general population, so on this case size really matters.

Also tried for a while to find a statistical model to this data, mostly around fitting a normal distribution and thus trying to get more analysis from the parameters, like the CDF and other density functions. This proved to be a frustrating task, as no decent fit came from it.

Due to the all the distinct types of probes in the code, we decided only take in consideration means and standard deviation, and looking at the evolution on time. This is the view we decided to use:


In a single chart we could be able to tell the evolution of the CYCLE_COLLECTOR with point position represents the mean, size of the points represent standard deviation (not the accurate value, but according to the scale) and color coded representing the size of the sample.


From R to CDE Dashboard

The next step, after knowing the kind of analysis we need to give to developers, is to build a dashboard that allows users to get this data from the BI system automatically, up to date and with the ability to quickly parametrize it. And obviously skip the need for the consumers of the data to have knowledge on R.

All the Ctools were made having in mind the capability to be able to virtually build *anything*, and replicate a R analysis is a very good challenge. Here's the end result after.... 2 days


With all the live connections to the data users can freely play with the data and change the parameters to be able to quickly see the impacts of the code



Discovery

One of the biggest advantages of a datawarehouse is that it comes with an astonishing query language, MDX. In our case (and for anyone using pentaho as a BI server) we're using Mondrian as the Rolap engine that allows those queries to run.

MDX is very well suited for answering business questions, behaving particularly well on time-based analysis. So the next step was building a table that could compare the last 7 day average with the prior 28 days average. Big shifts would indicate either improvements or regressions. Here's the resulting table, ordered  by default on regressions:


A regression of 1590% was immediately noticed. Clicking on that row allowed to inspect the actual histogram distribution:



I immediately checked with one of the firefox developers that mentioned that there was an error in that specific build that caused the counters of this probe to be totally skewed up. Success!

It's instantly rewarding to find out that the number of improvements absolutely outnumber the amount of regressions. One of my favorite ones that show all the improvements that developers have been putting in the code is IMAGE_DECODE_ON_DRAW_LATENCY


This is currently being used by the internal product developers to give them metrics over their code and the metrics team is working on allowing contributors outside the company to be able to take advantage of this tools.


Help Mozilla helping you

This is only possible to do with the help of users that are willing to submit their performance data back to mozilla. This is what we do with your data. There's absolutely nothing that can be traced back to you, as privacy is always the number one concern at mozilla. And here's how you can help:





No comments:

Post a Comment