Data Services Pyramid
While delivering services on
Business Intelligence - or any of the other words you want to make up for it,
Business Analytics,
Data Science,
whatever
(I’ve always been more focused on actual content than on the wrapping
some people like to put around it) - one very important part that
sometimes ends up being a bit overlooked is
how you hand out the information.
In the end it’s all about it. All the discussions carried out about
technology choices, implementation, hardware, all ends up the same way -
giving the information back to the consumers.
This may seem obvious. It’s not, and has a lot of implications
especially when we have at our reach a huge amount of technologies to
choose from.
This is written from the perspective of someone internal to a
company’s metrics / data engineer team. I’m wearing my Mozilla metrics
hat now, thinking about all the challenges we face on a daily basis.
Information delivery
One may be tempted to think that the best way to deliver information
is the one where the user has the higher degrees of freedom when it
comes to interact with that data. It’s not. With great power comes great responsibility, and sometimes
less is better.
A metrics person has to guarantee not only the data he delivers is
correct, but also that there’s no possibility of that data to be
misunderstood. I find the second part harder than the first, especially
when we’re dealing with scenarios with people that
have been in a specific line of business for a long time.
Ground zero - Raw data
This is the example taken to the extreme. Raw data is, by far, the
most valuable source of information, and the one where we can draw more
information from. Obviously, makes no sense to anyone to hand out this
source to information consumers. Very few hold the
keys to the dungeon, and for good reasons.
You need to go up in the stack, progressively converting data
language, the zeros and ones, into business terminology, to build your
Data Services Pyramid
Data Services Pyramid
While growing our information system, we’ll inevitably want to end up
with something like this. But instead of a structured stack, it’s very
easy to end up with a disjointed mess of technology and services.
Should be easy to understand what kind of information deliverables
sit on each level. They obviously depend on the tools at use, but some
examples:
- Raw data
- Handing out files
- Direct file system access
- Storage area
- SQL
- NoSQL endpoints
- Hadoop / hive access
- Ad-hoc layer
- Metadata based tools
- OLAP clients
- MDX
- Preformatted deliverables
- Dashboards
- Reports
- Csv/xls exports
The Services challenge
We must have a very clear objective in mind - put users as far up the stack as possible. The goal is not to prevent them from accessing the
data. Much more important, it’s preventing them from
misinterpreting the data. Every time I see someone asking access to a database or hadoop, I start trembling - I know trouble’s coming.
As we move up in the stack, we’re converting data language into
business language. This is a crucial point. As we work on this
translation, we need to set in stone our final language as a shared set
of
dimensions, well documented terminology, that poses no questions to anyone whenever they’re used.
And this translation is
hard, moving from data terms to
business terms. The less people involved here, the better. That’s why
I’ve identified, in the Data Services Pyramid, a danger zone that we
should be very cautious with, mostly cause the fore mentioned translation
isn’t done yet.
The Technology challenge
I once
read a post about the differences between Google and
Amazon when it came to inter-departments interaction, and how
Bezos
always insisted that such interaction
had to be done by strict
APIs and never in an unstructured way. What initially seemed a huge
overhead and caused a lot of despair, the services oriented approach
soon turned out to be Amazon’s greatest strength. After fine tuning all
inter-department relations, wrapping those up as offerings to the
outside world was easy - and that was in the origin of the cloud
services offering.
For some reason this keeps popping up to my mind, and I’m the kind of person that can’t remember what had for lunch.
This principle should also apply for the technology stack. We’re in a
golden era for IT junkies. Things are moving at the speed of light,
every year or less there’s a new best thing in town and it’s getting
harder and harder to keep up with everything and separate the hype from
the real deal.
Staying frozen on time is just not an option. The complete opposite -
always trying new things, new technologies and approaches - is equally
dangerous. Gets to a point where too few people are familiar with the
systems, they can easily get deprecated internally, the quality of the
information stored is progressively harder to validate and comes with
associated hardware costs, probably the cheapest of all.
This has to become a two step process. Inside a metrics team, every
new technology has to pass through an approval stage. The goal is simple
- there has to be a limited set of approved tools and technologies
involved. Internally, one may prefer to use R, others perl, python, excel, etc. That’s perfectly fine and recommended,
but when it comes to the
official set of tools, everyone must be familiar with them. And make the list short, as there’s only so much one can know.
The second step is the link with my initial story about Amazon. If we
apply the same principles to the technologies, we’re much less dependent on them, and gets much easier to swap out and optimize
individual pieces.
If you think about the technologies and tools you already use and the
ones you’re evaluating, they most likely fit into a very specific place
inside the data services pyramid. This is where the services approach
kicks in. Even though we can have more than one choice sitting at each
level of the stack, it’s
crucial that each one talks with the
layer immediately below and is able to provide end points to the layers above to connect. Al the data translations that I mentioned before must
happen only once, at a very well determined place. Data and overall
service integrity is at stake here.
It’s now easy to see how changing specific bits gets less painful and
error prone - as long as they maintain the same API interface to the
enclosing layers.