Data Services PyramidWhile delivering services on Business Intelligence - or any of the other words you want to make up for it, Business Analytics, Data Science, whatever (I’ve always been more focused on actual content than on the wrapping some people like to put around it) - one very important part that sometimes ends up being a bit overlooked is how you hand out the information.
In the end it’s all about it. All the discussions carried out about technology choices, implementation, hardware, all ends up the same way - giving the information back to the consumers.
This may seem obvious. It’s not, and has a lot of implications especially when we have at our reach a huge amount of technologies to choose from.
This is written from the perspective of someone internal to a company’s metrics / data engineer team. I’m wearing my Mozilla metrics hat now, thinking about all the challenges we face on a daily basis.
Information deliveryOne may be tempted to think that the best way to deliver information is the one where the user has the higher degrees of freedom when it comes to interact with that data. It’s not. With great power comes great responsibility, and sometimes less is better.
A metrics person has to guarantee not only the data he delivers is correct, but also that there’s no possibility of that data to be misunderstood. I find the second part harder than the first, especially when we’re dealing with scenarios with people that have been in a specific line of business for a long time.
Ground zero - Raw dataThis is the example taken to the extreme. Raw data is, by far, the most valuable source of information, and the one where we can draw more information from. Obviously, makes no sense to anyone to hand out this source to information consumers. Very few hold the keys to the dungeon, and for good reasons.
You need to go up in the stack, progressively converting data language, the zeros and ones, into business terminology, to build your Data Services Pyramid
Data Services Pyramid
While growing our information system, we’ll inevitably want to end up with something like this. But instead of a structured stack, it’s very easy to end up with a disjointed mess of technology and services.
Should be easy to understand what kind of information deliverables sit on each level. They obviously depend on the tools at use, but some examples:
- Raw data
- Handing out files
- Direct file system access
- Storage area
- NoSQL endpoints
- Hadoop / hive access
- Ad-hoc layer
- Metadata based tools
- OLAP clients
- Preformatted deliverables
- Csv/xls exports
The Services challengeWe must have a very clear objective in mind - put users as far up the stack as possible. The goal is not to prevent them from accessing the data. Much more important, it’s preventing them from misinterpreting the data. Every time I see someone asking access to a database or hadoop, I start trembling - I know trouble’s coming.
As we move up in the stack, we’re converting data language into business language. This is a crucial point. As we work on this translation, we need to set in stone our final language as a shared set of dimensions, well documented terminology, that poses no questions to anyone whenever they’re used.
And this translation is hard, moving from data terms to business terms. The less people involved here, the better. That’s why I’ve identified, in the Data Services Pyramid, a danger zone that we should be very cautious with, mostly cause the fore mentioned translation isn’t done yet.
The Technology challengeI once read a post about the differences between Google and Amazon when it came to inter-departments interaction, and how Bezos always insisted that such interaction had to be done by strict APIs and never in an unstructured way. What initially seemed a huge overhead and caused a lot of despair, the services oriented approach soon turned out to be Amazon’s greatest strength. After fine tuning all inter-department relations, wrapping those up as offerings to the outside world was easy - and that was in the origin of the cloud services offering.
For some reason this keeps popping up to my mind, and I’m the kind of person that can’t remember what had for lunch.
This principle should also apply for the technology stack. We’re in a golden era for IT junkies. Things are moving at the speed of light, every year or less there’s a new best thing in town and it’s getting harder and harder to keep up with everything and separate the hype from the real deal.
Staying frozen on time is just not an option. The complete opposite - always trying new things, new technologies and approaches - is equally dangerous. Gets to a point where too few people are familiar with the systems, they can easily get deprecated internally, the quality of the information stored is progressively harder to validate and comes with associated hardware costs, probably the cheapest of all.
This has to become a two step process. Inside a metrics team, every new technology has to pass through an approval stage. The goal is simple - there has to be a limited set of approved tools and technologies involved. Internally, one may prefer to use R, others perl, python, excel, etc. That’s perfectly fine and recommended, but when it comes to the official set of tools, everyone must be familiar with them. And make the list short, as there’s only so much one can know.
The second step is the link with my initial story about Amazon. If we apply the same principles to the technologies, we’re much less dependent on them, and gets much easier to swap out and optimize individual pieces.
If you think about the technologies and tools you already use and the ones you’re evaluating, they most likely fit into a very specific place inside the data services pyramid. This is where the services approach kicks in. Even though we can have more than one choice sitting at each level of the stack, it’s crucial that each one talks with the layer immediately below and is able to provide end points to the layers above to connect. Al the data translations that I mentioned before must happen only once, at a very well determined place. Data and overall service integrity is at stake here.
It’s now easy to see how changing specific bits gets less painful and error prone - as long as they maintain the same API interface to the enclosing layers.
Feedback appreciated. This is the result of my experience and would be great to have the ability to brainstorm on this issue.