Friday, September 21, 2012

CBF and Versioning - How to develop Pentaho solutions in a team

This blog post is sponsored by the Antonius BI team

Problem


We're struggling with versioning and deployment. We need some  help in managing the development and (especially) deployment process

This is a very well known problem. While versioning is a relevant issue in all areas - and not only development - the correct way to approach it changes drastically, and Pentaho is not different.

We've been working in pentaho projects for over 5 years. Since day 1, on scenarios where we had to manage:
  • Working on different projects locally
  • Working with multiple people on the same project
  • Managing several environments - development, staging, upgrades
  • Managing platform upgrades
 There are more reasons, but this are probably the most pressing ones. It's not up to me to convince you that you need a good approach to this problems: you'll know you need it!

I'll write a  collection of my experiences regarding this issue as of today. It can obviously change since we're always trying to optimize our processes.

I've been asked a specific set of question, that I'll introduce while contextualizing the big picture.

Versioning

There are 2 ways to approach this problem:
  1. Simple pentaho-solutions versioning
  2. CBF (Community Build Framework) setup
The first one is a subset of the second. There's no easy way to say it: The first one is a must have. If you're not doing it, you are asking for trouble.

As for the CBF... well, this is something Paul Stoellberger , Saiku author and general BI guru said on the irc channel:


< pstoellberger> i really need to get my cbf out again
< pstoellberger> just done hackish installations recently

Everyone doing something serious on pentaho uses CBF. You may think you don't need it. You may think it's complicated and doesn't worth the effort (since it's not an argument, we've put up a quickstart bundle for you). You're wrong, and you'll know it once you start using it.

VCS infrastructure

But one thing at a time. Before going through the specific workflows, you need to choose your VCS (Version Control System) tool. Once again there are 2 options:

Note: if you think I forgot to include CVS on this list, press Alt-F4 to close your Internet Explorer browser and go back to 1998, we don't want you here!

I'll skip the long arguments about those two. Use Git. It's amazing, handles branches and tag in a very efficient way, allows multiple remote repositories and has great UI tools that will be very handy.

Also bear in mind that Git is not Github. While you can definitely host your solutions in there, you're not forced to. I'd even say most of us would rather keep our files and implementation very securely locked.

So starting with infrastructure; Git doesn't even need a "server". Any shared directory could be used as the central repository. I've even used the "poor man's git server", initializing a repository ( git init --bare myproject/ ) in a dropbox folder. That has proven to be a very error prone approach, since there's no way to guarantee that our repository won't get damaged with the dropbox synchronization. So use a proper system. There are 2 options (this is getting a bit repetitive):

  • Use an existing one: Github (or something)
  • Install your own: Gitolite (or something)
Choosing github is a very valid and logical option. And doesn't mean your solutions have to be public, has you can have a payed account that will allow you to have private repositories. There should also be other alternatives on this lines.

We use Gitolite. Once installed, it's very easy to administer (creating repositories, adding / managing people and permissions) and very secure, as it uses ssh connections.

Regardless of what you chose, I'll now assume you have a proper VCS server available.


Versioning pentaho-solutions directory

Originally we included the entire CBF structure in the same repository, as described in the documentation. SVN allows us to checkout a subdirectory of the repository, so we could checkout only the project-client/solution/ folder. In git that's not possible and I could never get my head around submodules, so I simply have 2 different repositories (notice a trend here with the number 2?):


  • project-client
  • project-client-solution

The second has all the BI server solution files. The first one has... everything else, from CBF specific structure to ETL. After we clone the project-client repository we either link to the project-client-solution or clone that one inside the project-client directory.

Here's a real world example of one of our projects:

pedro@arpeggio:~/tex/pentaho/project-client (master)$ d
total 68
24 -rw-r--r-- 1 pedro pedro 20859 Feb 21  2011 build.xml.cbf-3.7
 4 drwxr-xr-x 2 pedro pedro  4096 May 21 16:41 config/
 4 drwxr-xr-x 2 pedro pedro  4096 Feb 17  2011 etl/
 4 -rwxr-xr-x 1 pedro pedro   138 Apr 27  2011 importCache.sh*
 4 -rw-rw-r-- 1 pedro pedro  1057 May 21 14:46 kettle.properties.diogo
 4 -rw-rw-r-- 1 pedro pedro  1118 May 21 14:46 kettle.properties.remote
 8 -rw-rw-r-- 1 pedro pedro  4534 May 21 14:46 kettle.properties.server
 4 drwxr-xr-x 4 pedro pedro  4096 Apr 19 12:57 patches/
 4 drwxr-xr-x 5 pedro pedro  4096 Feb 17  2011 patches-ee/
 4 -rwxrwxr-x 1 pedro pedro   261 May 21 14:46 remote_in.sh*
 0 lrwxrwxrwx 1 pedro pedro    29 Sep  6  2011 solution -> ../project-client-solution/

The project-client-solution is simply the pentaho solution folder without the system specific folder, admin, bi-developers, plugin-samples, steel-wheels, system. Tune this exclusion list at will in a file called .gitignore. Here's mine for the project-client-solution:

pedro@arpeggio:~/tex/pentaho/project-stonegate-solution (master)$ cat .gitignore 
admin/
system/
steel-wheels/
bi-developers/
*_tmp*
index*.properties
cde_sample/
.project
plugin-samples/

From this point on we can use the generic VCS techniques. Git can take a while to get used to, but the list of commands we need are very simple. I won't focus a lot on the project-client CBF structure, as there's lots of documentation on the CBF website, but everything still applies.

Moving on to the list of questions

FAQ: Frequently Asked Questions

Q: How do I checkout a project with git?
A: $ git clone git@yourserver.com:project-client-solution


This can be done not only for the development sandboxes (both project-client and project-client-solution) but also, for the latter, on the production and staging machines. That will allow us to manage versioning on the server too.

Q: Should several developers be working on the same development box? How to avoid conflicts?
A: I do not recommend this. It's always a good idea to have a local development sandbox. It's doable if the developers are working on different areas, but you'll get into conflicts that will be harder to isolate

Q: How to check in/check out units of work?
A: Once we have our local repository, we can jump to the most up to date  with the command:

$ git pull

Q: How to check in my work?

Unlike svn, when you commit work it doesn't get pushed to the remote repository. You shuold commit early and often (don't even need internet for it) and then push to the central server. You do that with the following:

$ git commit -m 'message'
$ git push

Here's where a visual tool gets useful. Mac users have GitX, linux users have GitG, everyone has gitk and git update. So for the commit part I usual use the visual tool.

Q: Guidelines how to package a new version of a dashboard and migrate it from development to test, and from test to production

The development happens on the main branch, called the master branch. When we're ready to release a certain version, we create a new branch with the name of that version (obviously, feel free to choose what you want). In this example, I'll call it v1.


That branch can then be be checked out on the QA server.

dev$ git branch v1
dev$ git push

qa$ git pull
qa$ git checkout branch 

Next step is testing it. If we find a bug that need fixing, we can fix it on that branch. If appropriated, we can merge the fix back on the development branch

qa$ git commit -m 'fixed bug on v1'
qa$ git push

dev$ git checkout  master # be sure we change back to master
dev$ git merge v1 # pull the bug fix
dev$ git push     # fix integrated


After we're happy with the solution and ready to go to production. This is where the tags come to good use. We can create a tag on it and push that information.

qa$ git tag v1.0
qa$ git push --tags

prod$ git pull
prod$ git checkout v1.0


That would put you on the correct version. Don't forget to update the solution repository.

Q: I was playing with the solution but I don't want to commit any change, just want to wipe the entire thing and get back to the clean state

 $ git reset --hard HEAD

Q: I changed a single and I want to have it back / reverted to the last state

 $ git checkout

Q: How to avoid overwriting each others work (which happens now if we're not careful)

No git command for this. Basically comes for free. However, if we're working on a bigger change, it's recommended that you create a new branch for it. That way you can work on that with guarantees that a specific feature can be developed independently.

Here's a schematic of how it conceptually works:

The idea is to isolate a specific feature. It's very easy to start:

 $ git branch featureX

This will start an isolated development. You can do regular commits, pushes, etc. When finished, you can can merge back to the man branch. You do that by

1) switching to the main branch, usually called master:

 $ git checkout master

2) Merge feature x back to the main master

 $ git merge featureX

3) If there are any conflicts you'll need to resolve them and commit. After a push, feature X will then be available.



Conclusion


This doesn't aim to be a full tutorial about git. There are tons of great documentation, and it's really a powerful tool. But should provide some best approaches on how to best handle a pentaho implementation.

Any extra questions, just email / comment here and I'll add them




Thursday, September 13, 2012

CGG - Putting CCC charts in Pentaho reporting / other tools

This has been a long standing blog post. CGG has been around for ages now, it's even in Pentaho platform core, and only a few knew about it.


CGG stands for Community Graphics Generator. Although it even has it's own homepage,  I never had the chance to blog about it. It's a somewhat hardcore plugin: it's basically able to execute on the server side custom scripts (java / javascript) that outputs images that can be used in external systems.


One of the most useful use cases is to be able to export CCC charts to images (png or svg) - either to allow a user the ability to download it, or to include it in Pentaho Reporting or any other of those tools


Here's an example of how it works. Imagine you develop for your users a great looking dashboard, almost as good looking as one of the UIs we develop :) :


 (This was the dashboard we developed in a recent Ctools training course in Orlando, Pentaho headquarters)

As you can see, we went to some extent, exploring CCC capabilities, to fine tune the charts. Now we want to be able to use that line chart to PRD. Even though CGG does not have a UI and doesn't aim to be user friendly, CDE has a way to make the bridge to it, using some hidden features.

Going straight to the point. In CDE, press shift-G. That will open the CGG window (as a side reference, press shift-? to see other very useful keyboard shortcuts):


 If you ever used this feature before, you'll notice this screen is a little different. We just added some more features, namely the ability to automatically get the url that generates that image from the outside - even I always struggled to find the right url. This is already available in the dev builds and will be released in the next stable version.

You'll be able to see that you can set some options there, namely the outputType, that currently can be either png or svg and you may need to change the server url if are developing in a sandbox and want to publish it to a server.  I actually thought svg support was broken in PRD but just tested it and that seems to be fixed, so take advantage of that feature.

One other thing to note is that you must take care of authentication. Either you pass the extra arguments &userid=joe&password=password or you allow that url to be accessible with no password, or whatever.

If you save your dashboard and try that url, this is what you'll get:


This is what CGG is all about, and there's tons of engineering work underneath to allow this "simple step" to work with just one keystroke.

Now it's gets very obvious what to do. Open PRD, add an image component, and put that url (don't forget authentication)


You'll also want to check the blog post I did a while back about using CDA datasources in PRD. In the meanwhile, in recent versions of PRD (4.5 and above) you don't need to download the CDA datasources, all you need to do is enable the experimental features in Edit -> Preferences -> General -> Enable Experimental Features.

To render this report from the server, you'll need to copy to pentaho/WEB-INF/lib/ the file pentaho-reporting-engine-classic-extensions-cda-*.jar that you'll find in PRD library directory.

There's another feature of CGG. You can pass parameters to the query, by adding &paramParameterName=ParameterValue  to the url. And that can be exposed from PRD too.

Just create a prompt the usual way. On my sample, the parameter is called a month, and since the query is already on the dashboard too, I just need to select it to build the prompt in the prpt.

However, in order to make the call with the new parameter, we can't use the image component anymore, and use the image-field instead.

For that, we need to create a formula that will build that url. Here's the sample formula I used:

="http://127.0.0.1:8080/pentaho/content/cgg/Draw?script=/SyncOrlando/Lab12/lineChart.js&outputType=png&userid=joe&password=password&parammonthParameter="&URLENCODE([month])

 The URLENCODE function allows us to be sure that our parameters will reach the server properly. In the end, we should have a report looking like this:


 Advanced topic: If you make reports with a lot of charts (you can even have cgg rendering a chart per line) you'll soon find out that your report starts to take a really long time to render. There's an explanation for it: PRD does a lot of passes to better determine the final layout. Since CGG doesn't support the HEAD request method, PRD won't get the appropriate info regarding cache, resulting in a bunch of requests for the image. Fortunately, Thomas Morgner allowed is a workaround to this issue, by changing a behavior in libloader. In your loader.properties file (located in WEB-INF/classes for the server or create it under prd/resources dir for the report designer) add the following lines:

# Controls the minimum time between HEAD requests regardless of
# the date -headers given by the response object.
org.pentaho.reporting.libraries.resourceloader.config.url.FixedCacheDelay=500000

# Fixes the date headers by simply using Date.now() as mod-date.
# This will break the HTTP specs and thus it is disabled by default.
org.pentaho.reporting.libraries.resourceloader.config.url.FixBrokenWebServiceDateHeader=true
This particular feature will be available in Pentaho Reporting 3.9.1 (or you'll have to compile your own)

And this is what it looks from the Pentaho BI server:


Cheers


-pedro
  



Thursday, September 6, 2012

New Ctools releases: 12.09.05

Here's a new set of releases, get them at the usual place

CDF 12.09.05 available.

Main features:
  •   Support for templates in excel export.
  •   Lots of fixes
Full changelog:
  •      Fixed [REDMINE 942] - Ensure only one instance of the minification routine is running at any given time
  •      send cda dataTables filtering settings on export
  •      Change isNumeric predicate in trendArrow addIn.
  •      Added support for excel template option in exportData.
  •      Fix to dashboardContext and CdfContentGenerator (templating)
  •      [PATCH][FIX] Bullet chart bulletTitle and bulletSubtitle options were not showing in the "fixed data" mode
  •      Add option to disable url scheme detection in dashboard headers
  •      Make bookmark state not generate multiple history entries
  •      Make pageStartingAt behave correctly

 CGG 12.09.05 available.

Main feature:
  • CPF Integration
Full changelog:
  • Integration with CPF
  • Updated pvc-d1.0.js file to keep in sync with CDF

CDA 12.09.05 available.

Major Upgrades:
  • Better integration with CDV
  • Fixes to CDC cache
  • Supports excel templates when exporting data
Full change log:
  • Implemented [REDMINE 808] - Show URL feature in CDA previewer
  • Implemented [REDMINE 1092] - template support for xls export
  • Implemented [REDMINE 1128] - CSV exporter: extension, headers, escaping
  • Fixed [REDMINE 1019] - Clean unnecessary ERROR messages
  • expose csv exporter enclosure, normalize headers setting
  • infer column types in CalculatedTableModel
  • Fix: Ensured DataFactory is closed even when there is an exception on the query
  • update cache to avoid default hazelcast instance
  • QueryErrorEvent: give cause of queryException if possible
  • expose timeToLive for cache manager
  • using getWithTimeout for cacheStats, WILL FAIL unless patched hazelcast is used; puts are now fire and forget; using last update time in monitor instead of first insertion
  • QueryError: send string array for stack trace
  • robochef update missing from .classpath
  • Fixed bad error log on cache miss
  • refactor olap4j connections
  • Calling setSolutionRepositoryThreadVariable before getting any connection. Fix suggested by wgorman and pstoellberg
  • format tablemodel also on empty result
  • fix classpath for eclipse + ant for running the tests (runtime-lib)
  • change olap4j sample file name
  • refactor use of mondrian role mapper

CDE 12.09.05 available.

Main features:
  •  Mostly fixes
Full Changelog:
  •     Fix [REDMINE 512] - Export to png with size small
  •     Fix [REDMINE 511] - Export to png 
  •     fix to mobile navigation component. Was not getting query string correctly
  •     Add editor support for resource reordering. Yay!
  •     Add support for removing url scheme detection when rendering dashboard headers
  •     Add some more error logging to dashboard rendering
  •     Make sure only one minifyPackage routine is runnning at any given time
  •     Make export popup's export size configurable
  •     Add missing pre/postChange properties to AutoCompleteBox
  •     SiteMap now keeps track of the original entry point with a css class

CDC 12.09.05 available.

Main features: 
  • Upgrade to hazelcast 2.2
  • CPF Integration
Full changelog:
  •     Update to hazelcast 2.2 to solve the network failure issue 
  •     Decoupled Hazelcast life cycle from plugin lifecycle
  •     Full Integration with CPF

CDV 12.09.05 available.

Main features:
  • Added samples and Documentation
  • UI fixes
Full changelog:
  • Added samples and Documentation
  • Changed orient to 1.1.0 
  • UI fixes