Pedro Alves on Business Intelligence

Pentaho 9.2 is available

2021-08-02T17:00:00.001+01:00

Pentaho 9.2 is available

Short and sweet - I'll make it fast.

Main features

Expand to Microsoft Azure – SQL DB, ADLS, HDInsight
Updated Data Fabric – CDP and MapR (HPE Ezmeral Data Fabric)
Productivity Boost – Logging and Upgrade

Go get EE through the support portal, and CE in the usual place!

Pentaho 9.1 is available!

2020-10-06T10:52:00.001+01:00

Pentaho 9.1 is available

It’s that time of the year! A new release is available!

Go get EE through the support portal, and CE in the usual place!

Main features

· Google Data Proc Support

· Catalog Steps in Spoon

· New Upgrade Utility

· And a bunch of consolidation stuff:

o 20+ Continuous Improvements

o 10+ Platform Updates

o 200+ Performance/quality bugs

Google Data Proc

You can now access and process data from a Google Dataproc cluster in PDI. Google Dataproc is a cloud-native Spark and Hadoop managed service that has built-in integration with other Google Cloud Platform services, such as BigQuery and Cloud Storage. With PDI and Google Dataproc, you can migrate from on-premise to the Google Cloud.

You can use PDI's Google Dataproc driver and named connection feature to access data on your Google Dataproc cluster as you would other Hadoop clusters, like Cloudera and Amazon EMR. See Set up the Pentaho Server to connect to a Hadoop cluster for further instructions.

§ What’s New

‒ New Hadoop driver

‒ AEL-Spark support

§ Version:

‒ Google Dataproc - 1.4 (Ubuntu 18.04 LTS, Hadoop 2.9, Spark 2.4)

§ Benefit

‒ Enables processing large data sets in Google Data Proc clusters

‒ On-premise data movement/migration

§ Hadoop Driver supports the following:

‒ Multi-cluster

‒ HDFS

‒ Hive

‒ PMR Hive

‒ Oozie

‒ Sqoop

‒ Hadoop Job Executor

‒ Pig

‒ Parquet / Avro / ORC

§ VFS support for GCS

§ Hbase is not supported

Lumada Data Catalog steps for PDI

Lumada Data Catalog lets data engineers, data scientists, and business users accelerate metadata discovery and data categorization, and permits data stewards to manage sensitive data. Data Catalog collects metadata for various types of data assets and points to the asset's location in storage. Data assets registered in Data Catalog are known as data resources.

You can use the folllowing four new PDI steps to work with Data Catalog metadata and data resources within your PDI transformations:

· Read Metadata

Search Data Catalog’s existing metadata for specific data resources, including their storage location.

· Write Metadata

Revise the existing Data Catalog tags associated with an existing data resource.

· Catalog Input

Reads the CSV text file types or Parquet data formats of a Data Catalog data resource that is stored in a Hadoop or S3 ecosystem and outputs the data payload in the form of rows to use in a transformation.

· Catalog Output

Encodes CSV text file types or Parquet data formats using the schema defined in PDI to create a new data resource or to replace or update an existing data resource in Data Catalog.

New Upgrade utility

• Current Scope:

‒ Scope only 9.0 to 9.1 (coming later: will extend to 8.3 LTS)

• Reliable upgrades and rollback

‒ Initial environment check to detect product component and will only upgrade what is there

‒ White list to persist customization

‒ Will persist all plug-ins across upgrade

‒ Automatically whitelist all database driver jars

Compatibility Updates

This:

Other improvements:

This:

§ Data Integration

‒ S3 Multipart Upload now allow configurable part sizes (PDI-16606)

‒ MongoDB Plug-in now allows PLAIN credentials for LDAP integration (PDI-17228)

§ Dashboards / Reporting

‒ 10-100x performance improvement for certain large slices and roll-ups for Mondrian Cubes (JIRA Link)

‒ Option to remove/hide the filter panel when used in a dashboard (ANALYZER-2270)

‒ Count and Count Distinct Summary on currency fields uses the default format (PIR-699)

‒ Admins can now customize the template(s) used for exporting to PDF and Excel (ANALYZER-12)

§ Platform

‒ Passwords stored in the BA Server config files and repository are now encrypted (BISERVER-3497)

‒ Users are now able to change their own password (BISERVER-13699)

-pedro

Pentaho 9.0 is available

2020-02-04T14:30:00.003+00:00

Pentaho 9.0 is available

Without further ado: Get Enterprise Edition here, and get Community Edition here

PDI Multi Cluster Hadoop Integration

Capability

Pentaho connects to Cloudera Distribution for Hadoop (CDH), Hortonworks Data Platform (HDP), Amazon Elastic MapReduce (EMR). Pentaho also supports many related services such as HDFS, HBase, Oozie, Zookeeper, and Spark.

Before this release, Pentaho Server as well as PDI design time environment – Spoon, can work with only one Hadoop cluster at a time. It required multiple transformations, instances, and pipelines to execute multiple Hadoop clusters. With 9.0 release, major architecture changes have occurred to easily configure, connect and manage multiple Hadoop clusters.

·       Users can access and process data from multiple Hadoop clusters from different distros and versions- all from single transformation and instance of Pentaho.

·       Also, within Spoon, users can now set up three distinct cluster configs, all having reference to the specific cluster, without having to restart Spoon. There is also a new configuration UI to easily configure your Hadoop drivers for managing different clusters.

·       Improved cluster configuration experience and secure connection with the new UI

·       Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.

Existing single cluster/shim functionality will continue to work.

The following example shows the Multi-cluster implemented in the same data pipeline via connecting to both Hortonworks HDP and Cloudera CDH clusters.

Use Cases and Benefits

· Enables hybrid big data processing support (on-prem or cloud)- all within single pipeline

· Simplifies Pentaho’s integration with Hadoop clusters including enhanced UX of cluster configurations

Key Considerations

· Adaptive Execution Layer Spark isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

· Pentaho Map Reduce isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

Additional Resources

See Adding a new driver for how to add a driver. See Connecting to a Hadoop cluster with the PDI client for how to create a named connection.

· Follow the suggestions in the Big Data issues troubleshooting sections to help resolve common issues when working with Big Data and Hadoop, especially Legacy mode activated when named cluster configuration cannot be located.

PDI AEL-Spark Enhancements

Capability

The Pentaho Adaptive Execution Layer (AEL) is intended to provide flexible and transparent data processing with Spark, in addition to the native Kettle engine. The goal of AEL is to develop complex pipelines visually and then execute in Kettle or Spark based on data volume and SLA requirements. AEL allows PDI users to designate Spark as execution engine for their transformations apart from Kettle.

The v9.0.0 release includes the following performance and flexibility enhancements to AEL-Spark:

·       Step level spark specific performance tuning options

·       Enhanced logging configuration and information entered into PDI logs

·       Added support for Spark 2.4, with existing 2.3 support

·       Supports following distros: Hortonworks HDP v3.0, 3.1; Cloudera CDH v6.1, 6.2; Amazon EMR v5.21, 5.24.

The following example showcases Spark App and Spark Tuning on specific steps within a PDI transformation:

Use Cases and Benefits

· Eliminates black box feel with better visibility

· Enable advanced Spark users with tools to improve performance

Key Considerations

Users must be aware of the following additional items related to AEL v9.0.0:

·       Spark v2.2 is not supported.

·       Native HBase steps are only available for CDH and HDP distributions.

·       Spark 2.4 is the highest Spark version currently supported.

Additional Resources

See the following documentation for more details: About Spark Tuning in PDI, Setup Spark Tuning, Configuring Application Tuning Parameters for Spark

Virtual File System (VFS) Enhancements

Capability

·       The changes to the VFS are in two main areas:

·       1. We added Amazon S3 and Snowflake Staging as VFS providers to named VFS Connections and introduced the Pentaho VFS (pvfs) that can reference defined VFS Connections and their protocols. In the S3 protocol, we support S3A and Session Tokens in 9.0.

·       The general format of a Pentaho VFS URL is:
pvfs://VFS_Connection/path (including a namespace, bucket or similar)

·       2. A new file browsing experience has been added. The enhanced VFS browser allows users to browse any preconfigured VFS locations using named connections, their local filesystem, configured clusters via HDFS, as well as a Pentaho repository, if connected.

Use Cases and Benefits

·       Through the support of Pentaho VFS, you have an abstraction of the protocol. That means, when you want to change your provider in the future, all your jobs and transformations work seamless after this change in the VFS Connection. Today, you reference S3. Tomorrow, you want to reference another provider, for example HCP or Google Cloud. Using Pentaho VFS, your maintenance burden in these cases is much lower.

·       VFS Connections also enables you to use different accounts and servers (including namespaces, buckets or similar) within one PDI transformation. Example: You want to process data within one transformation from S3 with different buckets and accounts.

·       Combining named VFS connections with the new file browsing experience provides a convenient way to easily access remote locations and extend the reach of PDI. The new file browser also offers the ability to manage files across those remote locations. For example, a user can easily copy files from Google Cloud into an S3 bucket using the browser's copy and paste capabilities. A user can then easily reference those files using their named connections, in supported steps and job entries.

A user can manage all files, whether they are local or remote in a central location. For example, there is no need to login to the Amazon S3 Management Console to create folders, rename, delete, move or copy files. Even a copy between the local filesystem and S3 is possible and you can upload/download files from within Spoon.

The new file browser also offers capabilities such as search, which allows a user to find filenames which match a specified search string. The file browser also remembers a user's most recently accessed jobs and transformations for easy reference.

Key Considerations

As of PDI 9.0, the following protocols are supported: Amazon S3, Snowflake Staging (read only), HCP, Google CS

The following steps and job entries have been updated to use the new file open save dialog for 9.0: Avro input, Avro output, Bulk load into MSSQL, Bulk load into MySQL, Bulk load from MySQL, CSV File Input, De-serialize from file, Fixed File Input, Get data from XML, Get file names, Get files rows count, Get subfolder names, Google Analytics, GZip CSV input, Job (job entry), JSON Input, JSON Output, ORC input, ORC output, Parquet Input, Parquet output, Text file output, Transformation (job entry)

The File / Open dialog is still using the old browsing dialog. The new VFS browser for opening jobs and transformations can be reached through the File / Open URL menu entry.

Additional Resources

See Virtual File System connections, Apache Supported File Systems and Open a transformation for more information.

Cobol copybook steps

Capability

PDI now has two transformation steps that can be used to read mainframe records from a file and transform them into PDI rows.

· Copybook input: This step reads the mainframe binary data files that were originally created using the copybook definition file and outputs the converted data to the PDI stream for use in transformations.

· Read metadata from Copybook: This step reads the metadata of a copybook definition file to use with ETL Metadata Injection in PDI.

The Copybook steps also support metadata injection, extended error handling and can work with redefines. Extensive examples for these use cases are available in the PDI samples folder.

Use Cases and Benefits

Pentaho Data Integration supports simplified integration with fixed-length records in mainframe binary data files, so that more users can ingest, integrate, and blend mainframe data as part of their data integration pipelines. This capability is critical if your business relies on massive amounts of customer and transactional datasets generated in mainframes that you want to search and query to create reports.

Key Considerations

This step works with Fixed Length COBOL records only. Variable record types such as VB, VBS, OCCURS DEPENDING ON are not supported.

Additional Resources

For more information about using copybook steps in PDI, see Copybook steps in PDI

Additional Enhancements

New Pentaho Server Upgrade Installer

The Pentaho Server Upgrade Installer is an easy to use graphical user interface that automatically applies the new release version to your archive installation of the Pentaho Server. You can upgrade versions 7.1 and later of the Pentaho Server directly to version 9.0 using this simplified upgrade process via the user interface of the Pentaho Server Upgrade Installer.

See Upgradethe Pentaho Server for instructions.

Snowflake Bulk Loader improvement

The Snowflake Bulk Loader has added support for doing a table preview in PDI 9.0. When connected to Snowflake and on the Output tab, select a table in the drop-down menu. The preview window is populated, showing the columns and data types associated with that table. The user can see the expected column layout and data types to match up with the data file.

For more information, please see the job entry documentation of the Snowflake Bulk Loader.

Redshift IAM security support and Bulk load improvements

With this release, you have more Redshift Database Connection Authentication Choices, these are

·       Standard credentials (default) – user password

·       IAM credentials

·       Profile located on local drive in AWS credentials file

Bulk load into Amazon Redshift enhancements: New Options tab and Columns option in the Output tab of the Bulk load into Amazon Redshift PDI entry. Use the settings on the Options tab to indicate if all the existing data in the database table should be removed before bulk loading. Use the Columns option to preview the column names and associated data types within your selected database table.

See Bulk load into Amazon Redshift for more information.

Improvements in AMQP and UX changes in Kinesis

The AMQP Consumer step provides Binary message support, for example allowing to process AVRO formatted data.

Within the Kinesis Consumer step, users can change the output field names and types.

See the documentation of the AMQP Consumer and Kinesis Consumer steps for more details.

Metadata Injection (MDI) Improvements

In PDI 9.0.0, we continue to enable more steps to support metadata injection (MDI):

·       Split Field to Rows

·       Delete

·       String operations

In the Excel Writer step, the missing MDI step option “Start writing at cell”, has been added. This option can also be injected now.

Additionally, the metadata injection example is now available in the samples folder:
/samples/transformations/metadata-injection-example

See ETL metadata injection for more details.

Excel Writer: Performance improvement

The performance of the Excel Writer has been drastically improved when using templates. A sample test file with 40,000 rows needed about 90 seconds before 9.0 and now processes in about 5 seconds.

For further details, please see PDI-18422.

JMS Consumer changes

In PDI 9.0, we added the following fields to the JMS Consumer step: MessageID, JMS timestamp and JMS Redelivered.

This addition enables restartability and allows to omit duplicate messages.

For further details, please see PDI-18104 and the step documentation.

Text file output: Header support with AEL

You can set up the Text file input step to run on the Spark engine via AEL. The Header option of the Text file output step works now with AEL.

For further details, please see PDI-18083 and the Using the Text File Output step on the Spark engine documentation.

Transformation & Job Executor steps, Transformation & Job entries: UX improvement

Before 9.0, when passing parameters to transformations/jobs, the options "Stream column name" vs. "Value" ("Field to use" vs. "Static input value") were ambiguous and led to hard to find issues.

In 9.0, we added behavior which prevents a user from entering values into both fields to avoid these situations.

For further details, please see PDI-17974.

Spoon.sh Exit code improvement

Spoon.sh (that gets called by kitchen.sh or pan.sh) sends the wrong exit status in certain situations.

In 9.0, we added a new environment variable FILTER_GTK_WARNINGS to control this behavior for warnings that effect the exit code. If the variable is set to anything, then a filter is applied to ignore any GTK warnings. If you don’t want to filter any warnings, then unset FILTER_GTK_WARNINGS.

For further details, please see PDI-17271.

Dashboard: Option for exporting analyzer report into CSV format.

Now it's possible to export an analyzer report into a CSF format file even when embedded on a dashboard.

In the previous release the export option was available, but without the CSV format.

The CSV format was available only when using Analyzer outside dashboards, in this way we provide functional parity between Analyzer standalone charts and charts embedded in dashboards.

For further details, please see PDB-1327.

Analyzer: Use of date picker when selecting ranges for a Fiscal Date level relative filter.

Before 9.0 and for an AnalyzerFiscalDateFormat annotation on a level in a Time dimension, Analyzer did not show the "Select from date picker" link.

Now, relative dates can be looked up from the current date on the Date level, then the date picker can also be used to select the nearest fiscal time period.

For further details, please see ANALYZER-3149.

Mondrian: Option for setting the 'cellBatchSize' default value.

From a default installation the mondrian.properties does not include mondrian.rolap.cellBatchSize as a configurable property.

The purpose of this improvement is to include this property in the mondrian.properties by default in new builds so customers do not run into performance issues due to the default value for this property being set too low. The default value of the property should be clearly indicated in the properties file as well.

The default value has been updated to mondrian.rolap.cellBatchSize=1000000.

This value was chosen because this setting can run a very large 25M cell space report while keeping total server memory usage around 6.7 GB which is under the 8GB we list as the minimum memory required on a Pentaho server.

For further details, please see MONDRIAN-1713.

Pentaho 8.3 is available!

2019-07-15T12:30:00.007+01:00

Sorry for the delay! The release was late last week but I was travelling. In all fairness, I could have done this friday night, but then again, it was friday night so... 🤪🍷

A note of thanks to the Pentaho PM team for the following copy-paste exercise!

So let's start from the beginning and... Download it here!

Pentaho 8.3

The major new product innovations and enhancements included in 8.3 are:

PDI Amazon Kinesis Streaming Integration
PDI Amazon Redshift Bulk Load
PDI + BA Snowflake Connectivity
PDI HCP (Hitachi Content Platform) Integration Enhancements
PDI SAP Connector
BA Viz API 3.0 General Availability

Toward the end of the document, we have also briefly summarized other new capabilities and improvements that are included in 8.3. These additional features include:

PDI AEL Enhancements
PDI + BA Upgrade Utility Enhancements
PDI VFS (Virtual File System) Updates
PDI Lineage Updates
PDI Metadata Injection Updates
PDI Python Executor Step Updates
BA Analyzer Improvements
BA Interactive Reporting Improvements

For links to detailed product documentation on what is new in Pentaho 8.3, see here: https://help.pentaho.com/Documentation/8.3

An exhaustive list of additional product improvements and bug fixes are available in the 8.3 release notes, which can be found on the Support Portal: https://support.pentaho.com/hc/en-us

PDI Amazon Kinesis Streaming Integration

Capability

Amazon Kinesis enables ingestion of real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for machine learning, analytics, and other applications. Data can be processed as it becomes available, and Amazon Kinesis is real time, fully managed and scalable. It is important to note that Kinesis is an umbrella of multiple real time technologies, including Kinesis Data Streams, Kinesis Video Streams, Kinesis Firehose, and Kinesis Analytics.

Pentaho version 8.3 introduces supported PDI integration with AWS Kinesis Data Streams, including the following features:

Ability to ingest messages to PDI from AWS Kinesis (via new Kinesis Consumer transformation step) and output messages from PDI to AWS Kinesis (using new Kinesis Producer transformation step)
Micro-batch processing of data originating from AWS Kinesis, leveraging processing windows based on time periods (in milliseconds) or number of messages
Choice of executing micro-batch processing transformations with either the native Pentaho (Kettle) engine on Pentaho Server or the Spark Streaming engine on EMR Spark via Adaptive Execution Layer (AEL)
Support for Avro-typed read and write of data using Kinesis data streams
Support for max concurrent batches to allow transformation-specific scaling in Kettle engine. (Spark Streaming automatically handles this when executed in Spark via AEL).

Use Cases and Benefits

Kinesis provides real-time data capability in an AWS environment. Pentaho and Kinesis Data Streams can provide filtering and contextual analysis of streaming data even for very high-throughput streaming data sources. Use case examples include shop-floor monitoring, as well as predicting quality risk and recommending optimizations. Through AWS Kinesis and Pentaho stream processing integration, Pentaho enables AWS developers to ingest and process streaming data in a powerful visual environment.

Additional Considerations

Users should be aware of the following items related to stream data processing capability in Pentaho 8.3:

Kinesis steps are only available in Enterprise Edition (EE) version of Pentaho.
Kinesis Producer step doesn’t create a new Stream if one does not exist already.
Avro read/write and max concurrent batch capabilities are not supported for other streaming steps aside from the Kinesis steps.

Additional Resources

Documentation on the Kinesis steps can be found below.

Consumer Step: https://help.pentaho.com/Documentation/8.3/Products/Kinesis_Consumer
Producer Step: https://help.pentaho.com/Documentation/8.3/Products/Kinesis_Producer

PDI Amazon Redshift Bulk Load

Capability

Redshift is a managed, highly scalable data warehouse service in the cloud that is offered by AWS, and Pentaho has provided integration with it for several years. PDI 8.3 includes a new job entry to visually configure bulk loads of files from AWS S3 storage to Redshift database tables. The entry orchestrates an AWS Redshift COPY Command and includes the following tabs:

Input Tab: Allows the user to select the S3 source for the bulk load (which can include multiple files), specify the format and compression of the source data, and configure commonly used properties relevant to the chosen format.
Output Tab: Lets the user specify the Redshift database connection, schema, and target table to load; also shows a preview of the table columns
Parameters Tab: User can input values for additional parameters for the COPY Command, related to error handling, field formats, and other items

Use Cases and Benefits

Organizations bulk load data into Redshift to support business processes and operations such as:

Populating Redshift data warehouses for regular reporting needs under standard SLAs
Repetitive data onboarding with similar logic across many S3 sources and Redshift table targets
Cleansing and repackaging data for specific customers
Loading many data files at once in parallel for high performance vs other approaches

The new bulk loader job entry helps PDI users avoid repetitive SQL scripting to orchestrate bulk loads. The GUI-based job entry takes advantage of high performance loading features of the COPY command, while supporting greater automation of onboarding to Redshift in PDI jobs.

Key Considerations

Users should be aware of the following when using the Redshift bulk load job entry in PDI:

In conjunction with the introduction of the new entry, the Redshift database connection dialog now includes an S3 Authentication Method section that must be populated to ensure that Redshift has access to the proper AWS S3 data for bulk load
The new job entry can only be used to load Redshift from S3, and does not support loads from other sources (though PDI already enables moving data from other sources to S3)

Additional Resources

For additional information on the PDI Redshift bulk load job entry, see here: https://help.pentaho.com/Documentation/8.3/Products/Bulk_load_into_Amazon_Redshift

For further information on the Redshift COPY Command parameters see here: https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

PDI + BA Snowflake Database Connectivity

Capability

Snowflake is a cloud-based data warehouse that speaks SQL and combines flexible compute clusters for query processing with a central data repository. Snowflake customers can scale compute and storage independent of each other as they manage data, and the resulting costs are also incurred independently.

Pentaho 8.3 includes supported database connectivity to Snowflake across the platform, including Pentaho Data Integration, Pentaho User Console, Analyzer, Interactive Reporting, Report Designer, Schema Workbench, and Metadata Editor.

A specific ‘Snowflake’ database connection type will now be available in these tools, and optimized SQL generation and dedicated dialects to work with Snowflake were added for Pentaho Data Integration, Analyzer (Mondrian), and Reporting (Metadata).

Use Cases and Benefits

The 8.3 connectivity enables blending, enrichment, and analysis of Snowflake data along with other data sources. It also helps to support customers in their ongoing journey toward cloud and hybrid environments for data management and analytics.

More specifically, customers who use the end-to-end Pentaho platform with on-premise analytic databases, and are planning migrations to Snowflake, can now leverage the different Pentaho tools with Snowflake for data orchestration, reporting, ad hoc analysis, and other use cases. A “lift and shift” approach like this is a popular first step when going to the cloud to take advantage of costs savings, elasticity, and other business benefits without drastically changing data management architecture and processes.

It is also worth noting that Pentaho equips customers to leverage data across multiple cloud platforms, including AWS and Google Cloud, in addition to Snowflake.

Key Considerations

Customers should be aware of the following points as they begin to use the Pentaho connectivity to Snowflake in 8.3:

8.3 ships with Snowflake JDBC drivers for Pentaho Server and PDI, however, for other client tools the drivers may need to be manually copied in the proper locations – see here for more detail: https://help.pentaho.com/Documentation/8.3/Setup/Install_drivers_with_the_JDBC_distribution_tool
When loading Snowflake with PDI 8.3, it is advised to orchestrate a SQL COPY INTO command rather than using the Table Output step due to slow Snowflake performance with insert statements

Note -- this process should be improved in the near future, as Pentaho anticipates releasing a PDI Bulk Load capability and other Snowflake integration enhancements later in 2019

As of 8.3 release, Aggregation Designer is not supported for Snowflake connectivity

Additional Resources

Information regarding the Snowflake JDBC driver to be used with Pentaho can be found here: https://help.pentaho.com/Documentation/8.3/Setup/JDBC_drivers_reference#Snowflake

Learn more about Snowflake here: https://docs.snowflake.net/manuals/index.html

PDI HCP (Hitachi Content Platform) Integration Enhancements

Capability

Since version 8.2 PDI has been able to connect to Hitachi Content Platform (HCP) via the Virtual File System (VFS) to read, write, update or delete data. In 8.3, it is now possible to read, write and update HCP custom metadata and to query objects with their system metadata.

HCP is a distributed object storage system designed to support large, growing repositories of fixedcontent data from simple text files to images and video to multi-gigabyte database images. HCP can be used as a Globally Compliant Retention Platform (GCRP) that meets compliance & legal retention requirements (WORM, SEC 17A-4, CFTC and MSRB). It can also be used as a secure analytics archive. HCP protects data with very high durability (up to fifteen 9s) and availability (up to ten 9s).

Each HCP object permanently associates data HCP receives (for example, a document, an image, or a movie) with information about that data called metadata. In PDI, you can query the metadata to locate and access HCP objects. The HCP object consists of the data (such as an image file), a unique URL to it, system metadata properties, and custom metadata annotations.

In 8.3, PDI has three transformation steps you can use to work with your metadata in HCP:

The Query HCP step locates data objects by searching system and custom metadata annotations. HCP returns the unique URL and system metadata of objects matching your search terms. For example, a radiology practice can search for an X-ray (object) of a specific patient or all X-rays performed by a specified physician.

With the Read metadata from HCP step, you can identify and select an HCP object by its URL path and then select a specific target annotation name to read. The step returns the requested custom metadata from the annotation back to your PDI transformation for downstream processing.

With the Write metadata to HCP step, you can identify and select an HCP object by its URL and then write custom metadata annotations to the object associated with the object URL, enriching and validating the data stored in your HCP repository. For example, a radiology practice could add patient medication data and associated medical conditions to an X-ray (object) or remove invalid diagnosis codes.

Use Cases and Benefits

HCP + PDI can act as a “Staging Data Lake” for semi-structured and unstructured data. PDI provides data quality, enrichment and cleaning before moving to an enterprise data warehouse or big data lake. You can execute data science workflows (for example deep learning for image recognition) against structured/unstructured data in HCP as well.

Nearly every PDI step and job entry that references files can use VFS and now also can access data on HCP and starting in 8.3 can add, read and modify custom metadata and query objects.

Key Considerations

In this release we support HCP versions 8.0 and 8.1. There is no support for HCP Anywhere nor HCP Cloud Scale.

Additional Resources

Please see the documentation of PDI steps that work with HCP: https://help.pentaho.com/Documentation/8.3/Products/PDI_and_Hitachi_Content_Platform_(HCP)

VFS Documentation with HCP Details: https://help.pentaho.com/Documentation/8.3/Products/Virtual_file_system_browser

HCP Documentation: https://knowledge.hitachivantara.com/Documents/Storage/Content_Platform

PDI SAP Connector

Capability

Note – the steps comprising the SAP Connector have been built by partner IT-Novum and are supported jointly by Hitachi Vantara and IT-Novum. They were first made available between the 8.2 and 8.3 releases of Pentaho. The SAP Connector is available and priced separately from the core Pentaho platform. Please connect with your sales representative for further details.

The Hitachi Data Connector for SAP ERP & Business Warehouse (BW) has the following capabilities:

Flexible and easy to use analysis of SAP data
Query of complex and nested SAP structures (such as cost center groups or structures/data only available at runtime of the SAP system)
Access to BW data using SAP Data Store Objects
Support for Metadata Injection
Cost-saving offload scenarios
Data Blending of SAP Data
Analytical Workloads on Data Warehouse Layer (SAP Infocubes)
High performance data transfer from SAP to Hadoop thanks to integrated server mode

The connector contains the following PDI steps:

The SAP ERP Table Input step allows you to load data from SAP tables for further processing.
The SAP BW/ERP RFC Executor step calls remote functions and supports SAP Business Warehouse and ERP and allows to solve special use cases by executing RFC/BAPIs (Remote Function Calls / Business Application Programming Interface).
The SAP BW DSO Input (Data Store Objects) step is capable of loading data from SAP BW – Data Store Objects. It also supports advanced Data Store Objects (A-DSO) with SAP HANA.

For reference, PDI separately offers support for integration to SAP HANA in PDI as a database connection (since PDI 5.4) and ability to bulk load data to SAP HANA (since PDI 6.0).

Use Cases and Benefits

With the Connector, application scenarios and use cases such as onboarding, blending or offloading of SAP ERP and SAP Business Warehouse data can easily be realized.

Furthermore, the steps have extended capabilities that offer full flexibility such as data retrieval and connection to SAP Business Processes (BAPI), calling many existing SAP functions, and write-back into SAP.

Pentaho's PDI enables companies to easily connect data from SAP ERP, SAP BW, and HANA with data from non-SAP systems to create powerful analytics applications.

The connector also helps to secure investments that companies have already made in SAP components such as HANA Views & Tables, Bex Queries, DSO / ODS objects, ABAP reports, SAP queries and SAP extractors.

Key Considerations

The SAP Connector is available and priced separately from the core Pentaho platform. Please connect with your sales representative for further details.

The Connector is a separate plugin and supports PDI Version 7.x and 8.x.

Additional Resources

SAP and Analytics: https://it-novum.com/en/big-data-analytics/sap-and-analytics/

Hitachi Data Connector SAP ERP & BW: https://it-novum.com/en/big-data-analytics/sap-and-analytics/hitachi-data-connector-sap-erp-bw/

BA Viz API 3.0 General Availability

Capability

The Visualization API 3.0 is part of Pentaho’s infrastructure that enables Analyzer, CTools, and Data Explorer (in PDI) to use visualizations in a unified, pluggable way. Further, it provides a simple, powerful, tested, and documented approach for developers to integrate new visualizations to use in these tools and configure properties of visualizations used in them.

Viz API was first introduced as an evolution from Viz API 2.0 in Pentaho version 7.1, however, it has technically been in a ‘beta’ state (such that it could continue to be improved) until now – in Pentaho 8.3 the Viz API 3.0 is generally available and supported. This means that Pentaho provides support for customer usage of the API, as documented, in order to integrate and configure visualizations. Further, customers can be confident that changes to the API will be carefully managed and communicated going forward.

Viz API allows Pentaho tools to leverage visualization libraries and configurations in a unified way

Use Cases and Benefits

Viz API 3.0 makes it easier for developers to integrate 3 rd party visualizations and develop new charts for use in Pentaho. This can be achieved by adapting code from 3 rd party libraries (such as D3) to be used in Pentaho per documentation provided. Further, Viz API is built on top of other Platform JavaScript APIs that ensure seamless data access and provide visualizations with validation, theming, and other features.

In addition, Viz API 3.0 provides the ability to reuse configurations across Analyzer, CTools, and Data Explorer (PDI).

Key Considerations

The following information should be carefully considered for customers planning to migrate from Viz 2.0 to Viz 3.0.

Please Note – to successfully convert reports from Viz API 2.0 to Viz API 3.0 upon upgrade from an early Pentaho version, some customers may first need to make additional configuration file changes to 8.3. Please contact Support for additional details.
Viz 2.0 is the default Viz API for customers that upgrade from a previous version of Pentaho and using Viz 2.0 with that version. Viz 2.0 continues to be supported in Pentaho 8.3. However, stock (out of box) visualizations can be converted from 2.0 to 3.0.
The Viz API version for Analyzer can be switched by the administrator in the Analyzer settings.xml configuration file, and after the switch is made, the conversion to Viz 3.0 for any given report is triggered by saving that report.
Before migrating to Viz 3.0, it is advised that customers back up their existing Analyzer reports and then test them with Viz API 3.0 before saving any report.
3.0 visualizations are configured in a global configuration file vs. 2.0 configuration in an Analyzerspecific config file. The syntax used for 3.0 syntax is also different than the 2.0 syntax.
Custom visualizations created with Viz 2.0 will always display as Viz 2.0 and cannot be converted to viz 3.0 via configuration.

Additional Resources

Documentation on Viz API 3.0 can be found at the following links --

Overview of Viz API 3.0: https://help.pentaho.com/Documentation/8.3/Developer_center/Visualization_API
Analyzer Considerations: https://help.pentaho.com/Documentation/8.3/Developer_center/Analyzer_and_the_Visualization_API
Configuration: https://help.pentaho.com/Documentation/8.3/Developer_center/Configuring_a_visualization

Additional Enhancements

The following capabilities have also been introduced in the Pentaho platform as part of the 8.3 release.

PDI AEL Enhancements

The Pentaho Adaptive Execution Layer (AEL) is intended to provide flexible, transparent data processing with Spark in addition to the native Kettle engine. The goal of AEL is to develop complex pipelines visually and then execute in Kettle or Spark based on data volume and SLA requirements. AEL allows PDI users to designate Spark as execution engine for their transformations apart from Kettle.

The 8.3 release includes the following performance and flexibility enhancements to AEL-Spark:

Spark Technology update -- Upgrading underlying data structure from RDD to Dataset
Adding native Spark execution for additional PDI transformation steps such as Switch-case and Merge Rows
Upgraded Hadoop distribution version support for AEL

Users should be aware of the following additional items related to AEL 8.3:

Spark native Hive steps only work with HDP distributions below 3.0.
Native HBase steps are only available for CDH and HDP distributions.
Spark 2.3 is the highest Spark version currently supported.
CDH 6.0 and higher versions are not supported as these versions only package Spark 2.4.

PDI + BA Upgrade Utility Enhancements

In Version 8.2, we introduced the Pentaho Server Upgrade Installer that we continue to deliver in 8.3 for further upgrade paths. It is an easy to use graphical user interface that automatically applies the minor release version to your existing Pentaho installation.

Pentaho versions 8.1 and 8.2 can be upgraded directly to Pentaho 8.3. The installer can also be invoked via command line for automated deployment scenarios. For all other upgrade paths from versions prior to 8.1, the pre-existing manual upgrade process is still the supported method of upgrade.

Further information about the upgrade installer and the download can be found in the customer support portal: https://support.pentaho.com/hc/en-us/categories/Downloads

General upgrade instructions can be found in the following documentation: https://help.pentaho.com/Documentation/8.3/Setup/Pentaho_upgrade

PDI VFS (Virtual File System) Updates

In 8.3 we started to unify the VFS property settings for different VFS providers. In this release we unified the settings for the HCP and Google Cloud Storage VFS providers in order to improve overall usability. Named VFS connections that can be reused for multiple steps and job entries, can be easily modified and used in different environments, and most importantly, there is a single place to set VFS properties. In general, we plan to apply the settings unification to other VFS providers, such as Amazon and others, in future releases.

PDI Lineage Updates

In this release, we added custom metaverse analyzers for data lineage tracking with respect to the following transformation steps:

AMQP Consumer & Producer
JMS Consumer & Producer
Kafka Consumer & Producer
MQTT Consumer & Producer

This expands the completeness of data lineage for PDI.

To view the full list of steps and entries with custom data lineage analyzers, see Data Lineage:

https://help.pentaho.com/Documentation/8.3/Products/Data_lineage

Additionally, the following features related to data lineage have been added as beta features that are not currently supported:

Support for IBM IGC (Beta) enables PDI lineage to be integrated with an existing governance solution IBM IGC (Information Governance Catalog)
The Spark lineage feature (Beta) can get run time data lineage for AEL-Spark run jobs.

If you are interested in the above beta features, please contact support or your sales representative.

PDI Metadata Injection Updates

Metadata injection enables the passage of metadata to transformation templates at runtime to drastically increase productivity, reusability, and automation of transformation workflow. This supports use cases like the onboarding of data from many files and tables to data lakes. In addition to existing metadata injection enabled steps, as of 8.3 you now can inject metadata into any field in the following Pentaho Data Integration (PDI) steps:

Table Output (added Connection Field)
Strings cut
Salesforce Input

Learn more about PDI steps supporting metadata injection: https://help.pentaho.com/Documentation/8.3/Products/ETL_metadata_injection

PDI Python Executor Step Updates

The Python Executor step now features a new Python Data Structure, in addition to the existing Pandas dataframe and NumPy Array selections, named Python List of Dictionaries. The Data structure drop down menu is where you select the List of dictionaries data structure. In addition, the Python.org version that the Python Executor step is compatible with has been updated to include Python 3.7.x. Overall, these enhancements allow for more Python data structure choices and strive to keep compatibility more current for improved usability.

BA Analyzer Improvements

In Pentaho 8.3, there were several improvements related to Analyzer export experience and capabilities. They include:

Un-merge Analyzer cells on Export to Excel: There is now a ‘merge pivot table cells’ checkbox in the Export to Excel flow. When unchecked, the resulting Excel spreadsheet exported will unmerge cells that were merged in the Analyzer online view, such as label cells for higher levels in a hierarchy and label cells across multiple columns. Ultimately, this makes data manipulation of the exported data easier for Excel users. (This update addresses the request in Analyzer-2055 from JIRA)
Change CSV separator for Analyzer Export: In the Export to CSV flow, there is now a ‘Separator’ input box where the user can specify what separator they want to include in the export. This helps to streamline use cases where a separator other than comma is needed, as may come up in regions where comma is normally used as a decimal point. (This update addresses the request in Analyzer-516 from JIRA)
Export Analyzer report via REST API Call: There is a new supported REST endpoint to export Analyzer reports to PDF, CSV, and Excel. It can also be involved via URL, and is useful for embedded analytics use cases. Developer documentation is forthcoming on this enhancement. (This update addresses the requests in Analyzer-867 and Analyzer-2967 in JIRA)

BA Interactive Reporting Improvements

In Business Analytics, there were two notable continuous improvements related to Pentaho Interactive Reporting in 8.3. These enhancements are:

Search Fields: Users of Interactive Reporting can now find fields they are look for by using a text search box at the top of the Data pane. This accelerates user productivity in scenarios where data sources have many fields. (This update addresses the request from PIR-875 in JIRA)
Disable/Enable Select Distinct Setting: Administrators can now set the ‘Select Distinct’ option in Interactive Reporting Query Settings as disabled or enabled by default for users. This is done with in the settings.xml configuration file for Interactive Reporting. The enhancement helps to optimize performance and efficiency of queries across users in the Pentaho deployment. (This update addresses the request from PIR-760 in JIRA)

Additional Resources

These resources include additional useful information regarding Pentaho 8.3:

• Detailed Documentation: In-depth documentation on Pentaho 8.3 can be found here: https://help.pentaho.com/Documentation/8.3
Upgrade Information: Documentation on upgrading to Pentaho 8.3 from previous Pentaho versions can be found here: https://help.pentaho.com/Documentation/8.3/Setup/Pentaho_upgrade
Compatibility Update: Details about support for Pentaho 8.3 with different technology components and versions can be found here: https://help.pentaho.com/Documentation/8.3/Setup/Components_Reference
Release Notes: Details on which outstanding bugs and feature requests were resolved in Pentaho version 8.3 can be found on the Support Portal: https://support.pentaho.com/hc/en-us

Pentaho 8.2 is available!

2018-12-04T14:08:00.002+00:00

I've come to accept my inefficiency on keeping up with the technical blog posts. This is the point where one accepts his complete uselessness (and I don't even know if this is a real word!)

Anyway - up to the good things:

Pentaho 8.2 is available!

Get it here!

A really really solid release! A huge list of things that will make a serious impact on the development effort and production releases out there.

Release overview

Here's the release at a glimpse:

Enhance Eco System Integration

Hitachi Content Platform (HCP) Connector I
MapR DB Support
Google Encryption Support

Improve Edge to Cloud Processing

Enhanced AEL
Streaming AMQP

Better Data Operation

Expanded Lineage
Status Monitoring UX
OpenJDK support

Enable Data Science & Visualization

Python Executor
PDI Data Science Notebook (Jupyter) Integration
Push Streaming

Improve Platform Stability and Usability

JSON Enhancements
BA Chinese Language Localization for PUC
Expanded MDI

Additional Improvements

And now a little bit of detail into each of them:

Ecosystem Integration

Hitachi Content Platform (HCP) Connectivity

HCP is a distributed storage system designed to support large, growing repositories of fixed-content data from simple text files to images, video to multi-gigabyte database images. HCP stores objects that include both data and metadata that describes that data and presents these objects as files in a standard directory structure.

An HCP repository is partitioned into namespaces owned and managed by tenants, providing access to objects through a variety of industry-standard protocols, as well as through various HCP-specific interfaces.

There are many use cases for using HCP in the Enterprise context:

Globally Compliant Retention Platform (GCRP)

Meet Compliance & Legal Retention requirements (WORM, SEC 17A-4, CFTC and MSRB)

Secure Analytics Archive

Big data source/target (land) for secure analytic workflows
Better Data portability
Multi-tenant

Protect data with much higher durability (up to fifteen 9s) and availability (up to ten 9s) with HCP

The PDI+HCP combo will allow much more resources into serving these use cases: By leveraging PDI's connectivity capabilities to a wide variety of data, we can use HCP as a "Staging Data Lake" for semi-structured and unstructured data and/or using it as an execution environment for the execution of data science algorithms against this type of content will also, like enriching HCP metadata or doing deep learning for image recognition

In this release we implemented a VFS driver for HCP; Next versions will include a deeper, metadata level integration with HCP's functionality.

MapR DB support

Simple but important improvement: MapR DB is now supported! It's an enterprise-grade, high performance, global NoSQL database management system. It is a multi-model database that converges operations and analytics in real-time, including the HBase API to run HBase applications, even though not all features are compatible.

It's now validated to read/write data from MapR-DB as Hbase. In terms of what use case this enables, I'd call out: Operational Data Hub/Real-Time BI, Customer 360 and several IoT related ones.

Google Cloud Encryption

Google CMEK allows data owners to have a multilayered security model that secures data and controls access to the data encryption keys. With this new capability, Pentaho users can use these custom encryption keys to access data in Google Cloud Storage and Google Big Query enhancing the security of the data. And we're very happy to say that we were able to test that it just works with no product change required! Damn, feels good when it happens 😁

Edge to Cloud Processing

Adaptive Execution Layer (AEL) Improvements

AEL is our cluster-version of the "Write Once, Run Everywhere", an abstraction layer where we currently have engines for Kettle (the classic one) and Spark.

Initially available in 7.1, we've been expanding it not only in terms of features but also in terms of vendor support. Here's the current matrix:

The current spark version supported is 2.3. There are many other point improvements to AEL. I won't go into details on them but here's a small overview:

Support for execution of MDI driven transformation via “ETL Metadata Injection” step
Support for sub-transformation steps Simple Mapping/Mapping (Transformation Executor was already supported)
Native Spark implementation for HBase and Hive
Support for S3 Cloud storage from AEL with native integration

One relevant change though is that starting from 8.2 AEL is available only on the EE version. It's something that from the beginning was being debated, on opening it completely or not (the code was never available as we were actively changing the APIs and couldn't guarantee stability in external contributions). After looking at all the data we made the tough call to pull it to EE land :(

Streaming AMQP

8.0 introduced a new paradigm for handing streaming transformations in a continuous way. A new set fo steps work in conjunction with a newly introduced step (“Get Records from Stream”) to process micro-batch of continuous stream of records.

In the meantime, a few steps were introduced to ingest/produce streaming data from/to Kafka/MQTT/JMS and 8.1 introduced a mechanism to pass all streaming data together to downstream.

And now two steps were added: AMQP Consumer and AMQP Producer. Nuff said

Data Operations

Lineage improvements

Getting to the point:

What’s New

Architecture improvements for 3rd party lineage bridges (like IGC)
Add step and job entry "description" fields to lineage data output
Continued upgrading to Custom Lineage Analyzers for the following steps and job entries: Hadoop File Input & Output, Spark Submit, Mapping (sub-transformation), ETL Metadata Injection step (added relation to the sub-transformation being executed)

Benefits

Better and easier integration of 3rd party lineage bridges also for future partnering
Improve of using lineage information for documentation and compliance use cases
Expand completeness of data lineage steps and job entries

Monitoring Status Page Update

Probably one of the most asked feature of the decade. Our "vintage-look" pdi status page has been refreshed, and along with it some extra functionality

OpenJDK support

If you've been waiting for this one as I have, you'll surely scream "FINALLY!!"! We now support of OpenJDK 8 JRE in server, AEL and client tools.

Advanced Analytics and Visualizations

There must be a reason why these 2 are connected on the same part - I just don't know why 😅

Python Executor

If you're an EE customer you'll be able to benefit from this refreshed, AEL compliant Python executor. Feature-wise, I'd call out:

Automated ability to Get fields from Python script
Allows for multiple inputs (Row by Row or All Rows)
Ability to Pick a Python environment from one or more installed Python installations, i.e. virtual environments
Each Step gets its own Python session

Used in conjunction with the existing R step and Spark Submit Job step to add overall Data Science offering and capabilities.

PDI Data Science Notebook (Jupyter) Integration

Not a feature per se but an extremely useful consequence of the python step and all the improvements on data services.

Data scientists develop analytical models to achieve specific business results, and they perform much of their work within Notebooks, such as Jupyter. By using Pentaho Data Integration(PDI) and the new Python Executor step, Data Engineers can create data sets within PDI and make them available to be consumed within a Jupyter Notebook by the Data Scientist, as shown below in a collaborative workflow:

Using PDI and the Python Executor step, the required data set is created using a PDI Data Service (virtual table), which can be consumed in the Jupyter Notebook, via a notebook template file . The file can be created programmatically via the Python Executor step in PDI, and pre-filled in with required connection info to the PDI Data Service.

Some technical considerations for the integration solution are as follows:

Pentaho Server needs to be running to host a PDI Data Service
PDI Spoon needs to be connected to the repository to save/deploy/edit the Data Service
PDI Data Service Client Jars need to be made available to be used by Jupyter Notebook
Compatible with Python 2.7.x or Python 3.5.x
Compatible with Jupyter Notebook 5.6.x
Python JDBC package dependencies include JayDeBeApi and jpype

Streaming Vizualizations and CTools (Push)

I obviously love this one! We finished the connection between the (really awesome) Streaming Data Services all the way through CTools dashboards. Now the dashboard components that are ready for it (tables and charts are) will receive data as soon as it's ready, versus polling every N seconds.

When using CDE to create push-based streaming dashboards, the ‘Refresh period’ property for components and streaming over data services queries can now be left blank, as data can be continuously received and rendered as soon as new windows are available (see image below).

More information can be accessed here:

https://help.pentaho.com/Documentation/8.2/Products/CTools/Create_Streaming_Service_Dashboard

Platform Updates

JSON Input updates

The JSON Input step now features a new Select Fields window for specifying what fields you want to extract from your source file. The window displays the structure of the source JSON file. Each field in the structure is displayed with a checkbox for you to indicate if it should be extracted from the file. You can also search within the structure for a specific field. Overall, these enhancements provide a drastic improvement in step usability.

BA Chinese Language Localization for PUC

请不要让我失望，谷歌翻译

Expanded Metadada Injection support

Metadata injection enables the passage of metadata to transformation templates at runtime in order to drastically increase productivity, reusability, and automation of transformation workflow. This supports use cases like the onboarding of data from many files and tables to data lakes. In addition to existing metadata injection enabled steps, as of 8.2 you now can inject metadata into any field in the following Pentaho Data Integration (PDI) steps:

Get System Data
Execute Row SQL Script
Execute SQL Script
User Defined Java Class
AMQP Consumer
AMQP Producer
JMS Consumer
JMS Producer
Add a Checksum
Set Field Value
Set Field Value to a Constant

BA Analyzer Numeric Level Comparison Filters

In 8.2, users can filter Analyzer reports using numeric level comparison filters, which provide an added degree of flexibility and productivity to Business Analytics customers. Previously, level filters treated all levels as text-based/non-numeric, and as such required filter criteria based on either a picklist or string matching.

The new filters for numeric levels include greater than, less than, greater than or equals, less than or equals, and between. For instance, as seen in the insurance example below, a numeric level representing monthly auto premiums can be filtered according to a numeric range, keeping only records and measure amounts (of individual customers) where the premium level is between $150 and $400 per month.

Additional considerations:

The numeric level comparison filters can be parametrized for use with Dashboard Designer
The filters can be applied via the report URL
If you are working with a high cardinality level, it may make sense to optimize performance by adjusting the mondrian.olap.maxConstraints property (ensure joins are handled by the underlying database) and/or rounding your data to manage cardinality

Additional Enhancements

Here are some not-so-minor other improvements that were done on the release:

PDI Step & Job Entry Improvements

User Defined Java Class step: Support of Java 1.8

Allow PDI users to make use of newer Java language features (e.g. enhanced for loops, lambda expressions, varargs, etc.)

Text File Output step: Added support of variables in the "Split every...rows" property

Improve creating of flexible output file sizes controlled by variables.

FTPS job entries: Support "Advanced server protection level"

All FTPS steps have been enhanced by supporting “private protection level”, so the data is secured by integrity and confidentiality.

Rest Client step: Allow to provide custom content type headers.

Many REST servers require custom content types to be sent to them. In particular W3C Semantic compliant data stores such as Allegrograph and MarkLogic Server.

Text File Input Step: Provide the full stack trace when a file cannot be opened

The full stack trace will provide very valuable debugging information and allow root cause analysis of problems to resolved them more quickly.

Calculator step: Added exceptions when a file is not found.

Instead of providing bad data when a file is not available, the process ends with an error to notify the user of the issue.

BA Improvements

PUC Upload/Download: Users with ‘publish content’ permission can now upload/download files to PUC

No longer need to rely on a few users with complete ‘admin’ rights to move content btwn environments

Scheduling Access: PUC users without scheduling permissions can no longer see the scheduling perspective

More logical permissions and user experience for BA customers

MDX Performance: MDX optimizations for some scenarios that incl. subtotals, numeric filters, and percentages

Better performance in some Analyzer/Mondrian query scenarios

Analyzer Business Groups: Global setting option to expand or collapse Analyzer business groups

Long lists of fields can be rolled up by default when a report is opened, reducing scrolling / improving UX

Analyzer Numeric Dimension Filters: (*Stretch Goal*) Comparison filters ( < , > , btwn, …) on numeric levels (i.e. age, credit score, customer id)

Much greater flexibility to query data with numeric levels (i.e. show me sales for customers between ages of 18 and 30). Previously every distinct level value would have to be manually added to an include filter criteria.

Get it here and Enjoy!!!

Pentaho Community Meeting - PCM18! Bologna, Italy, November 23-25!

2018-07-27T13:15:00.004+01:00

PCM 18!!

If you've been in one, no more words are needed, just go ahead and register! If you don't know what I'm talking about, just go ahead and register as well!

It's the best example of what Pentaho - how part of Hitachi Vantara - is all about. A very passionate group of people that are absolutely world class at what they do and still know how to spend a good time!

PCM17 group photo

Now shamelessly copy-pasting the content from it-novum:

Pentaho Community Meeting 2018

Pentaho Community Meeting 2018 will take place in Bologna from November 23-25. It will be organized by Italia Pentaho User Group and by it-novum, the host of PCM17. As always, it will be a 3-days event full of presentations, networking and fun and we invite Pentaho users of every kind to participate!

For PCM18 we will meet in the beautiful city of Bologna. The guys of Italia User Group will take care of the venue and the program. With Virgilio Pierini as group representative we not only have a Pentaho enthusiast but also a native of Bologna guiding us to the beautiful corners of the hometown of Europe’s oldest university!

What is Pentaho Community Meeting?

Pentaho Community Meeting is an informal gathering for Pentaho users from around the world. We meet to discuss the latest and greatest in Pentaho products and exciting geek stuff (techie track) as well as best practices of Pentaho implementations and successful projects (business track). Read this summary of Pentaho Community Meeting 2017 to learn more.

PCM18 is open to everyone who does something with Pentaho (development, extensions, implementation) or plans to do data integration, analytics or big data with Pentaho. Several Pentaho folks – architects, designers, product managers – will share their latest developments with us.

The event is community-oriented and open-minded. There’s room for networking and exchanging ideas and experiences. Participants are free to break off into groups and work together.

Call for Papers

For sure, this is intended to be a community event - for the community and by the community. To register your proposal for the agenda, please use the contact form to send a brief description including your name and title in English until September 30th.

Agenda

The agenda will be updated continuously, so stay tuned for updates! All updates will be posted on twitter, too.

Friday, November 23 | Hackathon

We start the three-day PCM with a hackathon, snacks and drinks. After a 2-hour hackathon, a highly esteemed jury will award the most intelligent/awkward/funny hacks.

Saturday, November 24 | Conference Day

Still a lot to be determined! We're still receiving papers

Welcome speech | Stefan Müller and the org team
The future of Pentaho in Hitachi Vantara | Pedro Alves, Hitachi Vantara
What's new in PDI 9.0 | Jens Bleuel, Hitachi Vantara
Useful Kettle plugins | Matt Casters, Neo4j (and founder of Kettle)
IoT and AI: Why innovation is a societal imperative | Wael Elrifai, VP for Solution Engineering - Big Data, IOT & AI, Hitachi Vantara
Pentaho at CERN | Gabriele Thiede, CERN
Pentaho User Group Italia
SSBI (Self Service BI ) - Pentaho Plugin Update | Pranav Lakhani, SPEC INDIA
Scaling Pentaho Server with Kubernetes | Diethard Steiner
Capitalizing on Lambda & Kappa Architectures for IoT with Pentaho | Issam Hizaji, Lead Sales Engineer, Data Analytics & IoT | Emerging & Southern

After the lunch, everybody splits up to join the business or the techie track.

Sunday, November 25 | Social Event

Brunch, sightseeing and... let´s see!

----

Anyway, believe me, you want to go! GO REGISTER HERE!

Pentaho 8.1 is available

2018-05-16T17:15:00.000+01:00

Pentaho 8.1 is available

The team has once again over delivered on a dot release! Below are what I think are the many highlights of Pentaho 8.1 as well as a long list of additional updates.

If you don’t have time to read to the end of my very long blog, just save some time and download it now. Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the community home!

Cloud

One of the biggest themes of the release: Increased support for Cloud. A lot of vendors are fighting for becoming the best providers, and what we do is try to make sure Pentaho users watch all that comfortably sitting on their chairs, having a glass of wine, and really not caring about the outcome. Like in a lot of areas, we want to be agnostic – which is not saying that we’ll leverage the best of each – and really focus on logic and execution.

It’s hard to do this as a one time effort, so we’ve been adding support as needed (and by “as needed” I really mean based on the prioritization given by the market and our customers). A big focus of this release was Google and AWS:

Google Storage (EE)

Google Cloud Storage is a RESTful unified storage for storing and accessing data on Google's infrastructure. PDI support for import and export Data To/From Cloud Storage is now done through a new VFS driver (gs://). You may even use it on the several steps that support it as well as browse it’s contents.

These are the roles required on Google Storage for this to work:

● Storage Admin

● Storage Object Admin

● Storage Object Creator

● Storage Object Viewer

In terms of authentication, you’ll need the following environment variable defined:

GOOGLE_APPLICATION_CREDENTIALS="/opt/Pentaho81BigQuery.json“

From this point on, just treat it as a normal VFS source.

Google BigQuery – JDBC Support (EE/CE)

BigQuery is Google's serverless, highly scalable, low cost enterprise data warehouse. Fancy name for a database, and that’s how we treat it.

In order to connect to it first we need the appropriate drivers. Steps here are pretty simple:

1. Download free driver: https ://cloud .google .com /bigquery /partners /simba -drivers /

2. Copy google*.* files from Simba driver to /pentaho/design-tools/data-integration/libs folder

Host Name will default to https ://www .googleapis .com /bigquery /v 2 but your mileage may vary.

Unlike the previous item, authentication doesn’t use the previously defined environment variable as does Google VFS. Authentication here is done at the JDBC driver level, though a driver option, OAuthPvtKeyPath, set in the Database Connection Option and the you need to point to the Google Storage certificate through the P12 key format.

The following Google BigQuery roles are required:

1. BigQuery Data Viewer

2. BigQuery User

Google BigQuery – Bulk Loader (EE)

While you can use a regular table output to insert data into BigQuery that’s going to be slow as hell (who said hell was slow? This expression makes no sense at all!). So we’ve added a step for that: Google BigQuery Loader.

This step leverages google’s loading abilities, and is processed out on Google, not on PDI. So the data, that has to be either in Avro, JSON or CSV has to be previously copied to Google Storage. From that point on is pretty straightforward. Authentication is done via the GOOGLE_APPLICATION_CREDENTIALS environment variable point to the Google JSON file.

Google Drive (EE/CE)

While Google Storage will probably be seen more frequently in production scenarios, we also added support for Goggle Drive, a file storage and synchronization service, allows users to store files on their servers, synchronize files across devices, and share files.

This is also done through a VFS driver, but given it’s a per user authentication a few steps need to be fulfilled to leverage this support:

● Copy your Google client_secret.json file into (The Google Drive option will not appear as a Location until you copy the client_secret.json file into the credentials directory and restart)

o Spoon: data-integration/plugins/pentaho-googledrive-vfs/credentials directory, and restart spoon.

o Pentaho Server: pentaho-server/pentaho-solutions/system/kettle/plugins/pentaho-googledrive-vfs/credentials directory and restart the server

● Select Google Drive as your Location. You are prompted to login to your Google account.

● Once you have logged in, the Google Drive permission screen displays.

● Click Allow to access your Google Drive Resources.

● A new file called StoredCredential will be added to the same place where you had the client_secret.json file. This file will need to be added to the Pentaho Server credential location and that authentication will be used

Analytics over BigQuery (EE/CE, depending on the tool used)

This JDBC connectivity to Google BigQuery, as defined previously for Spoon, can also be used throughout all the other Business Analytics browser and client tools – Analyzer, CTools, PIR, PRD, modeling tools, etc. Some care has to be taken here, though, as BigQuery’s pricing is related to 2 factors:

● Data stored

● Data queried

While the first one is relatively straightforward, the second one is harder to control, as you’re charged according to total data processed in columns selected. For instance, a ‘select *’ query should be avoided if only specific columns are needed. To be absolutely clear, this has nothing to do with Pentaho, these are Google BigQuery pricing rules.

So ultimately, and a bit like we need to do on all databases / data warehouses, we need to be smart and work around the constraints (usually speed and volume, on this case price as well) to leverage best what these technologies have to offer. Some examples are given here:

● By default, there is BigQuery caching and cached queries are free. For instance, if you run a report in Analyzer, clear the Mondrian cache, and then reload the report, you will not be charged (thanks to the BigQuery caching)

● Analyzer: Turn off auto refresh, i.e, this way you design your report layout first, including calculations and filtering, without querying the database automatically after each change

● Analyzer: Drag in filters before levels to reduce data queried (i.e. filter on state = California BEFORE dragging city, year, sales, etc. onto canvas)

● Pre-aggregate data in BigQuery tables so they are smaller in size where possible (to avoid queries across all raw data)

● GBQ administrators can set query volume limits by user, project, etc. (quotas)

AWS S3 Security Improvements (IAM) (EE/CE)

PDI is now able to get IAM security keys from the following places (in this order):

1. Environment Variables

2. Machine’s home directory

3. EC2 instance profile

This added flexibility helps accommodate different AWS security scenarios, such as integration with S3 data via federated SSO from a local workstation, by providing secure PDI read/write access to S3 without making user provide hardcoded credentials.

The IAM user secret key and access key can be stored in one place so they can be leveraged by PDI without repeated hardcoding in Spoon. These are the environment variables that point to them:

● AWS_ACCESS_KEY_ID

● AWS_SECRET_ACCESS_KEY

Big Data / Adaptive Execution Layer (AEL) Improvements

Bigger and Better (EE/CE)

AEL provides spectacular scale out capabilities (or is it scale up? I can’t cope with these terminologies…) by seamlessly allowing a very big transformation to leverage a clustered processing engine.

Currently we have support for Spark through the AEL layer, and throughout the latest releases we’ve been improving it in 3 distinct areas:

● Performance and resource optimizations

o Added Spark Context Reuse that, under certain circumstances can speed up startup performance on the range to 5x faster, proving specially useful under development conditions

o Spark History Server integration, providing a centralized administration, auditing and performance reviews of the transformations executed in Spark

o Ability to passing down to the cluster customized spark properties, allowing a finer-grained control of the execution process

● Increased support for native steps (eg, leveraging the spark specific group by instead of the PDI engine one)

● Adding support for more cloud vendors – and we just did that for EMR 5.9 and MapR 5.2

This is the current support matrix for Cloud Vendors:

Sub Transformation support (EE/CE)

This one is big, as it was the result of a big and important refactor on the kettle engine. AEL Now supports executing sub transformations through the Transformation Executor step, a long-standing request since the times of good-old PMR (Pentaho Map Reduce)

Big Data formats: Added support for Orc (EE/CE)

Not directly related to AEL, but most of the use cases where we want the AEL execution we’ll need to input data in a big data specific format. In previous releases we added support for Parquet and Avro, and we now added support for ORC (Optimized Record Columnar), a format favored by Hortonworks.

Like the others, Orc will be handled natively when transformations are executed in AEL

Worker Nodes (EE)

Jumping from scale-out to scale-up (or the opposite, like I mentioned, I never know), we continue to do lots of improvements on the Worker Nodes project. This is an extremely strategic project for us as we integrate with the larger Hitachi Vantara portfolio.

Worker nodes allow you to execute Pentaho work items, such as PDI jobs and transformations, with parallel processing and dynamic scalability with load balancing in a clustered environment. It operates easily and securely across an elastic architecture, which uses additional machine resources as they are required for processing, operating on premise or in the cloud.

It uses the Hitachi Vantara Foundry project, that leverages popular technologies under the hood such as Docker (Container Platform), Chronos (Scheduler) and Mesos/Marathon (Container Orchestration).

For 8.1 there are several other improvements:

● Improvements tn Monitoring, with accurate propagation of Work Items status for monitoring

● Performance improvements by optimizing the startup times for executing the work items

● Customizations are now externalized from docker build process

● Job clean up functionality

Streaming

In Pentaho 8.0 we introduced a new paradigm to handle streaming datasources. The fact that it’s a permanently running transformation required a different approach: The new streaming steps define the windowing mode and point to a sub transformation that will then be executed on a micro batch approach.

That works not only for ETL within the kettle engine but also in AEL, enabling spark transformations to feed from Kafka sources.

New Streaming Datasources: MQTT, and JMS (Active MQ / IBM MQ) (EE/CE)

Leveraging on the new streaming approach, there are 2 new steps available – well, one new and one (two, actually) refreshed.

The new one is MQTT – Message Queuing Telemetry Transport - an ISO standard publish-subscribe-based messaging protocol that works on top of the TCP/IP protocol. It is designed for connections with remote locations where a "small code footprint" is required or the network bandwidth is limited. Alternative IoT centric protocols include AMQP, STOMP, XMPP, DDS, OPC UA, WAMP

There are 2 new steps – MQTT Input and MQTT Output, that connect with the broker for consuming and publishing back the results.

Other than this new, IoT centered streaming source, there are 2 new steps, JMS Input and JMS Output. These steps replace the old JMS Consumer/Producer and the IBM Websphere MQ steps, supporting, in the new mode the following message queue platforms:

● ActiveMQ

● IBM MQ

Safe Stop (EE/CE)

This new paradigm to handle streaming sources introduced a new challenge that we never had to face. Usually, when we triggered jobs and transformations, they had a well defined start and end; Our stop functionality was used when we wanted to basically kill a running process because something was not going well.

However, on these streaming use cases, a transformation may never finish. So stopping a transformation the way we’ve always done – by stopping all steps at the same time – could have unwanted results.

So we implemented a different approach – We added a new option to safe stop a transformation implemented within Spoon, Carte and the Abort step, that instead of killing all the step threads, stops the input steps and lets the other steps gracefully finish the processing, so no records currently being processed are lost.

This is especially useful in real-time scenarios (for example reading from a message bus). It’s one of those things that when we look back seems pretty dumb that it wasn’t there from the start. It actually makes a lot of sense, so we went ahead and made this the default behavior.

Streaming results (EE/CE)

When we launched streaming in Pentaho 8.0 we focused on the processing piece. We could launch the sub transformation but we could not get results back. Now we have the ability to define which step on the sub-transformation will send back the results to follow the rest of the flow.

Why is this important? Because of what comes next…

Streaming Dataservices (EE/CE)

There’s a new option new option to run data service in streaming mode. This will allow the consumers (on this case CTools Dashboards) to get streaming data from this dataservice.

Once defined, we can test these options within the test dataservices page and see the results as they come.

This screen exposes the functionality as it would be called from a client. It’s important to know that the windows that we define here are not the same as the ones we defined for the micro batching service. The window properties are the following:

● Window Size – The number of rows that a window will have (row based), or the time frame that we want to capture new rows to a window (time based).

● Every - Number of rows (row based), or milliseconds (time based) that should elapse before creating a new window.

● Limit – Maximum number of milliseconds (row based) or rows (time based) which will be used to wait for a new window to be generated.

CTools and Streaming Visualizations (EE/CE)

We took a holistic approach to this feature. We want to make sure we can have a real time / streaming dashboard leveraging what was set up before. And this is where the CTools come in. There’s a new datasource in CDE available to connect to streaming dataservices:

Then the configuration of the component will select the kind of query we want – Time or number of records base, window size, frequency and limit. This gives us a good control for a lot of use cases.

This will allow us to then connect to a component the usual way. While this will probably be more relevant for components like tables and charts, ultimately all of them will work.

It is possible to achieve a level of multi-tenancy by passing a user name parameter from the PUC session (via CDE) to the transformation as a data services push-down parameter. This will enable restriction of the data viewed on a user by user basis

One important note is that the CTools streaming visualizations do not yet operate on a ‘push’ paradigm – this is on the current roadmap. In 8.1, the visualizations poll the streaming data service on a constant interval which has a lower refresh limit of 1 second. But then again… if you’re doing a dashboard of this types and need a refresh of 1 second, you’re definitely doing something wrong…

Time Series Visualizations (EE/CE)

One of the biggest use cases for streaming, from a visualization perspective, is time series. We improved the support for CCC for timeseries line charts, so now data trends over time will be shown without needing workarounds.

This applies not only to CTools but also to Analyzer

Data Exploration Tool Updates (EE)

We’re keeping on our path of improving our Data Exploration Tool. It’s no secret that we want to make it feature complete so that it can become the standard data analysis tool for the entire portfolio.

This time we worked on adding filters to the Stream view.

We’ll keep improving this. Next on the queue, hopefully, will be filters on the model view and date filters!

Additional Updates

As usual, there were several additional updates that did not make it to my highlights above. So for the sake of your time and not creating a 100 page blog – here are even more updates in Pentaho 8.1.

Additional updates:

● Salesforce connector API update (API version 41)

● Splunk connection updated to version 7

● Mongo version updated to 3.6.3 driver (supporting 3.4 and 3.6)

● Cassandra version updated to support version 3.1 and Datastax 5.1

● PDI repository browser performance updates, including lazy loading

● Improvements on the Text and Hadoop file outputs, including limit and control file handling

● Improved logging by removing auto-refresh from the kettle logging servlet

● Admin can empty trash folder of other users on PUC

● Clear button in PDI step search in spoon

● Override JDBC driver class and URL for a connection

● Suppressed the Pentaho ‘session expired’ pop-up on SSO scenarios, redirecting to the proper login page

● Included the possibility to schedule generation of reports with a timestamp to avoid overwriting content

In summary (and wearing my marketing hat) with Pentaho 8.1 you can:

● Deploy in hybrid and multi-cloud environments with comprehensive support for Google Cloud Platform, Microsoft Azure and AWS for both data integration and analytics

● Connect, process and visualize streaming data, from MQTT, JMS, and IBM MQ message queues and gain insights from time series visualizations

● Get better platform performance and increase user productivity with improved logging, additional lineage information, and faster repository access

Download it

Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the community home!

Pedro

Pentaho 8 is now available!

2017-11-15T18:20:00.000+00:00

I recently wrote about everything you needed to know about Pentaho 8. And now is available! Go get your Enterprise Edition or trial version from the usual places

For CE, you can find it on the new community home!

Enjoy!

-pedro

A new collaboration space

2017-11-09T11:22:00.000+00:00

With the move to Hitachi Vantara we're not letting the community go away - exactly on the contrary. And one of the first things is trying to give the community a new home, in here: http://community.pentaho.com

We're trying to gather people from the forums, user groups, whatever, and give a better and more modern collaboration space. This space will continue open, also because the content is extremely value, so the ultimate decision is yours.

Your mission, should you choose/decide to accept it, is to register and try this new home. Counting on your help to make it a better space

See you in http://community.pentaho.com

Cheers!

-pedro

Announcing Pentaho 8.0 - Coming in November to a theater near you!

2017-10-26T15:43:00.001+01:00

Pentaho 8!

The first of a new Era

Wow - time flies... Another Pentaho World this week, and another blog post announcing another release. This time... the best release ever! ;)

This is our first Pentaho product announcement since we became Hitachi Vantara - and you'll see that some synergies are already appearing. And as I said before, again and again... the Community Edition is still around! We're not kidding - we're here to rule the world and we know it's though an open source core strategy that we'll get there :)

Pentaho 8.0 In a nutshell

Ok, let's get on with this cause there's a lot of people at the bar calling me to have a drink. And I know my priorities!

Platform and Scalability

Worker Nodes
New theme

Data Integration

Streaming support!
Run configurations for Jobs
Filters in Data Explorer
New Open / Save experience

Big Data

Improvements on AEL
Big Data File Formats - Avro and Parquet
Big Data Security - Support for Knox
VFS improvements for Hadoop Clusters

Others

Ops Mart for Oracle, MySQL, SQL Server
Platform password security improvements
PDI mavenization
Documentation changes on help.pentaho.com
Feature Removals:

Analyzer on MongoDB
Mobile Plug-in (Deprecated in 7.1)

Is it done? Can I go now? No?.... damn, ok, now on to further details...

Platform and Scalability

Worker Nodes (EE)

This is big. I never liked the way we handled scalability in PDI. Having the ETL designer responsible for manually defining the slave server in advance, having to control the flow of each execution, praying for things not to go down... nah! Also, why ETL only? What about all the other components of the stack?

So a couple of years ago, after getting info from a bunch of people I submitted a design document with a proposal for this:

This was way before I knew the term "worker nodes" was actually not original... but hey, they're nodes, they do work, and I'm bad with names, so there's that... :p

It took time to get to this point, not because we didn't think this was important, but because of the underlying order of execution; We couldn't do this without merging the servers, without changing the way we handle the repository, without having AEL (the Adaptive Execution Layer). Now we got to it!

Fortunately, we have an engineering team that can execute things properly! They took my original design, took a look at it, laughed at me, threw me out of the room and came up with the proper way of doing things. Here's the high-level description:

This is where I mentioned that we are already leveraging Hitachi Vantara resources. We are using Lumada Foundry for worker nodes. Foundry is a platform for rapid development of service-based applications delivering the management of containers, communications, security, and monitoring toward creating enterprise products/applications, leveraging technology like docker, mesos, marathon, etc. More on this later, as it's something we'll be talking a lot more about...

Here's some of the features

Deploy consistently in physical, virtual and cloud environments
Scale and load balance services , helping to deal with peaks and limited time-windows, allocate the resources that are needed.
Hybrid deployments can be used to distribute load, even when the on-premise resources are not sufficient, scaling out into the Cloud is possible to provide more resources.

So, how does this work in practice? Once you have a Pentaho Server installed, you can configure it to connect to the cluster of Pentaho Worker nodes. From that point on - things will work! No need to configure access to repositories, accesses, funky stuff. You only need to say "Execute at scale" and if the worker nodes are there, it's where things will be executed. Obviously, the "things will work" will have to obey the normal rules of clustered execution, for instance, don't expect a random node on the cluster to magically find out your file:///c:/my computer/personal files/my mom's excel file.xls.... :/

So what scenarios will this benefit the most? A lot! Now your server will not be bogged down executing a bunch of jobs and transformations as they will be handed out for execution in one of the nodes.

This does require some degree of control, because there may be cases where you don't want remote execution (for instance, a transformation to feed a dashboard). This is where Run Configurations come into play. Also important to note that even though the biggest benefits of this will be ETL work, this concept is for any kind of execution.

This a major part of the work we're doing with the Hitachi Vantara team; By leveraging Foundry we'll be able to do huge improvements on areas we've been wanting to tackle for a while but never were able to properly address on our own: better monitoring, improving lifecycle management and active-active HA, among others. In 8.0 we leapfrogged in this worker nodes story, and we expect much more going forward!

New Theme - Ruby (EE/CE)

One of the things you'll notice is that we have a new theme that reflects the Hitachi Vantara colors. The new theme is the default on new installations (not for upgrades) and the others are still available

Data Integration

Streaming Support: Kafka (EE/CE)

In Pentaho 8.0 we're introducing proper streaming support in PDI! In case you're thinking "hum... but don't we already have a bunch of steps for streaming datasources? JMS, MQTT, etc?" you're not wrong. But the problem is that PDI is a micro batching engine, and these streaming protocols introduce issues that can't be solved with the current approach. Just think about it - a streaming datasource requires an always running transformation, and in PDI execution all steps run in different threads while the data pipeline is being processed; There are cases, when something goes wrong, where we don't have the ability to do proper error processing. It's simply not as simple as a database query or any other call where we get a finite and well known amount of data.

So we took a different approach - somewhat similar to sub-transformations but not quite... First of all, you'll see a new section in PDI:

Kafka is the one that was prioritized as being the most important for now, but this will actually be something that will be extended for other streaming sources.

The secret here is on the Kafka Consumer step:

The highlighted tabs should be generic for pretty much all the steps, and the Batch is what controls the flow. So what we did was instead of having an always running transformation at the top level, we break the input data into chunks - either by number of records or duration and the second transformation takes that input, the fields structure and does a normal execution. In here, the abort step was also improved to give you more control the flow of this execution. This is actually something that's been a long standing request from the community - we can now specify if we want to abort with error or without, having an extra ability to control the flow of our ETL.

Here's an example of this thing put together:

Now, even more interesting that that is that this also works in AEL (our Adaptive Execution Layer, introduced in Pentaho 7.1), so when you run this on a cluster you'll get spark native kafka support being executed at scale, which is really nice...

Like I mentioned before, moving forward you'll see more developments here, namely:

More streaming steps, and currently MQTT seems the best candidate for the short term
(and my favorite) Developer's documentation with a concrete example so that it's easy for anyone on the community to develop (and hopefully submit) their own implementations without having to worry about the 90% of the stuff that's common to all of them

New Open / Save experience (EE/CE)

In Pentaho 7.0 we merged the servers (no more that nonsense of having a distinct "BA Server" and a "DI Server") and introduced the unified Pentaho Server with a new and great looking experience to connect to it:

but then I clicked on Open file from repository and felt sick... That thing was absolutely horrible and painfully slow. We were finally able to do something about that! Now the experience is ... well... slightly better (as in, I don't feel like throwing up anymore!):

A bit better, no? :) Also with search capabilities and all the kind of stuff that you've been expecting from a dialog like this on the past 10 years! Same for the save experience.

This is another small but IMO always important step in unifying the user experience and work towards a product that gets progressively more pleasant to use. It's a never-ending journey but that's not an excuse not to take it.

Filters in Data Explorer (EE)

Now that I was able to open my transformation, I can show some of the improvements that we did on our Data Explorer experience in PDI. We now support the first set of filters and actions! This one is easy to show but extremely powerful to use.

Here's filters - depending on the data type you'll have a few options, like excluding nulls, equals, greater/lesser than and a few others. Like mentioned, others will come with time.

Also, while previous version only allowed for drill down, we can now do more operations on the visualizations.

Run configuration: Leveraging worker nodes and execute on server (EE/CE)

Now that we are connected to the repository, opened our transformation with a really nice experience and took benefit of these data exploration improvements to make sure our logic is spot on, we are ready to execute it to the server.

Now this is where the run configuration part comes in. I have my transformation, defined it, played with it, verified that really works as expected on my box. And now, I will want to make sure it also runs well on the server. What before was a very convoluted process, it's now much simplified.

What I do is define a new Run Configuration, like described in 7.1 for AEL, but with a little twist: I don't want it to use the spark engine; I want it to use the pentaho engine but on the server, not the one local to spoon:

Now, what happens when I execute this selecting the Pentaho Server run configuration?

Yep, that!! \o/

This screenshot shows PDI trigger the execution and my Pentaho Server console logging it's execution.

And if I had worker nodes configured, what I would see would be my Pentaho Server automatically dispatching the execution of my transformation to an available worker node!

This doesn't apply to the immediate execution only; We can now specify the run configuration on the job entry as well, allowing a full control of the flow of our more complex ETL

Big Data

Improvements on AEL (EE/CE apart from the security bits)

As expected, a lot of work was done on AEL. The biggest ones:

Communicates with Pentaho client tools over WebSocket; does NOT require Zookeeper
Uses distro-specific Spark library
Enhanced Kerberos impersonation on client-side

This brings a bunch of benefits:

Reduced number of steps to setup
Enable fail-over, load-balancing
Robust error and status reporting
Customization of Spark jobs (i.e. memory , settings)
Client to AEL connection can be secured
Kerberos impersonation from client tool

And not to mention performance improvements... One benchmark I saw that I found particularly impressive is that AEL is practically on pair with native spark execution! And this is impressive! Kudos for the team, just spectacular work!

Big Data File Formats - Avro and Parquet (EE/CE)

Big data platforms introduced various data formats to improve performance, compression and interoperability, and we added full support for these very popular big data formats: Avro and Parquet. Orc will come next.

When you run in AEL, these will also be natively interpreted by the engine, which adds a lot to the value of this.

The old steps will still be available on the marketplace but we don't recommend using them.

Big Data Security - Support for Knox

Knox provides perimeter security so that the enterprise can confidently extend Hadoop access to more of those new users while also maintaining compliance with enterprise security policies and used in some HortonWorks deployments. It is now supported on the Hadoop Clusters' definition if you enable the property KETTLE_HADOOP_CLUSTER_GATEWAY_CONNECTION on the kettle.properties file.

VFS improvements for Hadoop Clusters (EE/CE)

In order to simplify the overall lifecycle of jobs and transformations we made the hadoop clusters available through VFS, on the format hc://hadoop_cluster/.

Others

There are some other generic improvements worth noting

Ops Marts extended support (EE)

Ops Mart now supports Oracle, MySQL and SQL Server. I can't really believe I'm still writing about this thing :(

PDI Mavenization (CE)

Now, this is actually nice! PDI is now fully mavenized. Go to https://github.com/pentaho/pentaho-kettle, do a mvn package and you're done!!!

-----------

Pentaho 8 will be available to download mid-November.

Learn more about Pentaho 8.0 and a webinar here: http://www.pentaho.com/product/version-8-0

Also, you can get a glimpse of PentahoWorld this week watching it live at: http://siliconangle.tv/pentaho-world-2017/

Last but not See you in a few weeks at the Pentaho Community meeting in Mainz! https://it-novum.com/en/pcm17/

That's it - I'm going to the bar!

-pedro

Hello Hitachi Vantara!

2017-09-19T17:01:00.000+01:00

Ok, I admit it – I am one of those people that actually likes changes and views it as an opportunity. Four years ago, I announced here that Webdetails joined Pentaho. For the ones who don't know, Webdetails was the Portugese-based consulting company that then turned into Pentaho Portugal (and expanded from 20 people at the time to 60+), completely integrated into the Pentaho structure.

Two years ago, we announced that Pentaho was acquired by HDS, becoming a Hitachi Group Company.

We have a new change today - and since I'm lazy (and in Vegas, for the Hitachi Next event, and would rather be at our party at the Mandalay Bay Beach than in my room writing this blog post!), I'll simply steal the same structure I used two years ago (when Pentaho was acquired) and get straight to the point! :p

Big news

An extremely big transformation has been taking place and materialized itself today, September 19, 2017. A new company is born. Meet: Hitachi Vantara

You may be asking yourselves: Can it possibly be a coincidence that the new company is launched on the exact same day I turn 40? Well, actually yes, a complete coincidence... :/

This new company unifies the mission and operations of Pentaho, Hitachi Data Systems and Hitachi Insight Group into a single business. More info in the Pentaho blog: Hitachi Vantara - Here's what it means

What does this mean?

It has always been our goal to provide an offering that would allow customers to build their high value, data driven solutions. We were, I think, successful at doing that! And now we (Hitachi Vantara) want to take it to the next level, thus this transformation is needed: We're aiming higher - we want to not only to be the best at (big) data orchestration and analytics, we want to do so in this new IoT / social innovation ecosystem aiming to be the biggest player in the market.

And this transformation will allow us to do that!

What will change?

So that it's clear, Pentaho, as a product will continue to exist. Pentaho, as a company, is now Hitachi Vantara.

And for Pentaho as a product, this gives us conditions we've never had to improve the product focusing on what we need to do best (big data orchestration and analytics) and leveraging from other groups in the company on areas that even though they weren't our core focus, people expect us to have.

Overall, we'll also improve the overall portfolio interoperability. While so far we've always tried to be completely agnostic, now we'll keep saying that but add a small detail: But we have to work better with our stuff - because we can make it happen!

Community implications

This one is very easy!!! I'll just copy paste my previous answer - because it didn't change:

Throughout all the talks, our relationship and involvement with the community has always been one of the strong points of Pentaho, and seen with much interest.
The relationship between the community and a commercial company exists because it’s mutually beneficial. In Pentaho’s case, the community gets access to software it otherwise couldn’t, and Pentaho gets access to an insane amount of resources that contribute to the project. Don’t believe me? Check the Pentaho Marketplace for the large number of submissions, Jira for all the bug reports and improvement suggestions we get out of all the real world tests, and discussions on the forums or on the several available email lists.
Is anyone, in his or her right mind, willing to let all this go? Nah.
Plus, not having a community would render my job obsolete, and no one wants that, right? (don’t answer, please!)

The difference? We wanna do this bigger, better and faster!

-->

And things are already moving in that direction. We are moving the Pentaho Community page to the Hitachi Vantara communit site with some really col interactive and social features. You can visit our new home here https://community.hitachivantara.com/community/products-and-solutions/pentaho. I look forward to engaging with all of you on this new site.

Will Hitachi Vantara shut down it's Pentaho CE edition / it's open source model?

I will, once again, repeat the previous answer:

Just in case the previous answer wasn’t clear enough, lemme spell it out with all the words: There are no plans of changing our opensource strategy or stop providing a CE edition to our community!
Can that change in the future? Oh, absolutely yes! Just like it could have changed in the past. And when could it change? When it stops making sense; when it stops being mutually beneficial. And on that day, I’ll be the first one to suggest a change to our model.

And speaking of which - don't forget to register to PCM17! It's going to be the best ever!

Cheers!

-pedro

Pentaho Community Meeting 2017: exciting use cases & final Call for Papers

2017-09-14T11:41:00.000+01:00

Enjoyed your vacations? Good - now let's get back in business!

The Pentaho Community Meeting 2017 in Mainz, taking place from November 10-12, is approaching and more than 140 participants interested in BI and Big Data are already on board.

Many great speakers from all over the world will present their Pentaho use cases, including data management and analysis at CERN, evaluation of environmental data at the Technical University of Liberec and administration of health information in Mozambique. And of course Matt Casters, Pedro Alves and Jens Bleuel will introduce the latest features in Pentaho.

The 10^th jubilee edition features many highlights:

· Hackathon and technical presentations on FRI, Nov 10

· Conference day on SAT, Nov 11

· Dinner on SAT, Nov 11

· Get-together and drinks on SAT, Nov 11

· Social event on SUN, Nov 12

See here the complete agenda with all presentations of the business and technical track on the conference day. Food and drinks will be provided. Highlight to the CERN use case (you can read a blog post on it here)

And don’t forget: you can participate in the Call for Papers till September 30th! Send your Pentaho project to Jens Bleuel via the contact form.

Some of the speakers:

· Pedro Alves - Aka... me! All about Pentaho 8.0, which is a different way to say "hum, just put some random title, I'll figure out something later"

· Dan Keeley - Data Pipelines - Running PDI on AWS Lambda

· Francesco Corti - Pentaho 8 Reporting for Java Developers

· Pedro Vale - Machine Learning in PDI - What's new in the Marketplace?

· Caio Moreno de Souza - Working with Automated Machine Learning (AutoML) and Pentaho

· Nelson Sousa - 10 WTF moments in Pentaho Data Integration

If you haven't done so, Register Here

We are looking forward to seeing you in Mainz, which can be reached in only 20 minutes by train from Frankfurt airport or main train station!
In the meantime follow-up on all updates on Twitter.

-pedro, with all the content from this post shamelessly stolen from Ruth and Carolin, the spectacular organizers from IT-Novum

Pentaho Maven repository changed to nexus.pentaho.org

2017-08-14T10:59:00.002+01:00

From a recent (at the time of writing, obviously!) issue in the mondrian project we noticed we failed to notify an important change:

This morning the pentaho maven repository seems to be down.
Each download request during maven build fails with 503 error:

[WARNING] Could not transfer metadata XXX/maven-metadata.xml from/to pentaho-releases (http://repository.pentaho.org/artifactory/repo/): Failed to transfer file: http://repository.pentaho.org/artifactory/repo/XXX/maven-metadata.xml. Return code is: 503 , ReasonPhrase:Service Temporarily Unavailable.

The reason for this is that the maven url is now nexus.pentaho.org/content/groups/omni .

Here's a link to a complete ~/.m2/settings.xml config file: https://github.com/pentaho/maven-parent-poms/blob/master/maven-support-files/settings.xml

-pedro

PCM17 - Pentaho Community Meeting: November 10-12, Mainz

2017-07-26T21:21:00.002+01:00

PCM17 - 10th Edition

One of my favourite blog posts of the year - Announcing PCM17. And this year, for the 10th edition, we're going back to the beginning - Mainz in Germany.

Location

Location address: Kupferbergterrasse, Kupferbergterrasse 17-19, 55116 Mainz. Close to Frankfurt, Germany

Event

We're maintaining the schedule of the previous years: A meet-up on friday for drinks preceded by a hackathon; A meet-up on Saturday for drinks preceded by a bunch of presentations or really cool stuff; A meet-up on Sunday for drinks preceded by a city sightseeing! You got the idea

All the information....

Here: https://it-novum.com/en/pcm17/! IT-Novum is doing a spectacular work organizing this event, and you'll find all the information needed, from instructions on how to get there to suggestions for hotels to stay on

Registration and Call for Presentations

Please go to the #PCM17 website to register and also to send us a presentation proposal!

Cheers!

-pedro

A consulting POV: Stop thinking about Data Warehouses!

2017-06-27T15:22:00.001+01:00

A consulting POV: Stop thinking about Data Warehouses!

What I am writing in here is the materialization of a line of thought that started bothering me a couple of years ago. While I implemented projects after projects, built ETLs, optimized reports, designed dashboards, I couldn’t help but thinking that something didn’t quite make sense, but couldn’t quite see what. When I tried to explain it to someone, I just got blank stares…

Eventually things started to make more sense to me (which is far from saying they actually make sense, as I’m fully aware my brain is, hum, let’s just say a little bit messed up!) and I ended up realizing that I’ve been looking at the challenges from a wrong perspective. And while this may seem a very small change in mindset (specially if I fail in passing the message, which may very well happen), the implications are huge: not only it changed our methodology on how to implement projects in our services teams, it’s also guiding Pentaho’s product development and vision.

A few years ago, in a blog post far, far away...

A couple of years ago I wrote a blog post called ”Kimball is getting old”. It focused on one fundamental point: technology was evolving to a point where just looking at the concept of an enterprise datawarehouse (EDW) seemed restrictive. After all, the end users care only about information; they couldn’t care less about what gets the numbers in front of them. So I proposed that we should apply a very critical eye to our problem, and maybe, sometimes, Kimball’s DW, with its star schemas, snowflakes and all that jazz wasn’t the best option and we should choose something else…

But I wasn’t completely right…

I’m still (more than ever?) a huge proponent of the top down approach: focus on usability, focus on the needs of the user, provide him a great experience. All rest follows. All of that is still spot on.

But I made 2 big mistakes:

1. I confused data modelling with data warehouse

2. I kept seeing data sources conceptually as the unified, monolithic source of every insight

Data Modelling – the semantics behind the data

Kimball was a bloody genius! Actually, my mistake here was actually due to the fact that he is way smarter than everyone else. Why do I say this? Because he didn’t come up with one, but with two groundbreaking ideas...

First, he realized that the value of data, business-wise, comes when we stop considering it as just zeros and ones and start treating it as business concepts. That’s what the Data Modelling does: By adding semantics to raw data, immediately gives it meaning that makes sense to a wide audience of people. And this is the part that I erroneously dismissed. This is still spot on! All his concepts of dimensions, hierarchies, levels and attributes, are relevant first and foremost because that’s how people think.

And then, he immediately went prescriptive and told us how we could map those concepts to database tables and answer the business questions with relational database technology with concepts like star schemas, snowflake, different types of slowly changing dimensions, aggregation techniques, etc.

He did such a good job that he basically shaped how we worked; How many of us were involved in projects where we were talked to build data warehouses to give all possible answers when we didn’t even know the questions? I’m betting a lot, I certainly did that. We were taught to provide answers without focusing on understanding the questions.

Project’s complexity is growing exponentially

Classically, a project implementation was simply around reporting on the past. We can’t do that anymore; If we want our project to succeed, it can’t just report on the past: It also has to describe the present and predict the future.

There’s also the explosion on the amount of data available.

IoT brought us an entire new set of devices that are generating data we can collect.

Social media and behavior analysis brought us closer to our users and customers

In order to be impactful (regardless of how “impact” is defined), a BI project has to trigger operational actions: schedule maintenances, trigger alerts, prevent failures. So, bring on all those data scientists with their predictive and machine learning algorithms...

On top of that, in the past, we might have been successful at convincing our users that it’s perfectly reasonable to expect a couple of hours for that monthly sales report that processed a couple of gigabytes of data. We all know that’s changed; if they can search the entire internet in less than a second, why would they waste minutes for a “small” report?? And let’s face it, they’re right…

The consequence? It’s getting much more complex to define, architect, implement, manage and support a project that needs more data, more people, more tools.

Am I making all of this sound like a bad thing? On the contrary! This is a great problem to have! In the past, BI systems were confined to delivering analytics. We’re now given the chance to have a much bigger impact in the world! Figuring this out is actually the only way forward for companies like Pentaho: We either succeed and grow, or we become irrelevant. And I certainly don’t want to become irrelevant!

IT’s version of the Heisenberg’s Uncertainty Principle: Improving both speed and scalability??

So how do we do this?

My degree is actually in Physics (don’t pity me, took me a while but I eventually moved away from that), and even though I’m a really crappy one, I do know some of the basics…

One of the most well-known theorems in physics is Heisenberg’s Uncertainty principle. You cannot accurately know both the speed and location of (sub-)atomic particle with full precision. But can have a precise knowledge over one in detriment of the other

I’m very aware this analogy is a little bit silly (to say the least) but it’s at least vivid enough on my mind to make me realize that we can’t expect in IT to solve both the speed and scalability issue – at least not to a point where we have a one size fits all approach.

There have been spectacular improvements in the distributed computing technologies – but all of them have their pros and cons, the days where a database was good for all use cases is long gone.

So what do we do for a project where we effectively need to process a bunch of data and at the same time it has to be blazing fast? What technology do we chose?

Thinking “data sources” slightly differently

When we think about data sources, there are 2 traps most of us fall into:

1. We think of them as a monolithic entity (eg: Sales, Human Resources, etc) that hold all the information relevant to a topic

2. We think of them from a technology perspective

Let me try to explain this through an example. Imagine the following customer requirement, here in the format of a dashboard, but could very well be any other delivery format (yeah, cause a dashboard, a report, a chart, whatever, is just the way we chose to deliver the information):

Pretty common, hum?

The classical approach

When thinking about this (common) scenario from the classical implementation perspective, the first instinct would be to start designing a data warehouse (doesn’t even need to be an EDW per se, could be Hadoop, a no-sql source, etc). We would build our ETL process (with PDI or whatever) from the source systems through an ETL and there would always be a stage of modelling so we could get to our Sales data source that could answer all kinds of questions.

After that is done, we’d be able to write the necessary queries to generate the numbers our fictitious customer wants.

And after a while, we would implement a solution architecture diagram similar to this, that I’m sure looks very similar to everything we’ve all been doing in consulting:

Our customer gets the number he numbers he want, he’s happy and successful. So successful that he expands, does a bunch of acquisitions, gets so much data that our system starts to become slow. The sales “table” never stops growing. It’s a pain to do anything with it… Part of our dashboard takes a while to render… we’re able to optimize part of it, but other areas become slow.

In order to optimize the performance and allow the system to scale, we consider changing the technology. From relational databases to vertical column store databases, to nosql data stores, all the way through Hadoop, in a permanent effort to keep things scaling and fast…

The business’ approach

Let’s take a step back. Looking at our requirements, the main KPI the customer wants to know is:

How much did I sell yesterday and how is that compared to budget?

It’s one number he’s interested in.

Look at the other elements: He wants the top reps for the month. He wants a chart for the MTD sales. How many data points is that? 30 tops? I’m being simplistic on purpose, but the thing is that it is extremely stupid to force ourselves to always go through all the data when the vast majority of the questions isn’t a big data challenge in the first place. It may need big data processing and orchestration, but certainly not at runtime.

So here’s how I’d address this challenge

I would focus on the business question. I would not do a single Sales datasource. Instead, I’d define the following Business Data Sources (sorry, I’m not very good at naming stuff..), and I’d force myself to define them in a way where each of them contains (or output) a small set of data (up to a few millions the most):

· ActualVsBudgetThisMonth

· CustomerSatByDayAndStore

· SalesByStore

· SalesRepsPerformance

Then I’d implement these however I needed! Materialized, unmaterialized, database or Hadoop, whatever worked. But through this exercise we define a clear separation between where all the data is and the most common questions we need to answer in a very fast way.

Does something like this gives us all the liberty to answer all the questions? Absolutely not! But at least for me doesn’t make a lot of sense to optimize a solution to give answers when I don’t even know what the questions are. And the big data store is still there somewhere for the data scientists to play with

Like I said, while the differences may seem very subtle at first, here are some advantages I found of thinking through solution architecture this way:

· Faster to implement – since our business datasources’s signature is much smaller and well identified, it’s much easier to fill in the blanks

· Easier to validate – since the datasources are smaller, they are easier to validate with the business stakeholders as we lock them down and move to other business data sources

· Technology agnostic – note that at any point in time I mentioned technology choices. Think of these datasources as an API

· Easier to optimize – since we split a big data sources in multiple smaller ones, they become easier to maintain, support and optimize

Concluding thoughts

Give it a try – this will seem odd at first, but it forces us to think differently. We spend too much time worrying about the technology that more than often we forget what we’re here to do in the first place…

-pedro

Pentaho 7.1 is available!

2017-05-22T20:09:00.001+01:00

Pentaho 7.1 is out

Remember when I said at the time of the previous release that Pentaho 7.0 was the best release ever? Well, I was true till today! But not any more, as 7.1 is even better! :p

Why do I say that? It's a big step forward in the direction we've been aiming - consolidating and simplifying our stack, not passing the complexity to the end user.

These are the main features in the release:

Visual Data Experience

Data Exploration (PDI)

Drill Down
New Viz's: Geo map, sunburst, Heat Grid
Tab Persistency
Several other improvements including performance

Viz API 3.0 (Beta)

Viz API 3.0, with documentatino
Rollout of consistent visualizations between Analyzer, PDI and Ctools

Enterprise Platform

VCS-friendly features

File / repository abstraction
PDI files properly indented
Repository performance improvements

Reintroducing Ops Mart
New default theme on User Console
Pentaho Mobile deprecation

Big Data Innovation

AEL - Adaptive Execution Layer (via Spark)
Hadoop Security

Kerberos Impersonation (for Hortonworks)
Ranger support

Microsoft Azure HD Insights shim

I'm getting tired just of listing all this stuff... Now into a bit more detail, and I'll jump back and forth in these different topics ordering by the ones that... well, that I like the most :p

Adaptive Execution with Spark

This is huge; We've decoupled the execution engine from PDI so we can plug in other engines. Now we have 2:

Pentaho - the classic pentaho engine
Spark - you've guessed it...

What's the goal of this? Making sure we treat our ETL development with a pay as you go approach; First, we worry about the logic, then we select the engine that makes most sense.

AEL execution of Spark

One of the things people need to do on other tools (and even on our own tools, that's why I don't like our own approach to the Pentaho Map Reduce) is that from the start you need to think about the engine and technology you're going to use. But this makes little sense.

Scale as you go

Pentaho’s message is one of future-proofing the IT architecture, leveraging the best of what the different technologies have to offer without imposing a certain configuration or persona as the starting point. The market is moving towards a demand for BA/DI to come together in a single platform. Pentaho has an advantage here as we have seen the differentiation of BI and DI better together with our customers and what sets us apart from the competition. Gartner predicts that BI and Discovery tool vendors will partner to accomplish this. Larger, proprietary vendors, will attempt to build these platforms themselves. With this approach from the competition, Pentaho has a unique and early lead in delivering this platform.

A good example is the story we can tell about governed blending. We don’t need to impose on customers any pre-determined configuration; We can start with the simple use of dataservices and unmaterialized data sets. If it’s fast enough, we’re done. If not, we can materialize the data into a data base or even an enterprise data warehouse. If it’s fast enough, we’re done. If not we can resort to other technologies – NoSQL, Lucene based engines, etc. If it’s fast enough, we’re done. If everything else fails, we can setup a SDR blueprint which is the ultimate scalability solution. And throughout this entire journey we never let go of the governed blending message.

This is an insanely powerful and differentiated message; We allow our customers to start simple, and only go down the more complex routes when needed. When going down a single path, a user knows, accepts and sees the value in extra complexity to address scalability

Adaptive Execution Layer

The strategy described for the “Logical Data Warehouse” is exactly the one we need for the execution environment; A lot of times customers get hung up on a certain technology without even understanding if they actually needed. Countless times we we’ve seen customers asking for Spark without a use case that justifies it. We have to challenge that.

We need to move towards a scenario where the customer doesn’t have to think about technology first. We’ll offer one single approach and ways to scale as needed. If a data integration job works on a single Pentaho Server, why bother with other stacks? if it’s not enough, then making the jump to something like Map Reduce or Spark has to be a linear move.

The following diagram shows the Adaptive Execution Layer approach just described

AEL conceptual diagram

Implementation in 7.1 - Spark

For 7.1 we chose Spark as the first engine to implement for AEL. It has seen a lot of adoption, and the fact that it's not restricted to a map reduce paradigm makes it a good candidate to separate business logic and execution.

How to make it work? This high definition conceptual diagram should help me explain it:

An architectural diagram so beautiful it should almost be roughly correct

We start by generating a PDI Driver for Spark from our own PDI instance. This is a very important starting point because using this methodology we ensure that any plugins we may have developed / installed will work when we run the transformation - we couldn't let go of the extensibility capabilities of Pentaho

That driver will be installed on an edge node of the cluster, and that's what will be responsible for executing the transformation. Note that by using spark we're leveraging all it's characteristics: namely, we don't even need a cluster, as we can select if we want to use spark standalone or yarn mode, even though I suspect the majority of users will be on yarn mode leveraging the clustering capabilities.

Runtime flow

One of the main capabilities of AEL is that we don't need to think about adapting the business logic to the engine; We develop the transformation first and then we select where we want to execute. This is how this will work from within Spoon:

Creating and selecting a Spark run configuration

We created the concept of a Run Configuration. Once we select a run configuration set up to use Spark as the engine, PDI will send the transformation to the edge node and the driver will then execute it.

All transformation steps in PDI will run in AEL-Spark! This was the thought from the start. And to understand how this works, there are 2 fundamental concepts to understand:

Some steps are safe to run in parallel while others are not parallelizable or not recommended to run in clustered engines such as Spark. All the steps that take one row as input and one row as output (calculator, filter, select values, etc, etc), all of them are parallelizable; Steps that require access to other rows or depend on the position and order on the row set, still run on spark, but have to run on the edge node, which implies a collect of the RDDs (spark's datasets) from the nodes. It is what it is. And how do we know that? We simply tell PDI which steps are safe to run in parallel, and which are not
Some steps can leverage Spark's native APIs for perfomance and optimization. When that's the case, we can pass to PDI a native implementation of the step, greatly increasing the scalability on possible bottleneck points. Examples of these steps are the hadoop file inputs, hbase lookups, and many more

Feedback please!

Even though running on secured clusters (and leveraging impersonation) is an EE capability only, AEL is also available in CE. Reason for that is that we want to get help from the community in testing, hardening, nativizing more steps and even writing more engines for AEL. So go and kick the tires of this thing! (and I'll surely do a blog post on this alone)

Visual Data Experience (PDI) Improvements

This is one of my favorite projects. You may be wondering what's the real value of having this improved data experience in PDI, why is this all that exciting... Let me tell you why: This is the first materialization of something that we hope becomes the way to handle data in pentaho regardless of where we are. So this thing that we're building in PDI, will eventually make it's way to the server... I'd like to throw away all the technicalities that we expose in our server (analyzer for olap, pir for metadata, prd for dashboards....) into a single content driver approach and usability experience. This is surely starting to sound confusing, so I better stop here :p

In the 7.1 release, Pentaho provides new Data Explorer capabilities to further support the following key use cases more completely:

Data Inspection: During the process of cleansing, preparing, and onboarding data, organizations often need to validate the quality and consistency of data across sources. Data Explorer enables easier identification of these issues, informing how PDI transformations can be adjusted to deliver clean data.
BI Prototyping: As customers deliver analytic ready data to business analysts, Data Explorer reduces the iterations between business and IT. Specifically, It enables the validation of metadata models that are required for using Pentaho BA. Models can be created in PDI and tested in Data Explorer, ensuring data sources are analytics-ready when published to BA.

And how? By adding these improvements:

New visualization: Heatgrid

This chart can display 2 measures (metrics) and 2 attributes (categories) at once. Attributes are displayed on the axes and measures are represented by the size and color of the points on the grid. It is most useful for comparing metrics at the ‘intersection’ of 2 dimensions, as seen in the comparisons of quantity and price across combinations of different territories and years below (did I just define what an heatgrid is?! No wonder it's taking me hours to write this post!):

Look at all those squares!

New visualization: Sunburst

A pie chart on steroids that can show hierarchies. Less useless than a normal piechart!

Circles are also pretty!

New visualization: Geo Maps

The geo map uses the same auto-geocoding as Analyzer, with out of box ability to plot latitude and longitude pairs, all countries, all country subdivisions (state/province), major cities in select countries, as well as United States counties and postal codes.

Geo Map visualization

Drill down capabilities

--> When using dimensions in Data Explorer charts or pivot tables, users can now expand hierarchies in order to see the next level of data. This is done by double clicking a level in the visualization (for instance, double click a ‘country’ bar in a bar chart to drill down to ‘city’ data).

Drill down in the visualizations...

This can be done though the visualizations or though the labels / axis. Once again, look at this as the beginning of a coherent way to handle data exploration!

... or from where it makes more sense

And this is only the first of a new set of actions we'll introduce here...

Analysis persistency

In 7.0 these capabilities were a one-time inspection only. Now we've taken a step further - they get persisted with the transformations. You can now use to validate the data, get insights right on the spot, and make sure everything is lined up to show to the business users.

Analysis persistency indicator

Viz Api 3.0

Every old timer knows how much disparity we've had throughout the stack in terms of offering a consistent visualization. This is not an easy challenge to solve - the reason they are different is because different parts of our stack were created in completely different times and places, so a lot of different technologies were used. An immediate follow-up consequence is that we can't just add a new viz and expect it to be available in several places of the stack

We're been working on a visualization layer, codenamed VizAPI (for a while, actually, but now we reached a point where we can make it available on beta form), that brings this so needed consistency and consolidation.

Viz API compatible containers

In order to make this effort worthwhile, we needed the following solve order:

Define the VizAPI structure
Implement the VizAPI in several parts of the product
Document and allow users to extend it

And... we did it. We re-implemented all the visualizations in this new VizAPI structure, adapted 3 containers - Analyzer, Ctools and DET (Data Exploration) in PDI, and as a consequence, the look and feel of the visualizations are the same

Analyzer visualizations are now much better looking _and_ usable

One important note though - migration users will still default to the "old" VizAPI (yeah, we called it the same as well, isn't that smart :/ ) not to risk interfering with existing installations. In order for you to test an existing project with the new visualizations you need to change the VizAPI version number in analyzer.properties. New installs will default to the new ones.

In order to allow people to include their own visualization and promote more contributions to Pentaho (I'd love to start seeing more contributions to the marketplace with new and shiny Viz's), we need to really make it easy for people to know how to create them.

And I think we did that! Even though this will require it's own blog post, just take a look at the documentation the team prepared for this

Instructions for how to add new visualizations

You'll see this documentation has beta written on it. The reason is simple - we decided to put it out there, collect feedback from the community and implement any changes / fine tunes / etc before 8.0 timeframe, where we'll lock this down, guaranteeing long term support for new visualizations

MS HD Insights

HD Insights (HDI) is a hosted Hadoop cluster that is part of Microsoft’s Azure cloud offering. HDI is based on Hortonworks Data Platform (HDP). One of the major differences between the standard HDP release and HDI’s offering is the storage layer. HDI connects to local cluster storage via HDFS or to Azure Blob Storage (ABS) via a WASB protocol.

We now have a shim that allows us to leverage this cloud offering, something we've been seeing getting more and more interest on the marketplace.

Hortonworks security support

This is a continuation of the previous release, available on the Enterprise Edition (EE)

Added support for Hadoop user impersonation

Earlier releases of PDI introduced enterprise security for Cloudera, specifically, Kerberos Impersonation for authentication and integration with Apache Sentry for authorization.

This release of PDI extends these enterprise level security features to Hortonworks’s Hadoop distribution as well. Kerberos Impersonation is now support Hortonworks’s HDP. For authorization, PDI integrates with Apache Ranger, an alternative OSS component included in the HDP security platform.

Data Processing-Enhanced Spark Submit and SparkSQL JDBC

Earlier PDI and BA/Reporting releases broaden access to Spark for querying and preparing data through a dedicated transformation step Spark Submit and Spark SQL JDBC.

This release will be extending these existing features to support additional vendors so that these features can be used more widely. Apart from additional vendors, these features have been now certified with a more up to date version of Spark 2.0.

Additional big data infrastructure vendors supported for these functionalities apart from Cloudera and Hortonworks:

Amazon EMR
MapR
Azure HD Insights

VCS Improvements

Repository agnostic transformations and jobs

Currently some specific step interfaces (the sub-transformation one being the more impactful) where the ETL dev has to choose, upfront, if he's using a file on the file system or the repository. This prevents us from being able to abstract the environment where we're working, so checking out things from git/svn and just import them is a no-go.

Here's an example of a step that used this:

The classic way to reference dependent objects

ThisIn general, we need to abstract the linkage to other artifacts (sub-jobs and sub-transformations) independent on the used repository or file system.

The linkage needs to work in all environments whether it is a repository (Pentaho, Database, File) or File Based system (kjb and ktr).

The linkage needs to work independently of the execution system: On the Pentaho Server, on a Carte Server (with a repository or file based system), in Map Reduce and future execution systems as part of the Adaptive Execution System (AES)

So we turned this into something much simpler:

The current approach to define dependencies

We just define where the transformation lives. This may seem a "what, just this??" moment, but now we can just work locally, remotely, check into a repository, even automate the promotion and control the lifecycle in between different installation environments. I'm absolutely sure that existing users will value this a lot (as we can deprecate the stupid file-based repository)

KTR / KJB XML format

We did something very simple (in concept), but very useful. While we absolutely don't recommend playing around with the job and transformation files (they are plain old XML files), we guaranteed that they are properly indented. Why? Cause when we use a version control system (git / svn, don't care which as long as you USE one!), you can easily identify what changes happened from version to version

Repository performance improvements

We want you to use the Pentaho Repository. And till now, performance while browsing that repository from Spoon was crap (there's no other way to say it!). We addressed that - it's now about 100x faster to browse and open files from the repository

Operations Mart Updates

Also known as the ops marts, available in EE. Used to work. Then it stoped working. Now it's working again. Yay :/

I'll skip this one. I hate it. We're working on a different way to handle monitoring on our product, and at scale

Other Data Integration Improvements

Apart from all the above new big features, there are some smaller data integration enhancements added to product to build data pipeline with Pentaho easier.

Metadata Injection Enhancement

Metadata Injection enables creating generalized ETL transformations whose behavior can be changed at run-time and thus significantly improves data integration developer agility and productivity.

In this release, a new option for constant has been added for Metadata Injection which will help making steps more dynamic with Metadata Injection feature.

This functionality extended to Analytic Query and Dimension Lookup/Update steps which will help making these steps dynamic and thus make them highly dynamic. Dynamism of these steps will improve the Data Warehouse & Customer 360 blueprints and similar analytic data pipeline.

Lineage Collection Enhancement

Customers can now configure the location for the lineage output and add the ability to write to VFS location. This will help customers to maintain lineage in clustered / transient node environments, such as Pentaho MapReduce. Lineage information helps with data compliance and security needs of the customers.

XML Input Step Enhancement

XML Input Stream (StAX) step has been updated to receive XML from a previous step. This will make it easier to develop XML processing in data pipeline when you are working with XML data.

New Mobile approach (and the deprecation of Pentaho Mobile)

We used to have a mobile specific plugin, introduced in a previous Pentaho release, that enabled touch gestures to work with analyzer.

But while it sounds good, in fact it didn't work as we'd expected. The fact that we had to develop and maintain a completely separate access to information caused that mobile plugin to become very outdated.

To complement that, the maturity of the browsers on mobile devices and the increased strength of tables makes it possible for Pentaho reports and analytic views to be accessed directly without any specialized mobile interface. Thus, we are deprecating the Pentaho mobile plug-in and investing on the responsive capabilities of the interface

It sounds bad? Actually it's not - just use your tablet to access your EE pentaho, looks great :)

Pentaho User Console Updates

Sapphire theme in PUC

Starting in Pentaho 7.1, Onyx will be deprecated and removed from the list of available themes in PUC. In addition, a new theme “Sapphire” has been introduced in 7.0. As of Pentaho 7.1, Sapphire will be PUC’s default selected theme. Crystal will be the available alternative.

Moreover, a newly refreshed log-in screen has been implemented in Pentaho 7.1, this screen has been based on the new Sapphire theme that was introduced in Pentaho 7.0. This is something that was already in 7.0 CE and now it's the default for EE as well

-----------------------

As usual, you can get EE from here and CE from here

This is a spectacular release! I should be celebrating! But instead, it's 8pm, I'm stuck in the office writing this blog post, and already very very stressed because I have all my 8.0 work stuff already piling up on my inbox... :(

I'm out, have fun!

-pedro

PentahoDay 2017 - Brazil, Curitiba, May 11 and 12

2017-04-19T14:57:00.000+01:00

PentahoDay 2017 - Brazil, Curitiba, May 11 and 12

After a pause to rest in 2016, the biggest Pentaho event organized by the community is back. 2 days, May 11 and 12, dozens of presentations, use cases, even hands-on mini-labs will happen in Curitiba, Brazil.

Pentaho Day speakers

400 attendees or more are expected on this huge event. It's really amazing, so if you're even near South America, be there!

Register here

Building Pentaho Platform from source and debugging it

2017-04-03T11:58:00.000+01:00

After all, if it's open source, means we can compile it, right?

I'm sure you've guessed by now this is not an original image from me even though I've been told I'm very good at drawing stuff - and I always believe my daughter!

Sure - but sometimes it's not as easy as it seems. However, we're doing a huge consolidation work to streamline all our build processes. Historically, each project, specially the older ones (kettle, mondrian, prd, ctools) used each own build method, depending on the author's personal stance on them (and boy, there are some heavy opinions in here...)

Personally, I come from the CCLJTMHIWAPMIS school of thought (for the ones not familiar with it, the acronym means Couldn't Care Less Just Tell Me How It Works And Please Make It Simple, very popular specially within lazy Portuguese people).

And we're now doing this, slowly and surely, to all projects, as you can see from browsing through Pentaho's Github.

So let's take a look at an example - building Pentaho Platform from source. Please note that we'll try to make sure the project's README.md contains the correct instructions. Also, this won't work for all versions, as we don't backport this changes; In the case of Pentaho Platform, this works for master and will appear in 7.1. Other will have it's own timeline.

Compiling Pentaho Platform

1. Clone it from source

Ok, so step one, clone it from source:

$ git clone https://github.com/pentaho/pentaho-platform.git

(or use git:// if you already have a user)

2. Set up your m2 config right

Before compiling it, you need to set some stuff in your maven settings file. In your home directory, under the .m2 folder, place this settings file. If you already one m2 settings files, that means you're probably familiar with maven in the first place and will know how to merge the two. Don't ask me, I have no clue.

If you're wondering why we need a specific settings file... I wonder too, but since my laziness is bigger than my curiosity (CCLJTMHIWAPMIS, remember?) I think I zoned out when they were explaining it to me and now I forgot.

3. Build it

This one is easy :)

$ mvn clean install

or the equivalent without the tests:

$ mvn clean package -Dmaven.test.skip=true

If all goes well, you should see

[INFO]

[INFO] --- maven-site-plugin:3.4:attach-descriptor (attach-site-descriptor) @ pentaho-server-ce ---

[INFO]

[INFO] --- maven-assembly-plugin:3.0.0:single (assembly_package) @ pentaho-server-ce ---

[INFO] Building zip: /Users/pedro/tex/pentaho/pentaho-platform-master/assemblies/pentaho-server/target/pentaho-server-ce-7.1-SNAPSHOT.zip

[INFO] ------------------------------------------------------------------------

[INFO] Reactor Summary:

[INFO]

[INFO] Pentaho BI Platform Community Edition .............. SUCCESS [ 4.461 s]

[INFO] pentaho-platform-api ............................... SUCCESS [ 10.149 s]

[INFO] pentaho-platform-core .............................. SUCCESS [ 19.819 s]

[INFO] pentaho-platform-repository ........................ SUCCESS [ 2.210 s]

[INFO] pentaho-platform-scheduler ......................... SUCCESS [ 0.172 s]

[INFO] pentaho-platform-build-utils ....................... SUCCESS [ 1.695 s]

[INFO] pentaho-platform-extensions ........................ SUCCESS [01:22 min]

[INFO] pentaho-user-console ............................... SUCCESS [ 19.596 s]

[INFO] Platform assemblies ................................ SUCCESS [ 0.059 s]

[INFO] pentaho-user-console-package ....................... SUCCESS [ 16.399 s]

[INFO] pentaho-samples .................................... SUCCESS [ 1.159 s]

[INFO] pentaho-plugin-samples ............................. SUCCESS [ 11.129 s]

[INFO] pentaho-war ........................................ SUCCESS [ 45.434 s]

[INFO] pentaho-style ...................................... SUCCESS [ 0.742 s]

[INFO] pentaho-data ....................................... SUCCESS [ 0.211 s]

[INFO] pentaho-solutions .................................. SUCCESS [31:31 min]

[INFO] pentaho-server-manual-ce ........................... SUCCESS [01:15 min]

[INFO] pentaho-server-ce .................................. SUCCESS [01:51 min]

[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------------------

[INFO] Total time: 38:36 min

[INFO] Finished at: 2017-03-31T15:36:43+01:00

[INFO] Final Memory: 102M/1084M

[INFO] ------------------------------------------------------------------------

There you go! In the end you should see a dist file like assemblies/pentaho-server/target/pentaho-server-ce--SNAPSHOT.zip. Unzip it, run it, done.

Debugging / inspecting the code

So the next thing you'd probably want, would be to be able to inspect and debug the code. This is actually pretty simple and common to all java projects. Goes something like this:

1. Open the project in a Java IDE

Since we use maven, it's pretty straightforward to do this - simply navigate to the folder and open the project as a maven project.

In theory, any java IDE would do, but I had some issues with Netbeans given it uses an outdated version of maven and ended up switching to IntelliJ IDEA.

I actually took this screenshot of IntelliJ myself, so no need to give credits to anyone

2. Define a remote run configuration

Now you need to define a remote debug configuration. It works pretty much the same in all IDEs. Make sure you point to the port of the Java Debug Wire Protocol (JDWP) port you'll be using in the application you're attaching to

Setting up a debug configuration

3. Make sure you start your application with JDWP enabled

This sounds complex, but really isn't. Just make sure your java command includes the following options:

-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=8044

For pentaho platform is even easier, as you can simply run start-pentaho-debug.sh.

4. Once the server / application is running, simply attach to it

And from this point on, any breakpoints should be intercepted

Inspecting and debugging the code

Submitting your fixes

Now that you know how to compile and debug code, we're a contributor in the works! Let's imagine you add some new functionality or fix a bug, and you want to send it back to us (you do, right???). Here are the steps you need - they may seem extensive but it's really pretty much the normal stuff:

Create a jira
Clone the repository
Implement the improvement / fixes in your repository
Make sure to include a unit test on it (link to how to write a unit test or a sample would be good)
Separate formatting-only commits from actual commits. So if your commit reformats the java class, you need to have a commit with [CHECKSTYLE] as the commit comment. Your main changes including your test case should be in a single commit.
Get the code formatting style template for your IDE and update the year in the copyright header
Issue a pull request against the project with [JIRA-ID] as the start of the commit comment
For visibility, add that PR to the jira you created, email me, tweet, whatever it takes. Won't promise it will be fast, but I promise we'll look :)

Hope this is useful!

-pedro

[Marketplace Spotlight] BTable 3.x

2017-01-26T17:29:00.002+00:00

Marketplace spotlight time! This time for an amazing contribution by our Italian friends from Biztech.it.

Massimo Bonometto just blogged about the new BTable release, that I shamelessly report here:

Hats off, Massimo!

__________________________________________________

Repost from Massimo's blog post

In January 2017 a new BTable version has been released to Pentaho Community.

As always it is available from Pentaho Markeplace.

Note about BTable version numbering: Pentaho 7.0 uses an newer version of Spring platform. This is why we are forced to maintain 2 different versions of BTable. BTable 3.0 works with Pentaho 5.x and 6.x while BTable 3.6 is the one for Pentaho 7.x.

What's New?

In the following I'm going to give a brief description of the most important features introduced with this new version.

Styling And Alarms

We introduced the concept of BTable Templates. One template is a JSON file with .bttemplate suffix, that usually lives inside Pentaho Repository, whose structure is composed of 3 sections:

alarmRules: defines the alarm logic for each measure;
inlineCss: contains CSS statements added dynamically to one single BTable;
externalCss: similar to the previous one but uses externalCss file.

Alarm styling is based on CSS and gives developers the opportunity to create very nice results.

The Template is a BTable property and can be set inside CDE or changed in BTable Analyzer; that is developers can create, for example, many templates with different alarm logics and users can dynamically change templates in order to evaluate their effect.

It is possible to drive the default template for all BTables and default template for each Mondrian cube. Just create a new folder named /public/BTableCustom and add:

Default_Mondrian Catalog_Mondrian Cube.bttemplate (For example Default_SteelWheels_SteelWheelsSales): it is used as default for BTables on specific Mondrian cube;
Default.bttemplate: it is used as default when a specific template for cube is not found.

Show Table Option

I'm sure that most of you love to spend time adding filters to CDE dashboards. Well, I really hate it!!! (In particular when a customer asks to add one filter after I finished the dashboard). :-)
This is why I had the idea to use BTable just for filters selection. I find it really tricky.

In the BTable With Templates example I show you how you can add a BTable just for filter selection and then synchronize other 2 BTables.
The same can be easily done with other components based on MDX query.

Using BTable Filter Panel From External Applications

Sometimes it happens that in your custom application you need to work with dimension members selections (for example for profiling purposes). You can do it working directly on database but I found it very useful to create one way to do it through BTable Filters Panel. Basically you have the opportunity to invoke BTable passing an endpoint as parameter. When the user saves filters selections the endpoint is launched.
If you are curious about this, you can use comments to this post and I will do my best to explain it in details in another post.

Filter On Dimension Members

When the user selects one dimension inside Filter Panel the dimension member showed are filtered based on filter selections made for other dimensions. This is the default behaviour but can be optionally changed by users.

Show Toolbar Option

Now it is possible to show one toolbar with most common actions on top of BTable. Toolbar is active by default when you start from BTable Analyzer and viceversa for CDE dashboard.

Users can toggle the toolbar visibility.

History

Since its first version BTable has the command Reset to reload the initial state. Now we also added the Back button in the toolbar that gives the opportunity to move BTable to previous states.

Show Zeros Option

It is common in OLAP/MDX world to deal with NOT NULL option but it happens frequently that measures fields inside facts tables contain zeros values.
This option, active by default, deletes rows and columns when all values are nulls or zeros.

Performance

We made some improvements in order to speedup BTable rendering. I tested I'm able to list more then 300,000 rows in a reasonable amount of time.

New posts with further details will follow.

Enjoy!!

From 0 to a full blown Pentaho 7 spectacular dashboard in 60m

2017-01-20T17:58:00.000+00:00

CBF2 is awesome? Hell yeah!

I've recently been blogging about CBF2 and talking about how great it is. But I admit that even just by looking at the blog post some people may not take it seriously assuming it's too complex. It's not.

What you'll get - in less than one hour

Today I did a demo on a topic that I'm extremely passioned about, horology. With the help of Miguel Leite, one of our UX wizards here, we did a one day push to build this project (bidder beware, this was a looooong day....).

The result? Absolutely spectacular, completely worth the effort:

And all this fueled by the amazingly powerful dataservices + annotations:

You can get this in even less than one hour; you know, it's just that most of the time is downloading stuff, and I kept getting distracted and forget to go back to what I was doing. I'm absolutely sure you can do it in much less!

So, let's go!

Pre requisites

Here's what you need:

Any operating system, and a machine with at least 8gb
Docker configured with at least 4gb on it (get it from here)
Git (or any UI for git)
Not being afraid to launch a terminal window...

C'mon, it's not asking much, is it?

Getting it all working in just 6 steps

1. Create a directory for pentaho and CBF

Create a directory called pentaho, open a terminal there and clone CBF2

$ git clone https://github.com/webdetails/cbf2.git

You should have all the directory structure as described in the CBF2 blog post.

2. Download Pentaho 7.0

Under the software directory, create another folder, called 7.0.0.0-25 (I like to use the version / build number) and put pentaho there, CE or EE:

Get CE from sourceforge: pentaho-server-ce-7.0.0.0-25.zip
Get EE from the Pentaho support portal (customers only): paz-plugin-ee-7.0.0.0-25-dist.zip, pdd-plugin-ee-7.0.0.0-25-dist.zip, pentaho-server-ee-7.0.0.0-25-dist.zip, pir-plugin-ee-7.0.0.0-25-dist.zip . If you download patches for 7.0.0.0, they will be automatically applied. In this case you also need to put your license files under the cbf2/licenses/ folder.

3. Get the horlogery-demo project

Clone the horlogery-demo project under the cbf2/projects directory:

$ git clone https://github.com/pmalves/horlogery-demo.git

4. Do the CBF2 magic

Under the cbf2/ folder you have the cbf2.sh magic script, built by pink unicorns. Go to that dir and...

Execute cbf2 and press [A] to add a new image and select the server you downloaded. If you're using EE you'll need to accept the license agreement. A new image should be available
Execute cbf2 and press [C] to create a new project. Select the horlogery-demo project and the image created previously.
There's no 3

5. Start using it!

If everything went as expected, you should be seeing something like this:

pedro@orion:~/tex/pentaho/cbf2/projects/horlogery-demo (master *) $ cbf2

Core Images available:
----------------------
[0] baserver-ce-7.0.0.0-25
[1] baserver-ee-7.0.0.0-25

Core containers available:
--------------------------

Project images available:
-------------------------
[2] pdu-horlogery-demo-baserver-ce-7.0.0.0-25
[3] pdu-horlogery-demo-baserver-ee-7.0.0.0-25

Project containers available:
-----------------------------

> Select an entry number, [A] to add new image or [C] to create new project:

Select the project you want, press [L] to launch it and it will soon be available for you to start exploring!

Pentaho user console should be available at http://127.0.0.1:8080/
That great Ctools dashboard is avaiable at http://127.0.0.1:8081/pentaho/api/repos/:public:horlogery:Horlogery.wcdf/generatedContent

(Note that depending on the operating system, the docker IP may not be 127.0.0.1 though, I can't help there)

6. Next steps?

From this point on it's you writing your own project and success story! And I'm going to get some sleep, since I had nearly none last night!! :p

Have fun!

-pedro

Doing GeoLocation in PDI - Pentaho Data Integration (Kettle)

2017-01-10T15:54:00.000+00:00

Geo Location

Geo location is something we often need in ETL work. And while we had a step that worked in PDI 5.x and earlier releases, we just noticed it's not currently working.

Until this morning, that is :p

I just forked Matt's initial project and applied the relevant changes to make it compatible with Pentaho 6+

The basics

Well, easy to understand... We have an IP address, we want to know where it comes from!

Geolocation transformation - Let me see if it finds out where I am...

Once I execute this, I get the following result:

Yep, this is where I am...

I am indeed in Porto Salvo, Portugal, so this is right. Can't get any easier than this!

Making it work

So, how to make this work? First, you have to get the plugin from the PDI marketplace

This plugin is available through the marketplace. Just go ahead and install it.

PDI Marketplace - Get your goodies from here

After installing it and restarting PDI, you'll see the GeoIP Lookup step in the lookup folder. Configuring it is straightforward: You point to the stream field containing the IP address, point to the IP database files and specify what fields you want back:

Configuring the step

Getting the IP Database files

You need to get the files from MaxMind, and from my experience these guys do a great job here. They have some great commercial offerings but also a GeoLite database for country and city location. You can get them from here

Getting the GeoIP data files

And you should be done! This even works great in a map reduce job

CBF2 now supporting multiple instances running

2017-01-09T18:49:00.003+00:00

The scenario

Last year I announced CBF2, the biggest, best, coolest way to manage Pentaho projects, available on github. In case you don't recall, it relies on docker to manage the images. Just read about it - really awesome

Since then, we've been using it a lot here. Really helps managing different projects and environments, and it's been put up to test in multiple real world scenarios.

One of the limitations of CBF2 is that it's limited to running one project per machine - since it exposes the ports on the host machine.

Another immediate consequence is that we can't have a local tomcat running since we'd get port conflicts.

The need

However, sometimes it's useful to have running containers side by side. In this case, we wouldn't be able to run these two projects at the same time:

So I guess I have to run these two projects one at a time?

If you tried to run these two it would complain about conflicting ports and the likes of it.

The solution

Turns out Kleyson Rios was less clumsy than I am - so he implemented this feature in CBF2: The ability to have containers running side by side by automagically detecting used ports and just moving on to the next one.

In here you see that both containers successfully ran:

2 projects running side by side

The result

The end result? Pretty cool, I have to admit! I now can run and test different versions side by side on my machine just by using the correct port :)

2 different versions, 5.4 and 7.1, running on my machine with 2 simple commands

This improvement is already committed, so simply pull the latest version and you're ready to go.

Thanks Kleyson! :)

-pedro

New WEKA releases: 3.6.15, 3.8.1 and 3.9.1

2016-12-20T19:32:00.000+00:00

The Weka team is on fire. New releases available for download from the Weka homepage:

Weka 3.8.1 - stable version.

Weka 3.9.1 - development version

Weka 3.6.15 - stable book 3rd edition version

It is available as ZIP, with Win32 installer, Win32 installer incl. JRE 1.8.0_112, Win64 installer, Win64 installer incl. 64 bit JRE 1.8.0_112 and Mac OS X application with Oracle 64 bit JRE 1.8.0_112.

Stable 3.8 receives bug fixes and new features that do not include breaking API changes and maintain serialized model compatibility. 3.9 (development) receives bug fixes and new features that might include breaking API changes and/or render models serialized using earlier versions incompatible.

NOTE: 3.6.15 is the final release of stable-3-6.

Weka homepage:
http://www.cs.waikato.ac.nz/~ml/weka/

Pentaho data mining community documentation:
http://wiki.pentaho.com/display/Pentaho+Data+Mining+Community+Documentation

Packages for Weka>=3.7.2 can be browsed online at:
http://weka.sourceforge.net/packageMetaData/

What's new in 3.8.1/3.9.1?

Some highlights
---------------

In core weka:

Package manager now handles redirects generated by SourceForge
Package manager now employs a new class loading mechanism that attempts to avoid third-party library clashes by isolating the third-party libraries in each package
new RelationNameModifier, SendToPerspective, WriteWekaLog, Job, StorePropertiesInEnvironment, SetPropertiesFromEnvironment, WriteDataToResult and GetDataFromResult steps in Knowledge Flow
RandomForest now has an option for computing the mean impurity decrease variable importance scores
JRip now prunes redundant numeric attribute-value tests from rules
Knowledge Flow now offers an additional executor service that uses a single worker thread; steps can, if necessary, declare programmatically that they should run in the single-threaded executor.
GUIs with result lists now support multi-entry delete
GUIs now support copying/pasting of array configurations to/from the clipboard

In packages:

Multi-class FLDA in the discriminantAnalysis package
New implementations in the ensemblesOfNestedDichotomies package
distributedWekaBase now includes the latest version of Ted Dunning's t-digest quantile estimator, bringing a factor of 4 speedup over the old implementation
New streamingUnivariateStats package
RPlugin package updated to support the latest version of MLR
New wekaDeepLearning4j package - provides a MLP classifier built using the DL4J library. Can work with either CPU-based or GPU-based native libraries
New logarithmicErrorMetrics package
New RankCorrelation package, courtesy of Quan Sun. Provides rank correlation metrics, Kendall tau and Spearman rho, for evaluating regression schemes
New AffectiveTweets package, courtesy of Felipe Bravom. Provides text filters for sentiment analysis of tweets
New AnalogicalModeling package, courtesy of Nathan Glenn. Provides an exemplar-based approach to modeling
New MultiObjectiveEvolutionaryFuzzyClassifier package, courtesy of Carlos Martinez Cortes. Provides a fuzzy rule-based classifier
New MultiObjectiveEvolutionarySearch package, courtesy of Carlos Martinez Cortes. Provides a search method that uses the ENORA multi-objective evolutionary algorithm

As usual, for a complete list of changes refer to the changelogs.

Oh. btw, Pentaho 7.0 is out!

2016-11-16T15:06:00.002+00:00

#PCM16 group photo - once again I was outside having a drink

I almost forgot! Pentaho 7.0 is out! We chose the amazing Pentaho Community Meeting (#PCM16) as the release date (talk about commitment to the community ;) ), so just go get them at the usual places, both Enterprise Edition as well as Community Edition

-pedro

PCM16 - Pentaho Community Meeting, coming up Nov 11-13

2016-11-02T12:44:00.000+00:00

2 weeks to go. For the 9th year in a row, this fantastic event will take place in less than 2 weeks in Antwerp. See you soon!!

Now, let me shamelessly copy the blog post from Bart Maertens, the organizer of the event:

Announcing #PCM16, Antwerp, Belgium!!

Register now!
Friday, Nov 11th: PCM16 Hackathon
Saturday, Nov 12: PCM16
Submit your talk proposal!

Use case room: pcm16_biz@know.bi
Tech room: pcm16_tech@know.bi
AGENDA: What's cooking a PCM16? Have a look:
ANTWERP: How to get there and things to do in
After the 2015 edition in London, the eighth yearly Pentaho Community Meeting will be back where it was in 2014: Antwerp, Belgium. The dates for PCM16 will be Friday, November 11th and Saturday, November 12th. The venue for this edition will be the gorgeous medieval hospital and monastery Elzenveld. The views may not be as spectacular as they were in the 2013 Sintra edition, but the location sure will be fine!

As was the case in the 2014 and 2015 editions, we’ll have a hackathon (followed by drinks) on Friday evening and two presentation rooms (business and technical) on Saturday.
As has been the trend in the latest PCM editions, we aim to make this the European Pentaho event of the year for both Enterprise and Community Edition users. After all, no matter which version you use, we’re all just a community of Pentaho users.
The event is free of charge because of sponsorships by know.bi and Pentaho, there will only be a small charge (€10) for lunch which you will kindly be asked to pay in cash when registering on Saturday.
In return for a weekend of your time, you’ll enjoy a couple of days of being submerged in everything Pentaho, (Big) Data, Data Science and the excitement of talking to and working with the community involved in all of this.
Registrations are open now, register on our eventbrite pages for the hackathon and PCM:
Friday, Nov 11th: PCM16 Hackathon
Saturday, Nov 12: PCM16

Friday, November 11th

On the evening of Friday, November 11th, we’ll be hosting a hackathon. People will have to travel to Antwerp, therefore we won’t be able to start early (8PM-ish) and have a hackathon for hours on end. However, as has been shown in the previous years, a couple of hours suffice to build and present impressive solutions with PDI, Mondrian or CTools.
As tradition has it, beer is an important part of a pre-PCM Friday evening, and there’s nowhere better to go for beers than Belgium! There are quite a number of pubs in the vicinity of the venue: ‘K. Zeppos’, named after the -at least in Belgium- world famous sixties TV series ‘Kapitein Zeppos’ and ‘Pallieter’, named after a 1916 novel by Flemish writer Felix Timmermans, just to name a few.
After a couple of hours of hacking, this is the perfect excuse to enjoy some of our famous Belgian beers. Take it easy though, these are not Amstel or Heineken!

Saturday, November 12th

The rooms

Traditionally, Saturday is what a true PCM is all about!
Just like in the last editions, there will be two rooms: business and technical.
The business room will be your goto place for use case presentations where Pentaho customers and/or users explain what real life problems they are solving with the Pentaho suite.
Some of Pentaho’s biggest and most prestigious implementations will be presented here, alongside smaller but not less interesting implementations.
The technical room is for more technical presentations. This is the ‘old school’ Community Meeting room, and your goto place to find out what’s cooking within Pentaho and the Pentaho Community. Powerpoint used to be forbidden, just like beaming code on the big screen was mandatory in the earliest Pentaho Community Meetings, find out for yourself if this still stands.

Call for Speakers

A list of speakers for both rooms is currently being compiled.
If you’d like to present, please mail to pcm16_biz@know.bi or pcm16_tech@know.bi with a short description of your presentation proposal and we’ll get back to you asap.

Sunday, November 13th

Again, not intending to change a winning team and sticking to tradition, we’ll have a social activity on the post-PCM Sunday.
An agenda is still being compiled, but we’ll post regular updates here, so stay tuned for more.

How to get there

By Plane
When you're arriving in Belgium through Brussels Airport, there are direct trains from the airport to Antwerp (approximately 30 minutes).
A (limited) number of cities have direct connections to Antwerp Airport, which is just a 15 minute taxi ride from the city center.
By Train
Antwerp has connections to several European cities, including a number of high speed connections. Check out the Belgian Rail website for more details.
By car
Follow your GPS to you hotel's address. Driving in Antwerp is fine, but traffic around the city can be challenging ( slow).

Agenda

Technical Room

Tom Barber: Alternative Big Data Devops
Jens Bleuel: What’s new in PDI 7.0
Matt Casters: PDI Unit Testing
know.bi: Loading data to Neo4J using PDI
UbiquisBI: Data models for Hadoop: Kimball without updates
Wael Elrifai: TBD
Pentaho - WebDetails: TBD
Hiromu Hota ## smell of announcement ##

Use Cases

Pedro Alves on Business Intelligence

Pentaho 9.2 is available

Pentaho 9.2 is available

Main features

Pentaho 9.1 is available!

Main features

Google Data Proc

Lumada Data Catalog steps for PDI

New Upgrade utility

Compatibility Updates

Other improvements:

Pentaho 9.0 is available

Pentaho 9.0 is available Without further ado: Get Enterprise Edition here, and get Community Edition here

· Enables hybrid big data processing support (on-prem or cloud)- all within single pipeline · Simplifies Pentaho’s integration with Hadoop clusters including enhanced UX of cluster configurations

· Adaptive Execution Layer Spark isn’t validated to execute pipelines connecting to multiple Hadoop clusters. · Pentaho Map Reduce isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

Capability

Use Cases and Benefits

· Eliminates black box feel with better visibility · Enable advanced Spark users with tools to improve performance

Key Considerations

Users must be aware of the following additional items related to AEL v9.0.0: · Spark v2.2 is not supported. · Native HBase steps are only available for CDH and HDP distributions. · Spark 2.4 is the highest Spark version currently supported.

Additional Resources

See the following documentation for more details: About Spark Tuning in PDI, Setup Spark Tuning, Configuring Application Tuning Parameters for Spark

Capability

Use Cases and Benefits

Key Considerations

Additional Resources

See Virtual File System connections, Apache Supported File Systems and Open a transformation for more information.

Capability

Use Cases and Benefits

Key Considerations

This step works with Fixed Length COBOL records only. Variable record types such as VB, VBS, OCCURS DEPENDING ON are not supported.

Additional Resources

For more information about using copybook steps in PDI, see Copybook steps in PDI

New Pentaho Server Upgrade Installer

Snowflake Bulk Loader improvement

Redshift IAM security support and Bulk load improvements

Improvements in AMQP and UX changes in Kinesis

The AMQP Consumer step provides Binary message support, for example allowing to process AVRO formatted data. Within the Kinesis Consumer step, users can change the output field names and types. See the documentation of the AMQP Consumer and Kinesis Consumer steps for more details.

Metadata Injection (MDI) Improvements

Excel Writer: Performance improvement

The performance of the Excel Writer has been drastically improved when using templates. A sample test file with 40,000 rows needed about 90 seconds before 9.0 and now processes in about 5 seconds. For further details, please see PDI-18422.

JMS Consumer changes

In PDI 9.0, we added the following fields to the JMS Consumer step: MessageID, JMS timestamp and JMS Redelivered. This addition enables restartability and allows to omit duplicate messages. For further details, please see PDI-18104 and the step documentation.

Text file output: Header support with AEL

You can set up the Text file input step to run on the Spark engine via AEL. The Header option of the Text file output step works now with AEL. For further details, please see PDI-18083 and the Using the Text File Output step on the Spark engine documentation.

Transformation & Job Executor steps, Transformation & Job entries: UX improvement

Spoon.sh Exit code improvement

Dashboard: Option for exporting analyzer report into CSV format.

Analyzer: Use of date picker when selecting ranges for a Fiscal Date level relative filter.

Mondrian: Option for setting the 'cellBatchSize' default value.

Pentaho 8.3 is available!

Pentaho 8.3

PDI Amazon Kinesis Streaming Integration

Capability

Use Cases and Benefits

Additional Considerations

Additional Resources

PDI Amazon Redshift Bulk Load

Capability

Use Cases and Benefits

Key Considerations

Additional Resources

PDI + BA Snowflake Database Connectivity

Capability

Use Cases and Benefits

Key Considerations

Additional Resources

PDI HCP (Hitachi Content Platform) Integration Enhancements

Capability

Use Cases and Benefits

Key Considerations

Additional Resources

PDI SAP Connector

Capability

Use Cases and Benefits

Key Considerations

Additional Resources

BA Viz API 3.0 General Availability

Capability

Use Cases and Benefits

Pentaho 9.0 is available

Without further ado: Get Enterprise Edition here, and get Community Edition here

· Enables hybrid big data processing support (on-prem or cloud)- all within single pipeline

· Simplifies Pentaho’s integration with Hadoop clusters including enhanced UX of cluster configurations

· Adaptive Execution Layer Spark isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

· Pentaho Map Reduce isn’t validated to execute pipelines connecting to multiple Hadoop clusters.

· Eliminates black box feel with better visibility

· Enable advanced Spark users with tools to improve performance

Users must be aware of the following additional items related to AEL v9.0.0:

· Spark v2.2 is not supported.

· Native HBase steps are only available for CDH and HDP distributions.

· Spark 2.4 is the highest Spark version currently supported.

The AMQP Consumer step provides Binary message support, for example allowing to process AVRO formatted data.

Within the Kinesis Consumer step, users can change the output field names and types.

See the documentation of the AMQP Consumer and Kinesis Consumer steps for more details.

The performance of the Excel Writer has been drastically improved when using templates. A sample test file with 40,000 rows needed about 90 seconds before 9.0 and now processes in about 5 seconds.

For further details, please see PDI-18422.

In PDI 9.0, we added the following fields to the JMS Consumer step: MessageID, JMS timestamp and JMS Redelivered.

This addition enables restartability and allows to omit duplicate messages.

For further details, please see PDI-18104 and the step documentation.

You can set up the Text file input step to run on the Spark engine via AEL. The Header option of the Text file output step works now with AEL.

For further details, please see PDI-18083 and the Using the Text File Output step on the Spark engine documentation.