Hadoop

If you’re using Hadoop, it’s probably because you’ve got data that is huge, unstructured, nested or all three. Hadoop’s distributed file system (HDFS) and the MapReduce algorithm support parallel processing across massive data or disparate data. Together, these let you work with data that traditional databases find extremely difficult to process, including XML data.

But storing and accessing big and messy data is only part of the problem. You’ve still got to make sense of it. Tableau allows you to connect Apache Hadoop from Cloudera easily and do ad-hoc visualization so you can see patterns and outliers in all that data that’s stored in your Hadoop cluster. You can’t get value from your data unless you can see what’s inside of it.

“Tableau’s solution for Hadoop is one of the most elegant solutions I've seen, and performant,” said Ravi Bandaru, Product Manager of Advanced Analytics & Data Visualization at Nokia. “This obviates any need for us to move huge log data into Relational store before analyzing it with Tableau.”

Speed


Click to see more
The power of Hadoop without the latency

Hadoop’s most well-known drawback is its high latency. When you work with Hadoop and Tableau, you can connect live to your Hadoop cluster and then extract the data into Tableau’s fast in-memory data engine. In order to get the benefit of ad hoc visualization at interactive speeds, you need to be able to move fast.

Tableau lets you bring your data into its fast, in-memory analytical engine. With this approach you can query an extract of data without waiting for MapReduce queries to complete. With a single click between the options of live connect or in-memory analytics, you can quickly analyze samples of data in-memory, then reconnect to run a live query.

Easy connection


Click to see more
Connect directly to Apache Hadoop from Cloudera

Getting Hadoop to work with Tableau is easy: just point at your Hadoop cluster, just like you would with any data connection. You do need Hive installed on your Hadoop cluster, which is a common component that provides a SQL interface to Hadoop. There’s no special configuration you need to do for either Tableau or Hadoop.

In Tableau, Apache Hadoop from Cloudera is simply another data source. You can connect with no programming and drag & drop to visualize your data. Here we have weather data from a set of XML objects, now stored in a Hadoop cluster. Tableau’s powerful visualization capabilities let you create maps, charts and dashboards easily.

XML support


Click to see more
Work with a variety of data, including XML

An important application of Hadoop and Hive together is working with a variety of data, such as XML files. This often means that you need to unpack nested data, perform data transformations and process URLs. Tableau supports a number of new string functions when working with Hive and Hadoop, including URL processing, regular expressions, and hex/binary numeric operators.

This weather data was stored as a series of XML files that were loaded into Hadoop and unpacked on the fly by the Tableau custom SQL connection – this is true flexibility, and almost like on-the-fly ETL. Here we’re using the “XPATH” function to create a City field so that we can represent this data in a more traditional, relational way. XML functions are exposed in the Tableau calculations window when you’re working with Hive/ Hadoop data so you don’t need to do custom programming to work with XML objects.