Archives For Hadoop


Visualisations in PowerBI

Here are some charts that we generated with PowerBI. The nice thing is that you can drill down on any bar. This is ideal for exploring a dataset.


You can also easily build an animated chart. In the following example, the delays per airport are shown on a scatter chart, where the total delay is plotted against the likelihood of having a delay. If you ‘play’ the chart, you can see the evolution of the delays on a day-to-day basis. From this animation, it’s clear that Saturday is your best bet if you really don’t like delays.


Like any self-respecting BI tool, PowerBI also offers a Map chart. We’ve experimented with it and we’ve got some beautiful results already.


As I mentioned before, the search feature is also very powerful. For example:

Doing BI becomes as simple as doing a Google search. Well, I guess Microsoft calls it a Bing search, but anyways…

The only thing I really miss with PowerBI are “live queries”. PowerBI retrieves all data you need from the source, and does all calculations on your machine. This doesn’t work well with Big Data. For one, you’re moving your data around, not your processing, That’s a bad smell. You’re limited to the amount of memory and processing power of your machine. You’ve lost all advantages of a distributed SQL database or a Hadoop platform. Also, downloading millions and millions of rows puts a heavy load on the network, and it will take a while before you can fire your first query. Typically, you can download only a subset of your data. That obviously restricts you in so many ways.

Tableau does offer those live queries. What it means, is that it doesn’t try to retrieve the entire dataset. In stead, it fires the right SQL query to the database, and only returns the results. You can take full advantage of your powerful cluster, you’re not congesting the network, and you can start querying your dataset immediately. I hope this will be possible in future versions of PowerBI as well.

Detailed tutorial from Hortonworks

Hortonworks have done pretty much the same thing (obviously minus Impala), and have put together a very detailed tutorial about it. Well worth the read if you want to try this yourself:

Comparing query from SQL to PIG

SQL to PIG Cheat Sheet
Get Complete list from:

Pig   Cheat Sheet

Pig   Cheat Sheet

SQL on Hadoop

February 18, 2014 — Leave a comment


The good thing about running SQL on Hadoop is that SQL is a declarative language, which means that you don’t need to know where the data is, you just have to ask for it and then the database works out how to get the information you need. However, unless you have a database optimiser the performance will suck.

Now there are various SQL initiatives around but probably the most advanced is Impala. And in version 1.2, which was introduced at the end of December, Cloudera introduced facilities to optimise join order but, while this is a step in the right direction, it hardly constitutes a full-blown optimiser.

However, a couple of related announcements have caught my eye this week. The first was that Calpont has changed its name to the name of its product InfiniDB, it has raised another round of funding and it has announced version 4.5 of its database with an Enterprise Management dashboard. None of which has much to do with Hadoop except that it reminded me that Calpont (as it then was) announced the availability of InfiniDB running on Hadoop back last year, along with an open source license. And, of course, InfiniDB has a grown-up optimiser.

Another product that has an adult optimiser is HP Vertica. And MapR has just announced an early access program (prior to general availability in March) for the HP Vertica Analytics Platform running on the MapR Hadoop distribution.

The truth is that you will get much better performance—orders of magnitude better—from either InfiniDB or Vertica than you will from Impala. So this poses three questions: firstly, will we see more vendors porting their warehouse products onto Hadoop (or HDFS); secondly, how quickly will Cloudera or HortonWorks (with its SQL implementation) be able to produce an optimiser than can compete reasonably well with these intruders into their market; and, thirdly, how much does this matter?

The answer to the first question is yes. I don’t who or when but this is the general trend, not just in data warehousing but across a variety of markets. The answer to the second question is not soon: it takes years to develop a good optimiser—probably not as many years as it used to, because there is plenty of experience out there, which was not the case historically—but still a significant period.

Thirdly, yes it matters. You may have to pay a license fee for HP Vertica (or not, in the case of InfiniDB) but the performance advantages you get from having a decent optimiser will mean that you need significantly less hardware in order to get comparable performance, and that should more than offset any such license fees. And that also explains why I expect more vendors to do the same thing as InfiniDB and Vertica, because there is a window of opportunity while Cloudera gets its optimiser up to speed.


Environment : Windows 7 – 64bit and Virtual Box

VM Image : CDH4 Packages for Virtual Box

 CDH4 Packages for Virtual Box

1. Create a new Virtual Machine

Create new Cloudera Hadoop virtual machine

2. Enter a name for New Virtual Machine and Select the type of the quest operating system you plan to install into the virtual machine



3. Select the amount of base memory (RAM)

Setting up Cloudera Hadoop in Windows

4. Select “ Use existing hard disk” and navigate to the folder where you downloaded Cloudera-demo-vm. 
If you don’t have demo VM, download it from here: CDH4 Packages for Virtual Box

Running Cloudera Hadoop in Windows 7

5. Your going to create a new virtual Cloudera Hadoop in Windows 7 operating system


6. Now turn on Cloudera Hadoop in Windows and run the demo



7. Starting Cloudera Hadoop in Windows virtual machine


8. Cloudera Hadoop demo is now ready in Windows


Running the VM

Once you launch the VM, you are automatically logged in as the cloudera user.
The account details are:

  • username: cloudera
  • password: cloudera

The cloudera account has sudo privileges in the VM.

To learn more about Hadoop, see the Hadoop Tutorial.

You can access status through the browser at the following URLs:

  • NameNode status (localhost:50070)
  • JobTracker status (localhost:50030)
  • The Hue user interface (localhost:8888)
  • The HBase web UI (localhost:60010)



Enjoy your Cloudera Hadoop demo in Windows