Metrics visualization toolkit for Grid5000

This set of scripts allows you, after an experiment on the Grid5000 platform, to fetch per-machine metrics (CPU, network usage, ...) and see them nicely stacked, thanks to Mike Bostock's D3.js . It can also draw histograms from the system counters of different Hadoop experiments. Though, for general use the Hadoop part can be bypassed if you just want to see how the ressource usage among your cluster evolved during an experiment (see below).

Developped and tested on Firefox. We would welcome any feedback about running it in other browsers.

Pre-flight check

Resource usage is fetched through Grid5000's REST API, which is itself fed by the Ganglia monitoring system. Ganglia is installed and running on the default system image, but if you're using a custom image check that the deamon is still running after reboot - you may have to call /etc/init.d/ganglia-monitor restart as root on each machine. Check if your machines, once installed, still appear in Grid5000's per-site Ganglia reports.

Once you ensured that Ganglia is still part of your deployment, you only have to copy the grabMetrics.py script such that it will be available to your OAR driver script or shell.

Usage

  1. Assuming you run a job by invoking :
    hadoop [my hadoop parameters] &> client.log
    Wait at least 30 seconds after the job termination, while Ganglia fetches the last metrics.
  2. Let $MASTERS and $SLAVES be the paths to the Hadoop configuration files which contain the master and slaves nodes lists. Invoke :
    ./grabMetrics.py client.log $MASTERS $SLAVES > file.json
  3. You may repeat the operation for as many Hadoop runs as you want. Note the scripts are made to manage a single JSON file per program execution, but each execution may include many MapReduce jobs.
  4. Let's say you launched 3 Hadoop programs and created file1.json, file2.json and file3.json. In the visu folder, edit the first script block in visu/index.html such that it contains :
    var inputs = [
          "file1.json",
          "file2.json",
          "file3.json"
        ];
  5. Open visu/index.html in a browser. See what happened. Go back debugging :)

Notes on the example

All time measurements are in seconds.

The first panel, "Counters" draws histograms from the Hadoop's sytem counter(s) you selected; move your mouse over the "Counters" button to see the counters list. Hadoop counters' names are prefixed by their job ID, for programs running more than one MapReduce. If you generated many JSON files, you may check "Filter on XP name", click "Draw", enter a regular expression and the histograms will only display the experiments matching the expression.

The second panel, "System metrics" stacks the slaves' resource usage, for the selected metric and experiment. Each colored layer corresponds to a single machine, however we're only using 10 colors so they might be shared by different machines.

⚠ The "Print" and "Source" features are experimental ⚠

These two buttons are intended to provide respectively a PDF and SVG export (tip: ask your browser to print without header/footer and use pdfcrop). But they are far from perfect because we are limited by browsers capabilities...

Adapting it to your needs

Showing more/different metrics: simply edit the metrics list at the bottom of the grabMetrics.py script. You may find a list of the available metrics on Grid5000's Ganglia installation.

Removing the Hadoop part: to do so you would have to tune the main block at the end of the grabMetrics.py script. Change its arguments such that you provide two timestamps (start and end of the experiment) and a machines list. See which fields are required in the generated JSON in visu/non-Hadoop-job.json .