This set of scripts allows you, after an experiment on the Grid5000 platform, to fetch per-machine metrics (CPU, network usage, ...) and see them nicely stacked, thanks to Mike Bostock's D3.js . It can also draw histograms from the system counters of different Hadoop experiments. Though, for general use the Hadoop part can be bypassed if you just want to see how the ressource usage among your cluster evolved during an experiment (see below).
Developped and tested on Firefox. We would welcome any feedback about running it in other browsers.
Resource usage is fetched through
Grid5000's REST API,
which is itself fed by the
Ganglia
monitoring system.
Ganglia is installed and running on the default system image,
but if you're using a custom image check that the deamon is still running
after reboot - you may have to call
/etc/init.d/ganglia-monitor restart
as root on each machine.
Check if your machines, once installed, still appear in
Grid5000's per-site Ganglia reports.
Once you ensured that Ganglia is still part of your deployment,
you only have to copy the grabMetrics.py
script
such that it will be available to your OAR driver script or shell.
hadoop [my hadoop parameters] &> client.logWait at least 30 seconds after the job termination, while Ganglia fetches the last metrics.
$MASTERS
and $SLAVES
be the paths to
the Hadoop configuration files which contain the master and slaves nodes lists.
Invoke :
./grabMetrics.py client.log $MASTERS $SLAVES > file.json
file1.json
,
file2.json
and file3.json
.
In the visu
folder,
edit the first script
block in visu/index.html
such that it contains :
var inputs = [ "file1.json", "file2.json", "file3.json" ];
visu/index.html
in a browser.
See what happened.
Go back debugging :)
All time measurements are in seconds.
The first panel, "Counters" draws histograms from the Hadoop's sytem counter(s) you selected; move your mouse over the "Counters" button to see the counters list. Hadoop counters' names are prefixed by their job ID, for programs running more than one MapReduce. If you generated many JSON files, you may check "Filter on XP name", click "Draw", enter a regular expression and the histograms will only display the experiments matching the expression.
The second panel, "System metrics" stacks the slaves' resource usage, for the selected metric and experiment. Each colored layer corresponds to a single machine, however we're only using 10 colors so they might be shared by different machines.
⚠ The "Print" and "Source" features are experimental ⚠
These two buttons are intended to provide respectively a PDF and SVG export (tip: ask your browser to print without header/footer and use pdfcrop). But they are far from perfect because we are limited by browsers capabilities...
Showing more/different metrics: simply edit the metrics list
at the bottom of the grabMetrics.py
script.
You may find a list of the available metrics on
Grid5000's Ganglia installation.
Removing the Hadoop part: to do so you would have to tune the main block
at the end of the grabMetrics.py
script.
Change its arguments such that you provide two timestamps
(start and end of the experiment) and a machines list.
See which fields are required in the generated JSON in
visu/non-Hadoop-job.json .