Back to the docs page

Previous      Next

Visualizing data

One of the first tasks when working with some new data is to try to understand what's in the data, and how it is structured.

Stats

Perhaps the most commom way of visualizing some data is to list some basic stats about each of its attributes. For example, let's take a look at the iris dataset (which you can download from MLData.org).

	waffles_plot stats iris.arff
This is one command that you should probably memorize, because you will use it frequently. Here is its output:
	Filename: iris.arff
	Patterns: 150
	Attributes: 5 (Continuous:4, Nominal:1)
	  0) sepallength, Type: Continuous, Mean:5.8433333, Dev:0.82806613, Median:5.8, Min:4.3, Max:7.9, Missing:0
	  1) sepalwidth, Type: Continuous, Mean:3.054, Dev:0.43359431, Median:3, Min:2, Max:4.4, Missing:0
	  2) petallength, Type: Continuous, Mean:3.7586667, Dev:1.7644204, Median:4.35, Min:1, Max:6.9, Missing:0
	  3) petalwidth, Type: Continuous, Mean:1.1986667, Dev:0.76316074, Median:1.3, Min:0.1, Max:2.5, Missing:0
	  4) class, Type: Nominal, Values:3, Most Common:Iris-setosa (33.333333%), Entropy: 1.5849625, Missing:0
	     33.333333% Iris-setosa
	     33.333333% Iris-versicolor
	     33.333333% Iris-virginica
As you can see, it shows some very basic information about each attribute in the dataset.

Overview

Another common way to look at data is to plot the correlations between various attributes. The following command will generate such a plot for every pair of attributes.

	waffles_plot overview iris.arff

Each plot at column i, row j, shows how well the value in attribute i can predict the value in attribute j. (So, if the last attribute is a class label, then the bottom row of the correlation plot matrix is usually the one in which you will be most interested.)

As another example, here are the overview plots from a subset of the attributes in the adult-census.arff dataset. (I only show a subset here because there are a lot of attributes in that dataset.)

	waffles_plot overview adult-census.arff

Several trends can be immediately observed in these plots. For example, it looks like most of the people in the census made less than $50K. It can be seen that age was somewhat correlated with a greater likelihood of making more than $50K. Those who indicated a marital status of "Married-civ-spouse" were much more likely to make more than $50K than those with other values in this attribute. Those who indicated that their relationship was either "wife" or "husband" were likely to make more than $50K, while those that indicated something else were less likely. It is clear from this chart that the significant majority of the people in the census indicated race to be "white", and that those who indicated "white" or "Asian-Pac-Islander" were more likely to make more than $50K than those with other values. It can be seen that males were more likely than females to make more than $50K. Apparently there were more males than females in this census. It looks like most people worked approximately 40 hours-per-week.

Anyway, the point is, you can tell a lot about a dataset just by examining the overview plots.

Histograms

Histograms are a good way to look at the distribution of some data. The following command will draw a million random values from a gamma distribution, and then plot a histogram of it.

	waffles_generate noise 1000000 -seed 0 -dist gamma 9 2 > gamma.arff
	waffles_plot histogram gamma.arff
Here is the output of that command:

Model space

If you have a model that has been trained on some dataset, you might want to visualize that dataset with the trained model. This example will train a neural network (with no hidden layers) on the iris dataset, and will then create a visualization of that model.

	waffles_learn train -seed 0 iris.arff neuralnet > nn.json
	waffles_plot model nn.json iris.arff 2 3

Equations

You can visualize equations too.

	waffles_plot equation -range -6 0 6 1 "f1(x) = 1/(1+e^(-x))"

You can plot multiple equations together. Also, our tools let you define helper functions that you can use within your equations. Example:

	waffles_plot equation -range -10 0 10 10 "f1(x)=log(x^2+1)+2;f2(x)=\
	   x^2/g(x)+2;g(m)=10*(cos(m)+pi);f3(x)=sqrt(49-x^2);f4(x)=abs(x)-1"

3D

You can also plot in 3D:

	waffles_generate manifold 3000 -seed 1234 "y1(x1, x2) = sin(2 * t(x1));\
	                      y2(x1, x2) = -2 * cos(t(x1)); y3(x1,x2) = 2 * x2;\
			      t(x) = 3 * pi * x / 2 + pi / 4" > in.arff
	waffles_plot 3d in.arff

Scatter plots

Of all visualization methods, however, scatter plots may be the most useful of them all.

	waffles_plot scatter mydata.arff
This tool assumes that attribute 0 specifies a position on the horizontal axis. All other attributes specify points that should be plotted. By default, each attribute will be plotted with a unique color. You can also specify the "-lines" option to turn it into a line-graph.

If your data lacks an attribute that specifies position on the horizontal axis, you can add one, and then plot it.

	waffles_transform addindexcolumn mydata.arff > ready_for_scatter_plot.arff
	waffles_plot scatter mydata.arff -lines

This tool can handle logarithmic scales, and a plethora of other useful options. To see all available options, take a look at the usage information.

	waffles_plot usage

etc.

There are a few other visualization techniques available. For a complete list, see the usage information for the waffles_plot tool.


Previous      Next

Back to the docs page