Charts, also called plots or graphs, are illustrations of numerical data. When appropriately used, charts show properties of the data not readily apparent in a list of numbers.
Capable software for generating charts often presents a bewildering array of options, but we think most practitioners can get by with four basic charts. Other charts are variations on these four or are seldom used.
Some things to ask when creating a chart are which basic chart type is best for the data? Have irrelevant and distracting details been eliminated? Is the chart fully labelled and follow accepted conventions so that it can be understood by others?
Tables are a general way to display rectangular data.
- bar charts
- horizontal bar charts
- pie charts
- connecting the dots
- additional data sets
- fiddling with the width (or number of bins)
- strip chart (with jitter)
- estimating density distribution
- overlaying two data sets
- additional data sets
- fitting a line
- quantile-quantile plot
The starting point for most charts, and all the charts we describe here, is rectangular data.
Rectangular data is a list of tuples. The tuples have a fixed length and are called rows or records.
A column or variable is list consisting of the n-th element of each row. The data in each column should have a consistent type.
The order of the rows in the list is a presentation detail and not a property of the data. Re-arranging the rows does not change the data. Rectangular data should strictly speaking be defined as a set of tuples, but in practice it is implemented as a list of tuples.
Relational data tables and R data frames are two canonical examples of rectangular data.
Rectangular data can be stored in CSV files, tab-delimited files, or Excel spreadsheets, but these formats do not enforce the constraints that the rows have the same lengths and consistent types.
univariate, bivariate, identifier, index
"On the Theory of Scales of Measurement" (Stevens) classifies the data in a column or variable as belonging to one of four types:
The type of data governs which charts are appropriate for plotting the data. To illustrate them, consider the following data set:
Place Name M/F Age Finish City State 1 Erik Schulte M 26 19:46:20 Altadena CA 2 Michael Carson M 28 20:14:24 Tempe AZ 3 Ruperto Romero M 51 20:44:49 Hunington Park CA 4 Ashley Nordell F 35 22:35:38 Sisters CA 5 Kenneth Ringled M 34 22:49:10 Simi Valley CA 6 Gregory Benson M 32 22:58:50 San Francisco CA 7 Pete Sercel M 49 23:14:50 Pasadena CA 8 Jennifer Benna F 35 23:24:35 Reno NV
The M/F, City, and State columns are categorical variables. Categorical variables are often represented by strings.
The data set doesn't include an ordinal variable. Ski resorts classify slopes as easy, moderate, and hard using a green circle, blue square, and black diamond. The values are ordered by difficulty.
The data set doesn't include an interval variable. Longitude is an example of an interval variable. Finding the difference between the longitude of two points on the equator gives the distance in degrees of the two points.
The Age and Finish columns are ratio variables. The difference between a ratio and and interval value is the nature of the zero value. In the case of longitude, the value zero was assigned to the meridian running through Greenwich, but any other meridian could have been used. In the case of age or finish times, there is an obvious choice for the zero value. It is meaningful to compare the ratios of two ratio values. We can say that someone is twice as old as someone else, or that they took twice as long to finish a race. We can say the same about longitude, but because the zero point of longitude is arbitrary, there is an emptiness to the statement.