Visual Analysis in R (Part 2): Using Data Visualization to gain Insight from data.

Background:

Job Collins
5 min readJul 11, 2018

In the previous post, we used the Old Faithful Geyser -faithful- dataset to perform visual analysis on one dimensional and two dimensional numeric data. In this continuation of the Visual Analysis in R mini-project, we will be performing a Visual Analysis on variable relationships. The dataset in use is the MTCARS & MPG Datasets. Buckle up!

Loading the two datasets.

data(mtcars)
data(mpg)

Load the GGPlot2 Library.

library(ggplot2)

We install GGally package which contains extensions to ggplot2 that simplify creating certain plots. More information on this package is found here.

install.packages('GGally')

Now we are set!

Variable Relationships and Scatter Plots

The relationship between two numeric variables (and a third categorical variable) can be visualized using a scatter plot where the categorical variable controls the size, color, or shape of the markers. *Categorical Variable is non-numeric.

Here are the column names of the mtcars dataset.

names(mtcars)Output: [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Goal:

We want to show the relationship between (two numeric variables) horsepower(hp) and miles per gallon(mpg) and the reltionship between the above two variables and the transmission(am) variable.

The Scatter Plot

plot(mtcars$hp, mtcars$mpg, pch = mtcars$am, xlab = "horsepower", cex = 1.2, ylab = "miles er gallon", main = "mpg vs hp by transmission")
legend("topright", c("automatic", "manual"), pch = c(0,1))
“pch” determines the shape of the marker in the scatter plot based on the value of the am variable.

From the graph, we can see that the bigger the horsepower the lower the miles per gallon. This is said to be an inverse relationship between hp and mpg.
Also, we can see that for a given horsepower value, manual transmission cars are generally more fuel efficient than automatic transmission cars.
Indicated by the last two circles on the right, it can be concluded that cars with highest horsepower are manual.

The Multivariable Scatter Plot

The mpg dataset contains fuel economy and other car attributes. The column names are,

names(mpg)Output: [1] "manufacturer"  "model"  "displ"  "year"  "cyl"         
[6] "trans" "drv" "cty" "hwy" "fl"
[11] "class"

In this part, we’ll change marker size and visualize relationship between three numeric variables. The marker will cover for scatter plot’s two dimensional limitation by reflecting the third variable.

qplot(x = wt, y = mpg, data = mtcars, size = cyl, main = "MPG vs Weight(x1000lbs) by Cylinder")

We can see that an inverse relationship exists between MPG & Weight. Heavier vehicles tend to have lower miles per gallon value. Additionally, from the right we can see heavier vehicles have more cylinders.

Noisy Data

Noisy data can be quite ean at giving insight even upon visualization. For example, we can plot MPG vs Displacement from mtcars…

qplot(disp, mpg, data = mtcars, main = "MPG vs Displacement" )

When data is noisy, it’s wise to add a smooth curve to visualize the median trend. To get information(median trend) from the above visualization, we add the following code;

qplot(disp, mpg, data = mtcars, main = "MPG vs Displacement") + stat_smooth(method = "loess", degree = 0, span = 0.2, se = TRUE)
the median curve in blue

we can now visualize the trend of the MPG vs Displacement Scatter plot.

Facets

Facets allow us to visualize more than two dimensions. It is a good tool for comparing two variables, say, separately[-ish] using panels.

Creating Facets

Before we create any facet, we need to modify the mtcars dataframe to have transmission(am) and engine shape(vs) columns with meaningful names for the sake of labelling.

mtcars$amf[mtcars$am==0] = 'automatic'
mtcars$amf[mtcars$am==1] = 'manual'
mtcars$vsf[mtcars$vs==0] = 'flat'
mtcars$vsf[mtcars$vs==1] = 'V-shape'

Now the facets,

qplot(x = wt, y = mpg, facets = .~amf , data = mtcars, main = "MPG vs Weight by Transmission", color = 'darkred')
facets

Here we can compare the automatic vs manual transmission with ease. The x-axis and y axis is same for both panels.

Let’s now compare between v-shaped and flat engines.

qplot(x = wt, y = mpg, facets = vsf~., data = mtcars, main = "MPG vs Weight by Engine")

We can clearly see, heavier cars are of flat-shaped engines than v-shaped engines and that flat engines cover lower miles per gallon thab v-shaped engines.
Note how the varying “.~” and “~.” affect the panels.

Now lets combine the above two facets,

qplot(x = wt, y = mpg, facets = vsf~amf, data = mtcars, main = "MPG vs Weigth by Transmission and Engine")
a combination of the above two facets

From this simplified visualization, we can see a number of things even clearer;

1. Heavier automatic vehicles use the flat shaped engines. Lighter automatic vehicles use v-shaped engines.
2. Flat engines are generally used in heavier vehicles by both automatic and manual transmission systems. V-shaped engines are used for their lighter vehicles.
3. Manual Transmission Vehicles are generally lighter than Automatic Transmission vehicles.

Conclusion

Manual cars with V-shaped engines have lower weights and are more fuel efficient. Automatic cars with Flat-shaped engines are heavier and less fuel efficient.

From this mini-project we have successfully used R to visualize data and gain insight from the visualization. I hope this mini-project was easy enough to follow through and has informed on the significance of data visualization. This marks the end of the mini-project :).

--

--

Job Collins
Job Collins

No responses yet