Visual Analysis in R (Part 1): Using Data Visualization to gain Insight from data.
Introduction.
The Old Faithful Geyser in Yellowstone National Park, Wyoming-USA, is a tourist attraction and geographic phenomenon. In this mini-project, I will be doing a Visual Analysis in R of the (Old) Faithful dataset. Particularly, an analysis on the eruption time and wait time.
Load the Faithful dataset.
The Faithful dataset is housed in R studio therefore there is no need to download it.
We load the faithful dataset [and its column names] in R as follows:
data(faithful)
names(faithful)Output: [1] "eruptions" "waiting"
Including Plots
Strip Plot of the Data
plot(faithful$eruptions, xlab = "Sample Number", ylab = "Eruption Times(min)", main = "Old Faithful Eruption Times")
From the plot we can conclude that the Old Faithful has two typical eruption times.A longer eruption time around the 4.5 minute and a shorter eruption time around the 1.5 minute.There are clear clusters around the two times.
Histograms
Histograms graph one dimensional data by dividing the range into bins and counting the number of events in each bin.The bin width value is therefore, a critical value and must be set correctly.
A good bin width value gives a histogram that is very clear and with less noise whereas a small bin width value gives out a lot of noise and hence less clarity. A bigger bin width value obscures any insight.
In the following histogram I used a bin width of value 3.
library(ggplot2)
ggplot(faithful, aes(x = waiting)) + geom_histogram(binwidth = 3, fill = 'darkmagenta') + ggtitle("Waiting Time(min) to Next Eruption")
The “y” axis in histogram is usually the count of a measurement in the corresponding bin but can also be replaced with probability or frequency as below
library(ggplot2)
ggplot(faithful, aes(x = waiting, y = ..density..)) + geom_histogram(binwidth = 4, fill = 'darkblue') + ggtitle("Waiting Time(min) to Next Eruption")
From the two histograms we can see that there are two waiting times. Around the 50 minutes and the 80 minutes.
Smoothed Histogram.
Smoothed histogram can sometimes be the best approximation of the underlying density in data.
ggplot(faithful, aes(x = waiting, y = ..density..)) + geom_histogram(alpha = 0.3, fill = "blue") + geom_density(size =1.5 , color = "red") + ggtitle("Waiting Time(min) to Next Eruption")
The value “alpha = 0.3” tells ggplot to make the histogram slightly transparent so that the smoothed histogram can be visible. And “size = 1.5” determines the width of the smooth histogram line.
Scatter Plot
Scatter plots visualize relationships between two numeric variables.
plot(faithful$waiting, faithful$eruptions, pch = 17, col = 2, cex = 1.2, xlab = "waiting time(min)", ylab = "eruption time(min)")
From the plot, we can see there are two clear clusters. The first cluster is of shorter wait time and hence shorter eruption time. The second cluster is of longer wait time and as a result, a longer eruption time.
This is a sensible observation especially because this implies the Geographical knowledge of “if the geyser waits a longer time before erupting, more pressure builds up and when it is finally released there will be longer eruption”.
Conclusion
So far we have been able to create graphs that visualize one dimensional or two dimensional numeric data. One dimensional data was visualized by the Strip and Histogram Plots, while Two dimensional data was visualized by the Scatter Plot.
In the next part, we will see how to visualize relationships between variables.