Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Creating Effective Visualizations with ggplot2: A Comprehensive Guide, Study Guides, Projects, Research of Statistics

A step-by-step guide on constructing ggplots using R's ggplot2 library. It covers various aspects such as creating a scatterplot, adjusting axis limits, changing colors, and customizing themes. The document also introduces other types of plots like bubble charts and ordered bar charts.

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/27/2022

blueeyes_11
blueeyes_11 šŸ‡ŗšŸ‡ø

4.7

(18)

261 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Data Visualization with ggplot2
1. Understanding the ggplot syntax
The syntax for constructing ggplots could be puzzling if you are a beginner or work primarily
with base graphics. The main difference is that, unlike base graphics, ggplot works with
dataframes and not individual vectors. All the data needed to make the plot is typically be
contained within the dataframe supplied to the ggplot() itself or can be supplied to respective
geoms. More on that later.
The second noticeable feature is that you can keep enhancing the plot by adding more layers (and
themes) to an existing plot created using the ggplot() function.
Let's initialize a basic ggplot based on the midwest dataset.
# Setup
options(scipen=999) # turn off scientific notation like 1e+06
library(ggplot2)
data("midwest", package = "ggplot2") # load the data
# midwest <- read.csv("http://goo.gl/G1K41K") # alt source
# Init Ggplot
ggplot(midwest, aes(x=area, y=poptotal)) # area and poptotal are columns in 'midw
est'
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download Creating Effective Visualizations with ggplot2: A Comprehensive Guide and more Study Guides, Projects, Research Statistics in PDF only on Docsity!

Data Visualization with ggplot

1. Understanding the ggplot syntax

The syntax for constructing ggplots could be puzzling if you are a beginner or work primarily

with base graphics. The main difference is that, unlike base graphics, ggplot works with

dataframes and not individual vectors. All the data needed to make the plot is typically be

contained within the dataframe supplied to the ggplot() itself or can be supplied to respective

geoms. More on that later.

The second noticeable feature is that you can keep enhancing the plot by adding more layers (and

themes) to an existing plot created using the ggplot() function.

Let's initialize a basic ggplot based on the midwest dataset.

# Setup options (scipen= 999 ) # turn off scientific notation like 1e+ library (ggplot2) data ("midwest", package = "ggplot2") _# load the data

midwest <- read.csv("http://goo.gl/G1K41K") # alt source

Init Ggplot_

ggplot (midwest, aes (x=area, y=poptotal)) # area and poptotal are columns in 'midw est'

A blank ggplot is drawn. Even though the x and y are specified, there are no points or lines in it.

This is because, ggplot doesn't assume that you meant a scatterplot or a line chart to be drawn. I

have only told ggplot what dataset to use and what columns should be used for X and Y axis. I

haven't explicitly asked it to draw any points.

Also note that aes() function is used to specify the X and Y axes. That's because, any information

that is part of the source dataframe has to be specified inside the aes() function.

2. How to Make a Simple Scatterplot

Let's make a scatterplot on top of the blank ggplot by adding points using a geom layer called

geom_point.

library (ggplot2) ggplot (midwest, aes (x=area, y=poptotal)) + geom_point ()

We got a basic scatterplot, where each point represents a county. However, it lacks some basic

components such as the plot title, meaningful axis labels etc. Moreover most of the points are

concentrated on the bottom portion of the plot, which is not so nice. You will see how to rectify

these in upcoming steps.

Like geom_point(), there are many such geom layers which we will see in a subsequent part in

this tutorial series. For now, let's just add a smoothing layer using geom_smooth(method='lm').

Since the method is set as lm (short for linear model), it draws the line of best fit.

Warning: Removed 5 rows containing non-finite values (stat_smooth).

Warning: Removed 5 rows containing missing values (geom_point).

# g + xlim(0, 0.1) + ylim(0, 1000000) # deletes points

In this case, the chart was not built from scratch but rather was built on top of g. This is because,

the previous plot was stored as g, a ggplot object, which when called will reproduce the original

plot. Using ggplot, you can add more layers, themes and other settings on top of this plot.

Did you notice that the line of best fit became more horizontal compared to the original plot? This

is because, when using xlim() and ylim(), the points outside the specified range are deleted and

will not be considered while drawing the line of best fit (using geom_smooth(method='lm')). This

feature might come in handy when you wish to know how the line of best fit would change when

some extreme values (or outliers) are removed.

Method 2: Zooming In

The other method is to change the X and Y axis limits by zooming in to the region of interest

without deleting the points. This is done using coord_cartesian().

Let's store this plot as g1.

library (ggplot2) g <- ggplot (midwest, aes (x=area, y=poptotal)) + geom_point () + geom_smooth (method= "lm") _# set se=FALSE to turnoff confidence bands

Zoom in without deleting the points outside the limits.

As a result, the line of best fit is the same as the original plot._

g1 <- g + coord_cartesian (xlim= c ( 0 ,0.1), ylim= c ( 0 , 1000000 )) # zooms in plot (g1)

Since all points were considered, the line of best fit did not change.

4. How to Change the Title and Axis Labels

I have stored this as g1. Let's add the plot title and labels for X and Y axis. This can be done in one

go using the labs() function with title, x and y arguments. Another option is to use the ggtitle(),

xlab() and ylab().

library (ggplot2) g <- ggplot (midwest, aes (x=area, y=poptotal)) + geom_point () + geom_smooth (method= "lm") # set se=FALSE to turnoff confidence bands g1 <- g + coord_cartesian (xlim= c ( 0 ,0.1), ylim= c ( 0 , 1000000 )) _# zooms in

Add Title and Labels_

g1 + labs (title="Area Vs Population", subtitle="From midwest dataset", y="Populati on", x="Area", caption="Midwest Demographics")

Excellent! So here is the full function call.

# Full Plot call library (ggplot2) ggplot (midwest, aes (x=area, y=poptotal)) + geom_point () + geom_smooth (method="lm") + coord_cartesian (xlim= c ( 0 ,0.1), ylim= c ( 0 , 1000000 )) + labs (title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics")

5. How to Change the Color and Size of Points

How to Change the Color and Size To Static? We can change the aesthetics of a geom layer by

modifying the respective geoms. Let's change the color of the points and the line to a static value.

library (ggplot2) ggplot (midwest, aes (x=area, y=poptotal)) + geom_point (col="steelblue", size= 3 ) + # Set static color and size for points geom_smooth (method="lm", col="firebrick") + # change the color of line coord_cartesian (xlim= c ( 0 , 0.1), ylim= c ( 0 , 1000000 )) + labs (title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics")

Also, You can change the

color palette entirely.

gg + scale_colour_brewer (palette = "Set1") # change color palette

6. How to Change the X Axis Texts and Ticks Location How to Change the X and Y Axis Text and its Location?

Alright, now let's see how to change the X and Y axis text and its location. This involves two

aspects: breaks and labels.

Step 1: Set the breaks

The breaks should be of the same scale as the X axis variable. Note that I am using

scale_x_continuous because, the X axis variable is a continuous variable. Had it been a date

variable, scale_x_date could be used. Like scale_x_continuous() an equivalent

scale_y_continuous() is available for Y axis.

library (ggplot2) # Base plot gg <- ggplot (midwest, aes (x=area, y=poptotal)) + geom_point ( aes (col=state), size= 3 ) + # Set color to vary based on state categor ies. geom_smooth (method="lm", col="firebrick", size= 2 ) + coord_cartesian (xlim= c ( 0 , 0.1), ylim= c ( 0 , 1000000 )) + labs (title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") # Change breaks gg + scale_x_continuous (breaks= seq ( 0 , 0.1, 0.01))

library (ggplot2) gg <- ggplot (midwest, aes (x=area, y=poptotal)) + geom_point ( aes (col=state), size= 3 ) + # Set color to vary based on state categor ies. geom_smooth (method="lm", col="firebrick", size= 2 ) + coord_cartesian (xlim= c ( 0 , 0.1), ylim= c ( 0 , 1000000 )) + labs (title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") # Reverse X Axis Scale gg + scale_x_reverse ()

How to Customize the Entire Theme in One Shot using Pre-Built Themes?

Finally, instead of changing the theme components individually (which I discuss in detail in part

2), we can change the entire theme itself using pre-built themes. The help page ?theme_bw shows

all the available built-in themes.

This again is commonly done in couple of ways. Use the theme_set() to set the theme before

drawing the ggplot. Note that this setting will affect all future plots. Draw the ggplot and then add

the overall theme setting (eg. theme_bw())

library (ggplot2) # Base plot gg <- ggplot (midwest, aes (x=area, y=poptotal)) + geom_point ( aes (col=state), size= 3 ) + # Set color to vary based on state categor ies. geom_smooth (method="lm", col="firebrick", size= 2 ) + coord_cartesian (xlim= c ( 0 , 0.1), ylim= c ( 0 , 1000000 )) + labs (title="Area Vs Population", subtitle="From midwest dataset", y="Population" , x="Area", caption="Midwest Demographics") gg <- gg + scale_x_continuous (breaks= seq ( 0 , 0.1, 0.01)) # method 1: Using theme_set() theme_set ( theme_classic ()) _# not run gg

method 2: Adding theme Layer itself._

gg + theme_bw () + labs (subtitle="BW Theme")

Bubble plot

While scatterplot lets you compare the relationship between 2 continuous variables, bubble chart

serves well if you want to understand relationship within the underlying groups based on:

A Categorical variable (by changing the color) and Another continuous variable (by changing the

size of points). In simpler words, bubble charts are more suitable if you have 4-Dimensional data

where two of them are numeric (X and Y) and one other categorical (color) and another numeric

variable (size).

The bubble chart clearly distinguishes the range of displ between the manufacturers and how the

slope of lines-of-best-fit varies, providing a better visual comparison between the groups.

# load package and data library (ggplot2) data (mpg, package="ggplot2") # mpg <- read.csv("http://goo.gl/uEeRGu") mpg_select <- mpg[mpg$manufacturer %in% c ("audi", "ford", "honda", "hyundai"), ] # Scatterplot theme_set ( theme_bw ()) # pre-set the bw theme. g <- ggplot (mpg_select, aes (displ, cty)) + labs (subtitle="mpg: Displacement vs City Mileage", title="Bubble chart") g + geom_jitter ( aes (col=manufacturer, size=hwy)) + geom_smooth ( aes (col=manufacturer), method="lm", se=F)

Histogram

By default, if only one variable is supplied, the geom_bar() tries to calculate the count. In order

for it to behave like a bar chart, the stat=identity option has to be set and x and y values must be

provided.

Histogram on a continuous variable

Histogram on a continuous variable can be accomplished using either geom_bar() or

geom_histogram(). When using geom_histogram(), you can control the number of bars using the

bins option. Else, you can set the range covered by each bin using binwidth. The value of binwidth

is on the same scale as the continuous variable on which histogram is built. Since,

geom_histogram gives facility to control both number of bins as well as binwidth, it is the

preferred option to create histogram on continuous variables.

library (ggplot2) theme_set ( theme_classic ()) # Histogram on a Continuous (Numeric) Variable g <- ggplot (mpg, aes (displ)) + scale_fill_brewer (palette = "Spectral") g + geom_histogram ( aes (fill=class), binwidth =. 1 , col="black", size=. 1 ) + # change binwidth labs (title="Histogram with Auto Binning", subtitle="Engine Displacement across Vehicle Classes")