Tidyverse, dplyr R Cheat Sheet | Cheat Sheet Advanced Computer Programming

R For Data Science Cheat Sheet

Tidyverse for Beginners

Tidyverse

The tidyverse is a powerful collection of R packages that are actually

data tools for transforming and visualizing data. All packages of the

tidyverse share an underlying philosophy and common APIs.

The core packages are:

• ggplot2, which implements the grammar of graphics. You can use it

to visualize your data.

• dplyr is a grammar of data manipulation. You can use it to solve the

most common data manipulation challenges.

• tidyr helps you to create tidy data or data where each variable is in a

column, each observation is a row end each value is a cell.

• readr is a fast and friendly way to read rectangular data.

• purrr enhances R’s functional programming (FP) toolkit by providing a

complete and consistent set of tools for working with functions and

vectors.

• tibble is a modern re-imaginging of the data frame.

• stringr provides a cohesive set of functions designed to make

working with strings as easy as posssible

• forcats provide a suite of useful tools that solve common problems

with factors.

You can install the complete tidyverse with:

Then, load the core tidyverse and make it available in your current R

session by running:

Note: there are many other tidyverse packages with more specialised usage. They are not

loaded automatically with library(tidyverse), so you’ll need to load each one with its own call

to library().

ggplot2

> install.packages("tidyverse")

> iris %>% Select iris data of species

filter(Species=="virginica") "virginica"

> iris %>% Select iris data of species

filter(Species=="virginica", "virginica" and sepal length

Sepal.Length > 6) greater than 6.

dplyr

Filter

> library(tidyverse)

Useful Functions

Arrange

Mutate

Summarize

> tidyverse_conflicts() Conflicts between tidyverse and other

packages

> tidyverse_deps() List all tidyverse dependencies

> tidyverse_logo() Get tidyverse logo, using ASCII or unicode

characters

> tidyverse_packages() List all tidyverse packages

> tidyverse_update() Update tidyverse packages

Loading in the data

> library(datasets) Load the datasets package

> library(gapminder) Load the gapminder package

> attach(iris) Attach iris data to the R search path

filter() allows you to select a subset of rows in a data frame.

> iris %>% Sort in ascending order of

arrange(Sepal.Length) sepal length

> iris %>% Sort in descending order of

arrange(desc(Sepal.Length)) sepal length

arrange() sorts the observations in a dataset in ascending or descending order

based on one of its variables.

> iris %>% Filter for species "virginica"

filter(Species=="virginica") %>% then arrange in descending

arrange(desc(Sepal.Length)) order of sepal length

Combine multiple dplyr verbs in a row with the pipe operator %>%:

mutate() allows you to update or create new columns of a data frame.

> iris %>% Change Sepal.Length to be

mutate(Sepal.Length=Sepal.Length*10) in millimeters

> iris %>% Create a new column

mutate(SLMm=Sepal.Length*10) called SLMm

Combine the verbs filter(), arrange(), and mutate():

> iris %>%

filter(Species=="Virginica") %>%

mutate(SLMm=Sepal.Length*10) %>%

arrange(desc(SLMm))

> iris %>% Summarize to find the

summarize(medianSL=median(Sepal.Length)) median sepal length

> iris %>% Filter for virginica then

filter(Species=="virginica") %>% summarize the median

summarize(medianSL=median(Sepal.Length)) sepal length

summarize() allows you to turn many observations into a single data point.

> iris %>%

filter(Species=="virginica") %>%

summarize(medianSL=median(Sepal.Length),

maxSL=max(Sepal.Length))

You can also summarize multiple variables at once:

group_by() allows you to summarize within groups instead of summarizing the

entire dataset:

> iris %>% Find median and max

group_by(Species) %>% sepal length of each

summarize(medianSL=median(Sepal.Length), species

maxSL=max(Sepal.Length))

> iris %>% Find median and max

filter(Sepal.Length>6) %>% petal length of each

group_by(Species) %>% species with sepal

summarize(medianPL=median(Petal.Length), length > 6

maxPL=max(Petal.Length))

Scatter plot

> iris_small <- iris %>%

filter(Sepal.Length > 5)

> ggplot(iris_small, aes(x=Petal.Length, Compare petal

y=Petal.Width)) + width and length

geom_point()

Scatter plots allow you to compare two variables within your data. To do this with

ggplot2, you use geom_point()

Additional Aesthetics

> ggplot(iris_small, aes(x=Petal.Length,

y=Petal.Width,

color=Species)) +

geom_point()

• Color

• Size

> ggplot(iris_small, aes(x=Petal.Length,

y=Petal.Width,

color=Species,

size=Sepal.Length)) +

geom_point()

Faceting

> ggplot(iris_small, aes(x=Petal.Length,

y=Petal.Width)) +

geom_point()+

facet_wrap(~Species)

Line Plots

Bar Plots

Histograms

Box Plots

> by_year <- gapminder %>%

group_by(year) %>%

summ arize(medianGdpPerCap=median(gdpPercap))

> ggplot(by_year, aes(x=year,

y=medianGdpPerCap))+

geom_line()+

expand_limits(y=0)

> by_species <- iris %>%

filter(Sepal.Length>6) %>%

group_by(Species) %>%

summarize(medianPL=median(Petal.Length))

> ggplot(by_species, aes(x=Species,

y=medianPL)) +

geom_col()

> ggplot(iris_small, aes(x=Petal.Length))+

geom_histogram()

> ggplot(iris_small, aes(x=Species,

y=Sepal.Width))+

geom_boxplot()

Tidyverse, dplyr R Cheat Sheet, Cheat Sheet of Advanced Computer Programming

Often downloaded together

Related documents

Partial preview of the text

Download Tidyverse, dplyr R Cheat Sheet and more Cheat Sheet Advanced Computer Programming in PDF only on Docsity!

R For Data Science Cheat Sheet

Tidyverse for Beginners

Tidyverse

The tidyverse is a powerful collection of R packages that are actually

data tools for transforming and visualizing data. All packages of the

tidyverse share an underlying philosophy and common APIs.

The core packages are:

to visualize your data.

most common data manipulation challenges.

column, each observation is a row end each value is a cell.

complete and consistent set of tools for working with functions and

vectors.

working with strings as easy as posssible

with factors.

You can install the complete tidyverse with:

Then, load the core tidyverse and make it available in your current R

session by running:

ggplot

> iris %>% Select iris data of species

filter(Species=="virginica") "virginica"

> iris %>% Select iris data of species

filter(Species=="virginica", "virginica" and sepal length

Sepal.Length > 6) greater than 6.

dplyr

Filter

Useful Functions

Arrange

Mutate

Summarize

> tidyverse_conflicts() Conflicts between tidyverse and other

packages

> tidyverse_deps() List all tidyverse dependencies

> tidyverse_logo() Get tidyverse logo, using ASCII or unicode

characters

> tidyverse_packages() List all tidyverse packages

> tidyverse_update() Update tidyverse packages

Loading in the data

> library(datasets) Load the datasets package

> library(gapminder) Load the gapminder package

> attach(iris) Attach iris data to the R search path

filter() allows you to select a subset of rows in a data frame.

> iris %>% Sort in ascending order of

arrange(Sepal.Length) sepal length

> iris %>% Sort in descending order of

arrange(desc(Sepal.Length)) sepal length

arrange() sorts the observations in a dataset in ascending or descending order

based on one of its variables.

> iris %>% Filter for species "virginica"

filter(Species=="virginica") %>% then arrange in descending

arrange(desc(Sepal.Length)) order of sepal length

Combine multiple dplyr verbs in a row with the pipe operator %>%:

mutate() allows you to update or create new columns of a data frame.

> iris %>% Change Sepal.Length to be

mutate(Sepal.Length=Sepal.Length*10) in millimeters

> iris %>% Create a new column

mutate(SLMm=Sepal.Length*10) called SLMm

Combine the verbs filter(), arrange(), and mutate():

> iris %>% Summarize to find the

summarize(medianSL=median(Sepal.Length)) median sepal length

> iris %>% Filter for virginica then

filter(Species=="virginica") %>% summarize the median

summarize(medianSL=median(Sepal.Length)) sepal length

summarize() allows you to turn many observations into a single data point.

You can also summarize multiple variables at once:

group_by() allows you to summarize within groups instead of summarizing the

entire dataset:

> iris %>% Find median and max

group_by(Species) %>% sepal length of each

summarize(medianSL=median(Sepal.Length), species

> iris %>% Find median and max

filter(Sepal.Length>6) %>% petal length of each

group_by(Species) %>% species with sepal

summarize(medianPL=median(Petal.Length), length > 6

Scatter plot

> ggplot(iris_small, aes(x=Petal.Length, Compare petal