












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A guide on how to do data management and analysis in R. It includes sections on R basics and a data restructuring example based on HIV Pediatrics dataset. The guide also introduces more data management skills in the 'Data management syntax' section. instructions on how to install R and RStudio and how to use them. It includes R code for the tasks described in the text introduction and R output after executing the R script. The guide is useful for university students and lifelong learners interested in learning R for data management and analysis.
Typology: Study notes
1 / 20
This page cannot be seen from the preview
Don't miss anything!
1 Introduction 1
1.1 Install R and RStudio........................................ 2 1.2 Create a new R Script........................................ 2 1.3 Load packages............................................ 3 1.4 Import and export dataset...................................... 4 1.5 R Basics................................................ 4 1.6 dplyr.................................................. 5 1.7 Reshape................................................ 6
2 HIV Pediatrics Project 8
2.1 Import dataset from REDCap into R................................ 8 2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that the restructuring will go more smoothly downstream.......................... 12 2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond to each CRF................................................ 16 2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps:... 20
The goal of this guide is to show how to do data management and analysis in R. The first section is R basics. The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds to the STATA and SAS on CFAR training website. More data managements skills are introduced in the “Data management syntax”.
My data folders have this structure:
Data
SourceData
2019_09sep_
Stata
SAS
R (This file imports the data from here.)
OutputData (This file write the data to here.)
Figure 1: How to use this guide
There are three fonts in this file (Figure 1):
Text introduction - explains what we are exploring.
R script - The R code for the tasks that are described in the text introduction.
R output - The lines followed with the R scipt and start with “##” are the R output after executing the R scipt. It can be observed when you run code in R/RStudio.
Download the latest version of R from R CRAN (select for Mac, Windows, or Linux). Then install.
Download the most popular R IDE - R Studio (FREE version). Then install.
Then, open R Studio. This program will allow you to have the console, your script file (more on this later), and help panes within the same interface. It also helpfully includes some menus and buttons if you prefer to have a gentler introduction to using statistical software for the first time, particularly if you are coming to R from a program like SPSS. Even for more advanced users, RStudio can be useful in giving you a platform for using knitr, rmarkdown, and shiny.
In R Studio, click File -> New File -> R Script and you will open a blank .R file.
On the top of the R script, please input the title, author, date, and description for this file by using comments (with “#” at the beginning of every line). This will help you organize your R scripts.
_# Title: Data Management in R
# Use "/" instead of "" in the file directories
# .csv dataset (dt_csv is the assigned dataset name in R) dt_csv <- read.csv ("C:/Users/lwang7/OneDrive/Dataset.csv", head=T)
# SAS dataset dt_sas <- read_sav ("C:/Users/lwang7/OneDrive/Dataset.sav")
# STATA dataset dt_stata <- read_dta ("C:/Users/lwang7/OneDrive/stata_dataset.dta")
# Excel spreedsheet dt_xlsx <- read.xlsx ("C:/Users/lwang7/OneDrive/Data/Dataset.xlsx", sheet = "Calculator", startRow = 1, colNames = TRUE, rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE, namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE)
# Export dataset into .csv file write.csv (mydata, file = paste ("C:/Users/lwang7/OneDrive/Output/Newdataset", Sys.Date (), ".csv"), na="") # This will create a CSV dataset with the current date in the file name.
# display the current path getwd () # set the destination path setwd ("C:/Users/lwang7/Documents")
# overview of the dataset describe (mydata)
# Table one variable table (mydata $ gender)
# show information of missing values table (mydata $ gender, exclude=NULL)
# crosstable of gender*outcome table (mydata $ gender, mydata $ outcome, exclude=NULL)
# Histogram hist (mydata $ weight)
# check the row number and column number of the dataset dim (dataset)
# [1] 1289 1474 # row and column, respectively
# Display the ID in pcr dataset but not in the main dataset pcr $ ID[ which (pcr $ ID %notin% main $ ID)]
# subset dataset (select rows by ID and then keep 3 columns) pcr2 <- pcr[ which (pcr $ ID %notin% main $ ID) , c ("id", "date", "pcr_value")]
dplyr is a powerful R-package of data management. It provides simple functions that correspond to the most common data manipulation tasks to help you translate your ideas into code. It is very efficient that you spend less time waiting for the computer to process the steps. dplyr uses pipe operator %>% to separate every step, which increased the flexibility of organizing multiple manipulation steps together.
_# syntax
newdata <- olddata %>% # continue arrange (ID, DATE) %>% # continue group_by (ID) # stop and execute one by one
newdata2 <- newdata1 %>% mutate (newvar = age) # stop and execute
# the two steps above canbe combined together as below newdata2<- olddata %>% # continue arrange (ID, DATE) %>% # continue group_by (ID) %>% # continue mutate (newvar = age) # stop and execute one by one
newdata <- olddata %>%
# Create a new variable mutate (ID = substr ( as.character (SubjID), 3, 6) ) %>% mutate (DATE = as.Date (DATE2, format = "%Y-%m-%d")) %>%
# replace 8 with 0 in existing variable mutate (adlq01now = ifelse (adlq01now == 8, 0, adlq01now)) %>%
# group by ID group_by (ID) %>%
# Sort the dataset by ID and DATE arrange (ID, DATE) %>%
# Create a variable for the number of observations of every subject mutate (n2= n ()) %>%
'cm_indication', 'cm_start_date', 'cm_stop_date'), idvar= c ('ptid','visitcode'))
2 HIV Pediatrics Project
The dataset was exported from REDCap. REDCap generated a .csv file (data) and a .r file (data formating code). We need to run the .r program to import and format the dataset first, and then following with data management and analysis.
The following code was copied from the REDCap downloaded .r file. I made two changes and recommend them:
CHANGE 1 - Change file pathway where you saved your downloaded REDCap dataset.
Add parameter in read.csv for importing characters in character format instead of factor format. It is easier to use character variables in data management. Some people say that you have to use factor variables in the statistical models (e.g. linear regression) but it is not necessary because lm() function can transfer character into factor automatically. When you need to use factor format, use as.factor() to transform it.
The R code from RedCap will generate a list of new factor variables. The reason is that RedCap saves charactor variables with their code number and category name, which can be found in the code book. For instance, variable “demographics_complete” has three values “0”, “1”, and “2”, corresponding with “Incomplete”, “Unverified”, and “Complete” (in variable “demographics_complete.factor”), respectively.
#Clear existing data and graphics rm (list= ls ()) graphics.off () #Load Hmisc library library (Hmisc) #Read Data
#################### CHANGE 1 ##################### # add the file path where you saved the .csv setwd ('C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data/') getwd ()
#################### CHANGE 2 ##################### # add stringsAsFactors=FALSE in read.csv data= read.csv ('SourceData/2019_09sep_09/R/CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv', stringsAsFactors=FALSE)
# data=read.csv('CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv') # original from REDCap
#Setting Labels
label (data $ study_id)="Participant Enrollment ID (ends in 9)" label (data $ redcap_event_name)="Event Name" label (data $ demographics_complete)="Complete?" label (data $ ptid_studyid_validation)="Participant Enrollment ID (ends in 9)"
label (data $ randomization_rand_complete)="Complete?" label (data $ crfversion_lr)="Lab Results (LR):CRF Version" label (data $ ptid_lr)="Participant Enrollment ID (ends in 9)" label (data $ visitcode_lr)="Sample collection: Visit Code" label (data $ visitdate_lr)="Sample collection: Visit Date" label (data $ cd4cnt_lr)="CD4 Count (cells/uL)" label (data $ cd8cnt_lr)="CD8 Count (cells/uL)" label (data $ cd4to8ratio_lr)="CD4/CD8 ratio" label (data $ wbc_lr)="WBC (x10^3 cells/uL)" label (data $ rbc_lr)="RBC (x10^6 cells/uL)" label (data $ hb_lr)="Hemoglobin (g/dl)" label (data $ viralload_lr)="Viral load (copies/ml)" label (data $ frmcomplby_lr)="Form completed by: Staff code" label (data $ frmcomplbydt_lr)="Form completed by: Date" label (data $ frmdbaseby_lr)="Form entered into database by: Staff code" label (data $ frmdbasebydt_lr)="Form entered into database by: Date" label (data $ dbasecomment1_lr)="Database comments (1)" label (data $ dbasecomment2_lr)="Database comments (2)" label (data $ dbasecomment3_lr)="Database comments (3)" label (data $ lab_results_lr_complete)="Complete?" label (data $ crfversion_fu)="Followup (FU) Visit: CRF Version" label (data $ ptid_fu)="Participant Enrollment ID (ends in 9)" label (data $ visitcode_fu)="Visit Code" label (data $ visitdate_fu)="Visit Date" label (data $ sympfev_fu)="Fever" label (data $ sympfevdur_fu)="Fever: Duration (days)" label (data $ sympoth1_fu)="Other (1)" label (data $ sympoth1sp_fu)="Other (1): Specify" label (data $ sympoth1dur_fu)="Other (1) : Duration (days)" label (data $ sympoth2_fu)="Other (2)" label (data $ sympoth2sp_fu)="Other (2): Specify" label (data $ sympoth2dur_fu)="Other (2): Duration (days)" label (data $ txmed1_fu)="Medication (1)" label (data $ txmed1name_fu)="Medication (1): Name" label (data $ txmed1ind_fu)="Medication (1): Indication" label (data $ txmed1startdt_fu)="Medication (1): Start date" label (data $ txmed1stopdt_fu)="Medication (1): End date" label (data $ txmed2_fu)="Medication (2)" label (data $ txmed2name_fu)="Medication (2): Name" label (data $ txmed2ind_fu)="Medication (2): Indication" label (data $ txmed2startdt_fu)="Medication (2): Start date" label (data $ txmed2stopdt_fu)="Medication (2): End date" label (data $ weight_fu)="Current weight (kg)" label (data $ height_fu)="Current height (cm)" label (data $ temp_fu)="Body temperature (degrees C)" label (data $ heart_fu)="Heart rate (per minute)" label (data $ bpsystol_fu)="Blood pressure: Systolic (mmHg)" label (data $ bpdiast_fu)="Blood pressure: Diastolic (mmHg)" label (data $ nxtvisitdt_fu)="Next visit: Date" label (data $ frmcomplby_fu)="Form completed by: Staff code" label (data $ frmcomplbydt_fu)="Form completed by: Date" label (data $ frmdbaseby_fu)="Form entered into database by: Staff code" label (data $ frmdbasebydt_fu)="Form entered into database by: Date"
label (data $ dbasecomment1_fu)="Database comments (1)" label (data $ dbasecomment2_fu)="Database comments (2)" label (data $ dbasecomment3_fu)="Database comments (3)" label (data $ followup_fu_complete)="Complete?" #Setting Units
#Setting Factors(will create new variable for factors) data $ redcap_event_name.factor = factor (data $ redcap_event_name,levels = c ("enrollment_arm_1","followup_6mo data $ demographics_complete.factor = factor (data $ demographics_complete,levels= c ("0","1","2")) data $ check_studyid_validation.factor = factor (data $ check_studyid_validation,levels= c ("1")) data $ studyid_validation_complete.factor = factor (data $ studyid_validation_complete,levels= c ("0","1","2")) data $ crfversion_en.factor = factor (data $ crfversion_en,levels= c ("1","2")) data $ sex_en.factor = factor (data $ sex_en,levels= c ("1","2")) data $ sympfev_en.factor = factor (data $ sympfev_en,levels= c ("1","0")) data $ sympoth1_en.factor = factor (data $ sympoth1_en,levels= c ("1","0")) data $ sympoth2_en.factor = factor (data $ sympoth2_en,levels= c ("1","0")) data $ txmed1_en.factor = factor (data $ txmed1_en,levels= c ("1","0")) data $ txmed2_en.factor = factor (data $ txmed2_en,levels= c ("1","0")) data $ frmcomplby_en.factor = factor (data $ frmcomplby_en,levels= c ("1","2")) data $ frmdbaseby_en.factor = factor (data $ frmdbaseby_en,levels= c ("1","2")) data $ enrollment_en_complete.factor = factor (data $ enrollment_en_complete,levels= c ("0","1","2")) data $ crfversion_rand.factor = factor (data $ crfversion_rand,levels= c ("1","2")) data $ randarm_rand.factor = factor (data $ randarm_rand,levels= c ("0","1")) data $ frmcomplby_rand.factor = factor (data $ frmcomplby_rand,levels= c ("1","2")) data $ frmdbaseby_rand.factor = factor (data $ frmdbaseby_rand,levels= c ("1","2")) data $ randomization_rand_complete.factor = factor (data $ randomization_rand_complete,levels= c ("0","1","2")) data $ crfversion_lr.factor = factor (data $ crfversion_lr,levels= c ("1","2")) data $ frmcomplby_lr.factor = factor (data $ frmcomplby_lr,levels= c ("1","2")) data $ frmdbaseby_lr.factor = factor (data $ frmdbaseby_lr,levels= c ("1","2")) data $ lab_results_lr_complete.factor = factor (data $ lab_results_lr_complete,levels= c ("0","1","2")) data $ crfversion_fu.factor = factor (data $ crfversion_fu,levels= c ("1","2")) data $ sympfev_fu.factor = factor (data $ sympfev_fu,levels= c ("1","0")) data $ sympoth1_fu.factor = factor (data $ sympoth1_fu,levels= c ("1","0")) data $ sympoth2_fu.factor = factor (data $ sympoth2_fu,levels= c ("1","0")) data $ txmed1_fu.factor = factor (data $ txmed1_fu,levels= c ("1","0")) data $ txmed2_fu.factor = factor (data $ txmed2_fu,levels= c ("1","0")) data $ frmcomplby_fu.factor = factor (data $ frmcomplby_fu,levels= c ("1","2")) data $ frmdbaseby_fu.factor = factor (data $ frmdbaseby_fu,levels= c ("1","2")) data $ followup_fu_complete.factor = factor (data $ followup_fu_complete,levels= c ("0","1","2"))
levels (data $ redcap_event_name.factor)= c ("Enrollment","Followup: 6-month","Followup: 12-month") levels (data $ demographics_complete.factor)= c ("Incomplete","Unverified","Complete") levels (data $ check_studyid_validation.factor)= c ("Validate Participant ID") levels (data $ studyid_validation_complete.factor)= c ("Incomplete","Unverified","Complete") levels (data $ crfversion_en.factor)= c ("1.0","2.0") levels (data $ sex_en.factor)= c ("Male","Female") levels (data $ sympfev_en.factor)= c ("Yes","No") levels (data $ sympoth1_en.factor)= c ("Yes","No") levels (data $ sympoth2_en.factor)= c ("Yes","No") levels (data $ txmed1_en.factor)= c ("Yes","No") levels (data $ txmed2_en.factor)= c ("Yes","No") levels (data $ frmcomplby_en.factor)= c ("Personnel1","Personnel2") levels (data $ frmdbaseby_en.factor)= c ("Personnel1","Personnel2")
# Study_id is not friendly for review, create a character ID variable for convinence data $ study_id = as.character (data $ study_id)
# create a new dataset with checking variables ck1 = data %>% select (study_id, redcap_event_name, ptid_studyid_validation, ptid_en, ptid_rand, ptid_lr, ptid_fu, redcap_event_name) %>% mutate (check1 = ifelse ( !is.na (ptid_en) & study_id != ptid_en, 1, NA), check2 = ifelse ( !is.na (ptid_lr) & study_id != ptid_lr, 1, NA), check3 = ifelse ( !is.na (ptid_fu) & study_id != ptid_fu, 1, NA), check4 = ifelse ( !is.na (ptid_en) & !is.na (ptid_lr) & ptid_en != ptid_lr, 1, NA))
# These 4 checking variables should be all missing. if ( !all ( is.na (ck1[, c ("check1", "check2", "check3", "check4")]))){ cat ("QC PTIDs FAILED!!! check the data below") print (ck1[ which ( !all ( is.na (ck1[, c ("check1", "check2", "check3", "check4")]))),]) } else { cat ( "QC PTIDs PASSED!")}
# *Now that we've QC'd the PTIDs, we can proceed with just study_id
# display the variable with "date" or "dt" in their name var_list = colnames (data)[ which ( grepl ("date", colnames (data)) | grepl ("dt", colnames (data)))] var_list
# overview these variables describe (data[,var_list])
_# check the format --> date variables were imported as characters (chr)
# format date variables into date formats
# tables <encourage using "exclude=NULL" to display missing values> table (data $ redcap_event_name, data $ visitcode_en, exclude=NULL)
table (data $ redcap_event_name, data $ visitcode_fu, exclude=NULL)
table (data $ redcap_event_name, data $ visitcode_lr, exclude=NULL)
# check for issues ck3 = data %>% mutate (check1 = ifelse ( !is.na (visitcode_en) & !is.na (visitcode_lr) & visitcode_en != visitcode_lr, 1, NA), check2 = ifelse ( !is.na (visitcode_fu) & !is.na (visitcode_lr) & visitcode_fu != visitcode_lr, 1, NA))
if ( !all ( is.na (ck3[, c ("check1", "check2")]))){ cat ("QC VISIT CODES FAILED!!! check the data below") print (ck3[ which ( !all ( is.na (ck3[, c ("check1", "check2")]))),]) } else { cat ( "QC VISIT CODES PASSED!")}
table ( data $ redcap_event_name, data $ crfversion_rand, exclude=NULL)
Randomization0 <- data %>% # KEEP DESIRED EVENT: enrollment filter ( redcap_event_name == "enrollment_arm_1" ) %>% filter ( !is.na ( crfversion_rand) ) %>%
# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_rand : dbasecomment3_rand, randomization_rand_complete) %>%
# create a new variable mutate (rand_indata=1)
# let's see the result table ( Randomization0 $ redcap_event_name, Randomization0 $ crfversion_rand, exclude=NULL)
# 20 subjects
table ( data $ redcap_event_name, data $ crfversion_en, exclude=NULL)
Visits_Enrollment0 <- data %>% # KEEP DESIRED EVENT: enrollment filter ( redcap_event_name == "enrollment_arm_1" ) %>% filter ( !is.na ( crfversion_en) ) %>%
# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_en : dbasecomment3_en, enrollment_en_complete) %>%
# 20 subjects with 2*20 observations
table ( data $ redcap_event_name, data $ crfversion_lr, exclude=NULL)
Labs0 <- data %>% # KEEP DESIRED EVENT: FOLLOWUP. mutate ( keepvisit = ifelse ( redcap_event_name %in% c ( "enrollment_arm_1", "followup_6month_arm_1", "followup_12month_arm_1" ), 1, NA )) %>% filter ( keepvisit == 1) %>% select ( - keepvisit ) %>%
filter ( !is.na ( crfversion_lr) ) %>%
# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_lr : dbasecomment3_lr, lab_results_lr_complete) %>%
# SORT arrange (study_id, visitcode_lr)
# let's see the result table ( Labs0 $ redcap_event_name, Labs0 $ crfversion_lr, exclude=NULL)
# 20 subjects with 3*20 observations & 21 variables
2.4.1 Prepare files for incorporation with either of two files: (1) Main (1 record/participant), (2) Longitudinal (1 record/visit).
Subsequent steps merge these two files together to create additional variables (e.g. 1st pre-HAART viral load), and analysis follows.
2.4.2 Append and merge to create the Main and Longitudinal files.
2.4.3 Create other analysis variables.
2.4.4 Example descriptives, figures, and analyses
Please email lwang7@uw.edu for the source code. Thanks!