Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

R User Guide 1 - Data Restructuring, Study notes of Corporate Finance

A guide on how to do data management and analysis in R. It includes sections on R basics and a data restructuring example based on HIV Pediatrics dataset. The guide also introduces more data management skills in the 'Data management syntax' section. instructions on how to install R and RStudio and how to use them. It includes R code for the tasks described in the text introduction and R output after executing the R script. The guide is useful for university students and lifelong learners interested in learning R for data management and analysis.

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

heathl
heathl 🇺🇸

4.5

(11)

237 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
R User Guide 1 - Data Restructuring
Lei Wang, CFAR Biometics Core, University of Washington
Updated: November 17, 2019
Contents
1 Introduction 1
1.1 InstallRandRStudio ........................................ 2
1.2 CreateanewRScript ........................................ 2
1.3 Loadpackages ............................................ 3
1.4 Importandexportdataset...................................... 4
1.5 RBasics................................................ 4
1.6 dplyr.................................................. 5
1.7 Reshape................................................ 6
2 HIV Pediatrics Project 8
2.1 Import dataset from REDCap into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that the
restructuring will go more smoothly downstream. . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond to
eachCRF................................................ 16
2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps: . . . 20
1 Introduction
The goal of this guide is to show how to do data management and analysis in R. The first section is R basics.
The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds to
the STATA and SAS on CFAR training website. More data managements skills are introduced in the “Data
management syntax”.
My data folders have this structure:
Data
SourceData
2019_09sep_09
Stata
SAS
R (This file imports the data from here.)
OutputData (This file write the data to here.)
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download R User Guide 1 - Data Restructuring and more Study notes Corporate Finance in PDF only on Docsity!

R User Guide 1 - Data Restructuring

Lei Wang, CFAR Biometics Core, University of Washington

Updated: November 17, 2019

Contents

1 Introduction 1

1.1 Install R and RStudio........................................ 2 1.2 Create a new R Script........................................ 2 1.3 Load packages............................................ 3 1.4 Import and export dataset...................................... 4 1.5 R Basics................................................ 4 1.6 dplyr.................................................. 5 1.7 Reshape................................................ 6

2 HIV Pediatrics Project 8

2.1 Import dataset from REDCap into R................................ 8 2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that the restructuring will go more smoothly downstream.......................... 12 2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond to each CRF................................................ 16 2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps:... 20

1 Introduction

The goal of this guide is to show how to do data management and analysis in R. The first section is R basics. The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds to the STATA and SAS on CFAR training website. More data managements skills are introduced in the “Data management syntax”.

My data folders have this structure:

Data

SourceData

2019_09sep_

Stata

SAS

R (This file imports the data from here.)

OutputData (This file write the data to here.)

Figure 1: How to use this guide

There are three fonts in this file (Figure 1):

  1. Text introduction - explains what we are exploring.

  2. R script - The R code for the tasks that are described in the text introduction.

  3. R output - The lines followed with the R scipt and start with “##” are the R output after executing the R scipt. It can be observed when you run code in R/RStudio.

1.1 Install R and RStudio

Download the latest version of R from R CRAN (select for Mac, Windows, or Linux). Then install.

Download the most popular R IDE - R Studio (FREE version). Then install.

Then, open R Studio. This program will allow you to have the console, your script file (more on this later), and help panes within the same interface. It also helpfully includes some menus and buttons if you prefer to have a gentler introduction to using statistical software for the first time, particularly if you are coming to R from a program like SPSS. Even for more advanced users, RStudio can be useful in giving you a platform for using knitr, rmarkdown, and shiny.

1.2 Create a new R Script

  1. In R Studio, click File -> New File -> R Script and you will open a blank .R file.

  2. On the top of the R script, please input the title, author, date, and description for this file by using comments (with “#” at the beginning of every line). This will help you organize your R scripts.

_# Title: Data Management in R

Description: Introduce the data management and analysis skills in R_

1.4 Import and export dataset

# Use "/" instead of "" in the file directories

# .csv dataset (dt_csv is the assigned dataset name in R) dt_csv <- read.csv ("C:/Users/lwang7/OneDrive/Dataset.csv", head=T)

# SAS dataset dt_sas <- read_sav ("C:/Users/lwang7/OneDrive/Dataset.sav")

# STATA dataset dt_stata <- read_dta ("C:/Users/lwang7/OneDrive/stata_dataset.dta")

# Excel spreedsheet dt_xlsx <- read.xlsx ("C:/Users/lwang7/OneDrive/Data/Dataset.xlsx", sheet = "Calculator", startRow = 1, colNames = TRUE, rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE, namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE)

# Export dataset into .csv file write.csv (mydata, file = paste ("C:/Users/lwang7/OneDrive/Output/Newdataset", Sys.Date (), ".csv"), na="") # This will create a CSV dataset with the current date in the file name.

1.5 R Basics

# display the current path getwd () # set the destination path setwd ("C:/Users/lwang7/Documents")

# overview of the dataset describe (mydata)

# Table one variable table (mydata $ gender)

# show information of missing values table (mydata $ gender, exclude=NULL)

# crosstable of gender*outcome table (mydata $ gender, mydata $ outcome, exclude=NULL)

# Histogram hist (mydata $ weight)

# check the row number and column number of the dataset dim (dataset)

# [1] 1289 1474 # row and column, respectively

# Display the ID in pcr dataset but not in the main dataset pcr $ ID[ which (pcr $ ID %notin% main $ ID)]

# subset dataset (select rows by ID and then keep 3 columns) pcr2 <- pcr[ which (pcr $ ID %notin% main $ ID) , c ("id", "date", "pcr_value")]

1.6 dplyr

dplyr is a powerful R-package of data management. It provides simple functions that correspond to the most common data manipulation tasks to help you translate your ideas into code. It is very efficient that you spend less time waiting for the computer to process the steps. dplyr uses pipe operator %>% to separate every step, which increased the flexibility of organizing multiple manipulation steps together.

_# syntax

dplyr use "%>%" to seperate command

with "%>%", R will execute the next line until a line without %>% at the end_

newdata <- olddata %>% # continue arrange (ID, DATE) %>% # continue group_by (ID) # stop and execute one by one

newdata2 <- newdata1 %>% mutate (newvar = age) # stop and execute

# the two steps above canbe combined together as below newdata2<- olddata %>% # continue arrange (ID, DATE) %>% # continue group_by (ID) %>% # continue mutate (newvar = age) # stop and execute one by one

newdata <- olddata %>%

# Create a new variable mutate (ID = substr ( as.character (SubjID), 3, 6) ) %>% mutate (DATE = as.Date (DATE2, format = "%Y-%m-%d")) %>%

# replace 8 with 0 in existing variable mutate (adlq01now = ifelse (adlq01now == 8, 0, adlq01now)) %>%

# group by ID group_by (ID) %>%

# Sort the dataset by ID and DATE arrange (ID, DATE) %>%

# Create a variable for the number of observations of every subject mutate (n2= n ()) %>%

'cm_indication', 'cm_start_date', 'cm_stop_date'), idvar= c ('ptid','visitcode'))

2 HIV Pediatrics Project

The dataset was exported from REDCap. REDCap generated a .csv file (data) and a .r file (data formating code). We need to run the .r program to import and format the dataset first, and then following with data management and analysis.

2.1 Import dataset from REDCap into R

The following code was copied from the REDCap downloaded .r file. I made two changes and recommend them:

  1. CHANGE 1 - Change file pathway where you saved your downloaded REDCap dataset.

  2. Add parameter in read.csv for importing characters in character format instead of factor format. It is easier to use character variables in data management. Some people say that you have to use factor variables in the statistical models (e.g. linear regression) but it is not necessary because lm() function can transfer character into factor automatically. When you need to use factor format, use as.factor() to transform it.

The R code from RedCap will generate a list of new factor variables. The reason is that RedCap saves charactor variables with their code number and category name, which can be found in the code book. For instance, variable “demographics_complete” has three values “0”, “1”, and “2”, corresponding with “Incomplete”, “Unverified”, and “Complete” (in variable “demographics_complete.factor”), respectively.

#Clear existing data and graphics rm (list= ls ()) graphics.off () #Load Hmisc library library (Hmisc) #Read Data

#################### CHANGE 1 ##################### # add the file path where you saved the .csv setwd ('C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data/') getwd ()

[1] "C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data"

#################### CHANGE 2 ##################### # add stringsAsFactors=FALSE in read.csv data= read.csv ('SourceData/2019_09sep_09/R/CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv', stringsAsFactors=FALSE)

# data=read.csv('CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv') # original from REDCap

#Setting Labels

label (data $ study_id)="Participant Enrollment ID (ends in 9)" label (data $ redcap_event_name)="Event Name" label (data $ demographics_complete)="Complete?" label (data $ ptid_studyid_validation)="Participant Enrollment ID (ends in 9)"

label (data $ randomization_rand_complete)="Complete?" label (data $ crfversion_lr)="Lab Results (LR):CRF Version" label (data $ ptid_lr)="Participant Enrollment ID (ends in 9)" label (data $ visitcode_lr)="Sample collection: Visit Code" label (data $ visitdate_lr)="Sample collection: Visit Date" label (data $ cd4cnt_lr)="CD4 Count (cells/uL)" label (data $ cd8cnt_lr)="CD8 Count (cells/uL)" label (data $ cd4to8ratio_lr)="CD4/CD8 ratio" label (data $ wbc_lr)="WBC (x10^3 cells/uL)" label (data $ rbc_lr)="RBC (x10^6 cells/uL)" label (data $ hb_lr)="Hemoglobin (g/dl)" label (data $ viralload_lr)="Viral load (copies/ml)" label (data $ frmcomplby_lr)="Form completed by: Staff code" label (data $ frmcomplbydt_lr)="Form completed by: Date" label (data $ frmdbaseby_lr)="Form entered into database by: Staff code" label (data $ frmdbasebydt_lr)="Form entered into database by: Date" label (data $ dbasecomment1_lr)="Database comments (1)" label (data $ dbasecomment2_lr)="Database comments (2)" label (data $ dbasecomment3_lr)="Database comments (3)" label (data $ lab_results_lr_complete)="Complete?" label (data $ crfversion_fu)="Followup (FU) Visit: CRF Version" label (data $ ptid_fu)="Participant Enrollment ID (ends in 9)" label (data $ visitcode_fu)="Visit Code" label (data $ visitdate_fu)="Visit Date" label (data $ sympfev_fu)="Fever" label (data $ sympfevdur_fu)="Fever: Duration (days)" label (data $ sympoth1_fu)="Other (1)" label (data $ sympoth1sp_fu)="Other (1): Specify" label (data $ sympoth1dur_fu)="Other (1) : Duration (days)" label (data $ sympoth2_fu)="Other (2)" label (data $ sympoth2sp_fu)="Other (2): Specify" label (data $ sympoth2dur_fu)="Other (2): Duration (days)" label (data $ txmed1_fu)="Medication (1)" label (data $ txmed1name_fu)="Medication (1): Name" label (data $ txmed1ind_fu)="Medication (1): Indication" label (data $ txmed1startdt_fu)="Medication (1): Start date" label (data $ txmed1stopdt_fu)="Medication (1): End date" label (data $ txmed2_fu)="Medication (2)" label (data $ txmed2name_fu)="Medication (2): Name" label (data $ txmed2ind_fu)="Medication (2): Indication" label (data $ txmed2startdt_fu)="Medication (2): Start date" label (data $ txmed2stopdt_fu)="Medication (2): End date" label (data $ weight_fu)="Current weight (kg)" label (data $ height_fu)="Current height (cm)" label (data $ temp_fu)="Body temperature (degrees C)" label (data $ heart_fu)="Heart rate (per minute)" label (data $ bpsystol_fu)="Blood pressure: Systolic (mmHg)" label (data $ bpdiast_fu)="Blood pressure: Diastolic (mmHg)" label (data $ nxtvisitdt_fu)="Next visit: Date" label (data $ frmcomplby_fu)="Form completed by: Staff code" label (data $ frmcomplbydt_fu)="Form completed by: Date" label (data $ frmdbaseby_fu)="Form entered into database by: Staff code" label (data $ frmdbasebydt_fu)="Form entered into database by: Date"

label (data $ dbasecomment1_fu)="Database comments (1)" label (data $ dbasecomment2_fu)="Database comments (2)" label (data $ dbasecomment3_fu)="Database comments (3)" label (data $ followup_fu_complete)="Complete?" #Setting Units

#Setting Factors(will create new variable for factors) data $ redcap_event_name.factor = factor (data $ redcap_event_name,levels = c ("enrollment_arm_1","followup_6mo data $ demographics_complete.factor = factor (data $ demographics_complete,levels= c ("0","1","2")) data $ check_studyid_validation.factor = factor (data $ check_studyid_validation,levels= c ("1")) data $ studyid_validation_complete.factor = factor (data $ studyid_validation_complete,levels= c ("0","1","2")) data $ crfversion_en.factor = factor (data $ crfversion_en,levels= c ("1","2")) data $ sex_en.factor = factor (data $ sex_en,levels= c ("1","2")) data $ sympfev_en.factor = factor (data $ sympfev_en,levels= c ("1","0")) data $ sympoth1_en.factor = factor (data $ sympoth1_en,levels= c ("1","0")) data $ sympoth2_en.factor = factor (data $ sympoth2_en,levels= c ("1","0")) data $ txmed1_en.factor = factor (data $ txmed1_en,levels= c ("1","0")) data $ txmed2_en.factor = factor (data $ txmed2_en,levels= c ("1","0")) data $ frmcomplby_en.factor = factor (data $ frmcomplby_en,levels= c ("1","2")) data $ frmdbaseby_en.factor = factor (data $ frmdbaseby_en,levels= c ("1","2")) data $ enrollment_en_complete.factor = factor (data $ enrollment_en_complete,levels= c ("0","1","2")) data $ crfversion_rand.factor = factor (data $ crfversion_rand,levels= c ("1","2")) data $ randarm_rand.factor = factor (data $ randarm_rand,levels= c ("0","1")) data $ frmcomplby_rand.factor = factor (data $ frmcomplby_rand,levels= c ("1","2")) data $ frmdbaseby_rand.factor = factor (data $ frmdbaseby_rand,levels= c ("1","2")) data $ randomization_rand_complete.factor = factor (data $ randomization_rand_complete,levels= c ("0","1","2")) data $ crfversion_lr.factor = factor (data $ crfversion_lr,levels= c ("1","2")) data $ frmcomplby_lr.factor = factor (data $ frmcomplby_lr,levels= c ("1","2")) data $ frmdbaseby_lr.factor = factor (data $ frmdbaseby_lr,levels= c ("1","2")) data $ lab_results_lr_complete.factor = factor (data $ lab_results_lr_complete,levels= c ("0","1","2")) data $ crfversion_fu.factor = factor (data $ crfversion_fu,levels= c ("1","2")) data $ sympfev_fu.factor = factor (data $ sympfev_fu,levels= c ("1","0")) data $ sympoth1_fu.factor = factor (data $ sympoth1_fu,levels= c ("1","0")) data $ sympoth2_fu.factor = factor (data $ sympoth2_fu,levels= c ("1","0")) data $ txmed1_fu.factor = factor (data $ txmed1_fu,levels= c ("1","0")) data $ txmed2_fu.factor = factor (data $ txmed2_fu,levels= c ("1","0")) data $ frmcomplby_fu.factor = factor (data $ frmcomplby_fu,levels= c ("1","2")) data $ frmdbaseby_fu.factor = factor (data $ frmdbaseby_fu,levels= c ("1","2")) data $ followup_fu_complete.factor = factor (data $ followup_fu_complete,levels= c ("0","1","2"))

levels (data $ redcap_event_name.factor)= c ("Enrollment","Followup: 6-month","Followup: 12-month") levels (data $ demographics_complete.factor)= c ("Incomplete","Unverified","Complete") levels (data $ check_studyid_validation.factor)= c ("Validate Participant ID") levels (data $ studyid_validation_complete.factor)= c ("Incomplete","Unverified","Complete") levels (data $ crfversion_en.factor)= c ("1.0","2.0") levels (data $ sex_en.factor)= c ("Male","Female") levels (data $ sympfev_en.factor)= c ("Yes","No") levels (data $ sympoth1_en.factor)= c ("Yes","No") levels (data $ sympoth2_en.factor)= c ("Yes","No") levels (data $ txmed1_en.factor)= c ("Yes","No") levels (data $ txmed2_en.factor)= c ("Yes","No") levels (data $ frmcomplby_en.factor)= c ("Personnel1","Personnel2") levels (data $ frmdbaseby_en.factor)= c ("Personnel1","Personnel2")

2 9.9e+09 followup_12month_arm_1 NA NA

3 9.9e+09 followup_6month_arm_1 NA NA

4 9.9e+09 enrollment_arm_1 NA 9.9e+

5 9.9e+09 followup_12month_arm_1 NA NA

ptid_rand ptid_lr ptid_fu redcap_event_name.

1 9.9e+09 9.9e+09 NA enrollment_arm_

2 NA 9.9e+09 9.9e+09 followup_12month_arm_

3 NA 9.9e+09 9.9e+09 followup_6month_arm_

4 9.9e+09 9.9e+09 NA enrollment_arm_

5 NA 9.9e+09 9.9e+09 followup_12month_arm_

# Study_id is not friendly for review, create a character ID variable for convinence data $ study_id = as.character (data $ study_id)

# create a new dataset with checking variables ck1 = data %>% select (study_id, redcap_event_name, ptid_studyid_validation, ptid_en, ptid_rand, ptid_lr, ptid_fu, redcap_event_name) %>% mutate (check1 = ifelse ( !is.na (ptid_en) & study_id != ptid_en, 1, NA), check2 = ifelse ( !is.na (ptid_lr) & study_id != ptid_lr, 1, NA), check3 = ifelse ( !is.na (ptid_fu) & study_id != ptid_fu, 1, NA), check4 = ifelse ( !is.na (ptid_en) & !is.na (ptid_lr) & ptid_en != ptid_lr, 1, NA))

# These 4 checking variables should be all missing. if ( !all ( is.na (ck1[, c ("check1", "check2", "check3", "check4")]))){ cat ("QC PTIDs FAILED!!! check the data below") print (ck1[ which ( !all ( is.na (ck1[, c ("check1", "check2", "check3", "check4")]))),]) } else { cat ( "QC PTIDs PASSED!")}

QC PTIDs PASSED!

# *Now that we've QC'd the PTIDs, we can proceed with just study_id

2.2.2 QC DATES

# display the variable with "date" or "dt" in their name var_list = colnames (data)[ which ( grepl ("date", colnames (data)) | grepl ("dt", colnames (data)))] var_list

[1] "visitdate_en" "txmed1startdt_en" "txmed1stopdt_en"

[4] "txmed2startdt_en" "txmed2stopdt_en" "nxtvisitdt_en"

[7] "frmcomplbydt_en" "frmdbasebydt_en" "randdate_rand"

[10] "frmcomplbydt_rand" "frmdbasebydt_rand" "visitdate_lr"

[13] "frmcomplbydt_lr" "frmdbasebydt_lr" "visitdate_fu"

[16] "txmed1startdt_fu" "txmed1stopdt_fu" "txmed2startdt_fu"

[19] "txmed2stopdt_fu" "nxtvisitdt_fu" "frmcomplbydt_fu"

[22] "frmdbasebydt_fu"

# overview these variables describe (data[,var_list])

data[, var_list]

22 Variables 60 Observations

---------------------------------------------------------------------------

visitdate_en : Visit Date

n missing distinct value

20 40 1 2018-01-

Value 2018-01-

Frequency 20

Proportion 1

---------------------------------------------------------------------------

randdate_rand : Randomization: Date

n missing distinct value

20 40 1 2018-01-

Value 2018-01-

Frequency 20

Proportion 1

---------------------------------------------------------------------------

visitdate_lr : Sample collection: Visit Date

n missing distinct

60 0 3

Value 2018-01-01 2018-07-01 2019-01-

Frequency 20 20 20

Proportion 0.333 0.333 0.

---------------------------------------------------------------------------

visitdate_fu : Visit Date

n missing distinct

40 20 2

Value 2018-07-01 2019-01-

Frequency 20 20

Proportion 0.5 0.

---------------------------------------------------------------------------

Variables with all observations missing:

[1] txmed1startdt_en txmed1stopdt_en txmed2startdt_en

[4] txmed2stopdt_en nxtvisitdt_en frmcomplbydt_en

[7] frmdbasebydt_en frmcomplbydt_rand frmdbasebydt_rand

[10] frmcomplbydt_lr frmdbasebydt_lr txmed1startdt_fu

[13] txmed1stopdt_fu txmed2startdt_fu txmed2stopdt_fu

[16] nxtvisitdt_fu frmcomplbydt_fu frmdbasebydt_fu

_# check the format --> date variables were imported as characters (chr)

str(data[,var_list])_

# format date variables into date formats

Proportion 0.5 0.

---------------------------------------------------------------------------

# tables <encourage using "exclude=NULL" to display missing values> table (data $ redcap_event_name, data $ visitcode_en, exclude=NULL)

## 0

enrollment_arm_1 20 0

followup_12month_arm_1 0 20

followup_6month_arm_1 0 20

table (data $ redcap_event_name, data $ visitcode_fu, exclude=NULL)

## 6 12

enrollment_arm_1 0 0 20

followup_12month_arm_1 0 20 0

followup_6month_arm_1 20 0 0

table (data $ redcap_event_name, data $ visitcode_lr, exclude=NULL)

enrollment_arm_1 20 0 0

followup_12month_arm_1 0 0 20

followup_6month_arm_1 0 20 0

# check for issues ck3 = data %>% mutate (check1 = ifelse ( !is.na (visitcode_en) & !is.na (visitcode_lr) & visitcode_en != visitcode_lr, 1, NA), check2 = ifelse ( !is.na (visitcode_fu) & !is.na (visitcode_lr) & visitcode_fu != visitcode_lr, 1, NA))

if ( !all ( is.na (ck3[, c ("check1", "check2")]))){ cat ("QC VISIT CODES FAILED!!! check the data below") print (ck3[ which ( !all ( is.na (ck3[, c ("check1", "check2")]))),]) } else { cat ( "QC VISIT CODES PASSED!")}

## QC VISIT CODES PASSED!

2.3 Restructure the master REDCap file: Partition the Events and Variables

that correspond to each CRF.

2.3.1 RANDOMIZATION: EXTRACT.

table ( data $ redcap_event_name, data $ crfversion_rand, exclude=NULL)

## 1

enrollment_arm_1 20 0

followup_12month_arm_1 0 20

followup_6month_arm_1 0 20

Randomization0 <- data %>% # KEEP DESIRED EVENT: enrollment filter ( redcap_event_name == "enrollment_arm_1" ) %>% filter ( !is.na ( crfversion_rand) ) %>%

# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_rand : dbasecomment3_rand, randomization_rand_complete) %>%

# create a new variable mutate (rand_indata=1)

# let's see the result table ( Randomization0 $ redcap_event_name, Randomization0 $ crfversion_rand, exclude=NULL)

enrollment_arm_1 20

# 20 subjects

2.3.2 VISITS_ENROLLMENT: EXTRACT.

table ( data $ redcap_event_name, data $ crfversion_en, exclude=NULL)

## 1

enrollment_arm_1 20 0

followup_12month_arm_1 0 20

followup_6month_arm_1 0 20

Visits_Enrollment0 <- data %>% # KEEP DESIRED EVENT: enrollment filter ( redcap_event_name == "enrollment_arm_1" ) %>% filter ( !is.na ( crfversion_en) ) %>%

# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_en : dbasecomment3_en, enrollment_en_complete) %>%

# 20 subjects with 2*20 observations

2.3.4 LABS (ENROLLMENT+FOLLOWUP): EXTRACT.

table ( data $ redcap_event_name, data $ crfversion_lr, exclude=NULL)

enrollment_arm_1 20

followup_12month_arm_1 20

followup_6month_arm_1 20

Labs0 <- data %>% # KEEP DESIRED EVENT: FOLLOWUP. mutate ( keepvisit = ifelse ( redcap_event_name %in% c ( "enrollment_arm_1", "followup_6month_arm_1", "followup_12month_arm_1" ), 1, NA )) %>% filter ( keepvisit == 1) %>% select ( - keepvisit ) %>%

filter ( !is.na ( crfversion_lr) ) %>%

# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_lr : dbasecomment3_lr, lab_results_lr_complete) %>%

# SORT arrange (study_id, visitcode_lr)

# let's see the result table ( Labs0 $ redcap_event_name, Labs0 $ crfversion_lr, exclude=NULL)

enrollment_arm_1 20

followup_12month_arm_1 20

followup_6month_arm_1 20

# 20 subjects with 3*20 observations & 21 variables

2.4 The next syntax file “R User Guide 2 - Data Management” performs the

following steps:

2.4.1 Prepare files for incorporation with either of two files: (1) Main (1 record/participant), (2) Longitudinal (1 record/visit).

Subsequent steps merge these two files together to create additional variables (e.g. 1st pre-HAART viral load), and analysis follows.

2.4.2 Append and merge to create the Main and Longitudinal files.

2.4.3 Create other analysis variables.

2.4.4 Example descriptives, figures, and analyses

Please email lwang7@uw.edu for the source code. Thanks!