Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

R User Guide 1 - Data Restructuring, Study notes of Corporate Finance

University of Washington (UW) - Tacoma Corporate Finance

A guide on how to do data management and analysis in R. It includes sections on R basics and a data restructuring example based on HIV Pediatrics dataset. The guide also introduces more data management skills in the 'Data management syntax' section. instructions on how to install R and RStudio and how to use them. It includes R code for the tasks described in the text introduction and R output after executing the R script. The guide is useful for university students and lifelong learners interested in learning R for data management and analysis.

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

heathl 🇺🇸

4.5

(11)

237 documents

1 / 20

This page cannot be seen from the preview

Don't miss anything!

R User Guide 1 - Data Restructuring

Lei Wang, CFAR Biometics Core, University of Washington

Updated: November 17, 2019

Contents

1 Introduction 1

1.1 InstallRandRStudio ........................................ 2

1.2 CreateanewRScript ........................................ 2

1.3 Loadpackages ............................................ 3

1.4 Importandexportdataset...................................... 4

1.5 RBasics................................................ 4

1.6 dplyr.................................................. 5

1.7 Reshape................................................ 6

2 HIV Pediatrics Project 8

2.1 Import dataset from REDCap into R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that the

restructuring will go more smoothly downstream. . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond to

eachCRF................................................ 16

2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps: . . . 20

1 Introduction

The goal of this guide is to show how to do data management and analysis in R. The first section is R basics.

The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds to

the STATA and SAS on CFAR training website. More data managements skills are introduced in the “Data

management syntax”.

My data folders have this structure:

Data

SourceData

2019_09sep_09

Stata

SAS

R (This file imports the data from here.)

OutputData (This file write the data to here.)

1

Partial preview of the text

Download R User Guide 1 - Data Restructuring and more Study notes Corporate Finance in PDF only on Docsity!

R User Guide 1 - Data Restructuring

Lei Wang, CFAR Biometics Core, University of Washington

Updated: November 17, 2019

1.1 Install R and RStudio........................................ 2 1.2 Create a new R Script........................................ 2 1.3 Load packages............................................ 3 1.4 Import and export dataset...................................... 4 1.5 R Basics................................................ 4 1.6 dplyr.................................................. 5 1.7 Reshape................................................ 6

2 HIV Pediatrics Project 8

2.1 Import dataset from REDCap into R................................ 8 2.2 QC: Until the changes can be made to the REDCap project itself, make updates so that the restructuring will go more smoothly downstream.......................... 12 2.3 Restructure the master REDCap file: Partition the Events and Variables that correspond to each CRF................................................ 16 2.4 The next syntax file “R User Guide 2 - Data Management” performs the following steps:... 20

1 Introduction

The goal of this guide is to show how to do data management and analysis in R. The first section is R basics. The 2nd section is an data restructuring example based on HIV Pediatrics dataset, which corresponds to the STATA and SAS on CFAR training website. More data managements skills are introduced in the “Data management syntax”.

My data folders have this structure:

Data

SourceData

2019_09sep_

Stata

SAS

R (This file imports the data from here.)

OutputData (This file write the data to here.)

Figure 1: How to use this guide

There are three fonts in this file (Figure 1):

Text introduction - explains what we are exploring.
R script - The R code for the tasks that are described in the text introduction.
R output - The lines followed with the R scipt and start with “##” are the R output after executing the R scipt. It can be observed when you run code in R/RStudio.

1.1 Install R and RStudio

Download the latest version of R from R CRAN (select for Mac, Windows, or Linux). Then install.

Download the most popular R IDE - R Studio (FREE version). Then install.

Then, open R Studio. This program will allow you to have the console, your script file (more on this later), and help panes within the same interface. It also helpfully includes some menus and buttons if you prefer to have a gentler introduction to using statistical software for the first time, particularly if you are coming to R from a program like SPSS. Even for more advanced users, RStudio can be useful in giving you a platform for using knitr, rmarkdown, and shiny.

1.2 Create a new R Script

In R Studio, click File -> New File -> R Script and you will open a blank .R file.
On the top of the R script, please input the title, author, date, and description for this file by using comments (with “#” at the beginning of every line). This will help you organize your R scripts.

_# Title: Data Management in R

Description: Introduce the data management and analysis skills in R_

1.4 Import and export dataset

# Use "/" instead of "" in the file directories

# .csv dataset (dt_csv is the assigned dataset name in R) dt_csv <- read.csv ("C:/Users/lwang7/OneDrive/Dataset.csv", head=T)

# SAS dataset dt_sas <- read_sav ("C:/Users/lwang7/OneDrive/Dataset.sav")

# STATA dataset dt_stata <- read_dta ("C:/Users/lwang7/OneDrive/stata_dataset.dta")

# Excel spreedsheet dt_xlsx <- read.xlsx ("C:/Users/lwang7/OneDrive/Data/Dataset.xlsx", sheet = "Calculator", startRow = 1, colNames = TRUE, rowNames = FALSE, detectDates = FALSE, skipEmptyRows = TRUE, skipEmptyCols = TRUE, rows = NULL, cols = NULL, check.names = FALSE, namedRegion = NULL, na.strings = "NA", fillMergedCells = FALSE)

# Export dataset into .csv file write.csv (mydata, file = paste ("C:/Users/lwang7/OneDrive/Output/Newdataset", Sys.Date (), ".csv"), na="") # This will create a CSV dataset with the current date in the file name.

1.5 R Basics

# display the current path getwd () # set the destination path setwd ("C:/Users/lwang7/Documents")

# overview of the dataset describe (mydata)

# Table one variable table (mydata $ gender)

# show information of missing values table (mydata $ gender, exclude=NULL)

# crosstable of gender*outcome table (mydata $ gender, mydata $ outcome, exclude=NULL)

# Histogram hist (mydata $ weight)

# check the row number and column number of the dataset dim (dataset)

# [1] 1289 1474 # row and column, respectively

# Display the ID in pcr dataset but not in the main dataset pcr $ ID[ which (pcr $ ID %notin% main $ ID)]

# subset dataset (select rows by ID and then keep 3 columns) pcr2 <- pcr[ which (pcr $ ID %notin% main $ ID) , c ("id", "date", "pcr_value")]

1.6 dplyr

dplyr is a powerful R-package of data management. It provides simple functions that correspond to the most common data manipulation tasks to help you translate your ideas into code. It is very efficient that you spend less time waiting for the computer to process the steps. dplyr uses pipe operator %>% to separate every step, which increased the flexibility of organizing multiple manipulation steps together.

_# syntax

dplyr use "%>%" to seperate command

with "%>%", R will execute the next line until a line without %>% at the end_

newdata <- olddata %>% # continue arrange (ID, DATE) %>% # continue group_by (ID) # stop and execute one by one

newdata2 <- newdata1 %>% mutate (newvar = age) # stop and execute

# the two steps above canbe combined together as below newdata2<- olddata %>% # continue arrange (ID, DATE) %>% # continue group_by (ID) %>% # continue mutate (newvar = age) # stop and execute one by one

newdata <- olddata %>%

# Create a new variable mutate (ID = substr ( as.character (SubjID), 3, 6) ) %>% mutate (DATE = as.Date (DATE2, format = "%Y-%m-%d")) %>%

# replace 8 with 0 in existing variable mutate (adlq01now = ifelse (adlq01now == 8, 0, adlq01now)) %>%

# group by ID group_by (ID) %>%

# Sort the dataset by ID and DATE arrange (ID, DATE) %>%

# Create a variable for the number of observations of every subject mutate (n2= n ()) %>%

'cm_indication', 'cm_start_date', 'cm_stop_date'), idvar= c ('ptid','visitcode'))

2 HIV Pediatrics Project

The dataset was exported from REDCap. REDCap generated a .csv file (data) and a .r file (data formating code). We need to run the .r program to import and format the dataset first, and then following with data management and analysis.

2.1 Import dataset from REDCap into R

The following code was copied from the REDCap downloaded .r file. I made two changes and recommend them:

CHANGE 1 - Change file pathway where you saved your downloaded REDCap dataset.
Add parameter in read.csv for importing characters in character format instead of factor format. It is easier to use character variables in data management. Some people say that you have to use factor variables in the statistical models (e.g. linear regression) but it is not necessary because lm() function can transfer character into factor automatically. When you need to use factor format, use as.factor() to transform it.

The R code from RedCap will generate a list of new factor variables. The reason is that RedCap saves charactor variables with their code number and category name, which can be found in the code book. For instance, variable “demographics_complete” has three values “0”, “1”, and “2”, corresponding with “Incomplete”, “Unverified”, and “Complete” (in variable “demographics_complete.factor”), respectively.

#Clear existing data and graphics rm (list= ls ()) graphics.off () #Load Hmisc library library (Hmisc) #Read Data

#################### CHANGE 1 ##################### # add the file path where you saved the .csv setwd ('C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data/') getwd ()

[1] "C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data"

#################### CHANGE 2 ##################### # add stringsAsFactors=FALSE in read.csv data= read.csv ('SourceData/2019_09sep_09/R/CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv', stringsAsFactors=FALSE)

# data=read.csv('CFARBiometricsHIVPra_DATA_2019-09-24_2046.csv') # original from REDCap

#Setting Labels

label (data $ study_id)="Participant Enrollment ID (ends in 9)" label (data $ redcap_event_name)="Event Name" label (data $ demographics_complete)="Complete?" label (data $ ptid_studyid_validation)="Participant Enrollment ID (ends in 9)"

label (data $ randomization_rand_complete)="Complete?" label (data $ crfversion_lr)="Lab Results (LR):CRF Version" label (data $ ptid_lr)="Participant Enrollment ID (ends in 9)" label (data $ visitcode_lr)="Sample collection: Visit Code" label (data $ visitdate_lr)="Sample collection: Visit Date" label (data $ cd4cnt_lr)="CD4 Count (cells/uL)" label (data $ cd8cnt_lr)="CD8 Count (cells/uL)" label (data $ cd4to8ratio_lr)="CD4/CD8 ratio" label (data $ wbc_lr)="WBC (x10^3 cells/uL)" label (data $ rbc_lr)="RBC (x10^6 cells/uL)" label (data $ hb_lr)="Hemoglobin (g/dl)" label (data $ viralload_lr)="Viral load (copies/ml)" label (data $ frmcomplby_lr)="Form completed by: Staff code" label (data $ frmcomplbydt_lr)="Form completed by: Date" label (data $ frmdbaseby_lr)="Form entered into database by: Staff code" label (data $ frmdbasebydt_lr)="Form entered into database by: Date" label (data $ dbasecomment1_lr)="Database comments (1)" label (data $ dbasecomment2_lr)="Database comments (2)" label (data $ dbasecomment3_lr)="Database comments (3)" label (data $ lab_results_lr_complete)="Complete?" label (data $ crfversion_fu)="Followup (FU) Visit: CRF Version" label (data $ ptid_fu)="Participant Enrollment ID (ends in 9)" label (data $ visitcode_fu)="Visit Code" label (data $ visitdate_fu)="Visit Date" label (data $ sympfev_fu)="Fever" label (data $ sympfevdur_fu)="Fever: Duration (days)" label (data $ sympoth1_fu)="Other (1)" label (data $ sympoth1sp_fu)="Other (1): Specify" label (data $ sympoth1dur_fu)="Other (1) : Duration (days)" label (data $ sympoth2_fu)="Other (2)" label (data $ sympoth2sp_fu)="Other (2): Specify" label (data $ sympoth2dur_fu)="Other (2): Duration (days)" label (data $ txmed1_fu)="Medication (1)" label (data $ txmed1name_fu)="Medication (1): Name" label (data $ txmed1ind_fu)="Medication (1): Indication" label (data $ txmed1startdt_fu)="Medication (1): Start date" label (data $ txmed1stopdt_fu)="Medication (1): End date" label (data $ txmed2_fu)="Medication (2)" label (data $ txmed2name_fu)="Medication (2): Name" label (data $ txmed2ind_fu)="Medication (2): Indication" label (data $ txmed2startdt_fu)="Medication (2): Start date" label (data $ txmed2stopdt_fu)="Medication (2): End date" label (data $ weight_fu)="Current weight (kg)" label (data $ height_fu)="Current height (cm)" label (data $ temp_fu)="Body temperature (degrees C)" label (data $ heart_fu)="Heart rate (per minute)" label (data $ bpsystol_fu)="Blood pressure: Systolic (mmHg)" label (data $ bpdiast_fu)="Blood pressure: Diastolic (mmHg)" label (data $ nxtvisitdt_fu)="Next visit: Date" label (data $ frmcomplby_fu)="Form completed by: Staff code" label (data $ frmcomplbydt_fu)="Form completed by: Date" label (data $ frmdbaseby_fu)="Form entered into database by: Staff code" label (data $ frmdbasebydt_fu)="Form entered into database by: Date"

label (data $ dbasecomment1_fu)="Database comments (1)" label (data $ dbasecomment2_fu)="Database comments (2)" label (data $ dbasecomment3_fu)="Database comments (3)" label (data $ followup_fu_complete)="Complete?" #Setting Units

#Setting Factors(will create new variable for factors) data $ redcap_event_name.factor = factor (data $ redcap_event_name,levels = c ("enrollment_arm_1","followup_6mo data $ demographics_complete.factor = factor (data $ demographics_complete,levels= c ("0","1","2")) data $ check_studyid_validation.factor = factor (data $ check_studyid_validation,levels= c ("1")) data $ studyid_validation_complete.factor = factor (data $ studyid_validation_complete,levels= c ("0","1","2")) data $ crfversion_en.factor = factor (data $ crfversion_en,levels= c ("1","2")) data $ sex_en.factor = factor (data $ sex_en,levels= c ("1","2")) data $ sympfev_en.factor = factor (data $ sympfev_en,levels= c ("1","0")) data $ sympoth1_en.factor = factor (data $ sympoth1_en,levels= c ("1","0")) data $ sympoth2_en.factor = factor (data $ sympoth2_en,levels= c ("1","0")) data $ txmed1_en.factor = factor (data $ txmed1_en,levels= c ("1","0")) data $ txmed2_en.factor = factor (data $ txmed2_en,levels= c ("1","0")) data $ frmcomplby_en.factor = factor (data $ frmcomplby_en,levels= c ("1","2")) data $ frmdbaseby_en.factor = factor (data $ frmdbaseby_en,levels= c ("1","2")) data $ enrollment_en_complete.factor = factor (data $ enrollment_en_complete,levels= c ("0","1","2")) data $ crfversion_rand.factor = factor (data $ crfversion_rand,levels= c ("1","2")) data $ randarm_rand.factor = factor (data $ randarm_rand,levels= c ("0","1")) data $ frmcomplby_rand.factor = factor (data $ frmcomplby_rand,levels= c ("1","2")) data $ frmdbaseby_rand.factor = factor (data $ frmdbaseby_rand,levels= c ("1","2")) data $ randomization_rand_complete.factor = factor (data $ randomization_rand_complete,levels= c ("0","1","2")) data $ crfversion_lr.factor = factor (data $ crfversion_lr,levels= c ("1","2")) data $ frmcomplby_lr.factor = factor (data $ frmcomplby_lr,levels= c ("1","2")) data $ frmdbaseby_lr.factor = factor (data $ frmdbaseby_lr,levels= c ("1","2")) data $ lab_results_lr_complete.factor = factor (data $ lab_results_lr_complete,levels= c ("0","1","2")) data $ crfversion_fu.factor = factor (data $ crfversion_fu,levels= c ("1","2")) data $ sympfev_fu.factor = factor (data $ sympfev_fu,levels= c ("1","0")) data $ sympoth1_fu.factor = factor (data $ sympoth1_fu,levels= c ("1","0")) data $ sympoth2_fu.factor = factor (data $ sympoth2_fu,levels= c ("1","0")) data $ txmed1_fu.factor = factor (data $ txmed1_fu,levels= c ("1","0")) data $ txmed2_fu.factor = factor (data $ txmed2_fu,levels= c ("1","0")) data $ frmcomplby_fu.factor = factor (data $ frmcomplby_fu,levels= c ("1","2")) data $ frmdbaseby_fu.factor = factor (data $ frmdbaseby_fu,levels= c ("1","2")) data $ followup_fu_complete.factor = factor (data $ followup_fu_complete,levels= c ("0","1","2"))

levels (data $ redcap_event_name.factor)= c ("Enrollment","Followup: 6-month","Followup: 12-month") levels (data $ demographics_complete.factor)= c ("Incomplete","Unverified","Complete") levels (data $ check_studyid_validation.factor)= c ("Validate Participant ID") levels (data $ studyid_validation_complete.factor)= c ("Incomplete","Unverified","Complete") levels (data $ crfversion_en.factor)= c ("1.0","2.0") levels (data $ sex_en.factor)= c ("Male","Female") levels (data $ sympfev_en.factor)= c ("Yes","No") levels (data $ sympoth1_en.factor)= c ("Yes","No") levels (data $ sympoth2_en.factor)= c ("Yes","No") levels (data $ txmed1_en.factor)= c ("Yes","No") levels (data $ txmed2_en.factor)= c ("Yes","No") levels (data $ frmcomplby_en.factor)= c ("Personnel1","Personnel2") levels (data $ frmdbaseby_en.factor)= c ("Personnel1","Personnel2")

2 9.9e+09 followup_12month_arm_1 NA NA

3 9.9e+09 followup_6month_arm_1 NA NA

4 9.9e+09 enrollment_arm_1 NA 9.9e+

5 9.9e+09 followup_12month_arm_1 NA NA

ptid_rand ptid_lr ptid_fu redcap_event_name.

1 9.9e+09 9.9e+09 NA enrollment_arm_

2 NA 9.9e+09 9.9e+09 followup_12month_arm_

3 NA 9.9e+09 9.9e+09 followup_6month_arm_

4 9.9e+09 9.9e+09 NA enrollment_arm_

5 NA 9.9e+09 9.9e+09 followup_12month_arm_

# Study_id is not friendly for review, create a character ID variable for convinence data $ study_id = as.character (data $ study_id)

# create a new dataset with checking variables ck1 = data %>% select (study_id, redcap_event_name, ptid_studyid_validation, ptid_en, ptid_rand, ptid_lr, ptid_fu, redcap_event_name) %>% mutate (check1 = ifelse ( !is.na (ptid_en) & study_id != ptid_en, 1, NA), check2 = ifelse ( !is.na (ptid_lr) & study_id != ptid_lr, 1, NA), check3 = ifelse ( !is.na (ptid_fu) & study_id != ptid_fu, 1, NA), check4 = ifelse ( !is.na (ptid_en) & !is.na (ptid_lr) & ptid_en != ptid_lr, 1, NA))

# These 4 checking variables should be all missing. if ( !all ( is.na (ck1[, c ("check1", "check2", "check3", "check4")]))){ cat ("QC PTIDs FAILED!!! check the data below") print (ck1[ which ( !all ( is.na (ck1[, c ("check1", "check2", "check3", "check4")]))),]) } else { cat ( "QC PTIDs PASSED!")}

QC PTIDs PASSED!

# *Now that we've QC'd the PTIDs, we can proceed with just study_id

2.2.2 QC DATES

# display the variable with "date" or "dt" in their name var_list = colnames (data)[ which ( grepl ("date", colnames (data)) | grepl ("dt", colnames (data)))] var_list

[1] "visitdate_en" "txmed1startdt_en" "txmed1stopdt_en"

[4] "txmed2startdt_en" "txmed2stopdt_en" "nxtvisitdt_en"

[7] "frmcomplbydt_en" "frmdbasebydt_en" "randdate_rand"

[10] "frmcomplbydt_rand" "frmdbasebydt_rand" "visitdate_lr"

[13] "frmcomplbydt_lr" "frmdbasebydt_lr" "visitdate_fu"

[16] "txmed1startdt_fu" "txmed1stopdt_fu" "txmed2startdt_fu"

[19] "txmed2stopdt_fu" "nxtvisitdt_fu" "frmcomplbydt_fu"

[22] "frmdbasebydt_fu"

# overview these variables describe (data[,var_list])

data[, var_list]

22 Variables 60 Observations

---------------------------------------------------------------------------

visitdate_en : Visit Date

n missing distinct value

20 40 1 2018-01-

Value 2018-01-

Frequency 20

Proportion 1

---------------------------------------------------------------------------

randdate_rand : Randomization: Date

n missing distinct value

20 40 1 2018-01-

Value 2018-01-

Frequency 20

Proportion 1

---------------------------------------------------------------------------

visitdate_lr : Sample collection: Visit Date

n missing distinct

60 0 3

Value 2018-01-01 2018-07-01 2019-01-

Frequency 20 20 20

Proportion 0.333 0.333 0.

---------------------------------------------------------------------------

visitdate_fu : Visit Date

n missing distinct

40 20 2

Value 2018-07-01 2019-01-

Frequency 20 20

Proportion 0.5 0.

---------------------------------------------------------------------------

Variables with all observations missing:

[1] txmed1startdt_en txmed1stopdt_en txmed2startdt_en

[4] txmed2stopdt_en nxtvisitdt_en frmcomplbydt_en

[7] frmdbasebydt_en frmcomplbydt_rand frmdbasebydt_rand

[10] frmcomplbydt_lr frmdbasebydt_lr txmed1startdt_fu

[13] txmed1stopdt_fu txmed2startdt_fu txmed2stopdt_fu

[16] nxtvisitdt_fu frmcomplbydt_fu frmdbasebydt_fu

_# check the format --> date variables were imported as characters (chr)

str(data[,var_list])_

# format date variables into date formats

Proportion 0.5 0.

---------------------------------------------------------------------------

# tables <encourage using "exclude=NULL" to display missing values> table (data $ redcap_event_name, data $ visitcode_en, exclude=NULL)

## 0

enrollment_arm_1 20 0

followup_12month_arm_1 0 20

followup_6month_arm_1 0 20

table (data $ redcap_event_name, data $ visitcode_fu, exclude=NULL)

## 6 12

enrollment_arm_1 0 0 20

followup_12month_arm_1 0 20 0

followup_6month_arm_1 20 0 0

table (data $ redcap_event_name, data $ visitcode_lr, exclude=NULL)

enrollment_arm_1 20 0 0

followup_12month_arm_1 0 0 20

followup_6month_arm_1 0 20 0

# check for issues ck3 = data %>% mutate (check1 = ifelse ( !is.na (visitcode_en) & !is.na (visitcode_lr) & visitcode_en != visitcode_lr, 1, NA), check2 = ifelse ( !is.na (visitcode_fu) & !is.na (visitcode_lr) & visitcode_fu != visitcode_lr, 1, NA))

if ( !all ( is.na (ck3[, c ("check1", "check2")]))){ cat ("QC VISIT CODES FAILED!!! check the data below") print (ck3[ which ( !all ( is.na (ck3[, c ("check1", "check2")]))),]) } else { cat ( "QC VISIT CODES PASSED!")}

## QC VISIT CODES PASSED!

2.3 Restructure the master REDCap file: Partition the Events and Variables

that correspond to each CRF.

2.3.1 RANDOMIZATION: EXTRACT.

table ( data $ redcap_event_name, data $ crfversion_rand, exclude=NULL)

## 1

enrollment_arm_1 20 0

followup_12month_arm_1 0 20

followup_6month_arm_1 0 20

Randomization0 <- data %>% # KEEP DESIRED EVENT: enrollment filter ( redcap_event_name == "enrollment_arm_1" ) %>% filter ( !is.na ( crfversion_rand) ) %>%

# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_rand : dbasecomment3_rand, randomization_rand_complete) %>%

# create a new variable mutate (rand_indata=1)

# let's see the result table ( Randomization0 $ redcap_event_name, Randomization0 $ crfversion_rand, exclude=NULL)

enrollment_arm_1 20

# 20 subjects

2.3.2 VISITS_ENROLLMENT: EXTRACT.

table ( data $ redcap_event_name, data $ crfversion_en, exclude=NULL)

## 1

enrollment_arm_1 20 0

followup_12month_arm_1 0 20

followup_6month_arm_1 0 20

Visits_Enrollment0 <- data %>% # KEEP DESIRED EVENT: enrollment filter ( redcap_event_name == "enrollment_arm_1" ) %>% filter ( !is.na ( crfversion_en) ) %>%

# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_en : dbasecomment3_en, enrollment_en_complete) %>%

# 20 subjects with 2*20 observations

2.3.4 LABS (ENROLLMENT+FOLLOWUP): EXTRACT.

table ( data $ redcap_event_name, data $ crfversion_lr, exclude=NULL)

enrollment_arm_1 20

followup_12month_arm_1 20

followup_6month_arm_1 20

Labs0 <- data %>% # KEEP DESIRED EVENT: FOLLOWUP. mutate ( keepvisit = ifelse ( redcap_event_name %in% c ( "enrollment_arm_1", "followup_6month_arm_1", "followup_12month_arm_1" ), 1, NA )) %>% filter ( keepvisit == 1) %>% select ( - keepvisit ) %>%

filter ( !is.na ( crfversion_lr) ) %>%

# KEEP VARIABLES FOR THE CRF. select ( study_id, redcap_event_name, crfversion_lr : dbasecomment3_lr, lab_results_lr_complete) %>%

# SORT arrange (study_id, visitcode_lr)

# let's see the result table ( Labs0 $ redcap_event_name, Labs0 $ crfversion_lr, exclude=NULL)

enrollment_arm_1 20

followup_12month_arm_1 20

followup_6month_arm_1 20

# 20 subjects with 3*20 observations & 21 variables

2.4 The next syntax file “R User Guide 2 - Data Management” performs the

following steps:

2.4.1 Prepare files for incorporation with either of two files: (1) Main (1 record/participant), (2) Longitudinal (1 record/visit).

Subsequent steps merge these two files together to create additional variables (e.g. 1st pre-HAART viral load), and analysis follows.

2.4.2 Append and merge to create the Main and Longitudinal files.

2.4.3 Create other analysis variables.

2.4.4 Example descriptives, figures, and analyses

Please email lwang7@uw.edu for the source code. Thanks!

R User Guide 1 - Data Restructuring, Study notes of Corporate Finance

Related documents

Partial preview of the text

Download R User Guide 1 - Data Restructuring and more Study notes Corporate Finance in PDF only on Docsity!

R User Guide 1 - Data Restructuring

Lei Wang, CFAR Biometics Core, University of Washington

Updated: November 17, 2019

Contents

1 Introduction

1.1 Install R and RStudio

1.2 Create a new R Script

Description: Introduce the data management and analysis skills in R_

1.4 Import and export dataset

1.5 R Basics

1.6 dplyr

dplyr use "%>%" to seperate command

with "%>%", R will execute the next line until a line without %>% at the end_

2.1 Import dataset from REDCap into R

[1] "C:/Users/lwang7/OneDrive - UW/18_CFAR_Training_Guide/Data"

2 9.9e+09 followup_12month_arm_1 NA NA

3 9.9e+09 followup_6month_arm_1 NA NA

4 9.9e+09 enrollment_arm_1 NA 9.9e+

5 9.9e+09 followup_12month_arm_1 NA NA

ptid_rand ptid_lr ptid_fu redcap_event_name.

1 9.9e+09 9.9e+09 NA enrollment_arm_

2 NA 9.9e+09 9.9e+09 followup_12month_arm_

3 NA 9.9e+09 9.9e+09 followup_6month_arm_

4 9.9e+09 9.9e+09 NA enrollment_arm_

5 NA 9.9e+09 9.9e+09 followup_12month_arm_

QC PTIDs PASSED!

2.2.2 QC DATES

[1] "visitdate_en" "txmed1startdt_en" "txmed1stopdt_en"

[4] "txmed2startdt_en" "txmed2stopdt_en" "nxtvisitdt_en"

[7] "frmcomplbydt_en" "frmdbasebydt_en" "randdate_rand"

[10] "frmcomplbydt_rand" "frmdbasebydt_rand" "visitdate_lr"

[13] "frmcomplbydt_lr" "frmdbasebydt_lr" "visitdate_fu"

[16] "txmed1startdt_fu" "txmed1stopdt_fu" "txmed2startdt_fu"

[19] "txmed2stopdt_fu" "nxtvisitdt_fu" "frmcomplbydt_fu"

[22] "frmdbasebydt_fu"

data[, var_list]

22 Variables 60 Observations

---------------------------------------------------------------------------

visitdate_en : Visit Date

n missing distinct value

20 40 1 2018-01-

Value 2018-01-

Frequency 20

Proportion 1

---------------------------------------------------------------------------

randdate_rand : Randomization: Date

n missing distinct value

20 40 1 2018-01-

Value 2018-01-

Frequency 20

Proportion 1

---------------------------------------------------------------------------

visitdate_lr : Sample collection: Visit Date

n missing distinct

60 0 3

Value 2018-01-01 2018-07-01 2019-01-

Frequency 20 20 20

Proportion 0.333 0.333 0.

---------------------------------------------------------------------------

visitdate_fu : Visit Date

n missing distinct

40 20 2

Value 2018-07-01 2019-01-

Frequency 20 20

Proportion 0.5 0.

---------------------------------------------------------------------------

Variables with all observations missing:

[1] txmed1startdt_en txmed1stopdt_en txmed2startdt_en

[4] txmed2stopdt_en nxtvisitdt_en frmcomplbydt_en

[7] frmdbasebydt_en frmcomplbydt_rand frmdbasebydt_rand

[10] frmcomplbydt_lr frmdbasebydt_lr txmed1startdt_fu

[13] txmed1stopdt_fu txmed2startdt_fu txmed2stopdt_fu

[16] nxtvisitdt_fu frmcomplbydt_fu frmdbasebydt_fu

str(data[,var_list])_

Proportion 0.5 0.

---------------------------------------------------------------------------