Chapter 2 Snippets
Random useful snippets that do not fit anywhere else.
2.1 Useful RStudio functions
2.1.1 Clean restart
Cleans current environment and restarts R session (no code can run past this). Indicate objects to leave alone by supplying string to leave
argument.
= function(leave = NULL){
clean_restart rm(list=setdiff(ls(envir = .GlobalEnv), leave), envir = .GlobalEnv)
invisible(.rs.restartR())
}
2.1.2 Remove duplicated packages between personal and system libraries
This removes any user library packages that are also present in the system library. Use this to minimise loading conflicts. Change tdrake
in example to your rstudio username.
library(tidyverse)
= installed.packages() %>% #change 'tdrake' to your rstudio username
tdrake_packages as_tibble() %>%
filter(LibPath %>% str_detect("tdrake")) %>%
pull(Package)
= installed.packages() %>%
system_packages as_tibble() %>%
filter(LibPath %>% str_detect("local")) %>%
pull(Package)
= tdrake_packages[tdrake_packages %in% system_packages]
to_remove
remove.packages(to_remove)
}
2.2 Memory and efficient RAM usage
- Restart R often
- Use
rm(...)
to get rid of big files after using them - Save large objects (and compress with “gz”) that take a while to create but are needed often (
write_rds(x, "big_file.rds.gz", compress = "gz)
) - Run
gc()
(garbage collect) - Check memory usage:
pryr::mem_used()
Most importantly if using big files (even if they aren’t loaded in your environment!) do the following:
save.image("temp.rda")
- Restart R
load("temp.rda")
- Delete “temp.rda” from
Files
pane
2.3 Exporting tables that work in both PDF and Word
The kable()
function from library(knitr)
formats tables for all 3 output formats, but it is a bit limited.
library(kableExtra)
is great for table customisations that work in PDF and HTML.
kableExtra
functions do not work for Word output, but library(flextable)
does work for Word (but not for PDF).
We can define our own table printing function that uses kableExtra or flextable based on output type:
# This makes table resize or continue over multiple pages in all output types
# PDF powered by kableExtra, Word by flextable
= function(x, caption = "", longtable = FALSE, ...){
mytable library(kableExtra)
# if not latex or html then else is Word
if (knitr::is_latex_output() | knitr::is_html_output()){
::kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"),
knitrbooktabs = TRUE, caption = caption, longtable = longtable,
linesep = "", ...) %>%
::kable_styling(latex_options = c("scale_down", "hold_position"))
kableExtraelse{
}::flextable(x) %>%
flextable::autofit() %>%
flextable::width(j = 1, width = 1.5) %>%
flextable::height(i = 1, height = 0.5, part = "header")
flextable
}
}library(dplyr)
%>%
cars head() %>%
mytable()
speed | dist |
---|---|
4 | 2 |
4 | 10 |
7 | 4 |
7 | 22 |
8 | 16 |
9 | 10 |
2.3.1 Warning in kableExtra. Please specify format in kable. kableExtra can customize either HTML or LaTeX outputs.
This warning goes away when you load library(kableExtra)
in your script, instead of just using kableExtra::function()
.
What happens when you load library(kableExtra)
is that it sets that specifies the format for the R session, so that Warning goes away. It does mean that the render()
function may fail since it knits in the same environment you have, rather than a new one like the Knit button does.
2.5 Working with CHIs
Here are 4 functions for CHIs that could even be put in a small package. The Community Health Index (CHI) is a population register, which is used in Scotland for health care purposes. The CHI number uniquely identifies a person on the index.
2.5.1 chi_dob()
- Extract date of birth from CHI
Note cutoff_2000
.
As CHI has only a two digit year, need to decide whether year is 1900s or 2000s.
I don’t think there is a formal way of determining this.
library(dplyr)
= c("1009701234", "1811431232", "1304496368")
chi # These CHIs are not real.
# The first is invalid, two and three are valid.
# Cut-off any thing before that number is considered 2000s
# i.e. at cutoff_2000 = 20, "18" is considered 2018, rather than 1918.
= function(.data, cutoff_2000 = 20){
chi_dob %>%
.data ::str_extract(".{6}") %>%
stringr::parse_date_time2("dmy", cutoff_2000 = cutoff_2000) %>%
lubridate::as_date() # Make Date object, rather than POSIXct
lubridate
}
chi_dob(chi)
## [1] "1970-09-10" "1943-11-18" "1949-04-13"
# From tibble
tibble(chi = chi) %>%
mutate(
dob = chi_dob(chi)
)
## # A tibble: 3 × 2
## chi dob
## <chr> <date>
## 1 1009701234 1970-09-10
## 2 1811431232 1943-11-18
## 3 1304496368 1949-04-13
2.5.2 chi_gender()
- Extract gender from CHI
Ninth digit is odd for men and even for women.
A test for even is x modulus 2 == 0
.
= function(.data){
chi_gender %>%
.data ::str_sub(9, 9) %>%
stringras.numeric() %>%
ifelse(. %% 2 == 0, "Female", "Male")}
{
}
chi_gender(chi)
## [1] "Male" "Male" "Female"
# From tibble
tibble(chi = chi) %>%
mutate(
dob = chi_dob(chi),
gender = chi_gender(chi)
)
## # A tibble: 3 × 3
## chi dob gender
## <chr> <date> <chr>
## 1 1009701234 1970-09-10 Male
## 2 1811431232 1943-11-18 Male
## 3 1304496368 1949-04-13 Female
2.5.3 chi_age()
- Extract age from CHI
Works for a single date or a vector of dates.
= function(.data, ref_date, cutoff_2000 = 20){
chi_age = chi_dob(.data, cutoff_2000 = cutoff_2000)
dob ::interval(dob, ref_date) %>%
lubridateas.numeric("years") %>%
floor()
}
# Today
chi_age(chi, Sys.time())
## [1] 53 80 75
# Single date
library(lubridate)
chi_age(chi, dmy("11/09/2018"))
## [1] 48 74 69
# Vector
= dmy("11/09/2018",
dates "09/05/2015",
"10/03/2014")
chi_age(chi, dates)
## [1] 48 71 64
# From tibble
tibble(chi = chi) %>%
mutate(
dob = chi_dob(chi),
gender = chi_gender(chi),
age = chi_age(chi, Sys.time())
)
## # A tibble: 3 × 4
## chi dob gender age
## <chr> <date> <chr> <dbl>
## 1 1009701234 1970-09-10 Male 53
## 2 1811431232 1943-11-18 Male 80
## 3 1304496368 1949-04-13 Female 75
2.5.4 chi_valid()
- Logical test for valid CHI
The final digit of the CHI can be used to test that the number is correct via the modulus 11 algorithm.
= function(.data){
chi_valid %>%
.data ::str_split("", simplify = TRUE) %>%
stringr-10] %>% # Working with matrices hence brackets
.[, apply(1, as.numeric) %>% # Convert from string
seq(10, 2) %*% .} %>% # Multiply and sum step
{%% 11} %>% # Modulus 11
{. 11 - .} %>% # Substract from 11
{::near( # Compare result with 10th digit.
dplyr::str_sub(chi, 10) %>% as.numeric()}
{stringr%>%
) as.vector()
}
chi_valid(chi)
## [1] FALSE TRUE TRUE
# From tibble
tibble(chi = chi) %>%
mutate(
dob = chi_dob(chi),
gender = chi_gender(chi),
age = chi_age(chi, Sys.time()),
chi_valid = chi_valid(chi)
)
## # A tibble: 3 × 5
## chi dob gender age chi_valid
## <chr> <date> <chr> <dbl> <lgl>
## 1 1009701234 1970-09-10 Male 53 FALSE
## 2 1811431232 1943-11-18 Male 80 TRUE
## 3 1304496368 1949-04-13 Female 75 TRUE
2.6 Working with dates
2.6.1 Difference between two dates
I always forget how to do this neatly. I often want days as a numeric, not a lubridate type object.
library(lubridate)
= dmy("12/03/2018", "14/05/2017")
date1 = dmy("11/09/2019", "11/04/2019")
date2
interval(date1, date2) %>%
as.numeric("days")
## [1] 548 697
2.6.2 Lags
This is useful for calculating, for instance, the period off medications. Lags are much better than long to wide solutions for this.
library(tidyverse)
library(lubridate)
= c(2, 2, 2, 2, 3, 5)
id = c("aspirin", "aspirin", "aspirin", "tylenol", "lipitor", "advil")
medication = c("05/01/2017", "05/30/2017", "07/15/2017", "05/01/2017", "05/06/2017", "05/28/2017")
start.date = c("05/04/2017", "06/10/2017", "07/27/2017", "05/15/2017", "05/12/2017", "06/13/2017")
stop.date = tibble(id, medication, start.date, stop.date)
df df
## # A tibble: 6 × 4
## id medication start.date stop.date
## <dbl> <chr> <chr> <chr>
## 1 2 aspirin 05/01/2017 05/04/2017
## 2 2 aspirin 05/30/2017 06/10/2017
## 3 2 aspirin 07/15/2017 07/27/2017
## 4 2 tylenol 05/01/2017 05/15/2017
## 5 3 lipitor 05/06/2017 05/12/2017
## 6 5 advil 05/28/2017 06/13/2017
%>%
df mutate_at(c("start.date", "stop.date"), lubridate::mdy) %>% # make a date
arrange(id, medication, start.date) %>%
group_by(id, medication) %>%
mutate(
start_date_diff = start.date - lag(start.date),
medication_period = stop.date-start.date
)
## # A tibble: 6 × 6
## # Groups: id, medication [4]
## id medication start.date stop.date start_date_diff medication_period
## <dbl> <chr> <date> <date> <drtn> <drtn>
## 1 2 aspirin 2017-05-01 2017-05-04 NA days 3 days
## 2 2 aspirin 2017-05-30 2017-06-10 29 days 11 days
## 3 2 aspirin 2017-07-15 2017-07-27 46 days 12 days
## 4 2 tylenol 2017-05-01 2017-05-15 NA days 14 days
## 5 3 lipitor 2017-05-06 2017-05-12 NA days 6 days
## 6 5 advil 2017-05-28 2017-06-13 NA days 16 days
2.6.3 Pulling out “change in status” data
If you have a number of episodes per patient, each with a status and a time, then you need to do this as a starting point for CPH analysis.
2.6.4 Guess the Format of Dates
This function helps guess the date format from a dataset you have received. Add in new formats (as per lubridate) if they don’t match. Once you work out what format it is then extract date with mdy()
or dmy()
etc. based on format.
library(tidyverse)
library(lubridate)
= lakers %>%
lakers mutate(mm_date = paste0(str_sub(date, start =5, end = 6), "/",
str_sub(date, start = 7, end = 8), "/",
str_sub(date, start = 1, end = 4)),
dd_date = paste0(str_sub(date, start =7, end = 8), "/",
str_sub(date, start = 5, end = 6), "/",
str_sub(date, start = 1, end = 4)))
guess_formats(lakers$mm_date, c("dmy", "mdy"), print_matches = TRUE)
guess_formats(lakers$dd_date, c("dmy", "mdy"), print_matches = TRUE)
2.6.4.1 Example data
library(dplyr)
library(lubridate)
library(finalfit)
= tibble(
mydata id = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5),
status = c(0,0,0,1,0,0,1,1,0,0,0,0,0,1,1,1,0,0,1,1),
group = c(rep(0, 8), rep(1, 12)) %>% factor(),
opdate = rep("2010/01/01", 20) %>% ymd(),
status_date = c(
"2010/02/01", "2010/03/01", "2010/04/01", "2010/05/01",
"2010/02/02", "2010/03/02", "2010/04/02", "2010/05/02",
"2010/02/03", "2010/03/03", "2010/04/03", "2010/05/03",
"2010/02/04", "2010/03/04", "2010/04/04", "2010/05/04",
"2010/02/05", "2010/03/05", "2010/04/05", "2010/05/05"
%>% ymd()
)
) mydata
## # A tibble: 20 × 5
## id status group opdate status_date
## <dbl> <dbl> <fct> <date> <date>
## 1 1 0 0 2010-01-01 2010-02-01
## 2 1 0 0 2010-01-01 2010-03-01
## 3 1 0 0 2010-01-01 2010-04-01
## 4 1 1 0 2010-01-01 2010-05-01
## 5 2 0 0 2010-01-01 2010-02-02
## 6 2 0 0 2010-01-01 2010-03-02
## 7 2 1 0 2010-01-01 2010-04-02
## 8 2 1 0 2010-01-01 2010-05-02
## 9 3 0 1 2010-01-01 2010-02-03
## 10 3 0 1 2010-01-01 2010-03-03
## 11 3 0 1 2010-01-01 2010-04-03
## 12 3 0 1 2010-01-01 2010-05-03
## 13 4 0 1 2010-01-01 2010-02-04
## 14 4 1 1 2010-01-01 2010-03-04
## 15 4 1 1 2010-01-01 2010-04-04
## 16 4 1 1 2010-01-01 2010-05-04
## 17 5 0 1 2010-01-01 2010-02-05
## 18 5 0 1 2010-01-01 2010-03-05
## 19 5 1 1 2010-01-01 2010-04-05
## 20 5 1 1 2010-01-01 2010-05-05
2.6.4.2 Compute time from op date to current review
… if necessary
= mydata %>%
mydata arrange(id, status_date) %>%
mutate(
time = interval(opdate, status_date) %>% as.numeric("days")
) mydata
## # A tibble: 20 × 6
## id status group opdate status_date time
## <dbl> <dbl> <fct> <date> <date> <dbl>
## 1 1 0 0 2010-01-01 2010-02-01 31
## 2 1 0 0 2010-01-01 2010-03-01 59
## 3 1 0 0 2010-01-01 2010-04-01 90
## 4 1 1 0 2010-01-01 2010-05-01 120
## 5 2 0 0 2010-01-01 2010-02-02 32
## 6 2 0 0 2010-01-01 2010-03-02 60
## 7 2 1 0 2010-01-01 2010-04-02 91
## 8 2 1 0 2010-01-01 2010-05-02 121
## 9 3 0 1 2010-01-01 2010-02-03 33
## 10 3 0 1 2010-01-01 2010-03-03 61
## 11 3 0 1 2010-01-01 2010-04-03 92
## 12 3 0 1 2010-01-01 2010-05-03 122
## 13 4 0 1 2010-01-01 2010-02-04 34
## 14 4 1 1 2010-01-01 2010-03-04 62
## 15 4 1 1 2010-01-01 2010-04-04 93
## 16 4 1 1 2010-01-01 2010-05-04 123
## 17 5 0 1 2010-01-01 2010-02-05 35
## 18 5 0 1 2010-01-01 2010-03-05 63
## 19 5 1 1 2010-01-01 2010-04-05 94
## 20 5 1 1 2010-01-01 2010-05-05 124
2.6.4.3 Pull out “change of status”
= mydata %>%
mydata group_by(id) %>%
mutate(
status_change = status - lag(status) == 1, # Mark TRUE if goes from 0 to 1
status_nochange = sum(status) == 0, # Mark if no change from 0
status_nochange_keep = !duplicated(status_nochange, fromLast= TRUE) # Mark most recent "no change" episode
%>%
) filter(status_change | (status_nochange & status_nochange_keep)) %>% # Filter out other episodes
select(-c(status_change, status_nochange, status_nochange_keep)) # Remove columns not needed
mydata
## # A tibble: 5 × 6
## # Groups: id [5]
## id status group opdate status_date time
## <dbl> <dbl> <fct> <date> <date> <dbl>
## 1 1 1 0 2010-01-01 2010-05-01 120
## 2 2 1 0 2010-01-01 2010-04-02 91
## 3 3 0 1 2010-01-01 2010-05-03 122
## 4 4 1 1 2010-01-01 2010-03-04 62
## 5 5 1 1 2010-01-01 2010-04-05 94
2.6.5 Isoweek and epiweek
This is frequently painful, when working with weeks in a year to aggregate data, how to best plot and how to deal with multiple years.
Easiest solution I have found (EMH) is just to work with the first date of each isoweek. Make a look up table and join.
# Generate random set of dates for demonstration purposes
## This is not required when using normally
library(tidyverse)
library(lubridate)
library(finalfit)
= colon_s %>%
colon_s mutate(
date = seq(ymd(20200101), ymd(20220601), by="day") %>%
sample(929, replace = TRUE)
)
# Make isoweek start date look-up table
## Make sure includes entire range of your dates of interest.
= tibble(date = seq(ymd(20200101), ymd(20220601), by = 1)) %>%
isoweek_lookup mutate(isoweek = lubridate::isoweek(date),
year = lubridate::year(date))
= isoweek_lookup %>%
isoweek_lookup left_join(isoweek_lookup %>%
group_by(isoweek, year) %>%
slice(1) %>%
select(isoweek_date = date, isoweek, year)
)
## Joining with `by = join_by(isoweek, year)`
# Add isoweek_date (first date of a particular isoweek) and isoweek to analysis dataset.
%>%
colon_s left_join(isoweek_lookup, by = c("date" = "date")) %>%
select(date, isoweek, isoweek_date) %>%
slice(1:6)
## date isoweek isoweek_date
## 1 2020-06-15 25 2020-06-15
## 2 2020-06-20 25 2020-06-15
## 3 2020-02-23 8 2020-02-17
## 4 2020-11-16 47 2020-11-16
## 5 2020-06-27 26 2020-06-22
## 6 2021-06-12 23 2021-06-07
2.7 Labelling tables and figures in markdown chunks
2.8 Join tables and overwrite values (that are often missing)
From here: https://stackoverflow.com/questions/35637135/left-join-two-data-frames-and-overwrite
library(dplyr)
<- c("key1", "key2")
.keys
<- tribble(
.base_table ~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
<- tribble(
.join_table ~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
<- .base_table %>%
df_result # Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and
# bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
%>%
df_result print()
## # A tibble: 4 × 4
## key1 key2 val1 val2
## <chr> <chr> <dbl> <dbl>
## 1 A b 0 100
## 2 B a 1 111
## 3 A a 0 0
## 4 B b 1 1
2.9 Read in multiple files
As of July 2021, library(readr) is happy to read in multiple files into the same tibble without you having to use purrr::map_dfr()
.
library(tidyverse)
= read_csv(fs::dir_ls(path = "data_folder", glob = "*.csv")) exampledata
It’s also happy to automatically uncompress files ending with .zip, .gz or similar.
2.10 Averaging multiple AUCs (say from imputed datasets)
Final method here probably makes most sense: `
= c(0.5, 0.6, 0.7, 0.8, 0.9, 0.99999)
auc mean(auc) # Mean
median(auc) # Median
prod(auc)^(1/length(auc)) # Geometric mean
exp(mean(log(auc))) # Geometric mean again
log(auc) # Logs
log(auc / (1-auc) ) # Put values on scale from 0 (0.5) to Inf (1)
exp(mean(log(auc/(1-auc)))) / # Mean on the above scale.
1 + exp(mean(log(auc/(1-auc))))) (
= c(0.5, 0.6, 0.7, 0.8, 0.9, 0.99999)
auc mean(auc) # Mean
## [1] 0.7499983
median(auc) # Median
## [1] 0.75
prod(auc)^(1/length(auc)) # Geometric mean
## [1] 0.7298908
exp(mean(log(auc))) # Geometric mean again
## [1] 0.7298908
log(auc) # Logs
## [1] -6.931472e-01 -5.108256e-01 -3.566749e-01 -2.231436e-01 -1.053605e-01
## [6] -1.000005e-05
log(auc / (1-auc) ) # Put values on scale from 0 (0.5) to Inf (1)
## [1] 0.0000000 0.4054651 0.8472979 1.3862944 2.1972246 11.5129155
exp(mean(log(auc/(1-auc)))) / # Mean on the above scale.
1 + exp(mean(log(auc/(1-auc))))) (
## [1] 0.9384781
2.11 Formating thousands automatically
Works in R Markdown and Quarto.
Before: There are 1.2345^{4} patients in this dataset.
= function(x) {
inline_hook if (is.numeric(x)) {
format(x,big.mark = ",")
else x
}
}::knit_hooks$set(inline = inline_hook) knitr
After: There are 12,345 patients in this dataset.
This only works on numbers entered by inline R.
2.12 DataTable function
= function(df, pagelength = 20){
mydt ::datatable(df,
DTrownames = FALSE,
extensions = c('Buttons'),
filter = "top",
options = list(
pageLength = pagelength,
autoWidth = TRUE,
scrollX = TRUE,
searchHighlight = TRUE))
}
Usage:
::penguins %>%
palmerpenguinsslice(1:5) %>%
mydt()