Data science fundamentals 05: Wrangle data (Oct. 05, 2017)


Recorded Stream



A note on tibbles

We already saw tibbles and how to convert data frames into tibbles (using as.tibble()). Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen.

  # Load the package RCurl
  library(RCurl)
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Rename the HSQ_1 factor for identification
  HANES$HSQ_1 <- factor(HANES$HSQ_1, labels=c("Excellent","Very Good","Good", "Fair", "Poor"))
  # Rename the DX_DBTS as a factor
  HANES$DX_DBTS <- factor(HANES$DX_DBTS, labels=c("DIAB","DIAB NO_DX","NO DIAB"))
  # Omit all NA from the data frame
  HANES <- na.omit(HANES)
  # Observe the structure
  str(HANES)
## 'data.frame':    1112 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 43 44 53 55 70 84 90 100 107 ...
##  $ GENDER           : Factor w/ 2 levels "M","F": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 28 27 24 30 26 31 32 34 32 ...
##  $ AGEGROUP         : Factor w/ 3 levels "20-39","40-59",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "Excellent","Very Good",..: 2 2 2 1 1 3 1 2 1 3 ...
##  $ UCREATININE      : int  105 53 314 105 163 150 46 36 177 156 ...
##  $ UALBUMIN         : num  0.707 1 8 4 3 2 2 0.707 4 3 ...
##  $ UACR             : num  0.00673 2 3 4 2 ...
##  $ MERCURYU         : num  0.37 0.106 0.487 2.205 0.979 ...
##  $ DX_DBTS          : Factor w/ 3 levels "DIAB","DIAB NO_DX",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 5.2 ...
##  $ CADMIUM          : num  0.2412 0.1732 0.0644 0.0929 0.1202 ...
##  $ LEAD             : num  1.454 1.019 0.863 1.243 0.612 ...
##  $ MERCURYTOTALBLOOD: num  2.34 2.57 1.32 14.66 2.13 ...
##  $ HDL              : int  42 51 42 61 52 50 57 56 42 44 ...
##  $ CHOLESTEROLTOTAL : int  184 157 145 206 120 155 156 235 156 120 ...
##  $ GLUCOSESI        : num  4.61 4.77 5.16 5 5.11 ...
##  $ CREATININESI     : num  74.3 73 80 84.9 66 ...
##  $ CREATININE       : num  0.84 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 1.09 ...
##  $ TRIGLYCERIDE     : int  156 43 108 65 51 29 31 220 82 35 ...
##  $ GLUCOSE          : int  83 86 93 90 92 85 72 87 96 92 ...
##  $ COTININE         : num  31.5918 0.0635 0.035 0.0514 0.035 ...
##  $ LDLESTIMATE      : int  111 97 81 132 58 99 93 135 98 69 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:415] 2 15 16 24 26 28 33 34 35 39 ...
##   .. ..- attr(*, "names")= chr [1:415] "2" "15" "16" "24" ...
  # Load the tidyverse library
  library(tidyverse)

To subset a variable name in a pipe, we have to use a special placeholder . when we use tibble.

  # Convert HANES to tibble
  HANES_tibble <- as.tibble(HANES) 
  # Subset A1C variable from the tibble HANES
  HANES_A1C <- HANES_tibble %>% .$A1C 
  # You can set the print options for the 
  # number of lines to be printed
  options(tibble.print_min = 7)
  as.tibble(HANES_A1C)
## # A tibble: 1,112 x 1
##   value
##   <dbl>
## 1   5.0
## 2   5.2
## 3   4.8
## 4   5.1
## 5   4.3
## 6   5.2
## 7   4.8
## # ... with 1,105 more rows

Reading files and parsing

Comma Seperated Values (CSV) are the most common types of files. We can read a .csv file through the function read_csv().

  # Read the HANES orginal csv file 
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/HANES.original.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES_original <- read.csv(URL)
  # and convert into tibble
  as.tibble(HANES_original)
## # A tibble: 1,527 x 704
##       KEY BOROSTRATUM PSUNEST HHNEST  CAPI_WT  EXAM_WT BLOOD_WT URINE_WT
## *  <fctr>       <int>   <int>  <int>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 134040A           1      80 134040 7124.351 7421.006 8529.362 7811.598
## 2 134240B           1      82 134240 3946.733 4045.217 4696.925 4174.914
## 3 134460B           1      83 134460 3795.282 3859.116 4262.829 3854.439
## 4 134490A           1      83 134490 3946.733 4045.217 4696.925 4174.914
## 5 134620A           1      84 134620 3795.282 3859.116 4262.829 3854.439
## 6 134670A           1      84 134670 3795.282 3859.116 4262.829 3854.439
## 7 135210B           1      86 135210 3946.733 4045.217 4696.925 4174.914
## # ... with 1,520 more rows, and 696 more variables: SALIVA_WT <dbl>,
## #   ACASI_WT <dbl>, WT_CAPI <int>, WT_EXAM <int>, WT_BLOOD <int>,
## #   WT_URINE <int>, WT_SALIVA <int>, WT_ACASI <int>, LAQ1 <int>,
## #   IS_PROXY <int>, GENDER <int>, SPAGE <int>, AGEGROUP <int>,
## #   AGEGRP3C <int>, AGEGRP4C <int>, AGEGRP5C <int>, DMQ_A <int>,
## #   DMQ_1 <int>, EDUCATION <int>, EDU4CAT <int>, EDU3CAT <int>, EDU <int>,
## #   DMQ_2 <int>, MARITAL <int>, MARITALMOD <int>, DMQ_3 <int>,
## #   DMQ_5 <fctr>, DMQ_5A <fctr>, US_BORN <int>, BORN <int>, DMQ_6 <fctr>,
## #   DMQ_7MON <int>, DMQ_7YEAR <int>, DMQ_8MON <int>, DMQ_8YEAR <int>,
## #   DMQ_9MON <int>, DMQ_9YEAR <int>, DMQ_10 <fctr>, DMQ_10A <fctr>,
## #   DMQ_11 <fctr>, DMQ_11A <fctr>, DMQ_12 <int>, DMQ_13_1 <int>,
## #   DMQ_13_2 <int>, DMQ_13_3 <int>, DMQ_13OTH <fctr>, DMQ_14_1 <int>,
## #   DMQ_14_2 <int>, DMQ_14_3 <int>, DMQ_14OTH <fctr>, RACE <int>,
## #   RACECKDEPI <int>, ASIAN <int>, DMQ_15 <int>, DMQ_16_1 <int>,
## #   DMQ_16_2 <int>, DMQ_16OTH <fctr>, HSQ_1 <int>, HSQ_2 <int>,
## #   HSQ_3 <int>, HSQ_4 <int>, HSQ_5 <int>, HSQ_6 <int>, HSQ_7 <int>,
## #   HSQ_8 <int>, OCQ_1 <int>, OCQ_2 <int>, OCQ_3 <int>, OCQ_4 <fctr>,
## #   OCQ_5 <fctr>, OCQ_6 <int>, OCQ_7 <int>, OCQ_8 <int>, OCQ_9 <int>,
## #   OCQ_10 <int>, HIQ_1 <int>, HIQ_2_1 <int>, HIQ_2_2 <int>,
## #   HIQ_2_3 <int>, HIQ_3_1 <int>, HIQ_4 <int>, HIQ_5 <int>, HIQ_6 <int>,
## #   HIQ_7 <int>, HIQ_8_1 <int>, HIQ_8_2 <int>, HIQ_8_3 <int>,
## #   HIQ_8OTH <fctr>, HUQ_1 <int>, HUQ_2 <int>, HUQ_3 <int>, HUQ_4 <int>,
## #   HUQ_5 <int>, HUQ_6 <int>, HUQ_9 <int>, HUQ_10 <int>, HUQ_11 <int>,
## #   HUQ_12 <int>, MEDS_1_HUQ_13A <fctr>, MEDS_1_HUQ_13B <int>, ...

Parsing a vector can be done through the parsing function:

  # Parse a logical vector
  str(parse_logical(c("TRUE", "FALSE", "NA")))
  # Parse an integer vector 
  # Actually EDUCATION is a factor variable 
  str(parse_integer(HANES_original$EDUCATION))
  # Parse a date vector
  str(parse_date(c("2010-01-01", "1979-10-14")))
##  logi [1:3] TRUE FALSE NA
##  int [1:1527] 2 3 3 3 3 3 3 3 3 3 ...
##  Date[1:2], format: "2010-01-01" "1979-10-14"

We can specify which education categories should be treated as NA. For instance, if we wish to denote all high schoolers (category 2) as NA:

  # Replace EDUCATION variable with high schoolers as NA
  as.tibble(parse_integer(HANES_original$EDUCATION, na = "2"))
## # A tibble: 1,527 x 1
##   value
##   <int>
## 1    NA
## 2     3
## 3     3
## 4     3
## 5     3
## 6     3
## 7     3
## # ... with 1,520 more rows

If parsing fails we will get a warning:

  # Parse EDUCATION as a logical when it is a number
  parse_logical(HANES_original$EDUCATION)
## Warning: 1209 parsing failures.
## row # A tibble: 5 x 4 col     row   col           expected actual expected   <int> <int>              <chr>  <chr> actual 1     1    NA 1/0/T/F/TRUE/FALSE      2 row 2     2    NA 1/0/T/F/TRUE/FALSE      3 col 3     3    NA 1/0/T/F/TRUE/FALSE      3 expected 4     4    NA 1/0/T/F/TRUE/FALSE      3 actual 5     5    NA 1/0/T/F/TRUE/FALSE      3
## ... ................. ... ....................................... ........ ....................................... ...... ....................................... ... ....................................... ... ....................................... ........ ....................................... ...... .......................................
## See problems(...) for more details.

which can be output into a tibble format using the problems() function:

  # Parse EDUCATION as a logical when it is a number
  x <- parse_logical(HANES_original$EDUCATION)
  # and output the result as a nice tibble
  problems(x)
## # A tibble: 1,209 x 4
##     row   col           expected actual
##   <int> <int>              <chr>  <chr>
## 1     1    NA 1/0/T/F/TRUE/FALSE      2
## 2     2    NA 1/0/T/F/TRUE/FALSE      3
## 3     3    NA 1/0/T/F/TRUE/FALSE      3
## 4     4    NA 1/0/T/F/TRUE/FALSE      3
## 5     5    NA 1/0/T/F/TRUE/FALSE      3
## 6     6    NA 1/0/T/F/TRUE/FALSE      3
## 7     7    NA 1/0/T/F/TRUE/FALSE      3
## # ... with 1,202 more rows

Parsing different number formats are important as it comes in various styles:

  1. People write numbers differently in different parts of the world. For example, some countries use . in between the integer and fractional parts of a real number, while others use ,.

  2. Numbers are often surrounded by other characters that provide some context, like “$1000” or “10%”.

  3. Numbers often contain “grouping” characters to make them easier to read, like “1,000,000”, and these grouping characters vary around the world.

The first problem is solved by the argument locale in parse_double() function, the second using parse_number() function and the third by combination of locale and parse_number().

  # Parse the number 1,23 where the decimal mark is written as a `,`
  parse_double("1,63", locale = locale(decimal_mark = ","))
  # Just parse the number for $120 and 60%
  parse_number("$120")
  parse_number("60%")
  # Parse the grouping as used in Europe
  parse_number("123'834'297", locale = locale(grouping_mark = "."))
  # or as used in Switzerland
  parse_number("123'834'297", locale = locale(grouping_mark = "'"))
## [1] 1.63
## [1] 120
## [1] 60
## [1] 123
## [1] 123834297

All data sets are stored with encoding. Each hexadecimal number represents a byte of information: 48 is H, 61 is a, and so on. The mapping from hexadecimal number to character is called the encoding, and in the case of english the standard of encoding is ASCII. The function charToRaw() will list the encoding of characters in the ASCII format.

  # The function charToRaw will list the encoding of charecters in ASCII format
  charToRaw("This is a nice day")
##  [1] 54 68 69 73 20 69 73 20 61 20 6e 69 63 65 20 64 61 79

However, we need encodings for other languages and UTF-8 is the standard that can encode just about every character used by humans today, as well as many extra symbols (like emoji!). We can parse such encodings (that will appear gibberish if we read it in the english system) through parse_charecter() function:

  # Define other encoded strings
  x1 <- "El Ni\xf1o was particularly bad this year"
  x2 <- "\x82\xb1\x82\xf1\x82\xc9\x82\xbf\x82\xcd"
  # Parse those strings
  parse_character(x1, locale = locale(encoding = "Latin1"))
  parse_character(x2, locale = locale(encoding = "Shift-JIS"))
  # We can even guess the encodings, if they are not available
  guess_encoding(charToRaw(x1))
  guess_encoding(charToRaw(x2))
## [1] "El Niño was particularly bad this year"
## [1] "こんにちは"
## # A tibble: 2 x 2
##     encoding confidence
##        <chr>      <dbl>
## 1 ISO-8859-1       0.46
## 2 ISO-8859-9       0.23
## # A tibble: 1 x 2
##   encoding confidence
##      <chr>      <dbl>
## 1   KOI8-R       0.42

We see that KOI8-R is an 8-bit character encoding, designed to cover Russian, which uses a Cyrillic alphabet.

Parsing date and time is very important in health related data sets. For example, the MIMIC3 data has tons of data sets with time information. There are three main functions for parsing dates and times.

  • parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

  • parse_date() expects a four digit year, a - or /, the month, a - or /, then the day

  • parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier

  # Parse datetime
  parse_datetime("2010-10-01T2010")
  # Parse date
  parse_date("2010-10-01")
  # Parse time
  library(hms)
  parse_time("01:10 am")
## [1] "2010-10-01 20:10:00 UTC"
## [1] "2010-10-01"
## 01:10:00

If the deafults doesn’t work, there are several options as follows:

  • Year: %Y (4 digits), %y (2 digits); 00-69 -> 2000-2069, 70-99 -> 1970-1999.

  • Month: %m (2 digits), %b (abbreviated name, like “Jan”), %B (full name, “January”).

  • Day: %d (2 digits), %e (optional leading space).

  • Time: %H 0-23 hour, %I 0-12, must be used with %p, %p AM/PM indicator, %M minutes, %S integer seconds, %OS real seconds, %Z Time zone (as name, e.g. America/Chicago).

  • Non-digits: %. skips one non-digit character, %* skips any number of non-digits.

As an example:

  # Parse date in month/day/year format
  parse_date("01/02/15", "%m/%d/%y")
  # Parse date in day/month/year format
  parse_date("01/02/2015", "%d/%m/%Y")
  # Parse date in year/month/date format
  parse_date("01/02/15", "%y/%m/%d")
  # We can even set the `locale` option to specify different built in languages for the date
  parse_date("1 janvier 2015", "%d %B %Y", locale = locale("fr"))
## [1] "2015-01-02"
## [1] "2015-02-01"
## [1] "2001-02-15"
## [1] "2015-01-01"


Classwork/Homework: Read section 11.4 and 11.5 in the R for Data Science book.


Dealing with dates and times

Dealing with dates and times can be a very tough job. For example, consider the MIMIC3 admissions table which has several time variables.

Note: MIMIC3 says “dates were shifted into the future by a random offset for each individual patient in a consistent manner to preserve intervals, resulting in stays which occur sometime between the years 2100 and 2200.” so don’t freak out looking at the dates.

If we import the dataset as such, we will see that they import as a factor and the parse_date() function as such fails:

  # Load the package RCurl
  library(RCurl)
  # Import the admissions data set in MIMIC3 from GitHub; 
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/MIMIC3/admissions.csv"
  URL <- paste(URL_text_1,URL_text_2, sep="")
  MIMIC3_ADM <- read.csv(text=getURL(URL))
  # Observe the structure
  str(MIMIC3_ADM)
  # Parse the admission time using `parse_date()` function
  head(parse_date(MIMIC3_ADM$admittime, "%Y%m%d%e"))
## 'data.frame':    5770 obs. of  19 variables:
##  $ row_id              : int  83 84 85 86 87 88 89 90 91 92 ...
##  $ subject_id          : int  82 83 84 84 85 85 86 87 88 89 ...
##  $ hadm_id             : int  110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
##  $ admittime           : Factor w/ 5310 levels "2100-07-04","2100-07-09",..: 2665 2240 5030 5037 3331 3607 2458 4778 569 4458 ...
##  $ dischtime           : Factor w/ 5767 levels "2100-07-17 15:00:00",..: 2878 2415 5462 5469 3603 3901 2651 5204 609 4834 ...
##  $ deathtime           : Factor w/ 569 levels "2100-09-13 21:20:00",..: NA NA NA 538 NA NA NA NA NA NA ...
##  $ admission_type      : Factor w/ 4 levels "ELECTIVE","EMERGENCY",..: 3 4 1 2 2 2 1 3 2 3 ...
##  $ admission_location  : Factor w/ 9 levels "** INFO NOT AVAILABLE **",..: 5 6 5 3 2 2 5 5 3 5 ...
##  $ discharge_location  : Factor w/ 16 levels "DEAD/EXPIRED",..: 5 6 5 1 14 16 6 15 5 15 ...
##  $ insurance           : Factor w/ 5 levels "Government","Medicaid",..: 4 3 4 4 3 3 4 4 4 2 ...
##  $ language            : Factor w/ 28 levels "*ARM","*CDI",..: NA NA NA NA 11 11 NA NA NA NA ...
##  $ religion            : Factor w/ 17 levels "7TH DAY ADVENTIST",..: 17 17 13 13 3 3 12 17 NA 17 ...
##  $ marital_status      : Factor w/ 7 levels "DIVORCED","LIFE PARTNER",..: NA 3 3 3 3 3 5 NA NA NA ...
##  $ ethnicity           : Factor w/ 37 levels "AMERICAN INDIAN/ALASKA NATIVE",..: 27 32 33 33 33 33 33 32 10 32 ...
##  $ edregtime           : Factor w/ 3065 levels "2100-07-09 05:56:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ edouttime           : Factor w/ 3065 levels "2100-07-09 15:55:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ diagnosis           : Factor w/ 2313 levels " AORTIC ABDOMINAL ANEURYSM/SDA",..: 1505 450 1337 985 242 1642 330 1505 1920 1505 ...
##  $ hospital_expire_flag: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_chartevents_data: int  1 1 0 1 1 1 1 1 1 1 ...
## [1] NA NA NA NA NA NA

Another approach is to use the helpers provided by the library lubridate. They automatically work out the format once we specify the order of the component. To use them, we have to identify the order in which year, month, and day appear in our data, then arrange “y”, “m”, and “d” in the same order. That gives the name of the lubridate function that will parse our date. To use this, we need to load the lubridate library.

  # Load the lubridate library
  library(lubridate)
  # Since admission time has the format year/month/time, we pass ymd
  MIMIC3_ADM$admittime <- ymd(MIMIC3_ADM$admittime)
  # View the format
  str(MIMIC3_ADM)
  # Print the first few dates
  head(MIMIC3_ADM$admittime)
## 'data.frame':    5770 obs. of  19 variables:
##  $ row_id              : int  83 84 85 86 87 88 89 90 91 92 ...
##  $ subject_id          : int  82 83 84 84 85 85 86 87 88 89 ...
##  $ hadm_id             : int  110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
##  $ admittime           : Date, format: "2150-06-24" "2142-04-01" ...
##  $ dischtime           : Factor w/ 5767 levels "2100-07-17 15:00:00",..: 2878 2415 5462 5469 3603 3901 2651 5204 609 4834 ...
##  $ deathtime           : Factor w/ 569 levels "2100-09-13 21:20:00",..: NA NA NA 538 NA NA NA NA NA NA ...
##  $ admission_type      : Factor w/ 4 levels "ELECTIVE","EMERGENCY",..: 3 4 1 2 2 2 1 3 2 3 ...
##  $ admission_location  : Factor w/ 9 levels "** INFO NOT AVAILABLE **",..: 5 6 5 3 2 2 5 5 3 5 ...
##  $ discharge_location  : Factor w/ 16 levels "DEAD/EXPIRED",..: 5 6 5 1 14 16 6 15 5 15 ...
##  $ insurance           : Factor w/ 5 levels "Government","Medicaid",..: 4 3 4 4 3 3 4 4 4 2 ...
##  $ language            : Factor w/ 28 levels "*ARM","*CDI",..: NA NA NA NA 11 11 NA NA NA NA ...
##  $ religion            : Factor w/ 17 levels "7TH DAY ADVENTIST",..: 17 17 13 13 3 3 12 17 NA 17 ...
##  $ marital_status      : Factor w/ 7 levels "DIVORCED","LIFE PARTNER",..: NA 3 3 3 3 3 5 NA NA NA ...
##  $ ethnicity           : Factor w/ 37 levels "AMERICAN INDIAN/ALASKA NATIVE",..: 27 32 33 33 33 33 33 32 10 32 ...
##  $ edregtime           : Factor w/ 3065 levels "2100-07-09 05:56:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ edouttime           : Factor w/ 3065 levels "2100-07-09 15:55:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ diagnosis           : Factor w/ 2313 levels " AORTIC ABDOMINAL ANEURYSM/SDA",..: 1505 450 1337 985 242 1642 330 1505 1920 1505 ...
##  $ hospital_expire_flag: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_chartevents_data: int  1 1 0 1 1 1 1 1 1 1 ...
## [1] "2150-06-24" "2142-04-01" "2196-02-02" "2196-04-14" "2162-03-02"
## [6] "2167-07-25"

We also see the discharge time has both date and time formats so we can use parse_datetime() function to convert from the factor format.

  # Load the lubridate library
  library(lubridate)
  # Since discharge time has both date and time attributes we may parse using datetime function
  MIMIC3_ADM$dischtime <- parse_datetime(MIMIC3_ADM$dischtime)
  # View the format
  str(MIMIC3_ADM)
  # Print the first few dates and times
  head(MIMIC3_ADM$dischtime)
## 'data.frame':    5770 obs. of  19 variables:
##  $ row_id              : int  83 84 85 86 87 88 89 90 91 92 ...
##  $ subject_id          : int  82 83 84 84 85 85 86 87 88 89 ...
##  $ hadm_id             : int  110641 158569 120969 166401 116630 112077 190243 190659 123010 188646 ...
##  $ admittime           : Date, format: "2150-06-24" "2142-04-01" ...
##  $ dischtime           : POSIXct, format: "2150-06-29 15:00:00" "2142-04-08 14:46:00" ...
##  $ deathtime           : Factor w/ 569 levels "2100-09-13 21:20:00",..: NA NA NA 538 NA NA NA NA NA NA ...
##  $ admission_type      : Factor w/ 4 levels "ELECTIVE","EMERGENCY",..: 3 4 1 2 2 2 1 3 2 3 ...
##  $ admission_location  : Factor w/ 9 levels "** INFO NOT AVAILABLE **",..: 5 6 5 3 2 2 5 5 3 5 ...
##  $ discharge_location  : Factor w/ 16 levels "DEAD/EXPIRED",..: 5 6 5 1 14 16 6 15 5 15 ...
##  $ insurance           : Factor w/ 5 levels "Government","Medicaid",..: 4 3 4 4 3 3 4 4 4 2 ...
##  $ language            : Factor w/ 28 levels "*ARM","*CDI",..: NA NA NA NA 11 11 NA NA NA NA ...
##  $ religion            : Factor w/ 17 levels "7TH DAY ADVENTIST",..: 17 17 13 13 3 3 12 17 NA 17 ...
##  $ marital_status      : Factor w/ 7 levels "DIVORCED","LIFE PARTNER",..: NA 3 3 3 3 3 5 NA NA NA ...
##  $ ethnicity           : Factor w/ 37 levels "AMERICAN INDIAN/ALASKA NATIVE",..: 27 32 33 33 33 33 33 32 10 32 ...
##  $ edregtime           : Factor w/ 3065 levels "2100-07-09 05:56:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ edouttime           : Factor w/ 3065 levels "2100-07-09 15:55:00",..: NA NA NA 2904 NA 2075 NA NA 316 NA ...
##  $ diagnosis           : Factor w/ 2313 levels " AORTIC ABDOMINAL ANEURYSM/SDA",..: 1505 450 1337 985 242 1642 330 1505 1920 1505 ...
##  $ hospital_expire_flag: int  0 0 0 1 0 0 0 0 0 0 ...
##  $ has_chartevents_data: int  1 1 0 1 1 1 1 1 1 1 ...
## [1] "2150-06-29 15:00:00 UTC" "2142-04-08 14:46:00 UTC"
## [3] "2196-02-04 17:48:00 UTC" "2196-04-17 13:42:00 UTC"
## [5] "2162-03-10 13:15:00 UTC" "2167-07-30 15:24:00 UTC"

Sometimes we may just want to work with the dates given the variable in date-time format. We can use as_date() function to extract just the date.

  # Load the lubridate library
  library(lubridate)
  # Since discharge time has both date and time attributes we use as_date() to extract the date
  MIMIC3_ADM$dischtime <- as_date(MIMIC3_ADM$dischtime)
  # We can then select only the variables of interest
  MIMIC3_adm_dis <- select(MIMIC3_ADM,hadm_id, admittime,dischtime)
  as.tibble(MIMIC3_adm_dis)
## # A tibble: 5,770 x 3
##   hadm_id  admittime  dischtime
## *   <int>     <date>     <date>
## 1  110641 2150-06-24 2150-06-29
## 2  158569 2142-04-01 2142-04-08
## 3  120969 2196-02-02 2196-02-04
## 4  166401 2196-04-14 2196-04-17
## 5  116630 2162-03-02 2162-03-10
## 6  112077 2167-07-25 2167-07-30
## 7  190243 2146-04-06 2146-04-10
## # ... with 5,763 more rows


Classwork/Homework: Read section 16.3 Date-time components in the book R for Data Science.


Time spans

Like we noted above MIMIC3 dates are shifted into the future, 2100-2200, by a random offset. To make sense, it might be worth to convert the dates to earlier dates, like 1900-2000. To do this, we may have to systematically convert the dates. We can make use of time spans. Time spans come in three important classes:

  • durations, which represent an exact number of seconds.

  • periods, which represent human units like weeks and months.

  • intervals, which represent a starting and ending point.

In R, when we subtract two dates, we get a difftime object:

  # Load the lubridate library
  library(lubridate)
  # Subtract discharge time minus admission time
  MIMIC3_adm_dis <- MIMIC3_adm_dis %>% 
    mutate(time_spent = ymd(MIMIC3_adm_dis$dischtime) - ymd(MIMIC3_adm_dis$admittime))
  # And display as tibble
  as.tibble(MIMIC3_adm_dis)
## # A tibble: 5,770 x 4
##   hadm_id  admittime  dischtime time_spent
##     <int>     <date>     <date>     <time>
## 1  110641 2150-06-24 2150-06-29     5 days
## 2  158569 2142-04-01 2142-04-08     7 days
## 3  120969 2196-02-02 2196-02-04     2 days
## 4  166401 2196-04-14 2196-04-17     3 days
## 5  116630 2162-03-02 2162-03-10     8 days
## 6  112077 2167-07-25 2167-07-30     5 days
## 7  190243 2146-04-06 2146-04-10     4 days
## # ... with 5,763 more rows

While this may be useful to find features such as the distribution of the time spent by the patients in hospital:

  # And plot the frequency distribution of the time spent by the patients
  MIMIC3_adm_dis %>%
  ggplot(aes(time_spent)) + 
    geom_freqpoly(binwidth = 1)

time difference can be little painful to work with when we are interested in durations. The library lubridate provides an alternative which always uses seconds: the duration. We can convert our variable into duration by using the function as.duration().

  # Convert the time spent as duration (in seconds)
  MIMIC3_adm_dis <- MIMIC3_adm_dis %>%
                    mutate(time_spent_in_seconds = as.duration(time_spent))
  as.tibble(MIMIC3_adm_dis)
## # A tibble: 5,770 x 5
##   hadm_id  admittime  dischtime time_spent time_spent_in_seconds
##     <int>     <date>     <date>     <time>        <S4: Duration>
## 1  110641 2150-06-24 2150-06-29     5 days     432000s (~5 days)
## 2  158569 2142-04-01 2142-04-08     7 days    604800s (~1 weeks)
## 3  120969 2196-02-02 2196-02-04     2 days     172800s (~2 days)
## 4  166401 2196-04-14 2196-04-17     3 days     259200s (~3 days)
## 5  116630 2162-03-02 2162-03-10     8 days 691200s (~1.14 weeks)
## 6  112077 2167-07-25 2167-07-30     5 days     432000s (~5 days)
## 7  190243 2146-04-06 2146-04-10     4 days     345600s (~4 days)
## # ... with 5,763 more rows

Durations come with a bunch of convenient constructors:

  # List the duration of 15s
  dseconds(15)
  # List the duration in 10 minutes
  dminutes(10)
  # List the duration in 12 hrs and 24 hours
  dhours(c(12, 24))
  # List the duration from 0 to 5 days
  ddays(0:5)
  # List the duration in 3 weeks
  dweeks(3)
  # List the duration in a year
  dyears(1)
## [1] "15s"
## [1] "600s (~10 minutes)"
## [1] "43200s (~12 hours)" "86400s (~1 days)"
## [1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
## [4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
## [1] "1814400s (~3 weeks)"
## [1] "31536000s (~52.14 weeks)"

These constructors can help convert our years by a 200 year offset. Or alternatively we can use periods. Periods are time spans but don’t have a fixed length in seconds, instead they work with “human” times, like days and months. That allows them work in a more intuitive way. Like durations, periods can be created with a number of friendly constructor functions.

  # Make 15s
  seconds(15)
  # Make 10m
  minutes(10)
  # Make 12 hrs and 24 hrs
  hours(c(12, 24))
  # Make 7 days
  days(7)
  # Make 1-6 months
  months(1:6)
  # Make 3 weeks
  weeks(3)
  # Make 1 year
  years(1)
## [1] "15S"
## [1] "10M 0S"
## [1] "12H 0M 0S" "24H 0M 0S"
## [1] "7d 0H 0M 0S"
## [1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
## [5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
## [1] "21d 0H 0M 0S"
## [1] "1y 0m 0d 0H 0M 0S"

You can add and multiply periods and add them to dates. Thus we can subtract a 200 year time peroid from our admission and discharge times:

  # Makes times compatible with our times subtracting 200 years
  MIMIC3_adm_dis <- MIMIC3_adm_dis %>%
                    mutate(admission_time_minus_200 = ymd(admittime)-years(200),
                    dischtime_time_minus_200 = ymd(dischtime)-years(200)) %>% 
                    select(admission_time_minus_200,dischtime_time_minus_200, everything())
  as.tibble(MIMIC3_adm_dis)
## # A tibble: 5,770 x 7
##   admission_time_minus_200 dischtime_time_minus_200 time_spent hadm_id
##                     <date>                   <date>     <time>   <int>
## 1               1950-06-24               1950-06-29     5 days  110641
## 2               1942-04-01               1942-04-08     7 days  158569
## 3               1996-02-02               1996-02-04     2 days  120969
## 4               1996-04-14               1996-04-17     3 days  166401
## 5               1962-03-02               1962-03-10     8 days  116630
## 6               1967-07-25               1967-07-30     5 days  112077
## 7               1946-04-06               1946-04-10     4 days  190243
## # ... with 5,763 more rows, and 3 more variables: admittime <date>,
## #   dischtime <date>, time_spent_in_seconds <S4: Duration>

We can then look at the frequency of discharge over the hundred year period 1900-now.

  # Load the above admissions and discharge data
  MIMIC3_adm_dis %>% 
  # And plot the frequency distribution of discharge   
  ggplot(aes(dischtime_time_minus_200)) + 
    geom_freqpoly(binwidth = 100)

Or for a single year, say 1980, distributed over a month:

  # Load the above admissions and discharge data
  MIMIC3_adm_dis %>% 
  # and filter for the discharges in the year 1980
  filter(dischtime_time_minus_200 >= ymd(19800101) & dischtime_time_minus_200 <= ymd(19801231)) %>%
  # And plot the frequency distribution for 30 days
  ggplot(aes(dischtime_time_minus_200)) + 
    geom_freqpoly(binwidth = 30)

What should years(1) / days(1) return? Is it 365 or 366, since some years are leap years and periods are human interpretable values unlike duration? R will throw a warning if we try to perfom this operation.

  # Find years(1) / days(1)
  years(1) / days(1)
## estimate only: convert to intervals for accuracy
## [1] 365.25

Thus we have to use an interval. An interval is a duration with a starting point: that makes it precise so you can determine exactly how long it is:

  # Find years(1) / days(1) by converting to interval
  next_year <- today() + years(1)
  (today() %--% next_year) / ddays(1)
## [1] 365


Classwork/Homework: Read section 16.5 Date-time components in the book R for Data Science.


Tidying data

Same “kind of” data can be present in multiple columns that may denote a single variable. For example, consider MIMIC3 admissions table. There are two variables - edregtime and edouttime (emergency department registration time and emergency department out time, respectively) that represents one variable: time stamp in emergency, a larger class of variable.

  # Select subject_id, hadm_id, edregtime and edouttime variables
  MIMIC3_ADM_tibble <- MIMIC3_ADM %>% select(subject_id,hadm_id,edregtime,edouttime) %>% na.omit()
  as.tibble(MIMIC3_ADM_tibble)
## # A tibble: 3,065 x 4
##   subject_id hadm_id           edregtime           edouttime
## *      <int>   <int>              <fctr>              <fctr>
## 1         84  166401 2196-04-13 22:23:00 2196-04-14 04:31:00
## 2         85  112077 2167-07-25 16:37:00 2167-07-25 20:46:00
## 3         88  123010 2111-08-29 01:44:00 2111-08-29 02:28:00
## 4         91  121205 2177-04-22 21:02:00 2177-04-23 04:03:00
## 5         94  183686 2176-02-25 10:35:00 2176-02-25 18:14:00
## 6         94  140037 2176-09-02 09:54:00 2176-09-02 17:56:00
## 7         95  160891 2157-12-25 12:24:00 2157-12-25 16:56:00
## # ... with 3,058 more rows

The purpose of the gather function is to tidy the data by aggregating such data present in multiple columns into a new pair of variables. To accomplish this, we need three parameters:

  • The set of columns that represent sub-classification/sub-variables. In this example, those are the columns edregtime and edouttime

  • The name of a single variable that represents a larger class. This is called a key, and here it is time stamp in emergency. Lets call this variable EdTimeStamp.

  • The values that are present in the cells corresponding to the sub-variables. Here it’s the time stamp when registering and when out of the emergency department. Lets call this variable time

Once we have these parameters, we can gather data:

  # Gather edregtime and edouttime, to form variable "EdTimeStamp" with
  # value "time"
  MIMIC3_ADM_gather <- MIMIC3_ADM_tibble %>% 
    gather(`edregtime`, `edouttime`, key = "EdTimeStamp", value = "time")
  as.tibble(MIMIC3_ADM_gather)
## # A tibble: 6,130 x 4
##   subject_id hadm_id EdTimeStamp                time
##        <int>   <int>       <chr>               <chr>
## 1         84  166401   edregtime 2196-04-13 22:23:00
## 2         85  112077   edregtime 2167-07-25 16:37:00
## 3         88  123010   edregtime 2111-08-29 01:44:00
## 4         91  121205   edregtime 2177-04-22 21:02:00
## 5         94  183686   edregtime 2176-02-25 10:35:00
## 6         94  140037   edregtime 2176-09-02 09:54:00
## 7         95  160891   edregtime 2157-12-25 12:24:00
## # ... with 6,123 more rows


Classwork/Homework: Find the number of times each patient was registered in emergency department.


Spreading is the opposite of gathering. We use it when an observation is scattered across multiple rows. For example, in the above data set, the patient denoted by subject_id can have several hospital admission id hadm_id depending on how many times they were admitted in the hospital.

To spread this data, we first analyse the representation in similar way to gather(). This time, however, we only need two parameters:

  • The column that contains the key column. Here, it’s subject_id.

  • The column that contains multiple values for the key column, the values column. Here it’s hadm_id.

Once we have these two parameters we can spread the data:

  # Spread each patient (subject_id) for the hospital admission id - hadm_id
  MIMIC3_ADM_spread <- MIMIC3_ADM_tibble %>% head() %>%
    spread(key = subject_id, value = hadm_id) %>% select(`edregtime`, `edouttime`, '94', everything())
  as.tibble(MIMIC3_ADM_spread)
## # A tibble: 6 x 7
##             edregtime           edouttime   `94`   `84`   `85`   `88`
## *              <fctr>              <fctr>  <int>  <int>  <int>  <int>
## 1 2111-08-29 01:44:00 2111-08-29 02:28:00     NA     NA     NA 123010
## 2 2167-07-25 16:37:00 2167-07-25 20:46:00     NA     NA 112077     NA
## 3 2176-02-25 10:35:00 2176-02-25 18:14:00 183686     NA     NA     NA
## 4 2176-09-02 09:54:00 2176-09-02 17:56:00 140037     NA     NA     NA
## 5 2177-04-22 21:02:00 2177-04-23 04:03:00     NA     NA     NA     NA
## 6 2196-04-13 22:23:00 2196-04-14 04:31:00     NA 166401     NA     NA
## # ... with 1 more variables: `91` <int>

We immediately see that patient no. 94 had two admissions in the hospital as emergency visits.

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. For example, consider the admissions_location variable. Its made of 9 possible values:

  • EMERGENCY ROOM ADMIT
  • TRANSFER FROM HOSP/EXTRAM
  • TRANSFER FROM OTHER HEALT
  • CLINIC REFERRAL/PREMATURE
  • INFO NOT AVAILABLE
  • TRANSFER FROM SKILLED NUR
  • TRSF WITHIN THIS FACILITY
  • HMO REFERRAL/SICK
  • PHYS REFERRAL/NORMAL DELI
  # Separate admission_location information
  MIMIC3_ADM_LOC_SEP <- MIMIC3_ADM %>% select(admission_location) %>% 
    separate(admission_location, into = c("Info1", "Info2"), sep = "/", convert = FALSE)
  as.tibble(MIMIC3_ADM_LOC_SEP)
## # A tibble: 5,770 x 2
##                  Info1       Info2
## *                <chr>       <chr>
## 1        PHYS REFERRAL NORMAL DELI
## 2   TRANSFER FROM HOSP      EXTRAM
## 3        PHYS REFERRAL NORMAL DELI
## 4 EMERGENCY ROOM ADMIT        <NA>
## 5      CLINIC REFERRAL   PREMATURE
## 6      CLINIC REFERRAL   PREMATURE
## 7        PHYS REFERRAL NORMAL DELI
## # ... with 5,763 more rows

By default seperate() would coerce into charector vectors even if we seperate numerical variables. To retain numerical type, we have to set convert = TRUE argument.

unite() is the inverse of separate(): it combines multiple columns into a single column. For example, in the above data set we can unite hospital admission id hadm_id and the admission type admission_type to tie a reason why the patient got admitted. By default the unite will use underscore. This can be overridden by using a seperator argument. This is done as follows:

  # Unite hadm_id and admission_type
  MIMIC3_ADM_LOC_UNITE <- MIMIC3_ADM %>% select(subject_id, hadm_id, admission_type) %>% 
    unite(reason, hadm_id, admission_type, sep=":")
  as.tibble(MIMIC3_ADM_LOC_UNITE)
## # A tibble: 5,770 x 2
##   subject_id           reason
## *      <int>            <chr>
## 1         82   110641:NEWBORN
## 2         83    158569:URGENT
## 3         84  120969:ELECTIVE
## 4         84 166401:EMERGENCY
## 5         85 116630:EMERGENCY
## 6         85 112077:EMERGENCY
## 7         86  190243:ELECTIVE
## # ... with 5,763 more rows

Missing values

Missing values can be:

  1. Explicit and flagged with NA
  2. Implicit with values absent

Consider this data:

  # Make a tibble reporting the data on people affected with influenza 
  influenza <- tibble(
    year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
    qtr    = c(   1,    2,    3,    4,    2,    3,    4),
    no_sick = c(800, 450, 377,   NA, 900, 1300, 1257)
  )
  influenza
## # A tibble: 7 x 3
##    year   qtr no_sick
##   <dbl> <dbl>   <dbl>
## 1  2015     1     800
## 2  2015     2     450
## 3  2015     3     377
## 4  2015     4      NA
## 5  2016     2     900
## 6  2016     3    1300
## 7  2016     4    1257

There are two missing values in this dataset:

The no of sick people for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains NA.

The no of sick people for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.

There are several ways we can deal with the missing values.

  1. We can make the implicit missing value explicit by putting years in the columns:
  # Spread the data to make implicit absense explicit 
  influenza %>% 
    spread(year, no_sick)
## # A tibble: 4 x 3
##     qtr `2015` `2016`
## * <dbl>  <dbl>  <dbl>
## 1     1    800     NA
## 2     2    450    900
## 3     3    377   1300
## 4     4     NA   1257
  1. Explicit missing values may not be important in other representations of the data so we can set na.rm = TRUE in gather() to turn explicit missing values implicit:
  # Spread the data to make implicit absense explicit and
  # gather with na.rm = TRUE to turn explicit missing values implicit
  influenza %>% 
    spread(year, no_sick) %>%
    gather(year, no_sick, `2015`:`2016`, na.rm = TRUE)
## # A tibble: 6 x 3
##     qtr  year no_sick
## * <dbl> <chr>   <dbl>
## 1     1  2015     800
## 2     2  2015     450
## 3     3  2015     377
## 4     2  2016     900
## 5     3  2016    1300
## 6     4  2016    1257
  1. We can use complete() which takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary:
  # Complete the data by finding the combinations and replacing NA
  influenza %>% 
    complete(year, qtr) 
## # A tibble: 8 x 3
##    year   qtr no_sick
##   <dbl> <dbl>   <dbl>
## 1  2015     1     800
## 2  2015     2     450
## 3  2015     3     377
## 4  2015     4      NA
## 5  2016     1      NA
## 6  2016     2     900
## 7  2016     3    1300
## 8  2016     4    1257
  1. Or we can use fill() which takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).
  # Fill NA with the most recent value found - 377 in this case
  influenza %>% 
    fill(no_sick) 
## # A tibble: 7 x 3
##    year   qtr no_sick
##   <dbl> <dbl>   <dbl>
## 1  2015     1     800
## 2  2015     2     450
## 3  2015     3     377
## 4  2015     4     377
## 5  2016     2     900
## 6  2016     3    1300
## 7  2016     4    1257

Case study: TB data from WHO

The following case study illustrates the typical process involved in tidying up real world data. The data comes from the 2014 World Health Organization Global Tuberculosis Report, which can be downloaded from WHO TB Data. The data is a part of tidyverse package, so you don’t have to download it.

Just querying the data results as follows:

  # Query WHO TB data
  who
## # A tibble: 7,240 x 60
##       country  iso2  iso3  year new_sp_m014 new_sp_m1524 new_sp_m2534
##         <chr> <chr> <chr> <int>       <int>        <int>        <int>
## 1 Afghanistan    AF   AFG  1980          NA           NA           NA
## 2 Afghanistan    AF   AFG  1981          NA           NA           NA
## 3 Afghanistan    AF   AFG  1982          NA           NA           NA
## 4 Afghanistan    AF   AFG  1983          NA           NA           NA
## 5 Afghanistan    AF   AFG  1984          NA           NA           NA
## 6 Afghanistan    AF   AFG  1985          NA           NA           NA
## 7 Afghanistan    AF   AFG  1986          NA           NA           NA
## # ... with 7,233 more rows, and 53 more variables: new_sp_m3544 <int>,
## #   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
## #   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
## #   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
## #   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
## #   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
## #   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
## #   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
## #   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
## #   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
## #   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
## #   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
## #   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
## #   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
## #   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
## #   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
## #   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
## #   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
## #   newrel_f65 <int>

It contains redundant columns, odd variable codes, and many missing values.

The variables - country, iso2, and iso3 redundantly specify the country.

We don’t know what the variables new_sp_m014, new_ep_m014, new_ep_f014 mean.

Therefore, we look them up in the given dictionary.

Note: This would save the .csv dictionary file into your computer. You need to open it, like using excel.

iso2 and iso3 are ISO standard country/territory codes. Also the variables that start with new are new cases. In particular,

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

  2. The next two letters describe the type of TB:

    1. rel stands for cases of relapse

    2. ep stands for cases of extrapulmonary TB

    3. sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)

    4. sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)

    5. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

  3. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

    1. 014 = 0 – 14 years old

    2. 1524 = 15 – 24 years old

    3. 2534 = 25 – 34 years old

    4. 3544 = 35 – 44 years old

    5. 4554 = 45 – 54 years old

    6. 5564 = 55 – 64 years old

    7. 65 = 65 or older

Thus we need to separate them. But first since each of these variables contains count of new cases, a natural way to organize them would be to gather these variables. We can make a meta-variable “group” and “case-count” to reflect this gathering removing missing values NA:

  # Gather "new_" variables, removing NA, grouping them up and counting cases
  who_gathered <- who %>% 
    gather(new_sp_m014:newrel_f65, key = "group", value = "case-count", na.rm = TRUE)
  as.tibble(who_gathered)
## # A tibble: 76,046 x 6
##       country  iso2  iso3  year       group `case-count`
## *       <chr> <chr> <chr> <int>       <chr>        <int>
## 1 Afghanistan    AF   AFG  1997 new_sp_m014            0
## 2 Afghanistan    AF   AFG  1998 new_sp_m014           30
## 3 Afghanistan    AF   AFG  1999 new_sp_m014            8
## 4 Afghanistan    AF   AFG  2000 new_sp_m014           52
## 5 Afghanistan    AF   AFG  2001 new_sp_m014          129
## 6 Afghanistan    AF   AFG  2002 new_sp_m014           90
## 7 Afghanistan    AF   AFG  2003 new_sp_m014          127
## # ... with 7.604e+04 more rows

Also observe that for rel we don’t have underscore, whereas for ep we have underscore after new. Thus, we need to fix these to get consistent variables:

  # Replace newrel by new_rel
  who_rel_replaced <- who_gathered %>% 
    mutate(group = stringr::str_replace(group, "newrel", "new_rel"))
  as.tibble(who_rel_replaced)
## # A tibble: 76,046 x 6
##       country  iso2  iso3  year       group `case-count`
##         <chr> <chr> <chr> <int>       <chr>        <int>
## 1 Afghanistan    AF   AFG  1997 new_sp_m014            0
## 2 Afghanistan    AF   AFG  1998 new_sp_m014           30
## 3 Afghanistan    AF   AFG  1999 new_sp_m014            8
## 4 Afghanistan    AF   AFG  2000 new_sp_m014           52
## 5 Afghanistan    AF   AFG  2001 new_sp_m014          129
## 6 Afghanistan    AF   AFG  2002 new_sp_m014           90
## 7 Afghanistan    AF   AFG  2003 new_sp_m014          127
## # ... with 7.604e+04 more rows

We can now seperate the variables to reflect the type of TB, gender and the age group. First we seperate by underscore:

  # First we seperate by underscore
  who_seperated_by_underscore <- who_rel_replaced %>% 
    separate(group, c("new", "type", "sexage"), sep = "_")
  as.tibble(who_seperated_by_underscore)
## # A tibble: 76,046 x 8
##       country  iso2  iso3  year   new  type sexage `case-count`
## *       <chr> <chr> <chr> <int> <chr> <chr>  <chr>        <int>
## 1 Afghanistan    AF   AFG  1997   new    sp   m014            0
## 2 Afghanistan    AF   AFG  1998   new    sp   m014           30
## 3 Afghanistan    AF   AFG  1999   new    sp   m014            8
## 4 Afghanistan    AF   AFG  2000   new    sp   m014           52
## 5 Afghanistan    AF   AFG  2001   new    sp   m014          129
## 6 Afghanistan    AF   AFG  2002   new    sp   m014           90
## 7 Afghanistan    AF   AFG  2003   new    sp   m014          127
## # ... with 7.604e+04 more rows

And then by the gender and age group which are linked together, so we can use sep = 1 argument:

  # Then we seperate by gender and age using "sep = 1"
  who_seperated_by_gender_age <- who_seperated_by_underscore %>% 
    separate(sexage, c("sex", "age"), sep = 1)
  as.tibble(who_seperated_by_gender_age)
## # A tibble: 76,046 x 9
##       country  iso2  iso3  year   new  type   sex   age `case-count`
## *       <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>        <int>
## 1 Afghanistan    AF   AFG  1997   new    sp     m   014            0
## 2 Afghanistan    AF   AFG  1998   new    sp     m   014           30
## 3 Afghanistan    AF   AFG  1999   new    sp     m   014            8
## 4 Afghanistan    AF   AFG  2000   new    sp     m   014           52
## 5 Afghanistan    AF   AFG  2001   new    sp     m   014          129
## 6 Afghanistan    AF   AFG  2002   new    sp     m   014           90
## 7 Afghanistan    AF   AFG  2003   new    sp     m   014          127
## # ... with 7.604e+04 more rows

We can then drop the unwanted variables, such as new, iso2, iso3 etc. Thus, our data will be tidy after this:

  # Drop "new" and "iso" variables and tidy the data
  tidy_who <- who_seperated_by_gender_age %>% 
    select(-new, -iso2, -iso3)
  as.tibble(tidy_who)
## # A tibble: 76,046 x 6
##       country  year  type   sex   age `case-count`
## *       <chr> <int> <chr> <chr> <chr>        <int>
## 1 Afghanistan  1997    sp     m   014            0
## 2 Afghanistan  1998    sp     m   014           30
## 3 Afghanistan  1999    sp     m   014            8
## 4 Afghanistan  2000    sp     m   014           52
## 5 Afghanistan  2001    sp     m   014          129
## 6 Afghanistan  2002    sp     m   014           90
## 7 Afghanistan  2003    sp     m   014          127
## # ... with 7.604e+04 more rows

Relational data

Usually data in the real world comes in different tables. Multiple tables of data are called relational data because it is the relations, not just the individual datasets, that are important. Combining data in multiple tables is an imporant part of the wrangling process. Usually these combinations are facilitated by joins. There are three families of joins that are designed to work with relational data:

  1. Mutating joins: these joins add new variables to one data frame based on the presence of data from the other data table.

  2. Filtering joins: these filter observations from one data frame based on whether or not they match an observation in the other table.

  3. Set operations: these operations treat observations as if they were set elements.

A variable that connects two tables are called keys. There are several types of keys:

  1. A primary key uniquely identifies an observation in its own table.

  2. A foreign key uniquely identifies an observation in another table.

  3. A surrogate key uniquely identifies an observation in the absence of a primary key. They can be created, for instance using mutate() and row_number().

Note: A primary key can also be a foreign key.

Once we identify if a variable is a key one way to verify is to use the count() function and look for entries where n is greater than one:

  # Verify if hadm_id is a primary key
  MIMIC3_ADM %>% 
    count(hadm_id) %>%
    filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: hadm_id <int>, n <int>
  # Verify if subject_id is a primary key
  is_subject_id_key <- MIMIC3_ADM %>% 
                        count(subject_id) %>%
                        filter(n > 1)
  as.tibble(is_subject_id_key)
## # A tibble: 729 x 2
##   subject_id     n
##        <int> <int>
## 1         84     2
## 2         85     2
## 3         94     2
## 4        103     2
## 5        105     2
## 6        107     3
## 7        109     8
## # ... with 722 more rows

To describe the joins, we will work with input events data set from MIMIC3. A meta level description of the table is given as:

Inputs and outputs are extremely useful when studying intensive care unit patients. Inputs are any fluids which have been administered to the patient: such as oral or tube feedings or intravenous solutions containing medications.

A description of input events table is given here.

  # Load the package RCurl
  library(RCurl)
  # Import the input events data set in MIMIC3 from GitHub; 
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/MIMIC3/inputevents_cv.csv"
  URL <- paste(URL_text_1,URL_text_2, sep="")
  MIMIC3_INE <- read.csv(text=getURL(URL))
  # Observe the structure
  as.tibble(MIMIC3_INE)
## # A tibble: 19,369 x 22
##   row_id subject_id hadm_id icustay_id  charttime itemid amount amountuom
## *  <int>      <int>   <int>      <int>     <fctr>  <int>  <dbl>    <fctr>
## 1  50508      15239  134445     208022 2186-07-19  30056     60        ml
## 2  50509      15239  134445     208022 2186-07-20  30056    120        ml
## 3  50510      15239  134445     208022 2186-07-20  30056    120        ml
## 4  50511      15239  134445     208022 2186-07-21  30056    400        ml
## 5  50512      15239  134445     208022 2186-07-21  30056    240        ml
## 6  50513      15239  134445     208022 2186-07-21  30056    240        ml
## 7  50514      15239  134445     208022 2186-07-21  30056    100        ml
## # ... with 1.936e+04 more rows, and 14 more variables: rate <dbl>,
## #   rateuom <fctr>, storetime <fctr>, cgid <int>, orderid <int>,
## #   linkorderid <int>, stopped <fctr>, newbottle <int>,
## #   originalamount <dbl>, originalamountuom <fctr>, originalroute <fctr>,
## #   originalrate <dbl>, originalrateuom <fctr>, originalsite <fctr>
Inner join (mutating joins)

Mutating inner join can be described by the following figure:



The following code computes the inner join of admissionas and input events cv table:

  # Inner join of admissions table and input events (cv) table using the key "hadm_id"
  MIMIC_ADM_INE_inner_join <- MIMIC3_ADM %>% 
                                inner_join(MIMIC3_INE, by = "hadm_id")
                        
  as.tibble(MIMIC_ADM_INE_inner_join)
## # A tibble: 1,805 x 40
##   row_id.x subject_id.x hadm_id  admittime  dischtime           deathtime
##      <int>        <int>   <int>     <date>     <date>              <fctr>
## 1       93           91  121205 2177-04-23 2177-05-10 2177-05-10 15:16:00
## 2       94           92  142807 2122-12-13 2123-03-04                  NA
## 3       94           92  142807 2122-12-13 2123-03-04                  NA
## 4       94           92  142807 2122-12-13 2123-03-04                  NA
## 5       94           92  142807 2122-12-13 2123-03-04                  NA
## 6       94           92  142807 2122-12-13 2123-03-04                  NA
## 7       94           92  142807 2122-12-13 2123-03-04                  NA
## # ... with 1,798 more rows, and 34 more variables: admission_type <fctr>,
## #   admission_location <fctr>, discharge_location <fctr>,
## #   insurance <fctr>, language <fctr>, religion <fctr>,
## #   marital_status <fctr>, ethnicity <fctr>, edregtime <fctr>,
## #   edouttime <fctr>, diagnosis <fctr>, hospital_expire_flag <int>,
## #   has_chartevents_data <int>, row_id.y <int>, subject_id.y <int>,
## #   icustay_id <int>, charttime <fctr>, itemid <int>, amount <dbl>,
## #   amountuom <fctr>, rate <dbl>, rateuom <fctr>, storetime <fctr>,
## #   cgid <int>, orderid <int>, linkorderid <int>, stopped <fctr>,
## #   newbottle <int>, originalamount <dbl>, originalamountuom <fctr>,
## #   originalroute <fctr>, originalrate <dbl>, originalrateuom <fctr>,
## #   originalsite <fctr>
Outer joins (mutating joins)

There are three types of outer joins:



Left join admissions and input events table:

  # Left join admissions table and input events (cv) table using the key "hadm_id"
  MIMIC_ADM_INE_left_join <- MIMIC3_ADM %>% 
                                left_join(MIMIC3_INE, by = "hadm_id")
                        
  as.tibble(MIMIC_ADM_INE_left_join)
## # A tibble: 7,179 x 40
##   row_id.x subject_id.x hadm_id  admittime  dischtime           deathtime
##      <int>        <int>   <int>     <date>     <date>              <fctr>
## 1       83           82  110641 2150-06-24 2150-06-29                  NA
## 2       84           83  158569 2142-04-01 2142-04-08                  NA
## 3       85           84  120969 2196-02-02 2196-02-04                  NA
## 4       86           84  166401 2196-04-14 2196-04-17 2196-04-17 13:42:00
## 5       87           85  116630 2162-03-02 2162-03-10                  NA
## 6       88           85  112077 2167-07-25 2167-07-30                  NA
## 7       89           86  190243 2146-04-06 2146-04-10                  NA
## # ... with 7,172 more rows, and 34 more variables: admission_type <fctr>,
## #   admission_location <fctr>, discharge_location <fctr>,
## #   insurance <fctr>, language <fctr>, religion <fctr>,
## #   marital_status <fctr>, ethnicity <fctr>, edregtime <fctr>,
## #   edouttime <fctr>, diagnosis <fctr>, hospital_expire_flag <int>,
## #   has_chartevents_data <int>, row_id.y <int>, subject_id.y <int>,
## #   icustay_id <int>, charttime <fctr>, itemid <int>, amount <dbl>,
## #   amountuom <fctr>, rate <dbl>, rateuom <fctr>, storetime <fctr>,
## #   cgid <int>, orderid <int>, linkorderid <int>, stopped <fctr>,
## #   newbottle <int>, originalamount <dbl>, originalamountuom <fctr>,
## #   originalroute <fctr>, originalrate <dbl>, originalrateuom <fctr>,
## #   originalsite <fctr>

Right join admissions and input events table:

  # Right join admissions table and input events (cv) table using the key "hadm_id"
  MIMIC_ADM_INE_right_join <- MIMIC3_ADM %>% 
                                right_join(MIMIC3_INE, by = "hadm_id")
                        
  as.tibble(MIMIC_ADM_INE_right_join)
## # A tibble: 19,369 x 40
##   row_id.x subject_id.x hadm_id admittime dischtime deathtime
##      <int>        <int>   <int>    <date>    <date>    <fctr>
## 1       NA           NA  134445        NA        NA        NA
## 2       NA           NA  134445        NA        NA        NA
## 3       NA           NA  134445        NA        NA        NA
## 4       NA           NA  134445        NA        NA        NA
## 5       NA           NA  134445        NA        NA        NA
## 6       NA           NA  134445        NA        NA        NA
## 7       NA           NA  134445        NA        NA        NA
## # ... with 1.936e+04 more rows, and 34 more variables:
## #   admission_type <fctr>, admission_location <fctr>,
## #   discharge_location <fctr>, insurance <fctr>, language <fctr>,
## #   religion <fctr>, marital_status <fctr>, ethnicity <fctr>,
## #   edregtime <fctr>, edouttime <fctr>, diagnosis <fctr>,
## #   hospital_expire_flag <int>, has_chartevents_data <int>,
## #   row_id.y <int>, subject_id.y <int>, icustay_id <int>,
## #   charttime <fctr>, itemid <int>, amount <dbl>, amountuom <fctr>,
## #   rate <dbl>, rateuom <fctr>, storetime <fctr>, cgid <int>,
## #   orderid <int>, linkorderid <int>, stopped <fctr>, newbottle <int>,
## #   originalamount <dbl>, originalamountuom <fctr>, originalroute <fctr>,
## #   originalrate <dbl>, originalrateuom <fctr>, originalsite <fctr>

Full join admissions and input events table:

  # Full join admissions table and input events (cv) table using the key "hadm_id"
  MIMIC_ADM_INE_full_join <- MIMIC3_ADM %>% 
                                full_join(MIMIC3_INE, by = "hadm_id")
                        
  as.tibble(MIMIC_ADM_INE_full_join)
## # A tibble: 24,743 x 40
##   row_id.x subject_id.x hadm_id  admittime  dischtime           deathtime
##      <int>        <int>   <int>     <date>     <date>              <fctr>
## 1       83           82  110641 2150-06-24 2150-06-29                  NA
## 2       84           83  158569 2142-04-01 2142-04-08                  NA
## 3       85           84  120969 2196-02-02 2196-02-04                  NA
## 4       86           84  166401 2196-04-14 2196-04-17 2196-04-17 13:42:00
## 5       87           85  116630 2162-03-02 2162-03-10                  NA
## 6       88           85  112077 2167-07-25 2167-07-30                  NA
## 7       89           86  190243 2146-04-06 2146-04-10                  NA
## # ... with 2.474e+04 more rows, and 34 more variables:
## #   admission_type <fctr>, admission_location <fctr>,
## #   discharge_location <fctr>, insurance <fctr>, language <fctr>,
## #   religion <fctr>, marital_status <fctr>, ethnicity <fctr>,
## #   edregtime <fctr>, edouttime <fctr>, diagnosis <fctr>,
## #   hospital_expire_flag <int>, has_chartevents_data <int>,
## #   row_id.y <int>, subject_id.y <int>, icustay_id <int>,
## #   charttime <fctr>, itemid <int>, amount <dbl>, amountuom <fctr>,
## #   rate <dbl>, rateuom <fctr>, storetime <fctr>, cgid <int>,
## #   orderid <int>, linkorderid <int>, stopped <fctr>, newbottle <int>,
## #   originalamount <dbl>, originalamountuom <fctr>, originalroute <fctr>,
## #   originalrate <dbl>, originalrateuom <fctr>, originalsite <fctr>

Note: By default, by = NULL, which will use all common variables that appear in both tables as keys. This is called natural join. We can also us by = c("a" = "b"). This will match variable a in the first table to variable b in the second table.


Semi join (filtering joins)

semi_join(table1, table2, key = "k") would keep all observations in table 1 that have a match in table 2 identified by the key “k”.



Thus, in our tables:

  # Semi join admissions table and input events (cv) table using the key "hadm_id"
  MIMIC_ADM_INE_semi_join <- MIMIC3_ADM %>% 
                                semi_join(MIMIC3_INE, by = "hadm_id")
                        
  as.tibble(MIMIC_ADM_INE_semi_join)
## # A tibble: 396 x 19
##   row_id subject_id hadm_id  admittime  dischtime           deathtime
##    <int>      <int>   <int>     <date>     <date>              <fctr>
## 1  32008      26169  198816 2167-12-16 2167-12-25                  NA
## 2   2157       1785  138749 2139-12-16 2140-02-06 2140-02-06 15:20:00
## 3  15814      12896  101838 2115-05-22 2115-06-19 2115-06-19 05:30:00
## 4  29172      23870  133506 2183-11-21 2183-11-27                  NA
## 5  24323      19911  180900 2130-10-26 2130-12-07                  NA
## 6  15596      12720  123004 2173-07-01 2173-07-03                  NA
## 7   6307       5183  144799 2196-10-20 2197-01-05                  NA
## # ... with 389 more rows, and 13 more variables: admission_type <fctr>,
## #   admission_location <fctr>, discharge_location <fctr>,
## #   insurance <fctr>, language <fctr>, religion <fctr>,
## #   marital_status <fctr>, ethnicity <fctr>, edregtime <fctr>,
## #   edouttime <fctr>, diagnosis <fctr>, hospital_expire_flag <int>,
## #   has_chartevents_data <int>

this would result in keeping the observations in admissions that are present in input cv. This is particularly useful if we have a result in one table that we want to analyze related data in the other table.

The opposite of semi_join() is the anti-join(table1, table2, key = "k"), where we would retain only the observations in table 1 that does not have a match in table 2, identified by the key “k”.



In our data tables, this would pull out all the entries in admissions table where hospital admissions id (hadm_id) is not present in the input events (cv) table.

  # Anti join admissions table and input events (cv) table using the key "hadm_id"
  MIMIC_ADM_INE_anti_join <- MIMIC3_ADM %>% 
                                anti_join(MIMIC3_INE, by = "hadm_id")
                        
  as.tibble(MIMIC_ADM_INE_anti_join)
## # A tibble: 5,374 x 19
##   row_id subject_id hadm_id  admittime  dischtime deathtime admission_type
##    <int>      <int>   <int>     <date>     <date>    <fctr>         <fctr>
## 1  55920      90535  151788 2137-11-19 2137-11-22        NA      EMERGENCY
## 2  55918      90533  133492 2178-12-14 2178-12-24        NA       ELECTIVE
## 3  55915      90522  173862 2107-08-08 2107-08-24        NA       ELECTIVE
## 4  55913      90519  152185 2172-04-22 2172-04-29        NA       ELECTIVE
## 5  55911      90508  142580 2195-09-26 2195-10-02        NA      EMERGENCY
## 6  55907      90493  111396 2192-06-04 2192-06-06        NA      EMERGENCY
## 7  55905      90488  102480 2179-10-08 2179-10-21        NA      EMERGENCY
## # ... with 5,367 more rows, and 12 more variables:
## #   admission_location <fctr>, discharge_location <fctr>,
## #   insurance <fctr>, language <fctr>, religion <fctr>,
## #   marital_status <fctr>, ethnicity <fctr>, edregtime <fctr>,
## #   edouttime <fctr>, diagnosis <fctr>, hospital_expire_flag <int>,
## #   has_chartevents_data <int>

Anti-joins are useful for diagnosing join mismatches. In the above example, we might be interested to know how many patients don’t have input event information.


Joining tips

  1. Start by identifying the variables that form the primary key in each table. This is usually done by understanding the data set and not empirically by looking for a combination of variables that give a unique identifier. If we just look for variables without thinking about what they mean, we might find a combination that’s unique to our data but the relationship might not be true in general.

  2. Having said the above, it is also dependent on the question of interest. In all of our examples above, we used hospital admissions id (hadm_id) as our key. This may fetch the statistics and analysis based on the admissions statistics, but not based on patients (or subject_id). So if we want to derive patient statistics, we may want to combine, say, subject_id with hadm_id and derive a primary key. Therefore, combination may sometimes work and can capture statistics of interest.

  3. Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!

  4. Check if the foreign keys match primary keys in another table. The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.

  5. If we have missing keys, we’ll need to be thoughtful about the use of inner vs. outer joins, by carefully considering whether or not we want to drop rows that don’t have a match.


Classwork/Homework: Read and work on sections 14 and 15 (Strings and Factors) in the book R for Data Science.


Selected materials and references

R for Data Science - Wrangle Part