R fundamentals 02: Advanced data types and graphics (Sept. 14, 2017)


Recorded Stream



In this lecture we will consider other data types such as lists, data frames as well as graphics.

Factors

Factors are determined through categorical variables. What are categorical variables?

  • Limited number of differing values.
  • Belong to a certain category.
  • For statistical analysis, R calls them factors.
Creating and manipulating factors
  # Create a blood group vector
  blood_group_vector <- c("AB", "O", "B+", "AB-", "O", "AB", "A", "A", "B", "AB-")
  # Create fatcors from the vector
  blood_group_factor <- factor(blood_group_vector)
  blood_group_factor
##  [1] AB  O   B+  AB- O   AB  A   A   B   AB-
## Levels: A AB AB- B B+ O

Note

  1. The levels are sorted alphabetically.
  2. No more quotation marks.

R encodes factors to integers for easier memory access and computations. This is done alphabetically. For example, A is assigned 1, AB is assigned 2 etc. This can be viewed by invoking the str() function:

  # Show the structure of the blood group factor
  str(blood_group_factor)
##  Factor w/ 6 levels "A","AB","AB-",..: 2 6 5 3 6 2 1 1 4 3

This can be over-ridden by specifying the levels argument for the factor() function.

  # Define another set of levels over-riding default
  blood_group_factor2 <- factor(blood_group_vector, levels = c("A", "B", "B+", "AB", "AB-", "O"))
  str(blood_group_factor2)
##  Factor w/ 6 levels "A","B","B+","AB",..: 4 6 3 5 6 4 1 1 2 5

Renaming factors can be done using the level() function.

  # Define blood type
  blood_type <- c("B", "A", "AB", "A", "O")
  # Find the factors
  blood_type_factor <- factor(blood_type)
  blood_type_factor
## [1] B  A  AB A  O 
## Levels: A AB B O
  # Rename the factors
  levels(blood_type_factor) <- c("BT_A", "BT_AB", "BT_B", "BT_O")
  blood_type_factor
## [1] BT_B  BT_A  BT_AB BT_A  BT_O 
## Levels: BT_A BT_AB BT_B BT_O

Note It is extremely important to follow the same order as the default order supplied by R. Otherwise, the result can be extremely confusing as the following exercise will show.


Classwork/Homework: Rename the blood_type_factor in the above example as follows:

levels(blood_type_factor) <- c("BT_A", "BT_B", "BT_AB", "BT_O")

and justify the result of outputting blood_type_factor.

If you want to label the levels, it is always best to define the labels along with the levels like this -

  factor(blood_type, levels=c("A","B","AB","O"), labels=c("BT_A","BT_B","BT_AB","BT_O"))
Nominal vs. Ordinal factors

Nominal factors: Categorical variables that cannot be ordered, like blood group. For example, it doesn’t make sense to say blood group A < blood group B.

Ordinal factors: Those categorical variables that can be ordered. For instance, tumor sizes. We can say T1 (tumor size 2cm or smaller) < T2 (tumor size larger than 2cm but smaller than 5 cm).

R provides us with the way to impose order on factors. Simply use the argument “ordered=TRUE” inside the factor function.

  # Specify the tumor size vectore
  tumor_size <- c("T1","T1","T2","T3","T1")
  # Set the order by specifying "ordered=TRUE"
  tumor_size_factor <- factor(tumor_size, ordered=TRUE, levels=c("T1","T2","T3"))
  # Print the results
  tumor_size_factor
## [1] T1 T1 T2 T3 T1
## Levels: T1 < T2 < T3
  # Compare one factor vs the other
  tumor_size_factor[1] < tumor_size_factor[2]
## [1] FALSE


Classwork/Homework: Use the inequality operator (< or >) on a nominal category and print the output.


Lists

Recall vectors and matrices can hold only one data type, like integer or character. Lists can hold multiple R objects, without having to perform coercion.

  # Defining different data type as vector
  # Note coercion takes place
  vec <- c("Blood-sugar","High", 140, "Units", "mg/dL")
  vec
## [1] "Blood-sugar" "High"        "140"         "Units"       "mg/dL"
  # And as a list
  lst <- list("Blood sugar","High", 140, "mg/dL")
  lst
## [[1]]
## [1] "Blood sugar"
## 
## [[2]]
## [1] "High"
## 
## [[3]]
## [1] 140
## 
## [[4]]
## [1] "mg/dL"
  # One can use the list function to see if something is a list
  is.list(lst)
## [1] TRUE

Naming a list can be done through the names() function or specifying it in the list itself.

  # Define list
  lst <- list("Blood sugar","High", 140, "mg/dL")
  # Assign names and print
  names(lst) <- c("Fluid","Category","Value","Units")
  lst
## $Fluid
## [1] "Blood sugar"
## 
## $Category
## [1] "High"
## 
## $Value
## [1] 140
## 
## $Units
## [1] "mg/dL"
  # Specify within the list
  list(Fluid="Blood sugar",Category="High", Value=140, Units="mg/dL")
## $Fluid
## [1] "Blood sugar"
## 
## $Category
## [1] "High"
## 
## $Value
## [1] 140
## 
## $Units
## [1] "mg/dL"
  # For compact display use the str() function
  str(lst)
## List of 4
##  $ Fluid   : chr "Blood sugar"
##  $ Category: chr "High"
##  $ Value   : num 140
##  $ Units   : chr "mg/dL"

Note: List can contain another list, or any number of nested lists.

Aceesing and extending lists

Difference between [] and [[]] is that, [] will return a list back and [[]] will return the elements in the list.

  # Define a list
  blood_test <- list(Fluid="Blood sugar",Category="High", Value=140, Units="mg/dL")
  # Make another list containing this list
  patient <- list(Name="Mike",Age=36,Btest=blood_test)
  # Access the first list
  patient[1]
## $Name
## [1] "Mike"
  # Access the third "element" - which is actually a list itself.
  patient[[3]]
## $Fluid
## [1] "Blood sugar"
## 
## $Category
## [1] "High"
## 
## $Value
## [1] 140
## 
## $Units
## [1] "mg/dL"


Classwork/Homework:

  1. What does patient[c(1,3)] give us? Is it a list or elements?
  2. What does patient[[c(1,3)] give us? Is it a list or elements?
  3. How about patient[[c(3,1)]? What is the difference? ( Hint: patient[[c(1,3)] is same as patient[[1]][[3]]).

Subsetting by names is super easy: just supply the name within brackets. For example, patient["Name"] or patient[["Name"]]. Subsetting by logicals will work only for returning the list. For instance, patient[c(TRUE,FALSE)]. It doesn’t make sense to return the elements through logicals, i.e., patient[[c(TRUE,FALSE)]].

Another cool way to access elements (just the same as using [[]]) is the use of $ sign. However, to do this, the list should be named. For example, patient$Name will print the patient name.

$ can also be used for extending lists:

  # Extend the list to include gender
  patient$Gender <- "Male"
  # This is same as using double brackets
  patient[["Gender"]] <- "Male"
  # Extend the blood test list to include the date of the test
  patient$Btest$Date <- "Sept.14"


Classwork/Homework: How do you remove an element from a list?


Data frames

Datasets come with various shapes and sizes. Usually they constitute:

  • Observations (eg. each row is an observation)
  • Variables (eg. each column is a variable)
  • Matrix consitute only one data type
  • List is not practical

Data frames can contain different types for each observation/row however each variable (or a column) should have a same data type. Usually data frames are imported - through CSV, or Excel etc. However, we can create a data frame as well.

  # Create name, age and logical vectors
  name <- c("Anne","James","Mike","Betty")
  age <- c(20,43,27,25)
  cancer <- c(TRUE,FALSE,FALSE,TRUE)
  # Form a data frame
  df <- data.frame(name,age,cancer)
  df
##    name age cancer
## 1  Anne  20   TRUE
## 2 James  43  FALSE
## 3  Mike  27  FALSE
## 4 Betty  25   TRUE
  # Create names function (like we did for vectors)
  names(df) <- c("Name","Age","Cancer_Status")
  df
##    Name Age Cancer_Status
## 1  Anne  20          TRUE
## 2 James  43         FALSE
## 3  Mike  27         FALSE
## 4 Betty  25          TRUE
  # Or specify inside data frame
  df <- data.frame(Name=name, Age=age, Cancer_Status=cancer)
  df
##    Name Age Cancer_Status
## 1  Anne  20          TRUE
## 2 James  43         FALSE
## 3  Mike  27         FALSE
## 4 Betty  25          TRUE


Classwork/Homework:

  1. Examine the structure of the data frame.
  2. What happens if one of the vectors have unequal length?

Note: Data frames store character vectors as factors. You can over-ride this as follows: df <- data.frame(Name=name, Age=age, Cancer_Status=cancer, stringsAsFactors = FALSE)

Manipulating data frames: Subsetting
  # Create name, age and logical vectors
  name <- c("Anne","James","Mike","Betty")
  age <- c(20,43,27,25)
  cancer <- c(TRUE,FALSE,FALSE,TRUE)
  # Form a data frame
  df <- data.frame(name,age,cancer)
  df
##    name age cancer
## 1  Anne  20   TRUE
## 2 James  43  FALSE
## 3  Mike  27  FALSE
## 4 Betty  25   TRUE
  # Subsetting by indices - works just like matrices
  df[1,2]
## [1] 20
  # Subsetting by indices - one can use column names as well
  df[1,"age"]
## [1] 20
  # Get the entire row/column - just like matrices
  # Get the second row
  df[2,]
##    name age cancer
## 2 James  43  FALSE
  # Get the "cancer" column
  df[,"cancer"]
## [1]  TRUE FALSE FALSE  TRUE
  # Get all 2nd and 3rd observation with "name"" and "cancer"" status
  df[c(2,3),c("name","cancer")]
##    name cancer
## 2 James  FALSE
## 3  Mike  FALSE

The only difference is when you specify a single number as index within []. For matrices you get an element corresponding to the linear index but for data frame we’ll get the column vector corresponding to the index.

  # Print the second column
  df[2]
##   age
## 1  20
## 2  43
## 3  27
## 4  25

This is becuase data frames are made up of lists of vectors of equal length. Thus, single [2] will correspond to the second element in the list, which is a vector of ages.


Classwork/Homework: Test the operations of lists (like [age] & [["age"]]) to data frames.

Manipulating data frames: Extending

Extending a column is super easy. Since data frames are list of vectors one can just append a vector to the list. For instance, if we have a column of tumor size info like this for each patient: c("T0","T3","T2","T0"), the following code will append the vector.

  # Append tumor size vector
  df$tumor_size <- c("T0","T3","T2","T0")
  df
##    name age cancer tumor_size
## 1  Anne  20   TRUE         T0
## 2 James  43  FALSE         T3
## 3  Mike  27  FALSE         T2
## 4 Betty  25   TRUE         T0


Classwork/Homework:

  1. Use cbind() to append a vector of choice.
  2. What happens if the length of the appending vector is greater than (or less than) row dimensions?

In contrast, extending a row (or observation) is not straight-forward. This is because observations may contain different data types. To add observations, make a new data frame and append:

  # Create a data frame
  tom <- data.frame(name="Tom", age=47,cancer="TRUE",tumor_size="T2")
  # And append
  df <- rbind(df,tom)
  df
##    name age cancer tumor_size
## 1  Anne  20   TRUE         T0
## 2 James  43  FALSE         T3
## 3  Mike  27  FALSE         T2
## 4 Betty  25   TRUE         T0
## 5   Tom  47   TRUE         T2


Classwork/Homework:

  1. Can you use the list() function instead of the data frame function in the above code?
  2. What happens if you leave the arguments name=, age= etc. in the above code?
Manipulating data frames: Sorting

We can use the order() function to sort the entire data frame with respect to a particular column.

  # Rank the entries of a column, say "age"
  ranks <- order(df$age)
  # Sort the data frame according to the rank
  df[ranks,]
##    name age cancer tumor_size
## 1  Anne  20   TRUE         T0
## 4 Betty  25   TRUE         T0
## 3  Mike  27  FALSE         T2
## 2 James  43  FALSE         T3
## 5   Tom  47   TRUE         T2


Classwork/Homework:

  1. What does sort(df$age) perform and how is it related to the ranks?
  2. Sort the entries in descending order of the age. (Hint: Question the function to find out more about the function in question).

Graphics

R has very strong graphical capabilities - primary reason why both industries and academics are interested.

  • Create plots with code
  • Replication and modification is easy
  • Reproducibility!
  • graphics package loaded bu default produces great plots
  • Excellent packages like ggplot2, ggvis and lattice
graphics package

This package has many functions. Primarily plot() and hist() provide essential functionalities.

The plot() package is:

  1. Generic
  2. Different inputs gives different plots
  3. Can plot several things like vectors, linear models, kernel densities etc.

Before we see how the plot function works, we will first import a public health data set. We will work with Hanes data set which is New York City’s Health and Nutrition survey data set. For more info about Hanes, click here.

  # Install RCurl package and load the package
  library(RCurl)
## Loading required package: bitops
  # Import the HANES data set from GitHub; break the string into two for readability
  # (Please note this readability aspect very carefully)
  URL_text_1 <- "https://raw.githubusercontent.com/kannan-kasthuri/kannan-kasthuri.github.io"
  URL_text_2 <- "/master/Datasets/HANES/NYC_HANES_DIAB.csv"
  # Paste it to constitute a single URL 
  URL <- paste(URL_text_1,URL_text_2, sep="")
  HANES <- read.csv(text=getURL(URL))
  # Observe the structure
  str(HANES)
## 'data.frame':    1527 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 32 43 44 53 55 70 84 90 100 ...
##  $ GENDER           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 27 28 27 24 30 26 31 32 34 ...
##  $ AGEGROUP         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : int  2 2 2 2 1 1 3 1 2 1 ...
##  $ UCREATININE      : int  105 296 53 314 105 163 150 46 36 177 ...
##  $ UALBUMIN         : num  0.707 18 1 8 4 3 2 2 0.707 4 ...
##  $ UACR             : num  0.00673 6 2 3 4 ...
##  $ MERCURYU         : num  0.37 NA 0.106 0.487 2.205 ...
##  $ DX_DBTS          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 ...
##  $ CADMIUM          : num  0.2412 0.4336 0.1732 0.0644 0.0929 ...
##  $ LEAD             : num  1.454 0.694 1.019 0.863 1.243 ...
##  $ MERCURYTOTALBLOOD: num  2.34 3.11 2.57 1.32 14.66 ...
##  $ HDL              : int  42 52 51 42 61 52 50 57 56 42 ...
##  $ CHOLESTEROLTOTAL : int  184 117 157 145 206 120 155 156 235 156 ...
##  $ GLUCOSESI        : num  4.61 4.5 4.77 5.16 5 ...
##  $ CREATININESI     : num  74.3 80 73 80 84.9 ...
##  $ CREATININE       : num  0.84 0.91 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 ...
##  $ TRIGLYCERIDE     : int  156 63 43 108 65 51 29 31 220 82 ...
##  $ GLUCOSE          : int  83 81 86 93 90 92 85 72 87 96 ...
##  $ COTININE         : num  31.5918 57.6882 0.0635 0.035 0.0514 ...
##  $ LDLESTIMATE      : int  111 52 97 81 132 58 99 93 135 98 ...

Note that GENDER, AGEGROUP and HSQ_1 are integers but in fact they should be factors! So, we need to convert them to factors.

  # Convert them to factors
  HANES$GENDER <- as.factor(HANES$GENDER)
  HANES$AGEGROUP <- as.factor(HANES$AGEGROUP)
  HANES$HSQ_1 <- as.factor(HANES$HSQ_1)
  # Now observe the structure
  str(HANES)
## 'data.frame':    1527 obs. of  23 variables:
##  $ KEY              : Factor w/ 1527 levels "133370A","133370B",..: 28 32 43 44 53 55 70 84 90 100 ...
##  $ GENDER           : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SPAGE            : int  29 27 28 27 24 30 26 31 32 34 ...
##  $ AGEGROUP         : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ HSQ_1            : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 2 1 1 3 1 2 1 ...
##  $ UCREATININE      : int  105 296 53 314 105 163 150 46 36 177 ...
##  $ UALBUMIN         : num  0.707 18 1 8 4 3 2 2 0.707 4 ...
##  $ UACR             : num  0.00673 6 2 3 4 ...
##  $ MERCURYU         : num  0.37 NA 0.106 0.487 2.205 ...
##  $ DX_DBTS          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ A1C              : num  5 5.5 5.2 4.8 5.1 4.3 5.2 4.8 5.2 4.8 ...
##  $ CADMIUM          : num  0.2412 0.4336 0.1732 0.0644 0.0929 ...
##  $ LEAD             : num  1.454 0.694 1.019 0.863 1.243 ...
##  $ MERCURYTOTALBLOOD: num  2.34 3.11 2.57 1.32 14.66 ...
##  $ HDL              : int  42 52 51 42 61 52 50 57 56 42 ...
##  $ CHOLESTEROLTOTAL : int  184 117 157 145 206 120 155 156 235 156 ...
##  $ GLUCOSESI        : num  4.61 4.5 4.77 5.16 5 ...
##  $ CREATININESI     : num  74.3 80 73 80 84.9 ...
##  $ CREATININE       : num  0.84 0.91 0.83 0.91 0.96 0.75 0.99 0.9 0.84 0.93 ...
##  $ TRIGLYCERIDE     : int  156 63 43 108 65 51 29 31 220 82 ...
##  $ GLUCOSE          : int  83 81 86 93 90 92 85 72 87 96 ...
##  $ COTININE         : num  31.5918 57.6882 0.0635 0.035 0.0514 ...
##  $ LDLESTIMATE      : int  111 52 97 81 132 58 99 93 135 98 ...

Let’s plot a categorical variable, for instance gender.

  # Plot the factor gender
  plot(HANES$GENDER)


Classwork/Homework:

  1. Is the above plot informative?
  2. What will you do to make it more informative?

Let’s now plot a numerical variable.

  # Plot a numerical variable
  plot(HANES$A1C)

Of course, we can plot two numerical variables:

  # Plot two numerical variables 
  # A1c - Hemoglobin percentage, UACR - Urine Albumin/Creatinine Ratio
  plot(HANES$A1C,HANES$UACR)

Note that R autamatically renders them as a scatter plot. However, this plot is unformative as the data is poorly scaled. One can scale the data using the “ylim” argument:

  # Plot two numerical variables with appropriate scaling
  plot(HANES$A1C,HANES$UACR, ylim=c(0,10))

Although the scaling is okay now, the relationship is extremely complicated. One of the transformations to understand relationships between the variables is the log() function. We can apply logrithm to both variables -

  # Transform the data using the log function and plot the result
  plot(log(HANES$A1C), log(HANES$UACR))

We note that there are two different clusters of patients - one with low UACR values and another with high UACR values, both corresponding to a mean \(log(A1c)\) of about \(1.7\).

We can also plot two categorical variables. Let us plot GENDER and AGEGROUP factors.

  # Rename the GENDER factor for identification
  HANES$GENDER <- factor(HANES$GENDER, labels=c("M","F"))
  # Rename the AGEGROUP factor for identification
  HANES$AGEGROUP <- factor(HANES$AGEGROUP, labels=c("20-39","40-59","60+"))
  # Plot GENDER vs AGEGROUP
  plot(HANES$GENDER,HANES$AGEGROUP)

  # Swap AGEGROUP vs GENDER
  plot(HANES$AGEGROUP,HANES$GENDER)

Note that R already prints proportion as it displays the plots. The first element is the \(x\)-axis and the second element is the \(y\)-axis.

Next we will see the hist() function. hist() is a short form for histogram. The hist() function:

  • Visual representation of distribution
  • Bins all values
  • Plots frequency of bins

Here is an example to find the distribution of A1C variable for the male population.

  # Form a logical vector consisting only the MALE gender
  HANES_MALE <- HANES$GENDER == "M"
  # Select only the records for the male population
  MALES_DF <- HANES[HANES_MALE,]
  # Make an historgam
  hist(MALES_DF$A1C)

Observe that the Glycohemoglobin percentage lies between \(5-6\) for most of the men. Note that R has also chosen the number of bins, \(6\) by default. You can increase (or decrease) the number of bins using the “breaks” argument. There are other cool tools like barplot(), boxplot(), pairs() in the graphics package.


Classwork/Homework:

  1. Find the distribution of A1C for the female population in the above data set. Are they different?
  2. Find the distribution of A1C for three age groups in the above data set. Is there a difference?
  3. Try to find the distribution of one more numeric variable (other than A1C) for the three age-groups.
  4. Increase the number of bins to \(10\) in the above exercise.
Customizing plots

How does this plot look?

  # Plot LDL values vs HDL values
  plot(HANES$LDL, HANES$HDL)

compared to this -

  # Plot GLUCOSE vs GLUCOSESI with parameters
  plot(HANES$GLUCOSE, HANES$GLUCOSESI, xlab= "Plasma Glucose", 
         ylab = "Blood Glucose SI units", main = "Plasma vs Blood Glucose", type = "o", col="blue")

To do good data science, it certainly not only helps to know correlations between the variables (in the above figure, we know blood glucose levels and plasma glucose levels are the same), but how we present the data matters!

Some plot function characteristics:

  • xlab: Horizontal axis label

  • ylab: Vertical axis label

  • main: Plot title

  • type: Plot type

  • col: Plot color


Classwork/Homework: Change the type to “l” and report the plot type.

Graphical parameters are not maintained throughout session. If you want to maintain graphical parameters, use the par() function. For example,

  # Set the graphical parameter par's so that color red is held
  par(col="red")
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)

  # Plot Hemoglobin vs HDL
  plot(HANES$A1C, HANES$HDL)

More graphical parameters:

  • col.main: Color of the main title
  • cex.axis: Size of the axis numbers (towards 0 is more smaller). Just like “col” parameter has variants such as “main”, “cex” also has other variants - “axis” is one of them.
  • lty: Specifies the line type - solid, dashed etc. (1 is a full line, 2 is dashed etc.)
  • pch: Plot symbol. More than 35 types of symbols.
Multiple graphs

So far we saw single plots of data, with no combinations and layers. It may be good to plot several. We can use “mfrow” with the par() function.

  # Set the par function with mfrow to 2x2 "grid"
  par(mfrow = c(2,2))
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)
  # Plot A1C vs HDL
  plot(HANES$A1C, HANES$HDL)
  # Plot GLUCOSE vs HDL
  plot(HANES$GLUCOSE, HANES$HDL)
  # Plot CHOLESTEROLTOTAL vs HDL
  plot(HANES$CHOLESTEROLTOTAL, HANES$HDL)


Classwork/Homework: Do the above exercise with “mfcol” argument. How does it plot?

To reset the plot to 1 figure, one can use par(mfrow = c(1,1)), that will get us back to normal.

The layout() function

Facilitates more complex plot arrangements.

  # Create a grid on how our figures should appear
  grid <- matrix(c(1,1,2,3), nrow=2,ncol=2,byrow=TRUE)
  # Pass it to the layout function
  layout(grid)
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)
  # Plot GLUCOSE vs HDL
  plot(HANES$GLUCOSE, HANES$HDL)
  # Plot CHOLESTEROLTOTAL vs HDL
  plot(HANES$CHOLESTEROLTOTAL, HANES$HDL)

  # Reset the layout
  layout(1)

Tip: Resetting everytime might be too tedious. A trick is to assign the old setting to an object and reuse it when necessary:

  # Assign the old parameters to an object
  old_parameters <- par()
  # Change to new parameters
  par(col="red")
  # Plot LDL vs HDL
  plot(HANES$LDL, HANES$HDL)

  # Reset to old parameters
  par(old_parameters)
  # Test the original settings
  plot(HANES$LDL, HANES$HDL)

Stacking graphical elements. It’s a great way of adding more information to the plots.

  # Plot A1C vs GLUCOSESI
  plot(HANES$A1C,HANES$GLUCOSESI, xlim=c(6,8), ylim=c(3,10))
  # Using linear fit model. 
  # Note: `lm()` function will return a vector of coefficients for the fit
  lm_glucose_SI <- lm(HANES$A1C ~ HANES$GLUCOSESI)
  # Stack the linear model on top of the plot with line width 2 (specified by `lwd` argument)
  abline(coef(lm_glucose_SI),lwd = 2)



Classwork/Homework: Make a plot and add elements through the functions points(), lines(), segments() and text().

Adding lines may not be visually appealing if you ignore the order. In fact, it can make it worse:

  # Plot A1C vs GLUCOSESI
  plot(HANES$A1C,HANES$GLUCOSESI, xlim=c(6,8), ylim=c(3,10))
  # Using linear fit model. 
  # Note: `lm()` function will return a vector of coefficients for the fit
  lm_glucose_SI <- lm(HANES$A1C ~ HANES$GLUCOSESI)
  # Stack the linear model on top of the plot with line width 2 (specified by `lwd` argument)
  abline(coef(lm_glucose_SI),lwd = 2)
  # Adding lines to the plot
  lines(HANES$GLUCOSESI,HANES$A1C)

However, if you order your data, it may be really informative on how the errors are distributed:

  # Plot A1C vs GLUCOSESI
  plot(HANES$A1C,HANES$GLUCOSESI, xlim=c(6,8), ylim=c(3,10))
  # Using linear fit model. 
  # Note: `lm()` function will return a vector of coefficients for the fit
  lm_glucose_SI <- lm(HANES$A1C ~ HANES$GLUCOSESI)
  # Stack the linear model on top of the plot with line width 2 (specified by `lwd` argument)
  abline(coef(lm_glucose_SI),lwd = 2)
  # Order GLUCOSESI
  ranks <- order(HANES$GLUCOSESI)
  # And then add the lines
  lines(HANES$GLUCOSESI[ranks],HANES$A1C[ranks])

Ordering and sorting can be really handy in data manupilation and plotting.


Selected materials and references

An Introduction to R