Programming for Data Analysis

Course Director: Kasthuri Kannan, New York University (NYU)
Coordinates: TRB 737, 227, E.30th St. New York
E-mail at kasthuri.kannan [at] nyumc.org (no personal emails please!)
Information Lectures Homeworks Project Data

Teaching Assistant

The TA for this course is Yuhan Hao. Please email Yuhan.Hao at nyumc.org for any clarifications regarding homeworks or classworks.

Office Hours

Tuesday: 2:30 pm to 4 pm and by appointment

Course Information

Official Course Name: Prgramming for Data Analysis (BMIN-GA 1005/BMSC-GA 4486)

Meeting Schedule: Every Tuesday and Thursday starting Sept. 7th through Nov. 16 between 1 pm - 2.30 pm

Meeting Location: Translational Research Building (TRB) @ 227, E.30th St. conference room 618 except:

September 28th: the class will be held in the Smilow Seminar Room on the 1st floor (Yellow Pathway) at Tisch Hospital - 550 1st Avenue.

October 3rd: the class will be held at The Alexandria Center for Life Science in the Alexandria East conference room ERSP 901 - 430 E 29th St.

Important room notice for entry at the Alexandria building! Available 9am-5pm. Pre-approved security clearance is needed to access the Alexandria Building. Seating: 18 extra chairs in room. Room can be split down the middle. (This becomes a giant white board) Kitchen/pantry inside of room. Two exit doors. A point person should arrive 20 minutes prior to retrieve a guest pass from the front desk. Guests need to be swiped into our space to gain access, so we suggest the point person swipe other guests in at the elevator lobby level and guide them to the space.
Either I or someone on my behalf will be the point person.

October 24th: the class will be held at 650 First Avenue, 5th Floor, conference room 525.

Learning Objectives

Towards the end of this course the student will exhibit in-depth understanding of data science and analysis methods as well as proficiency in R. The student will produce a portfolio of data analysis projects from the course that demonstrates mastery of analysis and visualization methods. He /She will be equipped for analysis of biomedical and genomic data sets. Another main objective of the course is to communicate statistical results correctly and effectively.

Course Overview

This course is designed to empower students to learn R programming language to conduct data science. We will study a wide range of topics, including, handing and querying databases, exploratory/confirmatory analysis and visualization in R. We will closely follow the book R for Data Science, however the emphasis will be given to working with biomedical data than datasets illustrated and used in the textbook.
This course does not have any pre-requisites.

General Policies

Late/missed work: You must adhere to the due dates for all required submissions. If you miss a deadline, then you will not get credit for that assignment/post.
Incompletes: No "Incompletes" will be assigned for this course unless we are at the very end of the course and you have an emergency.
Responding to Messages: I will check e-mails daily during the week, and I will respond to course related questions within 48 hours.
Announcements: I will make announcements throughout the semester by e-mail.
Make sure that your email address is updated; otherwise you may miss important emails from me.
Safeguards: Always back up your work on a safe place (electronic file with a backup is recommended) and make a hard copy. Do not wait for the last minute to do your work. Allow time for deadlines.
Plagiarism: Plagiarism, the presentation of someone else's words or ideas as your own, is a serious offense and will not be tolerated in this class. The first time you plagiarize someone else's work, you will receive a zero for that assignment. The second time you plagiarize, you will fail the course with a notation of academic dishonesty on your official record.

Course Assessment (see the project grading rubric below)

Programming Assignments (40%)
Directed Insights (25%)
Final Project (35%)

Recommended Reading

1. R for Data Science by Garrett Grolemund & Hadley Wickham (available here)
2. R in Action by Robert I. Kabacoff
3. Several online tutorials (just type "R tutorial" in google and follow the lead)

Project Grading Rubric (20% each)

You need to document and demonstrate all aspects of data science foundations discussed in the class.

1. Correctly apply tools and techniques of data preparation and wrangling
    a. Missing data handling, joining, or other transformations, removing outliers etc.
    b. Gathering, spreading data (if needed)

2. Use Exploratory Data Analysis and `dplyr` transformation methods to identify structure and correlations in the data

3. Formulate questions and possible ways of analysis and visualization
    a. Identify appropriate visualization methods for analysis of your data set
    b. Choose the right geoms for the questions at hand

4. Correctly interpret results of analysis (clinical/biological significance)
    a. Demonstrate domain specific knowledge of clinical data
    b. Propose an hypothesis based on visualization and results
    c. Compare the usefulness of the obtained results/conclusions

5. Formulate appropriate plans for validation, further analysis, or to collect additional data needed.

Lectures


Introduction to the course (Sept. 07) PDF Link
R fundamentals 01: Elementary data types (Sept. 12) Presentation Link
R fundamentals 02: Advanced data types and graphics (Sept. 14) Presentation Link
R fundamentals 03: Elements of prog., R Markdown (Sept. 19) Presentation Link
Data science fundamentals 01: Visualize and explore (Sept. 21, 2017) Presentation Link
Data science fundamentals 02: Transform and explore (Sept. 26, 2017) Presentation Link
Data science fundamentals 03: Exploratory data analysis (Sept. 28, 2017) Presentation Link
Data science fundamentals 04: Exploratory data analysis (Oct. 03, 2017) Presentation Link
Data science fundamentals 05: Wrangle data (Oct. 05, 2017) Presentation Link
Mid-term presentation 01 (Oct. 10, 2017)
Mid-term presentation 02 (Oct. 12, 2017)
Database 01: Analysis and modeling (Oct. 17, 2017) PDF Link
Database 02: SQL (Oct. 19, 2017) Datasets PDF
Database 03: SQL workshop (Oct. 24, 2017)
High-performance computing (Oct. 26, 2017) PDF Link
Data science fundamentals 06: Basic inference and linear regression (Oct. 31, 2017) Presentation Link
Data science fundamentals 07: Advanced modeling (Nov. 02, 2017) Presentation Link
Basic modeling workshop (Nov. 07, 2017)
Advanced modeling workshop (Nov. 09, 2017)
Project presentation 01 (Nov. 14, 2017)
Project presentation 02 (Nov. 16, 2017)

Homeworks

General guidelines for homework can be accessed here

Homework #01, assigned - due date Sept. 27, 5pm

Homework #02, assigned - due date Oct. 05, 5pm

Homework #03, assigned - due date Oct. 18, 12am

Homework #04, assigned, this is the mid-term assessment for the final project, due date Oct. 28, 12am

Homework #05, assigned - due date Nov. 16, 5pm


Project Info

The final project is easy to state: Obtain directed insights on data sets of your choice (given below) based on Explore, Wrangle, Model, Program and Communicate paradigm.

You are advised to become familar with HANES and MIMIC3 data sets (see below) and their formats right away. The first step in becoming a good data scientist is becoming friendlier with the data you are handling. The more friend you are, better patterns you can decipher.


For information regarding HANES click here.
For information regarding MIMIC click here.
Click here for extensive medical informatics data sets you are encouraged to analyze. Several of them require prior permission for access. You are encouraged to approach relevant authorities for access to their data. If you need my permission, please see me.

You will be continously assessed to make sure you are progressing towards your final submissions. Please see the project page for more information.


Datasets

(for downloading CSV, use "Right click -> Save Link As")
Note 1: You don't have to download data to use in RStudio. You may use the `RCurl` library on the link.
Note 2: Percentage denotes the percent of data available for analysis.
New York City Health and Nutrition Examination Survey (HANES) Original (SAS format) CSV
New York City Health and Nutrition Examination Survey (HANES), Curated CSV
MIMIC3, Admissions (10%) CSV
MIMIC3, Fluid input events, CareVue (0.1%) CSV
MIMIC3, Fluid input events, MetaVision (0.1%) CSV
MIMIC3, Chart events 1 (0.01%) CSV
MIMIC3, Prescriptions (0.1%) CSV
MIMIC3, Other data sets Link