The TA for this course is Yuhan Hao. Please email Yuhan.Hao at nyumc.org for any clarifications regarding homeworks or classworks.
Tuesday: 2:30 pm to 4 pm and by appointment
Official Course Name: Prgramming for Data Analysis (BMIN-GA 1005/BMSC-GA 4486)
Meeting Schedule: Every Tuesday and Thursday starting Sept. 7th through Nov. 16 between 1 pm - 2.30 pm
Meeting Location: Translational Research Building (TRB) @ 227, E.30th St. conference room 618 except:
September 28th: the class will be held in the Smilow Seminar Room on the 1st floor (Yellow Pathway) at Tisch Hospital - 550 1st Avenue.Towards the end of this course the student will exhibit in-depth understanding of data science and analysis methods as well as proficiency in R. The student will produce a portfolio of data analysis projects from the course that demonstrates mastery of analysis and visualization methods. He /She will be equipped for analysis of biomedical and genomic data sets. Another main objective of the course is to communicate statistical results correctly and effectively.
This course is designed to empower students to learn R programming language to conduct data science. We will study a wide range of topics, including, handing and querying databases, exploratory/confirmatory analysis and visualization in R. We will closely follow the book R for Data Science, however the emphasis will be given to working with biomedical data than datasets illustrated and used in the textbook.
This course does not have any pre-requisites.
Late/missed work: You must adhere to the due dates for all required submissions. If you miss a deadline, then you will not get credit for that assignment/post.
Incompletes: No "Incompletes" will be assigned for this course unless we are at the very end of the course and you have an emergency.
Responding to Messages: I will check e-mails daily during the week, and I will respond to course related questions within 48 hours.
Announcements: I will make announcements throughout the semester by e-mail.
Make sure that your email address is updated; otherwise you may miss important emails from me.
Safeguards: Always back up your work on a safe place (electronic file with a backup is recommended) and make a hard copy. Do not wait for the last minute to do your work. Allow time for deadlines.
Plagiarism: Plagiarism, the presentation of someone else's words or ideas as your own, is a serious offense and will not be tolerated in this class. The first time you plagiarize someone else's work, you will receive a zero for that assignment. The second time you plagiarize, you will fail the course with a notation of academic dishonesty on your official record.
Programming Assignments (40%)
Directed Insights (25%)
Final Project (35%)
1. R for Data Science by Garrett Grolemund & Hadley Wickham (available here)
2. R in Action by Robert I. Kabacoff
3. Several online tutorials (just type "R tutorial" in google and follow the lead)
You need to document and demonstrate all aspects of data science foundations discussed in the class.
1. Correctly apply tools and techniques of data preparation and wrangling
    a. Missing data handling, joining, or other transformations, removing outliers etc.
    b. Gathering, spreading data (if needed)
2. Use Exploratory Data Analysis and `dplyr` transformation methods to identify structure and correlations in the data
3. Formulate questions and possible ways of analysis and visualization
    a. Identify appropriate visualization methods for analysis of your data set
    b. Choose the right geoms for the questions at hand
4. Correctly interpret results of analysis (clinical/biological significance)
    a. Demonstrate domain specific knowledge of clinical data
    b. Propose an hypothesis based on visualization and results
    c. Compare the usefulness of the obtained results/conclusions
5. Formulate appropriate plans for validation, further analysis, or to collect additional data needed.
Introduction to the course (Sept. 07) | Link | ||
R fundamentals 01: Elementary data types (Sept. 12) | Presentation | Link | |
R fundamentals 02: Advanced data types and graphics (Sept. 14) | Presentation | Link | |
R fundamentals 03: Elements of prog., R Markdown (Sept. 19) | Presentation | Link | |
Data science fundamentals 01: Visualize and explore (Sept. 21, 2017) | Presentation | Link | |
Data science fundamentals 02: Transform and explore (Sept. 26, 2017) | Presentation | Link | |
Data science fundamentals 03: Exploratory data analysis (Sept. 28, 2017) | Presentation | Link | |
Data science fundamentals 04: Exploratory data analysis (Oct. 03, 2017) | Presentation | Link | |
Data science fundamentals 05: Wrangle data (Oct. 05, 2017) | Presentation | Link | |
Mid-term presentation 01 (Oct. 10, 2017) | |||
Mid-term presentation 02 (Oct. 12, 2017) | |||
Database 01: Analysis and modeling (Oct. 17, 2017) | Link | ||
Database 02: SQL (Oct. 19, 2017) | Datasets | ||
Database 03: SQL workshop (Oct. 24, 2017) | |||
High-performance computing (Oct. 26, 2017) | Link | ||
Data science fundamentals 06: Basic inference and linear regression (Oct. 31, 2017) | Presentation | Link | |
Data science fundamentals 07: Advanced modeling (Nov. 02, 2017) | Presentation | Link | |
Basic modeling workshop (Nov. 07, 2017) | |||
Advanced modeling workshop (Nov. 09, 2017) | |||
Project presentation 01 (Nov. 14, 2017) | |||
Project presentation 02 (Nov. 16, 2017) |
Homework #01, assigned - due date Sept. 27, 5pm
Homework #02, assigned - due date Oct. 05, 5pm
Homework #03, assigned - due date Oct. 18, 12am
Homework #04, assigned, this is the mid-term assessment for the final project, due date Oct. 28, 12am
Homework #05, assigned - due date Nov. 16, 5pm
The final project is easy to state: Obtain directed insights on data sets of your choice (given below) based on Explore, Wrangle, Model, Program and Communicate paradigm.
You are advised to become familar with HANES and MIMIC3 data sets (see below) and their formats right away. The first step in becoming a good data scientist is becoming friendlier with the data you are handling. The more friend you are, better patterns you can decipher.
You will be continously assessed to make sure you are progressing towards your final submissions. Please see the project page for more information.
New York City Health and Nutrition Examination Survey (HANES) | Original (SAS format) | CSV | |
New York City Health and Nutrition Examination Survey (HANES), Curated | CSV | ||
MIMIC3, Admissions (10%) | CSV | ||
MIMIC3, Fluid input events, CareVue (0.1%) | CSV | ||
MIMIC3, Fluid input events, MetaVision (0.1%) | CSV | ||
MIMIC3, Chart events 1 (0.01%) | CSV | ||
MIMIC3, Prescriptions (0.1%) | CSV | ||
MIMIC3, Other data sets | Link |