Project Assesments


Reading and doing both constitute good project management. Final submission will be assessed based on your reading on data science as well as working towards your project. The first part will focus on your interest (and communicating) in doing data science and the second part will be aimed at identifying where you stand on the project, to help you meet the final delivarable. The third part is the final delivarable.


Part I: Reading/doing and presentation (due Oct. 12, 5pm)

On one of the two days, Oct. 10 or Oct. 12, 2017, you will be given about 10 minutes time to present on any of the following topics. You may have a maximum of 6 slides, within which you have to convey, in broadest terms possible, on what you learned on the topic. It could as well be analysis report and directed insights based on analysis tools (see below). Your time slot will be sent to you.

Grammar of graphics: A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the ``scatterplot’’) and gain insight into the deep structure that underlies statistical graphics\(^{1}\). Here is the original paper. You may apply grammar of graphics (ggplot) to your data set and explain the results.

KNIME: KNIME is an analytics platform and a leading open solution for data-driven innovation, designed for discovering the potential hidden in data, mining for fresh insights, or predicting new futures\(^{2}\). You may make a demo of KNIME or analyze your dataset using KNIME, describing the workflow. KNIME can be downloaded for free here.

Reading List: You may read any chapter in the books given in the section “Statistics and Algorithms Theory” listed here, and present it.

You need to submit the presentation in Github by Oct. 12, 5pm.


Part II: Mid-term project assessment (due Oct. 28, 12am)

You have to apply explore and wrangle paradigm on the data set of your choice and submit the results. Your analysis should reflect all aspects of exploration and wrangling. That is, all sorts of analysis explained in Data science fundamentals lectures 1 through 5. This will be your home work assignment # 04. The due date is Oct. 28 at 12am.


Part III: Final deliverable (due Nov. 30, 5pm)

Make an html document with directed insights based on Explore (E), Wrangle (W), Model(M), and Communicate (C) paradigm. You may use HANES/MIMIC3 datasets or any other datasets listed in the course page. Your analysis should contain each aspect of EWPMC paradigm. The more meticulous/thorough you are in your analysis, you’ll have a better chance of a good grade. The due date is Nov. 30 at 5pm.


Project Grading Rubric (20% each)


You need to document and demonstrate all aspects of data science foundations discussed in the class.

  1. Correctly apply tools and techniques of data preparation and wrangling

    1. Missing data handling, joining, or other transformations, removing outliers etc.

    2. Gathering, spreading data (if needed)

  2. Use Exploratory Data Analysis and dplyr transformation methods to identify structure and correlations in the data

  3. Formulate questions and possible ways of analysis and visualization

    1. Identify appropriate visualization methods for analysis of your data set

    2. Choose the right geoms for the questions at hand

  4. Correctly interpret results of analysis (clinical/biological significance)

    1. Demonstrate domain specific knowledge of clinical data

    2. Propose an hypothesis based on visualization and results

    3. Compare the usefulness of the obtained results/conclusions

  5. Formulate appropriate plans for validation, further analysis, or to collect additional data needed.