Reading and doing both constitute good project management. Final submission will be assessed based on your reading on data science as well as working towards your project. The first part will focus on your interest (and communicating) in doing data science and the second part will be aimed at identifying where you stand on the project, to help you meet the final delivarable. The third part is the final delivarable.
On one of the two days, Oct. 10 or Oct. 12, 2017, you will be given about 10 minutes time to present on any of the following topics. You may have a maximum of 6 slides, within which you have to convey, in broadest terms possible, on what you learned on the topic. It could as well be analysis report and directed insights based on analysis tools (see below). Your time slot will be sent to you.
Grammar of graphics: A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the ``scatterplot’’) and gain insight into the deep structure that underlies statistical graphics\(^{1}\). Here is the original paper. You may apply grammar of graphics (ggplot) to your data set and explain the results.
KNIME: KNIME is an analytics platform and a leading open solution for data-driven innovation, designed for discovering the potential hidden in data, mining for fresh insights, or predicting new futures\(^{2}\). You may make a demo of KNIME or analyze your dataset using KNIME, describing the workflow. KNIME can be downloaded for free here.
Reading List: You may read any chapter in the books given in the section “Statistics and Algorithms Theory” listed here, and present it.
You need to submit the presentation in Github by Oct. 12, 5pm.
You have to apply explore and wrangle paradigm on the data set of your choice and submit the results. Your analysis should reflect all aspects of exploration and wrangling. That is, all sorts of analysis explained in Data science fundamentals lectures 1 through 5. This will be your home work assignment # 04. The due date is Oct. 28 at 12am.
Make an html document with directed insights based on Explore (E), Wrangle (W), Model(M), and Communicate (C) paradigm. You may use HANES/MIMIC3 datasets or any other datasets listed in the course page. Your analysis should contain each aspect of EWPMC paradigm. The more meticulous/thorough you are in your analysis, you’ll have a better chance of a good grade. The due date is Nov. 30 at 5pm.
You need to document and demonstrate all aspects of data science foundations discussed in the class.
Correctly apply tools and techniques of data preparation and wrangling
Missing data handling, joining, or other transformations, removing outliers etc.
Gathering, spreading data (if needed)
Use Exploratory Data Analysis and dplyr
transformation methods to identify structure and correlations in the data
Formulate questions and possible ways of analysis and visualization
Identify appropriate visualization methods for analysis of your data set
Choose the right geoms for the questions at hand
Correctly interpret results of analysis (clinical/biological significance)
Demonstrate domain specific knowledge of clinical data
Propose an hypothesis based on visualization and results
Compare the usefulness of the obtained results/conclusions
Formulate appropriate plans for validation, further analysis, or to collect additional data needed.