MS in Biostatistics and Data Science

Preparing Students for the Data-Driven Challenges of Today's World

Our MS in Biostatistics and Data Science program provides top-class training in biostatistics and data science techniques that are essential to collect, manage, and analyze biomedical and health data.

Close-up of hands typing on keyboard.

Our coursework offers students a foundation for data science careers in health-related fields and beyond. 

Real-World Skills

We provide comprehensive hands-on training in statistical concepts and programming. During the MS in Biostatistics and Data Science program, students will:

  • Use state-of-the-art statistical and data science approaches to address modern data challenges.
  • Gain invaluable real-world exposure under the guidance of experienced biostatisticians and data scientists.
  • Build experience in the field through a faculty-mentored research project.
  • Take advantage of NYC’s proximity to leading educational institutions and some of the largest pharmaceutical hubs in the country.
  • Create close professional relationships with a diverse faculty, through low student-to-faculty class ratios.
  • Exposure to specializations such as health services research, cost-effectiveness, and comparative-effectiveness.

Unique Expertise

Our MS in Biostatistics and Data Science program is unique as it focuses on data mining and machine learning techniques yet retains the rigor of a traditional Biostatistics program.

Students from all over the world join this track with backgrounds in science (e.g., statistics, mathematics, biology, etc.), engineering, health and medicine.

Graduates are prepared for data science careers in the public and private biomedical, healthcare, insurance and pharmaceutical sectors, both in academia and industry.

The MS in Biostatistics and Data Science program has close ties to other programs within the Weill Cornell Medical College and Cornell University, the Department of Statistics and Data Science at Cornell University, the Cornell Tech campus in New York City, and NewYork-Presbyterian. Students can complete the MS in Biostatistics and Data Science program in 16 months starting in Fall 2023. Students must complete at least 36 credits to graduate.

Prerequisites for Admission

Information Sessions

Alumni Outcomes

Program Director

Xi Kathy Zhou, PhD, MS

BDS Student - Recommended Curriculum Progression

Starting in Fall 2023, students will be recommended to follow the schedule below in order to ensure eligibility for graduation. The Education Team will monitor progression, but it is ultimately the student’s responsibility to track their progression to ensure they meet graduation requirements. Course offerings and course availability are subject to change.

Fall Term 1

Typical course load is 12 credits

Biostatistics I with R Lab (HBDS 5005) - Required

Course Director: Xi Kathy Zhou, PhD
4 credits

This course provides an introduction to important topics in biostatistical concepts and reasoning. Specific topics include tools for describing central tendency and variability in data, probability distributions, sampling distributions, estimation, and hypothesis testing. Assignments will involve computation using the R programming language.

Study Design (HBDS 5015) - Required

Course Director: Linda Gerber, PhD
1.5 credits

The course will describe and apply measures of disease incidence and prevalence, and measures of effect; explain the basic principles underlying different study designs, including descriptive, ecological, crosssectional, cohort, case-control and intervention studies; assess strengths and limitations of different study designs; identify problems interpreting epidemiological data: chance, bias, confounding and effect modification; address validity, intra-rater reliability and inter-rater reliability.

Categorical and Censored Data Analysis (HBDS 5016) - Required

Course Director: Oleksandr Savenkov, PhD
1.5 credits

The course will describe methods related to categorical data analysis and basic concepts for censored data and Kaplan-Meier; and learn how to select appreciate methods and how to interpret the results from categorical data analysis and Kaplan-Meier.

Data Science I (R and Python) (HBDS 5018) - Required

Course Director: Wenna Xi, PhD
3 credits

This course provides an introduction to data science using both the R and python programming languages. In this course students will gain experience working directly with data to pose and answer questions. The course will be divided into two parts; the first part will be taught with the programming language R and the second with python. Topics covered include: reproducible research, exploratory data analysis, data manipulation, data visualization techniques, simulation design, and unsupervised learning methods.

Master’s Project 1 and Professional Development (HCPR 9010) - Required

Course Director: Faculty
2 credits

This is the culminating capstone course of all masters-level graduate education programs. It has two aims: (1) helping students to discover and develop new and effective ways of managing and working together with all the stakeholders within the healthcare field and (2) helping accelerate a student's development of 12 the context awareness, integrative management, and industry skills that are needed to lead in a rapidly changing healthcare sector. This capstone course puts students in a new organization, one they don’t already know well, and gives them the chance to practice hitting the ground running. This culminating course provides a deeper preparation for the next stages of a student's career. The capstone project will last the entire year: the first term involves matching students with the right project, the second term has students working with their client, and the third term consists of a detailed report and final presentation in front of the client as well as faculty and fellow classmates.

Statistical Programming with SAS (HBDS 5011) - Recommended Elective

Course Director: Zhengming Chen, PhD, MPH, M.S.
3 credits

This course provides introduction to the statistical software SAS. Students will receive a hands-on 6 exposure to data management and report generation with one of the most popular statistical software packages.

Intro to Health Services Research (HBDS 5002) - Elective

Course Director: Jiani Yu, PhD
3 credits

This course is designed to introduce students to the fundamentals of health services research. Health services research is the discipline that measures the evaluations of interventions designed to improve healthcare. These interventions can include changes to the organization, delivery and financing of health care and various healthcare policies. Common outcome measures in health services research include (but are not limited to) patient safety, healthcare quality, healthcare utilization, and cost. Specific topics to be covered in this course include: refining your research question, identifying common research designs and their strengths and weaknesses, minimizing bias and confounding, selecting data sources, optimizing measurement, and more. There will also be a component of the course that explores how to present your 9 ideas and iteratively refine your work, based on feedback from peers and reviewers. This course includes both lectures and interactive group discussions. Students will be able to apply the methods learned in this course to their masters’ research projects.

Spring Term 1

Typical course load is 12 or 15 credits

Biostatistics II - Regression Analysis (HBDS 5008) - Required

Course Director: Samprit Banerjee, PhD., MStat
3 credits

The focus of this course is theory and application of different types of regression analysis. Topics will include: linear regression, logistic regression, and cox proportional hazards regression. Additional topics will include coding of explanatory variables, residual diagnostics, model selection techniques, random effects and mixed models, and maximum likelihood estimation. Homework assignments will involve 4 computation using the R statistical package.

Master’s Project 2 (HCPR 9020) - Required

Course Director: Faculty
3 credits

This is the culminating capstone course of all masters-level graduate education programs. It has two aims: (1) helping students to discover and develop new and effective ways of managing and working together with all the stakeholders within the healthcare field and (2) helping accelerate a student's development of the context awareness, integrative management, and industry skills that are needed to lead in a rapidly changing healthcare sector. This capstone course puts students in a new organization, one they don’t already know well, and gives them the chance to practice hitting the ground running. This culminating course provides a deeper preparation for the next stages of a student's career. The capstone project will last the entire year: the first term involves matching students with the right project, the second term has students working with their client, and the third term consists of a detailed report and final presentation.

Data Management (SQL) (HBDS 5021) - Recommended Elective

Course Director: Debra D’Angelo
3 credits

This course covers tools that students will need to create, manage and maximize value from big databases. The emphasis is on design and implementation of relational databases and the use of Structured Query Language (SQL). At the end of this course, students will be able to explain the requirements for handling large and complex datasets; be able to design, build, and query a relational database; and understand how relational databases and big-data targeted tools complement one another.

Big Data in Medicine (HBDS 5020) - Recommended Elective

Course Director: Samprit Banerjee, PhD, MStat 
3 credits

There has been an explosion of big data in medicine and healthcare. There are four main sources of such big data – 1) administrative databases in healthcare such as electronic health records and health insurance claims, 2) biomedical imaging (e.g. MRI, CT-Scan, X-ray etc.) 3) sensors in smartphones, wearable and implantable devices and 4) genetics and genomics. It is difficult to navigate and critically assess the statistical methods and analytic tools that are needed to conduct analytics and research with such big biomedical data. This course will introduce the four above-mentioned important sources of big data in medical studies, discuss the nuances and intricacies of how such data are generated and introduce tools to navigate such databases visualize and describe them.

Modern Methods for Causal Inference (HBDS 5017) - Recommended Elective

Course Director: Ivan Diaz, PhD
3 credits

The goal of this course is to introduce a core set of modern statistical concepts and techniques to the students, and to demonstrate how to use them to answer complex research questions in healthcare. The students will acquire knowledge on causal inference methods using machine learning, including directed acyclic graphs, non-parametric structural equation models, inverse probability weighting, g-computation, survival analysis, marginal structural models, longitudinal data, mediation analyses, effect modification, and precision medicine. This course will use the free software R to perform all statistical analysis.

Artificial Intelligence in Medicine (HINF 5012) - Elective

Course Director: Fei Wang, PhD
3 credits

Introduces students to a variety of analytic methods for health data using computational tools. The course covers topics in data mining, machine learning, classification, clustering and prediction. Students engage in hands-on exercises using a popular collection of data mining algorithms.

Health Data for Research (SAS) (HPEC 5003) - Elective

Course Director: Mark Unruh, PhD
3 credits

Addresses challenges in the use of electronic clinical data for research purposes, such as electronic health records, clinical data warehouses, electronic prescribing, clinical decision support systems and health information exchange. Students will learn how clinical processes generate data in these different systems, the tasks required to obtain data for research purposes and steps to prepare data for analysis. Examples of research uses of clinical data will be drawn from case studies in the literature. Students will acquire skills in data review, preparation and analysis through hands-on experience with clinical data.

Summer Term 1

Typical course load is 3 credits 

Master’s Project 3 (HCPR 9030) - Required

Course Director: Faculty
3 credits

This is the culminating capstone course of all masters-level graduate education programs. It has two aims: (1) helping students to discover and develop new and effective ways of managing and working together with all the stakeholders within the healthcare field and (2) helping accelerate a student's development of the context awareness, integrative management, and industry skills that are needed to lead in a rapidly changing healthcare sector. This capstone course puts students in a new organization, one they don’t already know well, and gives them the chance to practice hitting the ground running. This culminating course provides a deeper preparation for the next stages of a student's career. The capstone project will last the entire year: the first term involves matching students with the right project, the second term has students working with their client, and the third term consists of a detailed report and final presentation in front of the client as well as faculty and fellow classmates.

Fall Term 2

Typical course load is 6 or 9 credits

Data Science II – Statistical Learning (HBDS 5014) - Required

Course Director: Samprit Banerjee, PhD, MStat
3 credits

The course starts with logistic regression and discriminant analysis with emphasis on classification and prediction. This course would cover some of more advanced topics such as regularized regression, resampling methods, tree-based methods and support vector machines.

Design & Analysis of Biomedical Studies (HBDS 5013) - Recommended Elective

Course Director: Faculty
3 credits

The course covers topics important in the application of statistical methods and relevant statistical software packages (primarily R) to biomedical studies, with an emphasis on applications in the design and analysis related to biomedical experiments, clinical trials and observational studies. The course uses real-world case studies to introduce commonly used experimental designs in biomedical research and discuss a variety of statistical methods and analytic tools for analyzing data generated from those studies. The course promotes good statistical/analytical practice through the introduction of several widely adopted reporting guidelines and tools for carrying out reproducible data analysis. The course aims to help students develop expertise in applying statistical methods and analytical tools, including developing their own R packages, to solve the design and data analysis challenges in biomedical studies and beyond.

Pharmaceutical Statistics (HBDS 5019) - Recommended Elective

Course Director: Faculty 
3 credits

Pharmaceutical studies use many statistical methods that are not routinely taught as part of conventional biostatistics courses. In this course, the students will learn the statistical methods specifically used in pharmaceutical studies. The course is divided into three modules. (1) “Statistical Aspects of Phase I Clinical Trial” will include 3+3 Design, accelerated titration; up and down designs; continual reassessment method (CRM), Modified CRM, TITE CRM, Bayesian Logistic Regression Model (BLRM), escalation with overdose control (EWOC), toxicity probability interval (TPI) and modified TPI (mTPI). (2) “Statistical Aspects of Phase II Clinical Trial” will include design and analyses for One stage and Simon’s Two Stage Designs, Multi-arm Phase II design. (3) “Statistical Aspects of Phase III Clinical Trial” will include randomization, design and analysis for parallel, crossover, factorial, seamless Phase II/III, Adaptive and SMART designs.

Hierarchal Modeling & Longitudinal Data Analysis (HBDS 5010) - Required

Course Director: Arindam RoyChoudhury, PhD
3 credits

An independent biostatistician often encounters data collected on patients over a length of time, or data that are otherwise clustered. This course will give the students necessary tools to analyze such data, while building on the core biostatistics material they have learned from other courses. Specifically, the students will learn to use mixed-effect models, mixed-effect ANOVA, generalized linear mixed models (GLMM), mixed-effect Cox-regression, Bayesian hierarchical models, repeated measure and longitudinal data analysis with appropriate covariance structures.

Study Designs & Comparative Effectiveness (HPEC 5006) - Elective

Course Director: Alvin Mushlin M.D., Sc.M. 
3 credits

This course will cover the conceptual underpinnings, the policy context, and the methods for comparative effectiveness research (CER) highlighting key issues and controversies. It will provide students with an understanding of the analytic methods and data resources used to conduct comparative effectiveness research. Topics that will be discussed include, observational studies, risk adjustment, propensity score matching, instrumental variables, meta-analysis/systematic reviews and the use of clinical registries and electronic health record data. Students will learn why comparative research has come to prominence, what constitutes good comparative effectiveness research, the main methods used and the advantages and disadvantages of each without being a statistics course. Sessions will consist of lectures from the instructors and experts on selected topics, as well as student discussions and presentations.