Causal discovery for cohort data


This project capitalised on methods of causal discovery to supplement standard statistical analyses and hence paved the way to fully exploit the potential wealth of information provided by cohort data. Cohort studies are a valuable resource for researchers, e.g. in epidemiology or sociology, when studying life-course developments so as to understand the relation between early exposures on later outcomes. Causal discovery is a field at the intersection of computer science and statistics. In ist idealized form, it takes a dataset as input and outputs a graphical representation of the causal structure among the variables in the dataset, albeit relying on very specific assumptions. These methods have attracted much attention, especially in the context of big data. With this project we filled crucial gaps regarding theory and software specifically targeted at, and suitable for, cohort data. In particular we:
  • (1) formulated and investigated a new class of causal models, cohort causal graphs (CCGs), and developped suitable and efficient model selection algorithms;
  • (2) extended these new statistical approaches to address the particular challenges causal discovery faces with typical cohort data, especially that of missing values; here we combined multiple imputation with the so-called PC-algorithm and showed the superiority of the method to previous practices;
  • (3) developped guidelines, including recommendations and caveats, as well as user-friendly software for practical applications, which now enable wide dissemination of the new methodology.
  • (4) successfully applied the new methodology in practical examples: (i) we investigated a genetic network underlying the protein 53 signalling pathway which plays an important role in head‐and‐neck squamous cell carcinoma; (ii) we analysed the IDEFICS/I-Family cohort data and found evidence for possible new indirect causal pathways.

The project was an important and successful enterprise because causal discovery takes a radically different approach from traditional statistical analyses; it therefore has the potential to generate genuinely novel insights, including valuable suggestions for follow-up intervention studies, e.g. useful for advising parents and teachers of overweight children. The project therefore ultimately contributes to better informing public health policies and medical decision making.

Funding period

Begin:   January 2018
End:   May 2021


  • German Research Foundation


Prof. Dr. rer. nat. Vanessa Didelez

Selected project-related publications

    Articles with peer-review

  • Witte J, Foraita R, Didelez V. Multiple imputation and test-wise deletion for causal discovery with incomplete cohort data. Statistics in Medicine. 2022;41(23):4716-4743.
  • Foraita R, Friemel J, Günther K, Behrens T, Bullerdiek J, Nimzyk R, Ahrens W, Didelez V. Causal discovery of gene regulation with incomplete data. Journal of the Royal Statistical Society. Series A (Statistics in Society). 2020;183(4):1747-1775.
  • Witte J, Henckel L, Maathuis MH, Didelez V. On efficient adjustment in causal graphs. Journal of Machine Learning Research. 2020;21(246):1-45.
  • Witte J, Didelez V. Covariate selection strategies for causal inference: Classification and comparison. Biometrical Journal. 2019;61(5):1270-1289.
  • Presentations at scientific meetings/conferences (invited)

  • Foraita R, Witte J, Didelez V. Causal discovery with cohort data. Institutskolloquium des Instituts für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) der Universität Mainz, 9. Februar 2023, Mainz.
  • Didelez V, Witte J, Foraita R. Causal and graphical modelling in epidemiology. Statistische Woche, Jahrestagung der Deutschen Statistischen Gesellschaft (DStatG), 14.-17. September 2021, Online-Vortrag.
  • Didelez V, Witte J. Kovariablen-Selektion für kausale Inferenz - Verschiedene Ansätze im Vergleich. Institutskolloquium des Instituts für Medizinische Biometrie, Epidemiologie und Informatik (IMBEI) der Universität Mainz, 10. Dezember 2020, Online-Vortrag.
  • Witte J, Didelez V. Kovariablen-Selektion für kausale Inferenz: Verschiedene Ansätze im Vergleich. Kolloquium "Statistische Methoden in der empirischen Forschung" des Instituts für Veterinär-Epidemiologie und Biometrie der Freien Universität Berlin, 11. Dezember 2018, Berlin.
  • Software

  • Witte J, Foraita R. R-Paket tpc. (Version 1.0); 2023.
  • Foraita R, Witte J. Multiple imputation in causal graph discovery. (Version 1.1.0); 2022.
  • Witte J. tPC - Causal discovery with temporal background knowledge. (Version 1.0.0); 2021.
  • Foraita R. micd (Multiple Imputation in Causal Graph Discovery). R package. (Version 0.2.0); 2019.