SAS-macros for the statistical analysis of case-control studies in a two-phase design

Double sampling, also known as two-phase sampling, is a standard technique for stratification (Breslow, 2005): At phase 1, the investigator first draws a random sample from the population to measure the covariates needed for stratification. At phase two, random subsamples of varying size are drawn from within each stratum to collect additional covariates for these selected subjects. By using larger sampling ratios for the most informative strata, e.g. by using a balanced sampling design (see Figure 1), the efficiency of estimates of population parameters is enhanced. The two-phase case-control study embodies a stratified sampling design where the strata depend on the outcome. Case-control studies in epidemiology typically sample a large fraction of the incident cases of disease and a much smaller fraction of disease-free controls to evaluate the association between disease outcome and risk factors. The following three examples illustrates fields of applications for case-control studies in a stratified two-phase design: White (1982) proposed studying the association between a rare exposure and a rare disease by sampling at phase two on the basis of both disease and exposure status. Another example of double sampling arises from studies of gene-environment interaction (Andrieu and Goldstein, 1998). Efficiency is enhanced by limiting expensive genotyping to a sample stratified both by disease and by family history or a rare environmental exposure. A third example is the validation study. Here subsamples of cases and controls are drawn to make error-free measurements, so that parameter estimates may be adjusted for the attenuation caused by measurement error (Breslow and Holubkov, 1997).

We offer a SAS-macro-package for the statistical analysis of case-control studies in a stratified two-phase design (Schill et al., 2014), which implements various methods to obtain effect estimates. In addition, we provide SAS-macros to calculate attributable fractions and to conduct automate covariate selection. As an extension, we present a SAS-macro, which obtains effect estimates in two-phase case-control without the need of a stratified sampling design (Scott and Wild, 2011). Instead, a parametric model for participation in phase 2 is assumed. All macros require at least SAS/STAT 9.22 and SAS/IML 9.2.

Figure 1: Exemplary case-control study with 150 cases and 350 controls in phase 1 und 8 strata (phase 2 is a 20% subsample of phase 1).


  • Andrieu N, Goldstein AM. Epidemiologic and genetic approaches in the study of gene-environment interaction: an overview of available methods. Epidemiologic Reviews. 1998;20(2):137-147.
  • Breslow NE. Case-control study, two-phase. In: Armitage P, Colton T, editors. Encyclopedia of Biostatistics. 2nd edition. Chichester, West Sussex, England, Hoboke, NJ: John Wiley & Sons. 2005
  • Breslow NE, Holubkov R. Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Series B. 1997;59(2):447-461.
  • Schill W, Enders D, Drescher K. A SAS package for logistic two-phase studies. Journal of Statistical Software. 2014;57(9):1-13.
  • Scott A, and Wild C. Fitting regression models with response-biased samples. The Canadian Journal of Statistics. 2011;39(3):519-536.
  • White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115(1):119-128.