Average hazard ratio and sample size under non-proportional hazards

Introduction

This document demonstrates applications of the average hazard ratio concept in the design of fixed designs without interim analysis. Throughout we consider a 2-arm trial with an experimental and control group and a time-to-event endpoint. Testing for differences between treatment groups is performed using the stratified logrank test. In the above setting, the gsDesign2::ahr() routine provides an average hazard ratio that can be used for sample size using the function gsDesign::nSurv(). The approach assumes piecewise constant enrollment rates and piecewise exponential failure rates with the option of including multiple strata. This approach allows the flexibility to approximate a wide variety of scenarios. We evaluate the approximations used via simulation using the simtrial package; we specifically provide a simulation routine so that any changes specified by the user should be easily incorporated. We consider both non-proportional hazards for a single stratum and multiple strata with different underlying proportional hazards assumptions.

There are two things to note regarding differences between simtrial::simfix() and gsDesign2::ahr():

simtrial::simfix() is less flexible in that it requires all strata are enrolled at the same relative rates throughout the trial whereas gsDesign2::ahr() allows, for example, enrollment to start or stop at different times in different strata. In this document, we use the more restrictive parameterization of simtrial::simfix() so that we can confirm the asymptotic sample size approximation based on gsDesign2::ahr() by simulation.
simtrial::simfix() provides more flexibility in test statistics used than gsDesign2::ahr() as documented in the pMaxCombo vignette demonstrating use of Fleming-Harrington weighted logrank tests and combinations of such tests.

Document organization

This vignette is organized as follows:

Two non-proportional hazards examples are introduced for fixed sample size approximation, one with a single stratum and one with two strata.
- The single stratum design assumes a delayed treatment benefit.
- The stratified example assumes different proportional hazards in 3 strata.
Each of these examples have the following subsections:
- Description of the design scenario.
- Deriving an average hazard ratio.
- Deriving sample size based on average hazard ratio.
- Computing and plotting the average hazard ratio as a function of time.
- Simulation to verify that the sample size approximation provides the targeted power.

Each simulation is done with data cutoff performed in 5 different ways:

Based on targeted trial duration
Based on targeted minimum follow-up duration only
Based on targeted event count only
Based on the maximum of targeted event count and targeted trial duration
Based on the maximum of targeted event count and targeted minimum follow-up

The method based on waiting to achieve targeted event count and targeted minimum follow-up appears to be both practical and to provide the targeted power.

Initial setup

We begin by setting two parameters that will be used throughout in simulations used to verify accuracy of power approximations; either could be customized for each simulation. First, we set the number of simulations to be performed. You can increase this to improve accuracy of simulation estimates of power.

nsim <- 2000

Simulations using the simtrial::simfix() routine below use blocked randomization. We set that here and do not change for individual simulations. Based on balanced randomization in block we set the randomization ratio of experimental to control to 1.

block <- rep(c("Control", "Experimental"), 2)
ratio <- 1

We load packages needed below.

gsDesign is used for its implementation of the Schoenfeld (1981) approximation to compute the number of events required to power a trial under the proportional hazards assumption.
dplyr and tibble to work with tabular data and the ‘data wrangling’ approach to coding.
simtrial to enable simulations.
survival to enable Cox proportional hazards estimation of the (average) hazard ratio for each simulation to compare with the approximation provided by the gsDesign2::ahr() routine that computes an expected average hazard ratio for the trial (Kalbfleisch and Prentice (1981), Schemper, Wakounig, and Heinze (2009)).
Hidden underneath this is the gsDesign2::eEvents_df() routine that provides expected event counts for each period and stratum where the hazard ratio differs. This is the basic calculation used in the gsDesign2::ahr() routine.

library(gsDesign)
library(gsDesign2)
library(ggplot2)
library(dplyr)
library(tibble)
library(survival)
library(gt)

Single stratum non-proportional hazards example

Design scenario

We set up the first scenario design parameters. Enrollment ramps up over the course of the first 4 months follow-up by a steady state enrollment thereafter. This will be adjusted proportionately to power the trial later. The control group has a piecewise exponential distribution with median 9 for the first 3 months and 18 thereafter. The hazard ratio of the experimental group versus control is 1 for the first 3 months followed by 0.55 thereafter.

# Note: this is done differently for multiple strata; see below!
enroll_rate <- define_enroll_rate(
  duration = c(2, 2, 10),
  rate = c(3, 6, 9)
)

fail_rate <- define_fail_rate(
  duration = c(3, 100),
  fail_rate = log(2) / c(9, 18),
  dropout_rate = .001,
  hr = c(1, .55)
)

total_duration <- 30

Since there is a single stratum, we set strata to the default:

strata <- tibble::tibble(stratum = "All", p = 1)

Computing average hazard ratio

We compute an average hazard ratio using the gsDesign2::ahr() (average hazard ratio) routine. We will modify enrollment rates proportionately below when the sample size is computed. This result is for the given enrollment rates which will be adjusted in our next step. However, since they will be adjusted proportionately with relative enrollment timing not changing, the average hazard ratio will not change. Approximations of statistical information under the null (info0) and alternate (info) hypotheses are provided here. Recall that the parameterization here is in terms of \(\log(HR)\), and, thus the information is intended to approximate 1 over the variance for the Cox regression coefficient for treatment effect; this will be checked with simulation later.

avehr <- ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = as.numeric(total_duration)
)

avehr %>% gt()

time	ahr	n	event	info	info0
30	0.691405	108	58.13107	14.10216	14.53277

This result can be explained by the number of events observed before and after the first 3 months of treatment in each treatment group.

xx <- pw_info(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = as.numeric(total_duration)
)
xx %>% gt()

time	stratum	t	hr	n	event	info	info0
30	All	0	1.00	12	22.24824	5.562060	5.562060
30	All	3	0.55	96	35.88283	8.540105	8.970708

Now we can replicate the geometric average hazard ratio (AHR) computed using the ahr() routine above. We compute the logarithm of each HR above and computed a weighted average weighting by the expected number of events under each hazard ratio. Exponentiating the resulting weighted average gives the geometric mean hazard ratio, which we label as AHR.

xx %>%
  summarize(AHR = exp(sum(event * log(hr) / sum(event)))) %>%
  gt()

AHR
0.691405

Deriving the design

With this average hazard ratio, we use the call for gsDesign::nEvents() which uses the Schoenfeld (1981) approximation to derive a targeted number of events. All you need for this is the average hazard ratio from above, the randomization ratio (experimental/control), Type I error and Type II error (1 - power).

target_event <- gsDesign::nEvents(
  hr = avehr$ahr, # average hazard ratio computed above
  ratio = 1, # randomization ratio
  alpha = .025, # 1-sided Type I error
  beta = .1 # Type II error (1-power)
)

target_event <- ceiling(target_event)
target_event
#> [1] 309

We also compute proportionately increase the enrollment rates to achieve this targeted number of events; we round up the number of events required to the next higher integer.

# Update enroll_rate to obtain targeted events
enroll_rate$rate <- ceiling(target_event) / avehr$event * enroll_rate$rate
avehr <- ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = as.numeric(total_duration)
)

avehr %>% gt()

time	ahr	n	event	info	info0
30	0.691405	574.082	309	74.9611	77.25

We also compute sample size, rounding up to the nearest even integer.

# round up sample size in both treatment groups
sample_size <- ceiling(sum(enroll_rate$rate * enroll_rate$duration) / 2) * 2
sample_size
#> [1] 576

Average hazard ratio and expected event accumulation over time

We examine the average hazard ratio as a function of trial duration with the modified enrollment required to power the trial. We also plot expected event accrual over time; although the graphs go through 40 months, recall that the targeted trial duration is 30 months. A key design consideration is selecting trial duration based on things like the degree of ahr improvement over time versus the urgency of completing the trial as quickly as possible, noting that the required sample size will decrease with longer follow-up.

avehrtbl <- ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = 1:(total_duration + 10)
)

ggplot(avehrtbl, aes(x = time, y = ahr)) +
  geom_line() +
  ylab("Average HR") +
  ggtitle("Average HR as a function of study duration")


ggplot(avehrtbl, aes(x = time, y = event)) +
  geom_line() +
  ylab("Expected events") +
  ggtitle("Expected event accumulation as a function of study duration")

Simulation to verify power

We use function simtrial::simfix() to simplify setting up and executing a simulation to evaluate the sample size derivation above. Arguments for simtrial::simfix() are slightly different than the set-up that was used for the gsDesign2::ahr() function used above. Thus, there is some reformatting of input parameters involved. One difference from the gsDesign2::ahr() parameterization in simtrial::simfix() is that block is provided to specify fixed block randomization as opposed to ratio for gsDesign2::ahr().

# Do simulations
# Cut at targeted study duration
results1 <- simtrial::simfix(
  nsim = nsim,
  block = block,
  sampleSize = sample_size,
  strata = strata,
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = total_duration,
  target_event = ceiling(target_event),
  timingType = 1:5
)

The following summarizes outcomes by the data cutoff chosen. Regardless of cutoff chosen, we see that the power approximates the targeted 90% quite well. The statistical information computed in the simulation is computed as one over the simulation variance of the Cox regression coefficient for treatment (i.e., the log hazard ratio).

# Loading the data saved previously
results1 <- readRDS("fixtures/results1.rds")
results1$Positive <- results1$Z <= qnorm(.025)
results1 %>%
  group_by(cut) %>%
  summarise(
    Simulations = n(), Power = mean(Positive), sdDur = sd(Duration), Duration = mean(Duration),
    sdEvents = sd(Events), Events = mean(Events),
    HR = exp(mean(lnhr)), sdlnhr = sd(lnhr), info = 1 / sdlnhr^2
  ) %>%
  gt() %>%
  fmt_number(column = 2:9, decimals = 3)

cut	Simulations	Power	sdDur	Duration	sdEvents	Events	HR	sdlnhr	info
Max(min follow-up, event cut)	2,000.000	0.895	0.983	30.560	7.050	314.226	0.692	0.116	73.83002
Max(planned duration, event cut)	2,000.000	0.895	0.910	30.551	7.090	314.163	0.692	0.116	73.94340
Minimum follow-up	2,000.000	0.888	0.495	30.024	11.621	310.147	0.694	0.117	73.27034
Planned duration	2,000.000	0.886	0.000	30.000	11.720	309.958	0.694	0.117	72.99295
Targeted events	2,000.000	0.880	1.595	29.824	0.000	309.000	0.694	0.118	71.83207

The column HR above is the exponentiated mean of the Cox regression coefficients (geometric mean of HR). We see that the HR estimate below matches the simulations above quite well. The column info here is the estimated statistical information under the alternate hypothesis, while info0 is the estimate under the null hypothesis. The value of info0 is 1/4 of the expected events calculated below. In this case, the information approximation under the alternate hypothesis appears slightly small, meaning that the asymptotic approximation used will overpower the trial. Nonetheless, the approximation for power appear quite good as noted above.

avehr %>% gt()

time	ahr	n	event	info	info0
30	0.691405	574.082	309	74.9611	77.25

Different proportional hazards by strata

Design scenario

We set up the design scenario parameter. We are limited here to simultaneous enrollment of strata since the simtrial::simfix() routine uses simtrial::simPWSurv() which is limited to this scenario. We specify three strata:

High risk: 1/3 of the population with median time-to-event of 6 months and a treatment effect hazard ratio of 1.2.
Moderate risk: 1/2 of the population with median time-to-event of 9 months and a hazard ratio of 0.2.
Low risk: 1/6 of the population that is essentially cured in both arms (median 100, HR = 1).

strata <- tibble::tibble(stratum = c("High", "Moderate", "Low"), p = c(1 / 3, 1 / 2, 1 / 6))

enroll_rate <- define_enroll_rate(
  stratum = c(array("High", 4), array("Moderate", 4), array("Low", 4)),
  duration = rep(c(2, 2, 2, 18), 3),
  rate = c((1:4) / 3, (1:4) / 2, (1:4) / 6)
)

fail_rate <- define_fail_rate(
  stratum = c("High", "Moderate", "Low"),
  duration = 100,
  fail_rate = log(2) / c(6, 9, 100),
  dropout_rate = .001,
  hr = c(1.2, 1 / 3, 1)
)

total_duration <- 36

Computing average hazard ratio

Now we transform the enrollment rates to account for stratified population.

ahr2 <- ahr(enroll_rate, fail_rate, total_duration)
ahr2 %>% gt()

time	ahr	n	event	info	info0
36	0.642733	84	53.41293	12.76869	13.35323

We examine the expected events by stratum.

xx <- pw_info(enroll_rate, fail_rate, total_duration)
xx %>% gt()

time	stratum	hr	n	event	info	info0
36	High	1.2000000	28	25.666089	6.4144810	6.4165222
36	Low	1.0000000	14	1.996737	0.4991842	0.4991842
36	Moderate	0.3333333	42	25.750105	5.8550281	6.4375262

Getting the average of log(HR) weighted by Events and exponentiating, we get the overall AHR just derived.

xx %>%
  ungroup() %>%
  summarise(lnhr = sum(event * log(hr)) / sum(event), AHR = exp(lnhr)) %>%
  gt()

lnhr	AHR
-0.4420259	0.642733

Deriving the design

We derive the sample size as before. We plan the sample size based on the average hazard ratio for the overall population and use that across strata. First, we derive the targeted events:

target_event <- gsDesign::nEvents(
  hr = ahr2$ahr,
  ratio = 1,
  alpha = .025,
  beta = .1
)
target_event <- ceiling(target_event)
target_event
#> [1] 216

Next, we adapt enrollment rates proportionately so that the trial will be powered for the targeted failure rates and follow-up duration.

enroll_rate <- enroll_rate %>% mutate(rate = target_event / ahr2$event * rate)

ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = total_duration
) %>% gt()

time	ahr	n	event	info	info0
36	0.642733	339.693	216	51.63614	54

The targeted sample size, rounding up to an even integer, is:

sample_size <- ceiling(sum(enroll_rate$rate * enroll_rate$duration) / 2) * 2
sample_size
#> [1] 340

Average HR and expected event accumulation over time

Plotting the average hazard ratio as a function of study duration, we see that it improves considerably over the course of the study. We also plot expected event accumulation. As before, we plot for 10 months more than the planned study duration of 36 months to allow evaluation of event accumulation versus treatment effect for different trial durations.

avehrtbl <- ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = 1:(total_duration + 10)
)

ggplot(avehrtbl, aes(x = time, y = ahr)) +
  geom_line() +
  ylab("Average HR") +
  ggtitle("Average HR as a function of study duration")


ggplot(avehrtbl, aes(x = time, y = event)) +
  geom_line() +
  ylab("Expected events") +
  ggtitle("Expected event accumulation as a function of study duration")

Simulation to verify power

We change the enrollment rates by stratum produced by gsDesign::nSurv() to overall enrollment rates needed for simtrial::simfix().

er <- enroll_rate %>%
  group_by(stratum) %>%
  mutate(period = seq_len(n())) %>%
  group_by(period) %>%
  summarise(rate = sum(rate), duration = last(duration))

er %>% gt()

period	rate	duration
1	4.043965	2
2	8.087929	2
3	12.131894	2
4	16.175858	18

Now we simulate and summarize results. Once again, we see that the expected statistical information from the simulation is greater than what would be expected by the Schoenfeld approximation which is the expected events divided by 4.

results2 <- simtrial::simfix(
  nsim = nsim,
  block = block,
  sampleSize = sample_size,
  strata = strata,
  enroll_rate = er,
  fail_rate = fail_rate,
  total_duration = as.numeric(total_duration),
  target_event = as.numeric(target_event),
  timingType = 1:5
)

results2 <- readRDS("fixtures/results2.rds")
results2$Positive <- (pnorm(results2$Z) <= .025)
results2 %>%
  group_by(cut) %>%
  summarize(
    Simulations = n(), Power = mean(Positive), sdDur = sd(Duration), Duration = mean(Duration),
    sdEvents = sd(Events), Events = mean(Events),
    HR = exp(mean(lnhr)), sdlnhr = sd(lnhr), info = 1 / sdlnhr^2
  ) %>%
  gt() %>%
  fmt_number(column = 2:9, decimals = 3)

cut	Simulations	Power	sdDur	Duration	sdEvents	Events	HR	sdlnhr	info
Max(min follow-up, event cut)	2,000.000	0.895	1.751	36.952	4.899	219.416	0.642	0.137	53.33857
Max(planned duration, event cut)	2,000.000	0.892	1.529	36.951	4.988	219.411	0.641	0.136	53.72394
Minimum follow-up	2,000.000	0.886	1.139	36.051	8.595	215.994	0.644	0.137	53.50321
Planned duration	2,000.000	0.882	0.000	36.000	8.918	215.713	0.644	0.137	53.53033
Targeted events	2,000.000	0.879	2.363	36.060	0.000	216.000	0.644	0.138	52.44797

Finally, compare the simulation results above to the asymptotic approximation below. The achieved power by simulation is just below the targeted 90%; noting that the simulation standard error is 0.006, the asymptotic approximation is quite good. Using the final cutoff that requires both the targeted events and minimum follow-up seems a reasonable convention to preserved targeted design power.

ahr(
  enroll_rate = enroll_rate,
  fail_rate = fail_rate,
  total_duration = total_duration
) %>% gt()

time	ahr	n	event	info	info0
36	0.642733	339.693	216	51.63614	54

References

Kalbfleisch, John D, and Ross L Prentice. 1981. “Estimation of the Average Hazard Ratio.” Biometrika 68 (1): 105–12.

Schemper, Michael, Samo Wakounig, and Georg Heinze. 2009. “The Estimation of Average Hazard Ratios by Weighted Cox Regression.” Statistics in Medicine 28 (19): 2473–89.

Schoenfeld, David. 1981. “The Asymptotic Properties of Nonparametric Tests for Comparing Survival Distributions.” Biometrika 68 (1): 316–19.

Keaven M. Anderson

Introduction

Document organization

Initial setup

Single stratum non-proportional hazards example

Design scenario

Computing average hazard ratio

Deriving the design

Average hazard ratio and expected event accumulation over time

Simulation to verify power

Different proportional hazards by strata

Design scenario

Computing average hazard ratio

Deriving the design

Average HR and expected event accumulation over time

Simulation to verify power

References