Skip to contents

Overview

There are multiple scenarios where event-based spending for group sequential designs has limitations in terms of ensuring adequate follow-up and in ensuring adequate spending is preserved for the final analysis. Example contexts where this often arises is in trials where

  • there may be a delayed treatment effect,
  • control failure rates are different than expected, and
  • multiple hypotheses are being tested.

In general, for such situations we have found that ensuring both adequate follow-up duration and an adequate number of events is important to fully evaluate the potential effectiveness of a new treatment. For testing of multiple hypotheses, carefully thinking through possible spending issues can be critical. In addition, for group sequential trials, preserving adequate \(\alpha\)-spending for a final evaluation of a hypothesis is important and difficult to do using traditional event-based spending.

In this document, we outline three examples to demonstrate these issues:

  • For a delayed effect scenario we demonstrate:
    • the importance of both adequate events and adequate follow-up duration to ensure power in a fixed design, and
    • the importance of guaranteeing a reasonable amount of \(\alpha\)-spending for the final analysis in a group sequential design.
  • For a trial examining an outcome in a biomarker positive and overall populations, we show the importance of considering how the design reacts to incorrect design assumptions on biomarker prevalence.

For the group sequential design options, we demonstrate that the concept of spending time is an effective way to adapt. Traditionally Lan and DeMets (1983), spending has been done according to targeting a specific number of events for an outcome at the end of the trial. However, for delayed treatment effect scenarios there is substantial literature (e.g., Lin et al. (2020), Roychoudhury et al. (2021)) documenting the importance of adequate follow-up duration in addition to requiring an adequate number of events under the traditional proportional hazards assumption.

While other approaches could be taken, we have found the spending time approach generalizes well for addressing a variety of scenarios. The fact that spending does not need to correspond to information fraction was perhaps first raised by Lan and DeMets (1989) where calendar-time spending was discussed. However, we note that Proschan, Lan, and Wittes (2006) have raised other scenarios where spending alternatives are considered. Two specific spending approaches are suggested here:

  • Spending according to the minimum of planned and observed event counts. This is suggested for the delayed effect examples.
  • Spending with a common spending time across multiple hypotheses; e.g., in the multiple population example, spending in the overall population at the same rate as in the biomarker positive subgroup regardless of event counts over time in the overall population. This is consistent with Follmann, Proschan, and Geller (1994) as applied when multiple experimental treatments are compared to a common control. Spending time in this case corresponds to the approach of Fleming, Harrington, and O’Brien (1984) where fixed incremental spending is set for a potentially variable number of interim analyses.

This document is fairly long in that it demonstrates a number of scenarios relevant to the spending time concept. The layout is intended to make it as easy as possibly to focus on the individual examples for those not interested in the full review. Code is available to unhide for those interested in implementation. Rather than bog down the conceptual discussion with implementation details, we have tried to provide sufficient comments in the code to guide implementation for those who are interested in that.

Delayed effect scenario

We consider an example in a single stratum where there is a possibility of a delayed treatment effect. The next two sections will consider both a 1) fixed design with no interim analysis, and 2) a design with interim analysis. Following are the common assumptions:

  • The control group time-to-event is exponentially distributed with a median of 12 months.
  • 2.5% one-sided Type I error.
  • 90% power.
  • A constant enrollment rate with an expected enrollment duration of 12 months.
  • A targeted trial duration of 30 months.
  • A delayed effect for the experimental group compared to control, with a hazard ratio of 1 for the first 4 months and a hazard ratio of 0.6 thereafter.

The restrictions on constant control failure rate, only two hazard ratio time intervals and constant enrollment are not required, but simplify the example. The approach taken uses an average-hazard ratio approach for approximating treatment effect as in Mukhopadhyay et al. (2020) and the asymptotic group sequential theory of Tsiatis (1982).

m <- 12 # Control median
enrollRates <- tibble(Stratum="All", duration = 12, rate = 1)
failRates <- tibble(Stratum="All", duration = c(4, 100), hr = c(1, .6), 
                    failRate = log(2) / m, dropoutRate = .001)

Fixed design, delayed effect

# Output table function for fixed design

table_fixed_design <- function(x, enrollRates){
  N <- sum(enrollRates$rate * enrollRates$duration)
  x %>%
    filter(Bound == "Upper") %>%
    transmute(Time = Time, N = ceiling(N), Events = ceiling(Events), "Nominal p" = pnorm(-Z), AHR = AHR, Power = Probability) %>%
  gt() %>% 
  fmt_number(columns = c("N", "Events"), decimals=0) %>%
  fmt_number(columns = "Time", decimals = 1) %>%
  fmt_number(columns = c("Nominal p", "Power", "AHR"), decimals=3)
}

The sample size and events for this design are shown below. We see that the average hazard ratio (AHR) under the above assumptions is 0.703, part way between the early HR of 1 and the later HR of 0.6 assumed for experimental versus control therapy.

# Bounds for fixed design are just a fixed bound for nominal p = 0.025, 1-sided
Z_025 <- qnorm(.975)

# Fixed design, single stratum
# Find sample size for 30 month trial under given 
# enrollment and sample size assumptions
xx <- gs_design_ahr(enrollRates, 
                    failRates, 
                    analysisTimes= 30,
                    upar = Z_025, 
                    lpar = Z_025)
xx$bounds %>% table_fixed_design(enrollRates)
Time N Events Nominal p AHR Power
30.0 510 340 0.025 0.703 0.900

Power when assumptions design are wrong

Scenario with less experimental benefit

If we assume instead that the effect delay is 6 months instead of 4 and the control median is 10 months instead of 12, there is a substantial impact on power. Here, we have assumed only the targeted events is required to do the final analysis resulting in an expected final analysis time of 25 months instead of the planned 30 and an average hazard ratio of 0.78 at the expected time of analysis rather than the targeted average hazard ratio of 0.70 under the original assumptions.

am <- 10 # Alternate control median
failRates$duration[1] <- 6
failRates$failRate <- log(2) / am
yy <- 
  gs_power_ahr(
      enrollRates = xx$enrollRates,
      failRates = failRates,
      events = xx$bounds$Events,
      upar = Z_025,
      lpar = Z_025
)
yy %>% table_fixed_design(xx$enrollRates)
Time N Events Nominal p AHR Power
25.2 510 340 0.025 0.778 0.631

Now we also require 30 months trial duration in addition to the targeted events. This improves the power from 63% above to 76% with an increase from 25 to 30 months duration and 340 to 377 expected events, an important gain. This is driven both by the average hazard ratio of 0.78 above compared to 0.76 below and by the increased expected number of events. It also ensures adequate follow-up to better describe longer-term differences in survival; this may be particularly important if early follow-up suggests a delayed effect or crossing survival curves. Thus, the adaptation of event-based design based to also require adequate follow-up can help ensure power for a large clinical trial investment where there is an clinically relevant underlying survival benefit.

yy <- 
  gs_power_ahr(
      enrollRates = xx$enrollRates,
      failRates = failRates,
      events = xx$bounds$Events,
      analysisTimes = 30,
      upar = Z_025,
      lpar = Z_025
)
yy %>% table_fixed_design(xx$enrollRates)
Time N Events Nominal p AHR Power
30.0 510 377 0.025 0.759 0.759

Scenario with low control event rates

Now we assume a longer than planned control median, 16 months to demonstrate the value of retaining the event count requirement. If we analyze after 30 months, the power of the trial is 87% with 288 events expected.

am <- 16 # Alternate control median
failRates$failRate <- log(2) / am
failRates$duration[1] <- 4
yy <- 
  gs_power_ahr(
      enrollRates = xx$enrollRates,
      failRates = failRates,
      events = 1,
      analysisTimes = 30,
      upar = Z_025,
      lpar = Z_025
)
yy %>% table_fixed_design(xx$enrollRates)
Time N Events Nominal p AHR Power
30.0 510 288 0.025 0.693 0.868

If we also require adequate events, we restore power to 94.5, above the originally targeted level of 90%. The cost is that the expected trial duration becomes 38.5 months rather than 30; however, since the control median is now larger, the additional follow-up should be useful to characterize tail behavior. Note that for this scenario we are likely particularly interested in retaining power as the treatment effect is actually stronger than the original alternate hypothesis. Thus, for this example, the time cutoff alone would not have ensured sufficient follow-up to power the trial.

yy <- 
  gs_power_ahr(
      enrollRates = xx$enrollRates,
      failRates = failRates,
      events = xx$bounds$Events,
      analysisTimes = 30,
      upar = Z_025,
      lpar = Z_025
)
yy %>% table_fixed_design(xx$enrollRates)
Time N Events Nominal p AHR Power
38.5 510 340 0.025 0.678 0.945

Conclusions for fixed design

In summary, we have demonstrated the value of requiring both adequate events and adequate follow-up duration over an approach where the analysis is done with only one of these requirements. Requiring both will retain both power and important treatment benefit characterization over time when there is potential for delayed onset of a positive beneficial treatment effect.

Group sequential design

Alternative spending strategies

We extend the above design to detect a delayed effect to a group sequential design with a single interim analysis after 80% of the final planned events have accrued. We will assume the final analysis will require both the targeted trial duration and events based on the fixed design based on the evaluations above. We assume the efficacy bound uses the Lan and DeMets (1983) spending function approximating an O’Brien-Fleming bound. No futility bound is planned, with the exception of a demonstration for one scenario. The interim analysis is far enough into the trial so that there is a substantial probability of stopping early under design assumptions.

Coding for the different strategies must be done carefully. At the time of design, we specify only the spending function when specifying the use of information fraction for design.

# Spending for design with planned information fraction
upar_design_IF <- list(
                       # total_spend represents one-sided Type I error
                       total_spend = 0.025,
                       # Spending function and associated 
                       # parameter (NULL, in this case)
                       sf = sfLDOF, param = NULL,
                       # Do NOT specify spending time here as it will be set
                       # by information fraction specified in call to
                       # gs_design_ahr()
                       timing = NULL,
                       # Do NOT specify maximum information here as it will be
                       # set as the design maximum information
                       max_info = NULL
)

If we wished to use 22 and 30 months as calendar analysis times and use calendar fraction for spending, we would need to specify spending time for the design.

upar_design_CF <- upar_design_IF
# Now switch spending time to calendar fraction
upar_design_CF$timing <- c(22, 30)/30

Next we show how to set up information-based spending for power calculation when timing of analysis is not based on information fraction; e.g., we will propose requiring not only achieving planned event counts, but also planned study duration before an analysis is performed. It is critical to set the maximum planned information to update the information fraction calculation in this case.

# We now need to change max_info from spending as specified for design
upar_actual_IF <- upar_design_IF
# Note that we still have timing = NULL, unchanged from information-based design
upar_actual_IF <- NULL
# Replace NULL maximum information with planned maximum null hypothesis
# information from design
# This max will be updated for each planned design later
upar_actual_IF$max_info <- 100

The final case will be to replace information fraction for a design to a specific spending time which will be plugged into the spending function to compute incremental \(\alpha\)-spending for each analysis. For our case, we will use planned information fraction from the design, which is 0.8 at the interim analysis and 1 for the final analysis. This will be used regardless of what scenario we are using to compute power, but recall that information fraction is still used for computing correlations in the asymptotic distribution approximation for design tests.

# Copy original upper planned spending
upar_planned_IF <- upar_design_IF
# Interim and final spending time will always be the same, regardless of 
# expected events or calendar timing of analysis
upar_planned_IF$timing <- c(0.8, 1)
# We will reset planned maximum information later

Finally, we set up a function to print a table to describe characteristics of a design and its updates as it is updated for different scenarios.

# Function to print table for group sequential design
# Desire is to print both information fraction relative to planned with planned
# input in maxinfo0. 
# Input spending_time here needs to be consistent with what is given in design;
# if NULL, information fraction will be used to compute spending time based in input max_info0.
# Enrollment rates input are used to compute sample size.
# This works for cases here, but N calculation will NOT work for
# IA's before planned enrollment completion; for that gs_power_ahr needs to be fixed
table_gs_design <- function(x, enrollRates, spending_time = NULL, max_info0 = NULL){
  N <- sum(enrollRates$rate * enrollRates$duration)
  if(is.null(max_info0)) stop("Must enter maximum H0 planned information in max_info0")
  x$N <- N
  if (is.null(spending_time)) {x$spending_time <- pmin(x$info0 / max_info0, 1)
  }else x$spending_time <- spending_time
  x %>% 
     transmute(Time = Time, N = ceiling(N), Events = ceiling(Events), "Information fraction" = info0 / max_info0,
               "Spending time" = spending_time,
               "Nominal p" = pnorm(-Z), AHR = AHR, "~HR at bound" = exp(-Z / sqrt(info)), Power = Probability) %>%
     gt() %>%
     fmt_number(columns = c("N","Events"), decimals = 0) %>%
     fmt_number(columns = "Time", decimals=1) %>%
     fmt_number(columns = c("AHR", "Power","Information fraction", "Spending time", "~HR at bound"), decimals=3) %>%
     fmt_number(columns = "Nominal p", decimals = 4)
}

Planned design

We extend the design studied above to a group sequential design with a single interim analysis after 80% of the final planned events have accrued. We will assume the final analysis will require both the targeted trial duration and events based on the fixed design evaluations made above. We assume the efficacy bound uses the Lan-DeMets spending function approximating an O’Brien-Fleming bound. No futility bound is planned. The interim analysis is far enough into the trial that there is a substantial probability of stopping early under design assumptions.

m <- 12 # Control median
failRates$failRate[1] <- log(2) / m
failRates$duration[1] <- 4
# Planned information fraction at interim(s) and final
planned_IF <- c(.8,1)
# No futility bound
lpar <- rep(-Inf,2) # lower Z bound of -Inf at all analyses
enrollRates <- tibble(Stratum="All", duration = 12, rate = 1)
failRates <- tibble(Stratum="All", duration = c(4, 100), hr = c(1, .6), 
                    failRate = log(2) / m, dropoutRate = .001)
# Note that timing here matches what went into planned_IF above
# Final analysis time set to targeted study duration; analysis times before are 'small'
# to ensure use of information fraction for timing
xx <- gs_design_ahr(enrollRates, failRates, analysisTimes = c(1,30), IF = planned_IF,
      upper = gs_spending_bound, upar = upar_design_IF, 
      lower = gs_b, lpar = lpar)
# Get upper bounds
planned_bounds <- xx$bounds %>% filter(Bound == "Upper")
# Planned number of analyses
K <- nrow(planned_bounds)
# Planned events
max_events <- max(planned_bounds$Events)
# save max information planned under H0
# This will be used for future information fraction calculations
# when event accumulation is not same as planned
max_info0 <- max(xx$bounds$info0)
# Planned analysis timing
planned_time <- planned_bounds$Time
planned_bounds %>% table_gs_design(xx$enrollRates, max_info0 = max(xx$bounds$info0))
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.1 534 285 0.800 0.800 0.0122 0.731 0.764 0.642
30.0 534 356 1.000 1.000 0.0214 0.703 0.805 0.900

Two alternate approaches

We consider two alternate approaches to demonstrate the spending time concept that may be helpful in practice. However, skipping the following two subsections can be done if these are not of interest. The first demonstrates calendar spending as in Lan and DeMets (1989). The second is a basically the method of Fleming, Harrington, and O’Brien (1984) where a fixed incremental spend is used for a potentially variable number of interim analyses, with the final bound computed based on the unspent one-sided Type I error assigned to a hypothesis.

Calendar spending

We use the same sample size as above, but change efficacy bound spending to calendar-based. The reason this spending is different than information-based spending is mainly due to the fact that the expected information is not linear in time. In this case, the calendar fraction at interim is less than the information fraction, but exactly the opposite would be true earlier in the trial. We just note that if calendar-based spending is chosen, it may be worth comparing the design bounds with bounds using the same spending function, but with information-based spending to see if there are important differences to the trial team or possibly to the scientific or regulatory community. We note also that there is risk there will not be enough events to achieve targeted power at the final analysis under a calendar-based spending strategy. We will not examine calendar-based spending further in this document.

yy <- 
  gs_power_ahr(enrollRates = xx$enrollRates, 
               failRates = xx$failRates, 
               events = 1:2, # Planned time will drive timing since information accrues faster
               analysisTimes = c(22, 30), # Interim time rounded
               upper = gs_spending_bound, upar = upar_design_CF, lpar = lpar) %>% filter(Bound == "Upper")
actual_IF <- NULL # yy$info0 / max(yy$info0)
yy %>% table_gs_design(xx$enrollRates, spending_time = upar_design_CF$timing, max_info0 = max(yy$info0))
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.0 534 283 0.796 0.733 0.0089 0.732 0.752 0.590
30.0 534 356 1.000 1.000 0.0231 0.703 0.808 0.905

Fixed incremental spend with a variable number of analyses

As noted, this method was proposed by Fleming, Harrington, and O’Brien (1984). The general strategy demonstrated is to do an interim analses every 6 months until a both a final targeted follow-up time and cumulative number of events is achieved. Once efficacy analyses start, a fixed incremental spend of 0.001 is used at each interim. When the criteria for final analysis is met, the remaining \(\alpha\) is spent. Cumulative spending at months 18 and 24 will be 0.001 and 0.002, respectively, with the full cumulative \(\alpha\)-spending of 0.025 at the final analysis. This is done by setting the spending time at 18 and 24 months to 1/25, 2/25 and 1; i.e., 1/25 incremental \(\alpha\)-spending is incorporated at each interim analysis and any remaining \(\alpha\) is spent at the final analysis. This enables a strategy such as analyzing every 6 months until both a minimum targeted follow-up and minimum number of events are observed, at which time the final analysis is performed. We will skip efficacy analyses at the first two interim analyses at months 6 and 12.

For futility, we simply use a nominal 1-sided p-value of 0.05 favoring control at each interim. We note that this only raises a flag if the futility bound is crossed and a Data Monitoring Committee (DMC) can choose to continue the trial even if a futility bound is crossed. However, the bound may be more effective in providing a DMC guidance not to stop for futility prematurely. For comparison with the above designs, we will leave the enrollment rates, failure rates, dropout rates and final analysis time as before.

We see in the following table summarizing efficacy bounds and power that there is little impact on the total power by having futility analyses as specified. While the cumulative \(\alpha\)-spending is 0.001 and 0.002 at the efficacy interim analyses, we see that the nominal p-value bound at the second interim is 0.0015, more then the 0.001 incremental \(\alpha\)-spend. We also note that with these nominal p-values for testing, the approximate hazard ratio required to cross the bounds would presumably help justify consideration of completing the trial based on a definitive interim efficacy finding. Also, with the small interim spend, the final nominal p-value is not reduced much from the overall \(\alpha=0.025\) Type I error set for the group sequential design.

# Cumulative spending at IA3 and IA4 will be 0.001 and 0.002, respectively.
# Power spending function sfPower with param = 1 is linear in timing
# which makes setting the above cumulative spending targets simple by
# setting timing variable the the cumulative proportion of spending at each analysis.
# There will be no efficacy testing at IA1 or IA2.
# Thus, incremental spend, which will be unused, is set very small for these analyses.
upar_FHO <- list(total_spend = 0.025,
                 sf = sfPower,
                 param = 1,
                 timing = c((1:2)/250, (1:2)/25, 1)
)
FHO <-
  gs_power_ahr(enrollRates = xx$enrollRates,
               failRates = xx$failRates,
               events = NULL,
               analysisTimes = seq(6, 30, 6),
               # No efficacy testing at IA1 or IA2
               # Thus, the small alpha the spending function would have
               # allocated will not be used
               test_upper = c(FALSE, FALSE, TRUE, TRUE, TRUE),
               upper = gs_spending_bound,
               upar = upar_FHO,
               lpar = c(rep(qnorm(.05), 4), -Inf),
)
FHO %>% filter(Bound == "Upper", Z < Inf) %>% table_gs_design(xx$enrollRates, spending_time = upar_FHO$timing[3:5], max_info0 = max(FHO$info0))
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
18.0 534 235 0.660 0.040 0.0010 0.762 0.665 0.150
24.0 534 304 0.854 0.080 0.0015 0.722 0.709 0.433
30.0 534 356 1.000 1.000 0.0249 0.703 0.811 0.881

We also examine the futility bound. The nominal p-value of 0.05 at each analysis is the one-sided p-value in favor of control over experimental treatment. We can see that the probability of stopping early under the alternate hypothesis (\(\beta\)-spending) is not substantial even given the early delayed effect. Also, the substantial approximate observed hazard ratios to cross a futility bound seem reasonable given the timing and number of events observed; the exception to this is the small number of events at the first interim, but a larger number could be observed by this time if there were early excess risk. It may be useful to plan additional analyses if a futility bound is crossed to support stopping or not. For example, looking in subgroups or evaluating smoothed hazard rates over time for each treatment group may be useful. A clinical trial study team should have a complete discussion of futility bound considerations at the time of design.

FHO %>% 
  filter(Bound == "Lower", abs(Z) < Inf) %>%
  transmute(Time = Time, 
            Events = ceiling(Events),
            "Nominal p" = pnorm(Z), 
            "Information fraction" = info0 / max_info0,
            "~HR at bound" = exp(-Z / sqrt(info)),
            "Cumulative beta-spend" = Probability
            ) %>%
  gt() %>%
  fmt_number(columns = c("Events"), decimals = 0) %>%
  fmt_number(columns = "Time", decimals=1) %>%
  fmt_number(columns = c("~HR at bound", "Cumulative beta-spend","Information fraction"), decimals=3) %>%
  fmt_number(columns = "Nominal p", decimals = 4)
Time Events Nominal p Information fraction ~HR at bound Cumulative beta-spend
6.0 41 0.0500 0.114 1.679 0.038
12.0 138 0.0500 0.388 1.326 0.041
18.0 235 0.0500 0.660 1.243 0.041
24.0 304 0.0500 0.854 1.210 0.041

Less treatment effect scenario

As before, we compute power under the assumption of changing the median control group time-to-event to 10 months rather than the assumed 12 and the delay in effect onset is 6 months rather than 4. We otherwise do not change enrollment, dropout or hazard ratio assumptions. In both of the following examples, we require both the targeted number of events and targeted trial duration from the group sequential design before doing the interim and final analyses. The first example, which uses interim spending based on the event count observed over the originally planned final event count has the information fraction 323 / 355 = 0.91. This gives event-based spending of 0.0191, substantially above the targeted information fraction of 284 / 355 = 0.8 with targeted interim spending of 0.0122. This reduces the power overall from 76% to 73% and lowers the nominal p-value bound at the final analysis from 0.0218 to 0.0165; see the following two tables. Noting that the average hazard ratio is 0.8 at the interim and 0.76 at the final analysis emphasizes the value of preserving \(\alpha\)-spending until the final analysis. Thus, in this example it is valuable to limit spending at the interim analysis to the minimum of planned spending as opposed to using event-based spending.

am <- 10 # Alternate control median
failRates$failRate <- log(2) / am
failRates$duration[1] <- 6
# Set planned maximum information from planned design
max_info0 <- max(xx$bounds$info0)
upar_actual_IF <- upar_design_IF
upar_actual_IF$max_info <- max_info0
# compute power if actual information fraction relative to original
# planned total is used
yy <- 
  gs_power_ahr(enrollRates = xx$enrollRates, 
               failRates = failRates, 
               events = 1:2, # Planned time will drive timing since information accrues faster
               analysisTimes = planned_time,
               upper = gs_spending_bound, upar = upar_actual_IF, lpar = lpar) %>% filter(Bound == "Upper")
actual_IF <- yy$info0 / max_info0
yy %>% table_gs_design(xx$enrollRates, spending_time = actual_IF, max_info0 = max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.1 534 325 0.914 0.914 0.0191 0.797 0.793 0.481
30.0 534 394 1.108 1.108 0.0163 0.759 0.805 0.726

Just as important, the general design principle of making interim analysis criteria more stringent that final is ensured for this alternate scenario. There are multiple trials where delayed effects have been observed where this difference in the final nominal p-value bound would have made a difference to ensure a statistically significant finding.

yz <- 
  gs_power_ahr(enrollRates = xx$enrollRates, 
               failRates = failRates, 
               events = xx$bounds$Events[1:2],
               analysisTimes = planned_time,
               upper = gs_spending_bound, 
               upar = upar_planned_IF, 
               lpar = lpar) %>% filter(Bound == "Upper")
# Note that max_info0 is denominator to compute "Information fraction" in table
# However, spending time is less since we use planned IF to compute spending
yz %>% table_gs_design(xx$enrollRates, spending_time = planned_IF, max_info0 = max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.1 534 325 0.914 0.800 0.0122 0.797 0.778 0.411
30.0 534 394 1.108 1.000 0.0219 0.759 0.815 0.761

Scenario with longer control median

Now we return to the example where the control median is longer than expected to confirm that spending according to the planned level alone without considering the actual number of events will also result in a power reduction. While the power gain is not great (94.2% vs 95.0%) the interim and final p-value bounds are more aligned with the intent of emphasizing the final analysis where a smaller average hazard ratio is expected (0.680 vs 0.723 at the interim). First, we show the result using planned spending.

am <- 16 # Alternate control median
failRates$failRate <- log(2) / am
# Return to 4 month delay with HR=1 before HR = 0.6
failRates$duration[1] <- 4
# Start with spending based on planned information
# which is greater than actual information
yy <- 
  gs_power_ahr(enrollRates = xx$enrollRates, 
               failRates = failRates, 
               events = c(1,max_events),
               analysisTimes = planned_time,
               upper = gs_spending_bound, upar = upar_planned_IF, lpar = lpar) %>% filter(Bound == "Upper")
yy %>% table_gs_design(xx$enrollRates, spending_time = planned_IF, max_info0 = max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.1 534 234 0.657 0.800 0.0122 0.722 0.742 0.581
38.5 534 356 1.000 1.000 0.0191 0.678 0.801 0.942

Since the number of events was less than expected, if we had used the actual number of events the interim bound would be more stringent than above and we obtain slightly greater power.

yz <- 
  gs_power_ahr(enrollRates = xx$enrollRates, 
               failRates = failRates, 
               events = c(1,xx$bounds$Events[2]),
               analysisTimes = planned_time,
               upper = gs_spending_bound, upar = upar_actual_IF, lpar = lpar) %>% filter(Bound == "Upper")
yz %>% table_gs_design(xx$enrollRates, spending_time = NULL, max_info0 = max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.1 534 234 0.657 0.657 0.0057 0.722 0.715 0.469
38.5 534 356 1.000 1.000 0.0232 0.678 0.808 0.950

Summary for spending time motivation assuming delayed benefit

In summary, using the minimum of planned and actual spending to adapt the design based on event-based spending adapts the interim bound to be more stringent than the final bound under different scenarios and ensures better power than event-based interim analysis and spending.

Testing multiple hypotheses

Assumptions

We consider a simple case where we use the method of Maurer and Bretz (2013) to test both in the overall population and in a biomarker subgroup for the same endpoint. We assume an exponential failure rate with a median of 12 for the control group regardless of population. The hazard ratio in the biomarker positive subgroup will be assumed to be 0.6, and in the negative population 0.8. We assume the biomarker positive group represents half of the population, meaning that enrollment rates will be assumed to be the same in negative and positive patients. The only difference between failure rates in the two strata is the hazard ratio. For this case, we assume proportional hazards within negative (HR = 0.8) and positive (HR = 0.6) patients.

enrollRates <- tibble(Stratum=c("Positive","Negative"), duration = 12, rate=20)
m <- 12
failRates <- tibble(Stratum = c("Positive", "Negative"), hr = c(0.6, 0.8),
                    duration = 100,
                    failRate = log(2) / m, dropoutRate = 0.001)

For illustrative purposes, we are choosing a strategy based on the possible feeling of much less certainty at study start as to whether there is any underlying benefit in the biomarker negative population. We wish to ensure power for the biomarker positive group, but allow a good chance of a positive overall population finding if there is a lesser benefit in the biomarker negative population. If an alternative trial strategy is planned, an alternate approach to the following should be considered. In any case, we design first for the biomarker positive population with one-sided Type I error controlled at \(\alpha = 0.0125\):

Planned design for biomarker positive population

# Spending based on information fraction
# At time of design, timing = NULL is used to select
# maximum planned information for information fraction.
# Since execution will be event-based for biomarker population,
# there will be no need to change spending plan for different scenarios.
# Total alpha spend is now 0.0125 
upar_design_spend <- list(sf = gsDesign::sfLDOF, total_spend = 0.0125, param = NULL, timing = NULL)
# No futility bound
lpar <- rep(-Inf,2) # Z = -infinity for lower bound
# We will base the combined hypothesis design to ensure power in the biomarker subgroup
positive <- 
  gs_design_ahr(enrollRates = enrollRates %>% filter(Stratum == "Positive"),
                failRates = failRates %>% filter(Stratum == "Positive"),
                # Following drives information fraction for interim
                IF = c(.8, 1), 
                # Total study duration driven by final analysisTimes value
                # Enter small increasing values before that so information
                # fraction in planned_IF drives timing of interims
                analysisTimes = c(1,30),
                upper =gs_spending_bound, upar = upar_design_spend,
                # Fixed lower bound with Z = -infinity
                lower = gs_b, lpar = lpar
               )
# This planned design will drive adaptation for deviations from plan
positive_planned_bounds <- positive$bounds %>% filter(Bound == "Upper")
# Planned timing of analysis
positive_planned_time <- positive_planned_bounds$Time
# Planned sample size
positive_planned_N <- max(positive_planned_bounds$N)
# Planned events
positive_max_events <- max(positive_planned_bounds$Events)
# save max information planned under H0
positive_max_info0 <- max(positive_planned_bounds$info0)
positive_planned_bounds %>% 
  table_gs_design(enrollRates = positive$enrollRates, 
                  max_info0 = positive_max_info0)  
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.6 305 158 0.800 0.800 0.0052 0.600 0.661 0.726
30.0 305 198 1.000 1.000 0.0109 0.600 0.719 0.900

Planned design for overall population

We adjust the overall study enrollment rate to match the design requirement for the biomarker positive population.

# Get enrollment rate inflation factor compared to originally input rate
inflation_factor <- positive$enrollRates$rate[1] / enrollRates$rate[1]
# Using this inflation factor, set planned enrollment rates
planned_enrollRates <- enrollRates %>% mutate(rate = rate * inflation_factor)
planned_enrollRates %>% gt()
Stratum duration rate
Positive 12 25.37723
Negative 12 25.37723
# Store overall enrollment rates for future use
overall_enrollRates <- 
  planned_enrollRates %>% 
  summarize(Stratum = "All", duration=first(duration), rate = sum(rate))

Now we can examine the power for the overall population based on hazard ratio assumptions in biomarker negative and biomarker positive subgroups and the just calculated enrollment assumption. We use the analysis times from the biomarker positive population design. We see that the interim information fraction for the overall population is slightly greater than the biomarker positive population above. To compensate for this and to enable flexibility below as biomarker positive prevalence changes, we use the same spending time as the biomarker positive subgroup regardless of the true fraction of final planned events at each analysis. Thus, the interim nominal p-value bound is the same for both the biomarker positive and overall populations. While this does not make much difference here, we see that we have a very natural way to adapt the design if the observed biomarker positive prevalence is different than what was assumed for the design.

# Set total spend for overall population, O'Brien-Fleming spending function, and 
# same spending time as biomarker subgroup
upar_overall_planned_IF <- list(sf = gsDesign::sfLDOF,  param = NULL,
                                   total_spend = 0.0125, timing = c(.8, 1),
                  # We will use actual final information as planned initially
                                   max_info = NULL)
overall_planned_bounds <- 
  gs_power_ahr(enrollRates = planned_enrollRates,
               failRates = failRates,
               analysisTimes = positive_planned_time,
               # Events will be determined by expected events at planned analysis times
               events = NULL, 
               upper = gs_spending_bound,
               # Recall planned spending times are specified the same as before 
               upar = upar_overall_planned_IF,
               lower = gs_b,
               lpar = lpar # fixed lower Z = -infinity
              ) %>% filter(Bound == "Upper")
# Planned events
overall_max_events <- max(overall_planned_bounds$Events)
# save max information planned under H0
overall_max_info0 <- max(overall_planned_bounds$info0)
overall_planned_bounds %>% 
  table_gs_design(planned_enrollRates,
                  spending_time = planned_IF,
                  max_info0 = overall_max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
22.6 610 330 0.805 0.800 0.0052 0.697 0.753 0.754
30.0 610 410 1.000 1.000 0.0110 0.697 0.796 0.915

Alternate scenarios overview

We divide our further evaluations into three subsections, one with a higher prevalence of biomarker positive patients than expected, one with a lower biomarker prevalence, followed by a section with differing event rate and hazard ratio assumptions. For each case, we will assume the total enrollment rate of 50.8 per month as planned above. We also assume that we enroll until the targeted biomarker positive subgroup enrollment of 305 from above is achieved, regardless of the overall enrollment. The specify interim analysis timing to require both 80% of the planned final analysis events in the biomarker positive population and at least 10 months of minimum follow-up; thus, for the biomarker population we will never vary events or spending here. The same spending time will be used for the overall population, but we will compare with event-based spending. The above choices are arbitrary. While we think they are reasonable, the design planner should think carefully about other variations to suit their clinical trial team needs.

## Setting spending alternatives

# Using information (event)-based spending time relative to overall population plan
# Set total spend for overall population, O'Brien-Fleming spending function. 
# For design information-spending, we set timing =  NULL and max_info to plan from above
upar_overall_planned_IF <- list(sf = gsDesign::sfLDOF,  param = NULL,
                                   total_spend = 0.0125, timing = planned_IF,
                  # We will use planned final information for overall population from design
                  # to compute information fraction relative to plan
                                   max_info = overall_max_info0)
# Using planned information fraction will demonstrate problems below.
# Set total spend for overall population, O'Brien-Fleming spending function, and 
# same spending time as biomarker subgroup
upar_overall_actual_IF <- list(sf = gsDesign::sfLDOF,  param = NULL,
                                  total_spend = 0.0125, timing = NULL,
                  # We will use planned final information for overall population from design
                                  max_info = overall_max_info0)

Biomarker subgroup prevalence higher than planned

Biomarker subgroup power

We suppose the biomarker prevalence is 60%, higher then the 50% prevalence the design anticipated. The enrollment rates by positive versus negative patients and expected enrollment duration are now:

positive_60_enrollRates <- rbind(
  overall_enrollRates %>% mutate(Stratum = "Positive", rate = 0.6 * rate),
  overall_enrollRates %>% mutate(Stratum = "Negative", rate = 0.4 * rate)
)
positive_60_duration <- positive_planned_N / overall_enrollRates$rate / 0.6
positive_60_enrollRates$duration <- positive_60_duration
positive_60_enrollRates %>% gt() %>% fmt_number(columns="rate",decimals=1)
Stratum duration rate
Positive 10 30.5
Negative 10 20.3

Now we can compute the power for the biomarker positive group with the targeted events. Since we have a simple proportional hazards model, they only thing that is changing here from the original design is that this takes slightly less time.

positive_60_power <- 
gs_power_ahr(enrollRates = positive_60_enrollRates %>% filter(Stratum == "Positive"),
             failRates = failRates %>% filter(Stratum == "Positive"),
             events = positive$bounds$Events[1:K],
             analysisTimes = NULL,
             upper = gs_spending_bound,
             upar = upar_design_spend,
             lower = gs_b,
             lpar = lpar
) %>% filter(Bound == "Upper")
positive_60_power %>% 
  table_gs_design(enrollRates = positive_60_enrollRates %>% filter(Stratum == "Positive"), 
                  max_info0 = positive_max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
21.5 305 158 0.800 0.800 0.0052 0.600 0.661 0.726
28.9 305 198 1.000 1.000 0.0109 0.600 0.719 0.900

Overall population power

Now we use the same spending as above for the overall population, resulting in full \(\alpha\)-spending at the end of the trial even though the originally targeted events are not expected to be achieved. We note that the information fraction computed here is based on the originally planned events for the overall population. Given this and the larger proportion of patients that are biomarker positive, the average hazard ratio is stronger than originally planned and the power for the overall population is still over 90%.

gs_power_ahr(enrollRates = positive_60_enrollRates,
             failRates = failRates,
             events = 1:2,
             analysisTimes = positive_60_power$Time[1:2],
             upper = gs_spending_bound,
             # use planned spending in spite of lower overall information
             upar = upar_overall_planned_IF,
             # Still use Z = -infinity lower bound 
             lower = gs_b, lpar = rep(-Inf, 2)) %>% 
  filter(Bound == "Upper") %>% 
  table_gs_design(enrollRates = positive_60_enrollRates, spending_time = c(.8,1), max_info0 = overall_max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
21.5 508 273 0.665 0.800 0.0052 0.677 0.731 0.734
28.9 508 339 0.827 1.000 0.0110 0.677 0.778 0.904

If we had used information-based (i.e., event-based) spending, we would not have reached full spending at final analysis and thus would have lower power.

xz <-
gs_power_ahr(enrollRates = positive_60_enrollRates,
             failRates = failRates,
             events = 1:2,
             analysisTimes = positive_60_power$Time[1:2],
             upper = gs_spending_bound,
             # use actual spending which uses less than complete alpha
             upar = upar_overall_actual_IF,
             # Still use Z = -infinity lower bound 
             lower = gs_b, lpar = lpar) %>% 
  filter(Bound == "Upper" ) 
xz %>% table_gs_design(enrollRates = positive_60_enrollRates, spending_time = NULL, max_info0 = overall_max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
21.5 508 273 0.665 0.665 0.0022 0.677 0.706 0.631
28.9 508 339 0.827 0.827 0.0054 0.677 0.756 0.850

Biomarker subgroup prevalence lower than planned

We suppose the biomarker prevalence is 40%, lower than the 50% prevalence the design anticipated. The enrollment rates by positive versus negative patients and expected enrollment duration will now be:

positive_40_enrollRates <- rbind(
  overall_enrollRates %>% mutate(Stratum = "Positive", rate = 0.4 * rate),
  overall_enrollRates %>% mutate(Stratum = "Negative", rate = 0.6 * rate)
)
positive_40_enrollRates$duration <- positive_planned_N / positive_40_enrollRates$rate[1]
positive_40_enrollRates %>% gt() %>% fmt_number(columns = "rate", decimals=1)
Stratum duration rate
Positive 15 20.3
Negative 15 30.5

Biomarker positive subgroup power

Now we can compute the power for the biomarker positive group with the targeted events.

upar_actual_IF$total_spend <- 0.0125
upar_actual_IF$max_info <- positive_max_info0
positive_40_power <- 
gs_power_ahr(enrollRates = positive_40_enrollRates %>% filter(Stratum == "Positive"),
             failRates = failRates %>% filter(Stratum == "Positive"),
             events = (positive$bounds %>% filter(Bound == "Upper"))$Events,
             analysisTimes = NULL,
             upper = gs_spending_bound,
             upar = upar_actual_IF, 
             lpar = rep(-Inf, 2)
)
positive_40_power %>% filter(Bound == "Upper") %>% 
  table_gs_design(enrollRates = positive_40_enrollRates %>% filter(Stratum == "Positive"), 
                  max_info0 = positive_max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
24.3 305 158 0.800 0.800 0.0052 0.600 0.661 0.726
31.7 305 198 1.000 1.000 0.0109 0.600 0.719 0.900

Overall population power

We see that by adapting the overall sample size and spending according to the biomarker subgroup, we retain 90% power. spite of the lower overall effect size, the larger adapted sample size ensures power retention.

gs_power_ahr(enrollRates = positive_40_enrollRates,
             failRates = failRates,
             events = 1:2,
             analysisTimes = positive_40_power$Time[1:2],
             upper = gs_spending_bound,
             upar = upar_overall_planned_IF,
             lpar = rep(-Inf,2)
) %>% filter(Bound == "Upper")  %>% 
  table_gs_design(enrollRates = positive_60_enrollRates, 
                  spending_time = c(.8, 1),
                  max_info0 = overall_max_info0)
Time N Events Information fraction Spending time Nominal p AHR ~HR at bound Power
24.3 508 416 1.014 0.800 0.0052 0.717 0.777 0.789
31.7 508 517 1.259 1.000 0.0110 0.717 0.817 0.933

Summary of findings

We suggested two overall findings when planning and executing a trial with a potentially delayed treatment effect:

  • Require both a targeted event count minimum follow-up before completing analysis of a trial helps ensure both powering the trial appropriately and having a better description of the tail behavior that may be essential if long-term results are key to establishing a potentially positive risk-benefit.
  • Do not over-spend Type I error at interim analyses by using event-based spending. This helps to ensure the least stringent bounds are at the final analysis when the most complete risk-benefit assessment can be made. We gave two options to this:
    • Use a fixed, small incremental \(\alpha\)-spend at each interim such as proposed by Fleming, Harrington, and O’Brien (1984) with a variable number of interim analyses to ensure adequate follow-up.
    • Use the minimum of planned and actual spending at interim analyses.

When implementing the Fleming, Harrington, and O’Brien (1984) approach, we also suggested a simple approach to futility that may be quite useful practically in a scenario with a potentially delayed onset of treatment effect. This basically looks for evidence of a favorable control group effect relative to experimental by setting a nominal p-value cutoff at a 1-sided 0.05 level for early interim futility analyses. Where crossing survival curves or inferior survival curves may exist, this may be a useful way to ensure continuing a trial is ethical; this approach is perhaps most useful when the experimental treatment is replacing components of the control treatment or in a case where add-on treatment may be toxic or potentially have other detrimental effects.

In addition to the delayed effect example, we considered an example testing in both a biomarker positive subgroup and the overall population. Using a common spending time for all hypotheses with a common interim analysis strategy as advocated by Follmann, Proschan, and Geller (1994) can be helpful to implement spending so that all hypotheses have adequate \(\alpha\) to spend at the final analysis and also to ensure full utilization of \(\alpha\)-spending. We suggested again using the minimum of planned and actual spending at interim analysis. Spending can be based on a key hypothesis (e.g., the biomarker positive population) or the minimum spending time among all hypotheses being tested. Taking advantage of know correlations to ensure full \(\alpha\) utilitization in multiple hypothesis testing is also more simply implemented with this strategy Anderson et al. (2021).

In summary, we have illustrated both the motivation and the illustration of the spending time approach through examples we have commonly encountered. Approaches suggested included an implementation of Fleming, Harrington, and O’Brien (1984) with a fixed incremental \(\alpha\)-spend at each interim analysis as well as the use of the minimum of planned and actual spending at interim analyses.

References

Anderson, Keaven M, Zifang Guo, Jing Zhao, and Linda Z Sun. 2021. “A Unified Framework for Weighted Parametric Group Sequential Design (WPGSD).” arXiv Preprint arXiv:2103.10537.
Fleming, Thomas R, David P Harrington, and Peter C O’Brien. 1984. “Designs for Group Sequential Tests.” Controlled Clinical Trials 5 (4): 348–61.
Follmann, Dean A, Michael A Proschan, and Nancy L Geller. 1994. “Monitoring Pairwise Comparisons in Multi-Armed Clinical Trials.” Biometrics, 325–36.
Lan, K. K. G., and David L. DeMets. 1983. “Discrete Sequential Boundaries for Clinical Trials.” Biometrika 70: 659–63.
———. 1989. “Group Sequential Procedures: Calendar Versus Information Time.” Statistics in Medicine 8: 1191–98. https://doi.org/10.1002/sim.4780081003.
Lin, Ray S, Ji Lin, Satrajit Roychoudhury, Keaven M Anderson, Tianle Hu, Bo Huang, Larry F Leon, et al. 2020. “Alternative Analysis Methods for Time to Event Endpoints Under Nonproportional Hazards: A Comparative Analysis.” Statistics in Biopharmaceutical Research 12 (2): 187–98.
Maurer, Willi, and Frank Bretz. 2013. “Multiple Testing in Group Sequential Trials Using Graphical Approaches.” Statistics in Biopharmaceutical Research 5: 311–20. https://doi.org/10.1080/19466315.2013.807748.
Mukhopadhyay, Pralay, Wenmei Huang, Paul Metcalfe, Fredrik Öhrn, Mary Jenner, and Andrew Stone. 2020. “Statistical and Practical Considerations in Designing of Immuno-Oncology Trials.” Journal of Biopharmaceutical Statistics, 1–17.
Proschan, Michael A., K. K. Gordon Lan, and Janet Turk Wittes. 2006. Statistical Monitoring of Clinical Trials. A Unified Approach. New York, NY: Springer.
Roychoudhury, Satrajit, Keaven M Anderson, Jiabu Ye, and Pralay Mukhopadhyay. 2021. “Robust Design and Analysis of Clinical Trials with Non-Proportional Hazards: A Straw Man Guidance from a Cross-Pharma Working Group.” Statistics in Biopharmaceutical Research, 1–37.
Tsiatis, Anastasios A. 1982. “Repeated Significance Testing for a General Class of Statistics Use in Censored Survival Analysis.” Journal of the American Statistical Association 77: 855–61.