Computing p-values for Fleming-Harrington weighted logrank tests and the MaxCombo test

Introduction

This vignette demonstrates use of a simple routine to do simulations and testing using Fleming-Harrington weighted logrank tests and the MaxCombo test. In addition, we demonstrate how to perform these tests with a dataset not generated by simulation routines within the package. Note that all \(p\)-values computed here are one-sided with small values indicating that the experimental treatment is favored.

Defining the test

The MaxCombo test has been posed as the maximum of multiple Fleming-Harrington weighted logrank tests (Harrington and Fleming (1982), Fleming and Harrington (2011)). Combination tests looking at a maximum of selected tests in this class have also been proposed; see Lee (2007), Roychoudhury et al. (2021), and Lin et al. (2020). The Fleming-Harrington class is indexed by the parameters \(\rho \geq 0\) and \(\gamma \geq 0\). We will denote these as FH(\(\rho, \gamma\)). This class includes the logrank test as FH(0, 0). Other tests of interest here include:

FH(0, 1): a test that down-weights early events
FH(1, 0): a test that down-weights late events
FH(1, 1): a test that down-weights events increasingly as their quantiles differ from the median

Executing for a single dataset

Generating test statistics with `sim_fixed_n()`

We begin with a single trial simulation generated by the routine sim_fixed_n() using default arguments for that routine. sim_fixed_n() produces one record per test and data cutoff method per simulation. Here we choose 3 tests (logrank = FH(0, 0), FH(0, 1) and FH(1, 1)). When more than one test is chosen the correlation between tests is computed as shown by Karrison (2016), in this case in the columns V1, V2, V3. The columns rho, gamma indicate \(\rho\) and \(\gamma\) used to compute the test. z is the FH(\(\rho, \gamma\)) normal test statistic with variance 1 with a negative value favoring experimental treatment. The variable cut indicates how the data were cut for analysis, in this case at the maximum of the targeted minimum follow-up after last enrollment and the date at which the targeted event count was reached. Sim is a sequential index of the simulations performed.

library(simtrial)
library(knitr)
library(dplyr)
library(gt)

set.seed(123)

x <- sim_fixed_n(
  n_sim = 1,
  timing_type = 5,
  rho_gamma = data.frame(rho = c(0, 0, 1), gamma = c(0, 1, 1))
)
#> Backend uses sequential processing.

x |>
  gt() |>
  fmt_number(columns = c("ln_hr", "z", "duration", "v1", "v2", "v3"), decimals = 2)

method	parameter	estimate	se	z	p_value	v1	v2	v3	event	ln_hr	cut	duration	sim
MaxCombo	FH(0, 0) + FH(0, 1) + FH(1, 1)	-	-	−4.06	3.70532e-07	1.00	0.85	0.93	350	−0.44	Max(min follow-up, event cut)	77.17	1
MaxCombo	FH(0, 0) + FH(0, 1) + FH(1, 1)	-	-	−4.04	3.70532e-07	0.85	1.00	0.94	350	−0.44	Max(min follow-up, event cut)	77.17	1
MaxCombo	FH(0, 0) + FH(0, 1) + FH(1, 1)	-	-	−4.95	3.70532e-07	0.93	0.94	1.00	350	−0.44	Max(min follow-up, event cut)	77.17	1

Generating data with `sim_pw_surv()`

We begin with another simulation generated by sim_pw_surv(). Again, we use defaults for that routine.

set.seed(123)

s <- sim_pw_surv(n = 100)

s |>
  head() |>
  gt() |>
  fmt_number(columns = c("enroll_time", "fail_time", "dropout_time", "cte"), decimals = 2)

stratum	enroll_time	treatment	fail_time	dropout_time	cte	fail
All	0.02	experimental	23.29	1,287.17	23.32	1
All	0.14	control	6.96	306.66	7.10	1
All	0.25	control	16.96	1,761.75	17.21	1
All	0.28	experimental	3.32	1,650.14	3.60	1
All	0.46	control	19.08	787.98	19.53	1
All	0.46	experimental	39.67	50.64	40.13	1

Once generated, we need to cut the data for analysis. Here we cut after 75 events.

x <- s |> cut_data_by_event(75)

x |>
  head() |>
  gt() |>
  fmt_number(columns = "tte", decimals = 2)

tte	event	stratum	treatment
23.29	1	All	experimental
6.96	1	All	control
16.96	1	All	control
3.32	1	All	experimental
19.08	1	All	control
33.29	0	All	experimental

Now we can analyze this data. We begin with s to show how this can be done in a single line. In this case, we use the 4 test combination suggested in Lin et al. (2020), Roychoudhury et al. (2021).

z <- s |>
  cut_data_by_event(75) |>
  maxcombo(rho = c(0, 0, 1, 1), gamma = c(0, 1, 0, 1))

z
#> $method
#> [1] "MaxCombo"
#> 
#> $parameter
#> [1] "FH(0, 0) + FH(0, 1) + FH(1, 0) + FH(1, 1)"
#> 
#> $z
#> [1] -2.511925 -2.907093 -1.899871 -3.119549
#> 
#> $p_value
#> [1] 0.00204688

Suppose we want the \(p\)-value just based on the logrank and FH(0, 1) and FH(1, 0) as suggested by Lee (2007). We remove the rows and columns associated with FH(0, 0) and FH(1, 1) and then apply pvalue_maxcombo().

z <- s |>
  cut_data_by_event(75) |>
  maxcombo(rho = c(0, 1), gamma = c(1, 0))

z
#> $method
#> [1] "MaxCombo"
#> 
#> $parameter
#> [1] "FH(0, 1) + FH(1, 0)"
#> 
#> $z
#> [1] -2.907093 -1.899871
#> 
#> $p_value
#> [1] 0.003395849

Using survival data in another format

For a trial not generated by sim_fixed_n(), the process is slightly more involved. We consider survival data not in the simtrial format and show the transformation needed. In this case we use the small aml dataset from the survival package.

library(survival)
aml |>
  head() |>
  gt()

time	status	x
9	1	Maintained
13	1	Maintained
13	0	Maintained
18	1	Maintained
23	1	Maintained
28	0	Maintained

We rename variables and create a stratum variable as follows:

x <- aml |> transmute(
  tte = time,
  event = status,
  stratum = "All",
  treatment = case_when(
    x == "Maintained" ~ "experimental",
    x == "Nonmaintained" ~ "control"
  )
)

x |>
  head() |>
  gt()

tte	event	stratum	treatment
9	1	All	experimental
13	1	All	experimental
13	0	All	experimental
18	1	All	experimental
23	1	All	experimental
28	0	All	experimental

Now we analyze the data with a MaxCombo with the logrank and FH(0, 1) and compute a \(p\)-value.

x |> maxcombo(rho = c(0, 0), gamma = c(0, 1))
#> $method
#> [1] "MaxCombo"
#> 
#> $parameter
#> [1] "FH(0, 0) + FH(0, 1)"
#> 
#> $z
#> [1] -1.842929 -1.621762
#> 
#> $p_value
#> [1] 0.0491509

Simulation

We now consider the example simulation from the pvalue_maxcombo() help file to demonstrate how to simulate power for the MaxCombo test. However, we increase the number of simulations to 100 in this case; a larger number should be used (e.g., 1000) for a better estimate of design properties. Here we will test at the \(\alpha=0.001\) level.

set.seed(123)

# Only use cut events + min follow-up
x <- sim_fixed_n(
  n_sim = 100,
  timing_type = 5,
  rho_gamma = data.frame(rho = c(0, 0, 1), gamma = c(0, 1, 1))
)

# MaxCombo power estimate for cutoff at max of targeted events, minimum follow-up
x |>
  group_by(sim) |>
  filter(row_number() == 1) |>
  ungroup() |>
  summarize(power = mean(p_value < .001))
#> # A tibble: 1 × 1
#>   power
#>   <dbl>
#> 1  0.79

We note the use of group_map in the above produces a list of \(p\)-values for each simulation. It would be nice to have something that worked more like dplyr::summarize() to avoid unlist() and to allow evaluating, say, multiple data cutoff methods. The latter can be done without having to re-run all simulations as follows, demonstrated with a smaller number of simulations.

# Only use cuts for events and events + min follow-up
set.seed(123)

x <- sim_fixed_n(
  n_sim = 100,
  timing_type = c(2, 5),
  rho_gamma = data.frame(rho = 0, gamma = c(0, 1))
)

Now we compute a \(p\)-value separately for each cut type, first for targeted event count.

# Subset to targeted events cutoff tests
# This chunk will be updated after the development of sim_gs_n and sim_fixed_n
x |>
  filter(cut == "Targeted events") |>
  group_by(sim) |>
  filter(row_number() == 1) |>
  ungroup() |>
  summarize(power = mean(p_value < .025))
#> # A tibble: 1 × 1
#>   power
#>   <dbl>
#> 1  0.95

Now we use the later of targeted events and minimum follow-up cutoffs.

# Subset to targeted events cutoff tests
x |>
  filter(cut != "Targeted events") |>
  group_by(sim) |>
  filter(row_number() == 1) |>
  ungroup() |>
  summarize(power = mean(p_value < .025))
#> # A tibble: 1 × 1
#>   power
#>   <dbl>
#> 1  0.95

References

Fleming, Thomas R, and David P Harrington. 2011. Counting Processes and Survival Analysis. Vol. 169. John Wiley & Sons.

Harrington, David P, and Thomas R Fleming. 1982. “A Class of Rank Test Procedures for Censored Survival Data.” Biometrika 69 (3): 553–66.

Karrison, Theodore G. 2016. “Versatile Tests for Comparing Survival Curves Based on Weighted Log-Rank Statistics.” The Stata Journal 16 (3): 678–90.

Lee, Seung-Hwan. 2007. “On the Versatility of the Combination of the Weighted Log-Rank Statistics.” Computational Statistics & Data Analysis 51 (12): 6557–64.

Lin, Ray S, Ji Lin, Satrajit Roychoudhury, Keaven M Anderson, Tianle Hu, Bo Huang, Larry F Leon, et al. 2020. “Alternative Analysis Methods for Time to Event Endpoints Under Nonproportional Hazards: A Comparative Analysis.” Statistics in Biopharmaceutical Research 12 (2): 187–98.

Roychoudhury, Satrajit, Keaven M Anderson, Jiabu Ye, and Pralay Mukhopadhyay. 2021. “Robust Design and Analysis of Clinical Trials with Nonproportional Hazards: A Straw Man Guidance from a Cross-Pharma Working Group.” Statistics in Biopharmaceutical Research, 1–15.

Keaven Anderson, Yujie Zhao