Introduction to POSSA. Sequential tests for two means (t-tests).

Below, the essence of POSSA’s power (and type 1 error rate, T1ER) calculation for sequential designs is explained using the simple example of a t-test. Of course, for t-tests specifically, you could use other and potentially more user-friendly software – however, the point here is to understand how the POSSA works, and, subsequently, it should be possible for you to extend this example to virtually any statistical null hypothesis test. Cases of multiple hypotheses (and recording descriptives) are explained on a separate page, but these too are straightforward extensions of the present examples.

This tutorial presupposes the basic conceptual understanding of sequential analyses (and the basics or statistical power and T1ER). There are several related free tutorials online (see e.g. Lakens, 2014).

All parameters mentioned below are described in detail in the full documentation (see ?sim and ?pow).

So let’s say you want to compare the means of two independent continuous variables with a simple t-test. Your smallest effect size of interest (SESOI) is 5 units (arbitrary units that could represent anything – e.g., the height of a plant in cm, or the time of performing a task in seconds, etc.).

User functions for sample simulation and testing

First, write a function that simulates the samples (i.e., observed values) for a single instance of the hypothetical experiment in case the null hypothesis is true (no actual difference between the two population means) as well as in case the alternative hypothesis is true (there is in fact difference).

If the null hypothesis “H0” is true, the two samples have the same mean, let’s say 0 units; and let’s assume normally distributed values an SD of 10 units for each sample. For this, two samples can each be simulated via the function rnorm(sampleSize, mean = 0, sd = 10) (where sampleSize is a variable to be provided via another function, as explained later). You can assign this to sample1, which represents a baseline variable (e.g., “group A”), and, separately, to sample2_h0, which represents a sample that does not differ from the baseline (“H0” true). If the null hypothesis “H1” is true, the baseline sample may again have a mean of 0, but the other sample should have the SESOI, 5 units (again assuming an SD of 10) – which can be simulated as rnorm(sampleSize, mean = 5, sd = 10). This can be assigned to sample2_h1, which represents a sample that does differ from the baseline (“H1” true). (The baseline sample has identical properties in case of H0 and H1, so it’s enough to simulate it only once, and use that same resulting sample for both cases – for the sake of brevity as well as of computational speed.)

All this is to be put into a function, which takes a sampleSize argument and returns the simulated samples as a named list, as follows.

customSample = function(sampleSize) {
  list(
    sample1 = rnorm(sampleSize, mean = 0, sd = 10),
    sample2_h0 = rnorm(sampleSize, mean = 0, sd = 10),
    sample2_h1 = rnorm(sampleSize, mean = 5, sd = 10)
  )
}

Note that the customSample and sampleSize variables were written with camel case (likeThis), while the _h0 and _h1 endings used underscores (like_this). This alternate notation is used in the present tutorial to help distinguish what names can be freely chosen, and what is necessarily named so for the POSSA package: custom names are all camel case, while underscores indicate notation necessary for POSSA. Here specifically, the returned list’s element names for the sample that varies depending on whether the null (H0) and alternative (H1) hypothesis is true must be indicated with _h0 and _h1 endings, respectively, with a common root (here: sample2_h). The root name or the other variable names (customSample and sampleSize) could be named differently, and everything would work just as well.

Now, write a function that performs the test; here, a one-sided t-test (with equal variances assumed – this is just for the example’s outcome’s comparability for other software, but normally Welch’s test should be preferred).

The function’s parameters must be identical to the names of the list elements returned by the sample function (here: sample1, sample2_h0, and sample2_h1). The returned value must be a named vector that contains the derived p values. The name of each p value must start with p_, and end with _h0 for the “H0” outcome and with _h1 for the “H1” outcome. (Below, it’s simply p_h0/p_h1, but it could just as well be, e.g., p_my_t.test_h0/p_my_t.test_h1).

customTest = function(sample1, sample2_h0, sample2_h1) {
  c(
    p_h0 = t.test(sample1, sample2_h0, 'less', var.equal = TRUE)$p.value,
    p_h1 = t.test(sample1, sample2_h1, 'less', var.equal = TRUE)$p.value
  )
}

(Other information, in particular descriptives such as means or SDs, can also be returned via this function – but this is described in a separate vignette.)

To make sure the two functions are working correctly, as a quick test you can run the following line (with any applicable number argument).

do.call(customTest, customSample(55))
#>       p_h0       p_h1 
#> 0.67172109 0.03385126

This passes 55 as (arbitrary) sample size to customSample, which passes the created samples to customTest, which finally returns the named vector with the p_h0 and p_h1 p values. (Given the randomization, if you run this line repeatedly, the outcome should differ each time – however, with large enough numbers, p_h1 is likely to be noticeably smaller than p_h0.)

POSSA’s sim and pow

Time to load POSSA.

library('POSSA')

Now, the POSSA::sim function can be used with customSample for fun_obs (function for observation values) and customTest for fun_test (function for statistical testing), and any desired numbers of observation. (This runs the simulation 45000 times by default, so it takes a while. For quick testing, set the number to a lower value via the n_iter parameter, e.g., n_iter = 500.)

dfPvals = sim(fun_obs = customSample,
            n_obs = 80,
            fun_test = customTest)

The returned dfPvals contains all the simulated p values, and can be directly passed to POSSA::pow to get the power for any specified alpha level.

pow(dfPvals)

This prints to the console the following.

#> # POSSA pow() results #
#> N(average-total) = 160.0 (if H0 true) or 160.0 (if H1 true)
#> (p) Type I error: .05140; Power: .93327
#> Local alphas: (1) .05000

This is a fixed sample design with 80 observations per each of the two groups. The function uses an alpha of .05 by default, this may be modified e.g. to .01 below.

pow(dfPvals, alpha_global = 0.01)

#> # POSSA pow() results #
#> N(average-total) = 160.0 (if H0 true) or 160.0 (if H1 true)
#> (p) Type I error: .00991; Power: .79069
#> Local alphas: (1) .01000

So far, we can easily get the same information from other software, e.g. pwr::pwr.t.test(n = 80, d = 0.5, alternative = 'greater') (returns power = 0.93368) and pwr::pwr.t.test(n = 80, d = 0.5, sig.level = 0.01, alternative = 'greater') (returns power = 0.79068).

However, sequential designs are extremely easy to be expanded from the above sim and pow examples. One just has to specify multiple instances of observation numbers for the n_obs in sim and specify a local alphas in pow, e.g. as below.

dfPvalsSeq = sim(fun_obs = customSample,
                n_obs = c(27, 54, 81),
                fun_test = customTest)
pow(dfPvalsSeq, alpha_locals = NA)

#> # POSSA pow() results #
#> N(average-total) = 158.6 (if H0 true) or 98.8 (if H1 true)
#> (p) Type I error: .05000; Power: .89922
#> Local alphas: (1) .02288; (2) .02288; (3) .02288
#> Likelihoods of significance if H0 true: (1) .02296; (2) .01631; (3) .01073
#> Likelihoods of significance if H1 true: (1) .42324; (2) .32298; (3) .15300

The alpha_locals = NA input specifies that all local alphas are initially NA, in which case they will all be calculated to have a single identical number (corresponding to Pocock’s correction in case of equally spaced looks). Rounding to the conventional 3 fractional digits, the adjusted constant local alpha number corresponds to the exact statistics (.023) provided in other software, e.g. gsDesign::gsDesign(k = 3, test.type = 1, alpha = 0.05, sfu = 'Pocock').

The console output contains only the most important information, but, when assigned, the pow function returns a data frame with detailed information per each look (and their totals), including the specified sample sizes, the numbers of iterations, etc. The printed information includes

the average number of subjects in H0 and in H1 scenarios (i.e., out of all simulations, on average at how many participants the experiment was stopped),
the T1ER (here, it’s for a single p value, but, in case of multiple p values, it would be a combined T1ER as specified)
the power (again, would be combined in case of multiple p values),
the local alphas per look (and per each p value; but in the present case there is only one, named simply p, the root derived from p_h0/p_h1),
the ratios of significant findings per each look (corresponding to the local alphas).

The look numbers are always in parenthesis, such that “(1)” designates the first look, “(2)” the second look, etc.

The console output can also be called as print(dfPvalsSeq). The number of fractional digits to which the values are to be rounded can be changed via a round_to argument, such as e.g. print(dfPvalsSeq, round_to = 2).

Adjusting local alphas

POSSA helps obtain the local alphas, in sequential designs, that result in a specified overall (global) T1ER, which may also be called “global alpha”. This procedure is automatized for most practical cases, but still users should ideally have a basic idea of how this works. However, impatient readers (who do not want to have customized adjustments) can skip to the Specifying initial local alphas section below.

The essential mechanism is a staircase procedure, where the local alphas are continually increased when the T1ER is smaller than specified and decreased when the T1ER is larger then specified, and each direction change (from decrease to increase or vice versa) moves onto a smaller step size in the adjustment. For example, the simplest case is when all local alphas are provided as NA, as above: in this case, the initial replacement value (which may be modified via the adj_init parameter) is by default the global alpha divided by the maximum number of looks, as a rough initial estimation. Subsequently, the given p values are tested with this setting, and a global T1ER is calculated. If it is larger than specified, the replacement value is decreased, or vice versa. The amount of decrease is the first step, by default 0.01. Then the testing is repeated, and the change is repeated in the same direction until the obtained T1ER passes the specified T1ER (e.g., becomes smaller, when initially larger). Then the replacement value is increased with a smaller second step size, by default 0.05. And so on, until either there are no more steps (as specified) or (more typically) the obtained T1ER is close enough to the specified one (e.g., by default, matches it up to 5 fractional digits; may be specified via alpha_precision), at which point the procedure stops and the obtained local alphas (each having the same value) and the corresponding results are returned and printed.

The adjustment actually happens via a function that can be specified via the adjust parameter, but, by default, whenever there is at least one NA value among the given alpha_locals argument, the function implements the above-described procedure, and looks as follows.

adjust = function(adj, prev, orig) {
    prev[is.na(orig)] = adj
    return(prev)
}

In this function, adj means the adjustment value (in this case a replacement value), prev the previous vector of local alphas (obtained in the last adjustment), and orig the original vector of local alphas as provided for alpha_locals. (For the first adjustment, prev is the same as orig). Any sort of custom function may be provided as adjust, but it cannot contain any parameter other than these three (adj, prev, and orig).

By default, if no NA value is found among the given alpha_locals, the adjust function uses multiplication to adjust the given values (with adj_init set to 1), as follows.

adjust = function(orig, adj) {
    return(orig * adj)
}

By default, there are altogether 11 step sizes, chosen in a way that typically results in 5-fractional-digit T1ER match before running getting to the last step, and the calculation typically takes only a few seconds. (Step sizes can be modified via the staircase_steps parameter.)

Specifying initial local alphas

Given the above-described default mechanisms, there are, without modifying the adjustment procedure, two easy ways to provide the argument for alpha_locals (for any given p value): (a) a vector that includes (or consist entirely of) NA values that are to be replaced with a constant number that results in the desired global T1ER, or (b) a vector with all numeric values that are each to be multiplied with a number in order to obtain a vector of alphas that again result in the desired global T1ER.

However, you can also simply check the results, without any adjustment, of any given set of local alphas, by setting adjust = FALSE.

Here are some examples, all using the data frame “dfPvalsSeq” created above with POSSA::sim().

To use Haybittle–Peto boundary (all interim local alphas .001), you can specify .001 for each local alpha except for the last, which could be NA to be adjusted for the given global alpha (here: .025), as below.

pow(dfPvalsSeq, alpha_locals = c(0.001, 0.001, NA), alpha_global = 0.025)

#> # POSSA pow() results #
#> N(average-total) = 161.8 (if H0 true) or 140.4 (if H1 true)
#> (p) Type I error: .02500; Power: .88411
#> Local alphas: (1) .00100; (2) .00100; (3) .02431
#> Likelihoods of significance if H0 true: (1) .00111; (2) .00067; (3) .02322
#> Likelihoods of significance if H1 true: (1) .09224; (2) .21596; (3) .57591

In this specific scenario, despite noticeably decreased average sample in case of H1 (140.4 instead of the fixed 162), the adverse effect of interim stops on T1ER is so minimal (see, in case of H0, the tiny ratio significant [false positive] findings at the first and second looks), that the last look’s alpha (.02431) hardly differs from the specified global alpha (.025).

You can also check what happens without adjustment, using adjust = FALSE.

pow(dfPvalsSeq, alpha_locals = c(0.001, 0.001, 0.025), alpha_global = 0.025, adjust = FALSE)

#> # POSSA pow() results #
#> N(average-total) = 161.8 (if H0 true) or 140.4 (if H1 true)
#> (p) Type I error: .02551; Power: .88618
#> Local alphas: (1) .00100; (2) .00100; (3) .02500
#> Likelihoods of significance if H0 true: (1) .00111; (2) .00067; (3) .02373
#> Likelihoods of significance if H1 true: (1) .09224; (2) .21596; (3) .57798

Which leads to the same conclusion from the reverse perspective: the T1ER hardly exceeds .25.

Nonetheless, more substantial sample reduction can be achieved with somewhat more liberal interim local alphas; besides, normally it makes more sense to increase the local alphas gradually (because, in brief, given the greater uncertainty of the initial smaller samples, beginning with lower local alphas and gradually increasing them provides the most efficient “information gain” for the same eventual T1ER). For this, one could base the adjustment on one of the popular boundary methods used for parametric tests, such as the O’Brien-Fleming bounds. These bounds for parametric tests can be easily calculated via other packages. They may not apply to nonparametric tests in exactly the same way, but the principle of exponentially increasing local alphas remains the same. (To decide what fits your specific scenario best, you could also check the outcomes of various versions of the Wang-Tsiatis bounds by changing its Delta parameter; e.g., in gsDesign::gsDesign(), set sfu = 'WT' for Wang-Tsiatis, and vary its Delta via sfupar.) Most importantly in any case, the adjustment of alphas in POSSA will ensure that the T1ER is at the specified level. This will only make a difference for other custom tests and multiple hypotheses, but not for the present t-test example.

For our example of three equally spaced interim looks, O’Brien-Fleming bounds can be calculated e.g. via gsDesign::gsDesign(k = 3, test.type = 1, alpha = 0.05, sfu = 'OF'), giving the local alphas of 0.0015, 0.0181, and 0.0437. (The same is returned by rpact::getDesignGroupSequential(typeOfDesign = "OF", informationRates = c(1/3, 2/3, 1), alpha = 0.05).)

To merely calculate the T1ER and power for any specific tests, again use adjust = FALSE.

pow(dfPvalsSeq, alpha_locals = c(0.0015, 0.0181, 0.0437), alpha_global = 0.05, adjust = FALSE)

#> # POSSA pow() results #
#> N(average-total) = 160.8 (if H0 true) or 118.7 (if H1 true)
#> (p) Type I error: .04951; Power: .93087
#> Local alphas: (1) .00150; (2) .01810; (3) .04370
#> Likelihoods of significance if H0 true: (1) .00162; (2) .01809; (3) .02980
#> Likelihoods of significance if H1 true: (1) .11531; (2) .57211; (3) .24344

(To verify the power via other software, see e.g. summary(rpact::getSampleSizeMeans(rpact::getDesignGroupSequential(typeOfDesign = "OF", informationRates = c(1/3, 2/3, 1), alpha = 0.05, beta = 1-0.93087), alternative = 0.5)).)

Since these local alphas were specifically chosen to have a T1ER of .05 for the t-test comparison used here, they will hardly change even with adjust = TRUE. However, to test the adjustment function, we can set a different global alpha, e.g. .01, so that the adjustment will reduce each local alpha value by multiplication.

pow(dfPvalsSeq, alpha_locals = c(0.0015, 0.0181, 0.0437), alpha_global = 0.025)

#> # POSSA pow() results #
#> N(average-total) = 161.4 (if H0 true) or 126.9 (if H1 true)
#> (p) Type I error: .02498; Power: .87487
#> Local alphas: (1) .00071; (2) .00862; (3) .02080
#> Likelihoods of significance if H0 true: (1) .00089; (2) .00876; (3) .01533
#> Likelihoods of significance if H1 true: (1) .07629; (2) .49747; (3) .30111

(Cf. summary(rpact::getSampleSizeMeans(rpact::getDesignGroupSequential(typeOfDesign = "asUser", informationRates = c(1/3, 2/3, 1), userAlphaSpending = c(.00071, .0089, .0244), beta = 1-.87487), alternative = 0.5)).)

Finally, just for quick demonstration, for the same scenario, here is a function that adjusts all values via additions (constant increase/decrease of all local alpha values). Also, here the global alpha is set to be larger (alpha_global = 0.1), just so as to make the local alphas increase too in this case.

pow(
  dfPvalsSeq,
  alpha_locals = c(0.0015, 0.0181, 0.0437),
  alpha_global = 0.1,
  adjust = function(adj, prev, orig) {
    return(orig + adj)
  }
)

#> # POSSA pow() results #
#> N(average-total) = 161.4 (if H0 true) or 126.9 (if H1 true)
#> (p) Type I error: .02498; Power: .87487
#> Local alphas: (1) .00071; (2) .00862; (3) .02080
#> Likelihoods of significance if H0 true: (1) .00089; (2) .00876; (3) .01533
#> Likelihoods of significance if H1 true: (1) .07629; (2) .49747; (3) .30111

Futility bounds

By default, the interim local alphas will all be zero (i.e., no stopping for significance), and the final local alpha will equal the given global alpha – meaning a simple fixed design (with the maximum specified observation number).

pow(dfPvalsSeq)

#> # POSSA pow() results #
#> N(average-total) = 162.0 (if H0 true) or 162.0 (if H1 true)
#> (p) Type I error: .04860; Power: .93636
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04860
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .93636

This is to make the setting of futility bounds alone easy and straightforward. For instance, to stop collection for futility whenever the p value exceeds .5 (during interim looks), one can simply add fut_locals = 0.5.

pow(dfPvalsSeq, fut_locals = 0.5)

#> # POSSA pow() results #
#> N(average-total) = 101.1 (if H0 true) or 158.3 (if H1 true)
#> (p) Type I error: .04516; Power: .91622
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04516
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .91622
#> Futility bounds: (1) .50000; (2) .50000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .03011; (2) .00176
#> Likelihoods of exceeding futility alpha if H1 true: (1) .03324; (2) .00180

This is equivalent to specifying each futility bound. Since futility bound is meaningless for the final look, and the present example (dfPvalsSeq) contains two interim looks, two values are required. Hence the pow(dfPvalsSeq, fut_locals = c(0.5, 0.5)) gives results equivalent to pow(dfPvalsSeq, fut_locals = 0.5) (as above).

Of course, once again, a constant value is not ideal; the futility bound is more effective with values decreasing by each look. The specific choices depends on how much power one is willing to sacrifice in exchange for reduced sample sizes (which, as seen in the outputs, will help primarily in case H0 is true).

pow(dfPvalsSeq, fut_locals = c(0.6, 0.3))

#> # POSSA pow() results #
#> N(average-total) = 100.9 (if H0 true) or 159.3 (if H1 true)
#> (p) Type I error: .04587; Power: .92331
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04587
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .92331
#> Futility bounds: (1) .60000; (2) .30000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .01549; (2) .01307
#> Likelihoods of exceeding futility alpha if H1 true: (1) .01789; (2) .01336

Importantly, alphas stopping for significance and bounds stopping for futility can be combined in any desired way.

For instance:

pow(
  dfPvalsSeq,
  alpha_locals = c(0.002, 0.018, 0.044),
  fut_locals = c(0.6, 0.3)
)

#> # POSSA pow() results #
#> N(average-total) = 99.7 (if H0 true) or 114.3 (if H1 true)
#> (p) Type I error: .05000; Power: .92229
#> Local alphas: (1) .00211; (2) .01897; (3) .04636
#> Likelihoods of significance if H0 true: (1) .00216; (2) .01864; (3) .02920
#> Likelihoods of significance if H1 true: (1) .13907; (2) .55542; (3) .22780
#> Futility bounds: (1) .60000; (2) .30000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .03304; (2) .37542
#> Likelihoods of exceeding futility alpha if H1 true: (1) .01789; (2) .01336

(Note the robust sample size reduction in case of both H0 and H1, with power hardly changed as compared to the fixed design.)

The alpha_locals functions same as without futility bounds; e.g., in this last example it was adjusted via multiplication, but it could also be again given NA values to be adjusted via replacements.

To disable, at any given look, stopping either for significance or for futility, just set the local alpha/bound to zero or to one, respectively. For example, the above example can be modified as below to have no stopping for futility at the first look and no stopping for significance at the second look.

pow(
  dfPvalsSeq,
  alpha_locals = c(0.002, 0, 0.044),
  fut_locals = c(1, 0.3)
)

#> # POSSA pow() results #
#> N(average-total) = 123.8 (if H0 true) or 145.2 (if H1 true)
#> (p) Type I error: .05000; Power: .93416
#> Local alphas: (1) .00235; (2) none; (3) .05167
#> Likelihoods of significance if H0 true: (1) .00247; (2) 0; (3) .04753
#> Likelihoods of significance if H1 true: (1) .14633; (2) 0; (3) .78782
#> Futility bounds: (1) none; (2) .30000
#> Likelihoods of exceeding futility alpha if H0 true: (1) 0; (2) .01867
#> Likelihoods of exceeding futility alpha if H1 true: (1) 0; (2) .01916

(Of course, once again, this example hardly makes sense in a real case; it serves only as a demonstration here.)

Specifying observation numbers per each sample

Consider the sample creation given in the beginning:

customSample = function(sampleSize) {
  list(
    sample1 = rnorm(sampleSize, mean = 0, sd = 10),
    sample2_h0 = rnorm(sampleSize, mean = 0, sd = 10),
    sample2_h1 = rnorm(sampleSize, mean = 5, sd = 10)
  )
}

Here, we have two different groups (denoted as “sample1” and “sample2”), but, since their sample sizes (numbers of observations) are the same, for convenience the sample size can be passed via a single parameter (here: sampleSize). It could also be specified separately, passing via two parameters. In such a case, the parameter names must exactly correspond to the sample root names (i.e., the names without any final 0/1 that specify “H0” or “H1”); here, sample1 and sample2_h.

customSampleTwoParam = function(sample1, sample2_h) {
  list(
    sample1 = rnorm(sample1, mean = 0, sd = 10),
    sample2_h0 = rnorm(sample2_h, mean = 0, sd = 10),
    sample2_h1 = rnorm(sample2_h, mean = 5, sd = 10)
  )
}

Now, the previous sim and pow procedures can be run just the same, and the results will be identical as in the first example (with single parameter passed).

dfPvalsSeqTwoParam = sim(fun_obs = customSampleTwoParam,
                         n_obs = c(27, 54, 81),
                         fun_test = customTest)

pow(dfPvalsSeqTwoParam, alpha_locals = NA)

Here, given the n_obs = c(27, 54, 81) with a single vector, this is automatically assigned to each of the two given sample parameters (sample1 and sample2_h). These can also be assigned separately, and the results will once again be identical.

dfPvalsSeqTwoParamSeparate = sim(
  fun_obs = customSampleTwoParam,
  n_obs = list(
    sample1 = c(27, 54, 81),
    sample2_h = c(27, 54, 81)
  ),
  fun_test = customTest
)

pow(dfPvalsSeqTwoParamSeparate, alpha_locals = NA)

Importantly, this also allows specifying different observation numbers for each of the two samples.

dfPvalsSeqTwoParamSeparate = sim(
  fun_obs = customSampleTwoParam,
  n_obs = list(
    sample1 = c(17, 44, 71),
    sample2_h = c(37, 64, 91)
  ),
  fun_test = customTest
)

pow(dfPvalsSeqTwoParamSeparate, alpha_locals = NA)

Now, the outcome is of course different (but would be especially affected in case of unequal variances).

Incidentally: all sequential design examples so far used two interim looks – but there could of course be any other number. For instance, here is a design with a senseless amount of six interim looks (for two differently sized samples), just for the fun of it.

dfPvalsManyLooks = sim(
  fun_obs = customSampleTwoParam,
  n_obs = list(
    sample1 = c(8, 16, 24, 32, 40, 48, 56),
    sample2_h = c(10, 20, 30, 40, 50, 60, 70)
  ),
  fun_test = customTest
)

pow(dfPvalsManyLooks, alpha_locals = NA)

#> # POSSA pow() results #
#> N(average-total) = 122.4 (if H0 true) or 78.8 (if H1 true)
#> (p) Type I error: .04991; Power: .78164
#> Local alphas: (1) .01381; (2) .01381; (3) .01381; (4) .01381; (5) .01381; (6) .01381; (7) .01381
#> Likelihoods of significance if H0 true: (1) .01478; (2) .01036; (3) .00682; (4) .00600; (5) .00449; (6) .00378; (7) .00369
#> Likelihoods of significance if H1 true: (1) .10896; (2) .14676; (3) .14107; (4) .12364; (5) .10562; (6) .08716; (7) .06844

Paired (and correlated) samples – “GRP” group notation

Correlated samples can be created with the help of various R packages – for instance, faux has the intuitive rnorm_multi function to generate multiple correlated normal distributions. In the sample generation function below, calling faux::rnorm_multi(n = sampSize, vars = 3, mu = c(0, 0, 5), sd = 10, r = c(.5, .5, 0)) returns a table with three vectors, sampled from theoretical normal distribution: X1 (mean of 0), X2 (mean of 0), and X3 (mean of 5), where X1 is correlated with both X2 and X3, but the latter two are not correlated. (Since X2 and X3 are never used in the same test in the power calculation, for the related results it doesn’t actually matter whether they are correlated; here it’s set to zero merely to imitate a real case, where H0 and H1 cases do not even exist together let alone correlate.)

As for POSSA, to indicate that there is just a single group involved, all sample names must start with GRP. This accomplishes two important things (internally): (a) it adjusts total observation number calculation (to prevent incorrect additions, e.g. counting two observations from the same sample as two samples), and (b) it ensures that, in case of sequential testing, the random sampling happens in pairs (see the pair parameter in the ?POSSA::sim documentation for details).

samplePaired = function(sampSize) {
  correlated_samples = faux::rnorm_multi(n = sampSize, vars = 3, mu = c(0, 0, 5), sd = 10, r = c(.5, .5, 0))
  list(
    GRP_v1 = correlated_samples$X1, # correlated with both X2 and X3
    GRP_v2_h0 = correlated_samples$X2, # correlated only with X1
    GRP_v2_h1 = correlated_samples$X3  # correlated only with X1
  )
}

(Again, the parameters could have been separated into GRP_v1 and GRP_v2_h, but, since their observation numbers are identical, this is unnecessary.)

Now specify the paired test.

testPaired = function(GRP_v1, GRP_v2_h0, GRP_v2_h1) {
  c(
    p_h0 = t.test(GRP_v1, GRP_v2_h0, 'less', paired = TRUE, var.equal = TRUE)$p.value,
    p_h1 = t.test(GRP_v1, GRP_v2_h1, 'less', paired = TRUE, var.equal = TRUE)$p.value
  )
}

(Again, this can be quickly verified as do.call(testPaired, samplePaired(70)).)

Now simulate the p values.

dfPvalsPaired = sim(
  fun_obs = samplePaired,
  n_obs = c(15, 30, 45),
  fun_test = testPaired
)

#> Note: Observation numbers groupped as "GRP" for GRP_v1, GRP_v2_h0, GRP_v2_h1.

Let’s do some checks with pow.

Fixed design with max sample (45 observations; cf. pwr::pwr.t.test(n = 45,d = 0.5,type = 'paired',alternative = 'greater')):

pow(dfPvalsPaired)

#> # POSSA pow() results #
#> N(average-total) = 45.0 (if H0 true) or 45.0 (if H1 true)
#> (p) Type I error: .04956; Power: .95016
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04956
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .95016

Fixed design by stopping only at the second look (30 observations; cf. pwr::pwr.t.test(n = 30,d = 0.5,type = 'paired',alternative = 'greater')):

pow(dfPvalsPaired, alpha_locals = c(0, .05, 0), adjust = FALSE)

#> # POSSA pow() results #
#> N(average-total) = 44.3 (if H0 true) or 32.3 (if H1 true)
#> (p) Type I error: .04822; Power: .84818
#> Local alphas: (1) none; (2) .05000; (3) none
#> Likelihoods of significance if H0 true: (1) 0; (2) .04822; (3) 0
#> Likelihoods of significance if H1 true: (1) 0; (2) .84818; (3) 0

Now with some sequential designs.

pow(dfPvalsPaired, alpha_locals = NA)

#> # POSSA pow() results #
#> N(average-total) = 44.1 (if H0 true) or 27.2 (if H1 true)
#> (p) Type I error: .05000; Power: .91580
#> Local alphas: (1) .02299; (2) .02299; (3) .02299
#> Likelihoods of significance if H0 true: (1) .02302; (2) .01538; (3) .01160
#> Likelihoods of significance if H1 true: (1) .41867; (2) .34931; (3) .14782

pow(dfPvalsPaired, alpha_locals = NA, fut_locals = 0.5)

#> # POSSA pow() results #
#> N(average-total) = 27.2 (if H0 true) or 26.3 (if H1 true)
#> (p) Type I error: .04998; Power: .90453
#> Local alphas: (1) .02344; (2) .02344; (3) .02344
#> Likelihoods of significance if H0 true: (1) .02353; (2) .01553; (3) .01091
#> Likelihoods of significance if H1 true: (1) .42233; (2) .34644; (3) .13576
#> Futility bounds: (1) .50000; (2) .50000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .22618; (2) .17264
#> Likelihoods of exceeding futility alpha if H1 true: (1) .02720; (2) .00131

pow(dfPvalsPaired, fut_locals = c(0.6, 0.3))

#> # POSSA pow() results #
#> N(average-total) = 28.1 (if H0 true) or 44.4 (if H1 true)
#> (p) Type I error: .04709; Power: .93902
#> Local alphas: (1) none; (2) none; (3) .05000
#> Likelihoods of significance if H0 true: (1) 0; (2) 0; (3) .04709
#> Likelihoods of significance if H1 true: (1) 0; (2) 0; (3) .93902
#> Futility bounds: (1) .60000; (2) .30000
#> Likelihoods of exceeding futility alpha if H0 true: (1) .00609; (2) .00669
#> Likelihoods of exceeding futility alpha if H1 true: (1) .01482; (2) .00931

Arbitrary stops

In a real experiment, you may not know in advance the exact number of observations at the stopping point. For example, you plan (and preregister) to open an initial 30 slots for participation, after which you would perform the first testing (first interim “look”) – however, some participants will probably have to be excluded (e.g. because of failing attention checks), expected to be only a few, but cannot be known exactly in advance.

Assuming that minor variations in power do not matter, you can simply calculate with the approximate observation numbers (at the given planned looks) in advance, and, when the actual observation numbers are known, the calculation is repeated with these latter numbers to ensure that the T1ER remains as desired. For example, taking into account some necessary exclusions, the expected observation numbers with two interim looks are 27 (first look), 54 , and 81. This may be calculated as before, using O’Brien-Fleming bounds (as provided by, e.g., gsDesign::gsDesign(test.type = 1, alpha = 0.05, sfu = 'OF', timing = c(27, 54, 81)/81), then adjusted by POSSA).

dfPvals_planned = sim(fun_obs = customSample,
                n_obs = c(27, 54, 81),
                fun_test = customTest)

pow(dfPvals_planned, alpha_locals = c(0.0015, 0.0181, 0.0437), alpha_global = 0.05)

#> # POSSA pow() results #
#> N(average-total) = 160.8 (if H0 true) or 118.5 (if H1 true)
#> (p) Type I error: .05000; Power: .93164
#> Local alphas: (1) .00152; (2) .01830; (3) .04419
#> Likelihoods of significance if H0 true: (1) .00164; (2) .01831; (3) .03004
#> Likelihoods of significance if H1 true: (1) .11604; (2) .57309; (3) .24251

Now, let’s say that, due to somewhat more exclusions than expected, the initial sample contains only 24 valid observations. You can rerun the simulation with the modified first observation number, and run the power calculation with initial alphas corrected with gsDesign::gsDesign(test.type = 1, alpha = 0.05, sfu = 'OF', timing = c(24, 54, 81)/81).

dfPvals_first = sim(fun_obs = customSample,
                n_obs = c(24, 54, 81),
                fun_test = customTest)

pow(dfPvals_first, alpha_locals = c(0.0009, 0.0182, 0.0438), alpha_global = 0.05)

#> # POSSA pow() results #
#> N(average-total) = 161.0 (if H0 true) or 120.6 (if H1 true)
#> (p) Type I error: .05000; Power: .93484
#> Local alphas: (1) .00091; (2) .01850; (3) .04453
#> Likelihoods of significance if H0 true: (1) .00087; (2) .01760; (3) .03153
#> Likelihoods of significance if H1 true: (1) .07056; (2) .61742; (3) .24687

Let’s say that the p value was not below the local alpha (.02295), so you move open another 30 slots, and once again end up with more exclusions than expected, resulting in only 49 valid observations. The adjusted calculation is then as follows.

gsDesign::gsDesign(test.type = 1, alpha = 0.05, sfu = 'OF', timing = c(24, 49, 81)/81)
# gives new initial local alphas

dfPvals_second = sim(fun_obs = customSample,
                   n_obs = c(24, 49, 81),
                   fun_test = customTest)

pow(dfPvals_second, alpha_locals = c(0.0009, 0.0145, 0.0447), alpha_global = 0.05)

#> # POSSA pow() results #
#> N(average-total) = 161.0 (if H0 true) or 119.6 (if H1 true)
#> (p) Type I error: .04996; Power: .93153
#> Local alphas: (1) .00090; (2) .01453; (3) .04479
#> Likelihoods of significance if H0 true: (1) .00091; (2) .01360; (3) .03544
#> Likelihoods of significance if H1 true: (1) .07009; (2) .53733; (3) .32411

(And, if needed, a similar modified recalculation could be done for the last look.)

As can be seen, the power differences are very small. Nonetheless, if it is important to keep the same level of power with high precision (e.g., up to 3 fractional digits), you can plan for (and preregister) to conditionally adjust (increase/decrease) the subsequent interim (if any) and final sample sizes to achieve the same power. (Technically, another option is to conditionally adjust the alpha level, i.e., the T1ER, but that’s rather questionable.)