Timing of randomization
An important element in designing a TwiCs study is the timing of randomization, which varies according to the intervention or treatment under study [50]. Cohort participants can be randomized to the control or intervention arm at one moment in time, which is a feasible approach in a closed or recruiting cohort, and is referred to as the ‘single-batch sampling approach’. An alternative to the single-batch sampling approach is the ‘multiple-batch sampling’ approach [50], where a subgroup of cohort participants is randomized at one moment in time. In this approach the cohort continues to randomize eligible patients who are not allocated yet to the control or intervention arm. This approach is also feasible for closed or recruiting cohorts. Multiple rounds of randomization are conducted within the cohort. This approach was applied in the UMBRELLA FIT trial [16, 17, 31] and is also adopted in the HONEY trial [30].
For some interventions, the single and multiple-batch randomizations are not feasible, because screening for trial eligibility and randomization needs to take place within a short timeframe right after diagnosis, progression or relapse [50]. This entails that eligible patients should be randomized as soon as they consented to the trial, which makes it impossible to randomize all patients at the same time. This randomization procedure is comparable to the way patients are randomized in standard RCTs. Within a cohort setting, the randomization approach often requires a recruiting cohort and can be applied shortly after the start of the cohort. The latter implies that upon (cancer) diagnosis, patients are invited to participate in a cohort study where a cohort consent and possible consent to randomization into future RCTs are provided (two staged-informed consent procedure). In case the intervention or treatment needs to be administered shortly after diagnosis, eligible patients for the trial are randomized immediately or very soon after cohort enrollment. In these situations, it is impossible to leave much time between cohort enrollment and the moment patients are randomized into a TwiCs study. This procedure was applied in the RECTAL BOOST trial [27,28,29], where patients provided informed consent for cohort enrollment after being diagnosed with locally advanced rectal cancer. Directly after cohort enrollment, patients who consented for randomization into future RCTs (among other trial eligibility criteria) were randomized to the control arm or to the alternative treatment arm. Patients in the control arm received standard chemoradiation and patients randomized to the alternative treatment arm were offered a boost before chemoradiation. By nature of the design, patients in the control arm were not informed about this boost possibility. The same procedure was used in the VERTICAL trial [35,36,37,38].
When randomization into a TwiCs study starts at the same day or shortly after cohort enrollment, it is inevitable that the ‘future’ trial is already known by researchers upon the moment that patients sign the two staged-informed consent. This may still lead to selection bias in the trial, which is exactly what one wants to minimize when conducting a TwiCs study. Furthermore, this potential selection bias into the trial brings in another possible risk—selection into the trial may trickle down to selection for cohort enrollment and thus representativeness of the cohort. When a newly diagnosed cancer patient is suited for cohort enrollment, but ineligible for the TwiCs study upon diagnosis, it is highly undesirable to exclude that patient from the cohort. In other words, when recruiting patients for the cohort study, eligibility criteria for future RCTs should not be considered. This risk plays a potential role when the TwiCs study investigates the effect of (new) interventions of which it is known that these interventions start shortly after cohort enrollment. The advantages of TwiCs studies over standard RCTs (e.g. fast accrual) should not tempt researchers to start a cohort study for the sake of a clinical trial as this would slowly turn the trial into an RCT following the controversial Zelen design, where patients are randomized before consent is given [51].
Non-compliance in the alternative treatment arm
In a TwiCs study, only patients randomly selected for the alternative treatment arm are asked to provide informed consent after randomization (but before treatment). As stated previously, patients randomly selected for the control arm receiving SOC are not notified about the trial, are therefore not asked informed consent and are not aware of the alternative intervention. As a consequence, only patients randomized to the alternative treatment arm can refuse this treatment (after randomization), and patients who refuse will receive SOC. This will lead to non-compliance in the alternative treatment arm. Since control patients are not informed about the trial, it is highly unlikely that this type of non-compliance (refusal of assigned treatment) is randomly distributed over study arms (as opposed to standard RCTs). It is important to consider this selected non-compliance when defining the research question, determining the effect size and calculating the required sample size of TwiCs studies. In the remainder of this manuscript, non-compliance is defined as refusal of an alternative treatment or intervention (if offered) after randomization.
Most oncology TwiCs studies presented in Table 1 anticipated on the occurrence of non-compliance in the treatment arm during the design phase by including the expected non-compliance rate in the sample size calculations. However, it is worth mentioning that the anticipated non-compliance rate might deviate from the actual non-compliance rate. For example, in the UMBRELLA FIT trial, the anticipated non-compliance rate was 30%, but after 152 patients of the initially required 166 patients were recruited, the actual non-compliance rate was 45% [17]. In the RECTAL-BOOST trial, there was an overall non-compliance rate of 27% compared to an expected rate of 20% [28]. In the VERTICAL trial, the assumed non-compliance rate was 10% while the actual rate was 27% [36]. As the TILT trial was a feasibility study, non-compliance rate in the alternative treatment arm was considered a primary outcome measure, but the authors also included failure to complete follow-up in the control arm in the non-compliance definition, which is why in the TILT trial the non-compliance rate definition was different compared to the other trials [25]. The study was considered feasible with respect to the non-compliance rate if that rate was below 10%. Of the 12 randomized patients, one patient in the alternative treatment arm refused the treatment after randomization and one control patient did not complete the follow-up schedule, which indicates that the 10% maximum was exceeded.
These results show that the actual non-compliance rates deviate from the expected non-compliance rate. The non-compliance rate in the treatment arm can be interpreted as a methodological challenge of a TwiCs study that requires careful consideration when defining the research question, the clinical endpoints and the determination the required sample size. In the upcoming subsections we will discuss the implications of non-compliance for the treatment effect estimate and the statistical power. However, before these aspects are discussed, it is necessary to first clarify which effect is estimated in a TwiCs trial and how this is connected to the research question. This will be the topic of the next subsection.
Defining the efficacy estimand in a TwiCs study
For this discussion we consider the guidelines outlined in the ICH E9 (R1) draft addendum on “Estimands and Sensitivity Analysis in Clinical Trials” [52]. The estimand of a clinical trial can be defined as the targeted treatment effect that reflects the research question which is given by the research objective. It provides a summary at the level of the population of what the treatment effect would be in the same patients under different treatment options being compared. How the estimand is to be estimated should be specified in advance of the trial and once this is defined, the trial can be designed as such that it is possible to generate a reliable estimation of that treatment effect. For the definition of the estimand in a clinical trial, it is required to anticipate on so-called intercurrent events, which are defined as events that mark a change in the course of treatment and that influence the estimation and interpretation of treatment effects. Intercurrent events need to be addressed a-priori when describing the clinical research question of interest. In a TwiCs trial, non-compliance, or refusal of the alternative treatment after randomization but before started treatment, can be regarded as such an intercurrent event. It is obvious that this phenomenon will alter the interpretation of the treatment effect and should be considered when defining the estimand. More specifically, we should question what is the estimand in a TwiCs study, which effect is of interest (what is the research question?) and how can we estimate that effect. For the remainder of this discussion, we only consider the refusal of an offered alternative treatment or intervention after randomization as known intercurrent event in a TwiCs study and therefore only discuss the implication of that particular event.
First, it is important to assume that non-compliance due to refusal only occurs in the alternative treatment arm, which means that the intercurrent event is dependent on the assigned treatment. Second, it is also assumed that the occurrence of non-compliance will affect the treatment effect indefinitely—once a patient refuses offered treatment, the patient will receive the SOC for the remainder of the trial duration. Finally, it is assumed that the control patients do not have access and will not get the alternative treatment since these patients are not informed about the trial. In other words, there is no contamination in the control group.
Treatment policy strategy
The way non-compliance is addressed in the trial defines the research question that a TwiCs study is able to answer. One of the strategies to address the research question described in the ICH E9 (R1) draft guidance document is the treatment policy strategy—the intercurrent event is taken to be part of the treatment regimen of interest. The treatment effect is then estimated irrespective of the occurrence of an intercurrent event and the estimand is a combined effect of the initial randomized treatment and the treatment modified by the intercurrent event. Adopting a treatment policy approach has become known as the “Intention-to-treat” (ITT) approach—all patients are analyzed ‘as randomized’ regardless of the occurrence of the intercurrent event. For a TwiCs study, this implies that the non-compliance rate in the alternative treatment group is considered as part of the treatments being compared.
What does this mean for the interpretation of the ITT effect in a TwiCs study? This question can be answered by first taking a closer look at the ITT definition in a standard RCT. The ITT effect in a standard RCT is generally interpreted as the average causal effect (ACE) of the assigned treatment. The ACE measures the difference in the mean outcome between patients assigned to the alternative intervention and patients assigned to SOC. It has been argued that the ACE of a standard RCT is, on average, an unbiased estimate of the population mean effect of the alternative treatment compared to SOC in patients receiving treatment, under the assumption that treatments are randomly assigned, thereby assuming no confounding exists [53, 54]. Assuming that all patients also receive the assigned treatment, we refer to the ITT effect in a standard RCT as the ACE of received treatment for the remainder of this discussion. Although this technically is not the pure ITT definition (analyzing ‘as randomized’ regardless of taking up treatment), it is important to mention this nuance here when discussing the difference between the ITT definition of a standard RCT and a TwiCs study. In a TwiCs study, patients are also assigned a treatment, but the difference is that patients are offered alternative treatment when assigned to that treatment, whereas in a standard RCT we expect all patients to receive the assigned treatment. Therefore, to distinguish between the ACE of a standard RCT and a TwiCs study, we refer to received treatment and offered treatment, respectively.
Non-compliance is known to be a methodological problem that can lead to bias in estimating the ACE of received treatment in randomized experiments [55]. However, in a standard RCT, the refusal of treatment happens generally before randomization and these patients do not enter the trial. Furthermore, (potential) non-compliance is randomly distributed over the treatment arms in a standard RCT, which avoids immediate bias in estimating the ACE of a received treatment. In fact, only selective non-compliance in a standard RCT might lead to a more-or-less biased ACE of received treatment relative to the population value of the ACE of received treatment.
In contrast, for a TwiCs study, it is already expected beforehand that non-compliance only occurs in the alternative treatment arm after randomization; the intercurrent event occurs by nature of the design. As a result, non-compliance is known to be not random (selective non-compliance) and therefore, the treatment effect under the ITT-principle will be diluted when incorporating non-compliant patients. One might argue that non-compliance in a TwiCs study leads to a biased ACE of received treatment, but this is incorrect, because a TwiCs study simply adopts a different estimand compared to a standard RCT. As stated earlier, the ITT effect of a TwiCs study is the ACE of offered treatment rather than received treatment. This also means that when we speak of bias in a TwiCs study, it is important to refer to bias in the estimand of a TwiCs study. In a situation where the refusal rate of the alternative intervention in the trial matches that of the population, the ACE in a TwiCs study will provide an unbiased estimate of the population mean effect of the alternative intervention compared to SOC in patients who are offered the alternative intervention compared to patients receiving SOC [54]. The key point here is that a TwiCs study and a standard RCT estimate a different ITT effect (estimand) under the treatment policy strategy and therefore answer different research questions. Bias in a standard RCT is defined as bias relative to the effect of received treatment, whereas bias in a TwiCs study is defined as bias relative to the effect of offered treatment. Consequently, a TwiCs study will not provide a biased estimate of the ACE observed in a standard RCT, as sometimes falsely claimed (see the section on ‘Analysis of a TwiCs study’).
In the UMBRELLA FIT-, the RECTAL BOOST-, the VERTICAL-, and SPONGE trial, the primary analysis was done according to the ITT principle. However, as explained above, interpreting the ITT effect of a TwiCs study cannot be separated from the non-compliance rate in the alternative treatment arm. Therefore, the expression of the final results should be stated carefully. For example, in the VERTICAL trial, the interpretation of the results was expressed as: “we found no differences in pain response, pain scores, and global QOL between patients receiving cRT and those (offered to be) treated with SBRT” (p. , [36]). The part between brackets points out that treatment effects represent the effect of offered alternative treatment compared to receiving SOC rather than a comparison of patients receiving two different treatments [54]. The same phrasing with respect to the treatment effect under the ITT principle was adopted when presenting the UMBRELLA FIT trial results. In addition, for the UMBRELLA FIT trial, results were reported for patients offered the alternative intervention as well as those for patients accepting the alternative intervention [16].
Finally, analyzing a TwiCs study according to the treatment policy strategy ensures that the occurrence of the intercurrent event is also of main interest [56], which means that a TwiCs study can be used to gain insight in the acceptability of an alternative treatment. This was recognized in the VERTICAL trial, where this acceptability was explicitly stated when discussing the results [36]. Therefore, acceptability of the alternative treatment could be part of the research question and must be seen as part of the treatment effect [57].
Principal stratum strategy
In addition to the treatment policy strategy, the ICH E9(R1) guideline lists four other strategies to address the research question. Each of these strategies approaches a different research question. We will briefly discuss one other strategy that plays a role in the TwiCs setting. This strategy is the principal stratum strategy where the intercurrent event is considered a confounding factor when estimating a treatment effect. In sum, the treatment effect is estimated in a (target) population (“stratum”) whose status with respect to the intercurrent event is similar, irrespective of treatment arm. For a TwiCs study, this means that the treatment effect is estimated in a population that is capable and willing to accept the treatment being assigned to. Using different analysis strategies than the ITT approach, an estimate of the treatment effect under perfect compliance can be generated, typically based on causal inference models [58]. An example of such an estimate is the complier average causal effect (CACE), which provides an unbiased treatment effect for patients who comply with the protocol [59]. This definition diverges from the ITT definition in a TwiCs study, which demonstrates that both estimands are concerned with a different research question.
The remaining strategies listed in the ICH E9 (R1) may also apply to the TwiCs design, but the treatment policy strategy and the principal stratum strategy have been described in publications of TwiCs trials, which is why we limit the discussion to these two strategies. For a detailed overview on how to define the estimand based on difference strategies with detailed examples, see [56, 60, 61].
In sum, different research questions can in principle be addressed by a TwiCs study. The research question drives the definition of the estimand(s) of interest in a TwiCs study, which should be defined before the start of the study. These definitions will then determine the primary analysis and, importantly, power and sample size assessment. It is crucial to mention that these different estimands should not be interpreted as alternatives to one another, but merely as ways to answer different research questions.
Analysis of a TwiCs study
The effect of the alternative treatment arm compared to control can simply be estimated by comparing the group of patients randomized to the alternative treatment arm with the group of patients randomized to SOC, using an appropriate statistical test. This approach is similar to the primary analysis strategy of most randomized trials. However, the result of this analysis in a TwiCs study should not be interpreted as the ACE observed in a standard RCT, because the non-compliance rate observed in the intervention arm dilutes this effect and should be taken into consideration when interpreting the results.
When the main focus is the effect of the intervention under compliance (principal stratum strategy), the analysis must be adapted accordingly. In the TwiCs literature, instrumental variable (IV) analyses have been proposed to accomplish this [57, 62, 63]. These IV analyses use a two-stage least squares method to account for possible non-compliance in the alternative intervention group [64]. In the first stage, the effect of exposure (actual treatment received) is predicted by the effect of randomization. In the second stage, this information is used to understand how the exposure affects the outcome. Two different IV analyses were proposed by Pate et al. [63] and Candlish et al. [62] to analyze TwiCs studies. In the first IV analysis, a two-stage regression model is applied. In the first stage, the effect of randomization on exposure is estimated using logistic regression, which provides the estimated exposure given the allocated treatment. Subsequently, in the second stage, a regression model for the outcome is fitted using the estimated exposure from the previous logistic regression model as covariate. The effect of the estimated exposure on the outcome provides the estimated treatment effect of interest. The second IV analysis also starts with a logistic regression model predicting exposure by randomization, but here the residual term is calculated as the difference between actual exposure and predicted exposure. In the second regression model, the outcome is modeled as a function of the treatment received and the residuals calculated from the previous logistic regression where the coefficient of treatment received provides the estimated treatment effect.
In two simulation studies, the performance and accuracy of the ITT and IV analysis in analyzing TwiCs study results were investigated [62, 63]. The authors reveal that the larger the refusal rate, the more bias was found in the ITT effect as expected in a standard RCT. However, considering our arguments in the previous Section, this is a logical finding. When acknowledging that a TwiCs study estimates a different ITT effect compared to a standard RCT, it is expected that the ITT effect of a TwiCs study deviates from a (simulated) ITT effect of a standard RCT, but that should not be interpreted as bias. Again, bias in the ITT effect of a TwiCs study should not be seen as bias relative to the ITT effect of a standard RCT, but relative to its own definition. For example, when non-compliance depends on certain patient characteristics (e.g. only male participants refuse treatment), we can expect bias in the ACE of offered treatment relative to the population value. Furthermore, in the same simulation studies, it was also found that when refusal in the intervention arm is present, the IV analyses in a TwiCs study provided an effect estimate that was closer to the ITT effect estimate of a standard RCT than the ITT effect of a TwiCs study was to the ITT effect estimate of a standard RCT [62, 63]. This implies that for researchers who are interested in deriving a treatment effect from a TwiCs study that is close to the ITT of a standard RCT, IV analyses offer this possibility. However, this does not fix the issue of non-compliance and we believe that it is not necessary to fix this as long as researchers acknowledge that a TwiCs study estimates something different compared to a standard RCT.
With respect to the completed TwiCs oncology studies (Table 1), only the UMBRELLA FIT trial provided results of an ITT and IV analysis. In addition, in the UMBRELLA FIT trial, another alternative analysis strategy was used, namely a propensity score analysis by comparing intervention accepters to patients in the control group who would have accepted the alternative intervention if offered [16]. This propensity score analysis serves as a sensitivity analysis to the IV analysis, because it is unknown whether intervention refusers are influenced by the offer of the intervention.
Statistical power
In general, sample size calculations should be based on the anticipated treatment effect according to the ITT definition. The anticipated ITT effect of a TwiCs study reflects the ITT effect considering non-compliance in the alternative treatment arm (offered treatment) and will therefore be smaller than the ITT effect in a standard RCT. As a result, required sample sizes for obtaining sufficient power in a TwiCs study are often larger than those of standard RCTs [62, 63].
A critical issue in TwiCs studies is that the expected non-compliance rate may diverge from the actual non-compliance rate, which was the case in the UMBRELLA FIT trial, the RECTAL BOOST trial, and the VERTICAL trial (see the Section on ‘Non-compliance in the alternative treatment arm’). Consequently, the sample size had to be updated during the trial based on the actual non-compliance rate, which was also recommended by Candlish et al. [62]. This can have severe implications when the observational cohort is limited in the number of available patients, which can be the case in a closed cohort [50]. Updating the required sample size is easier in recruiting cohorts. Furthermore, recruiting cohorts have the advantage that the non-compliance rate can be updated after each randomization and the sample size can be adapted until the actual non-compliance rate is reached. It has been recommended to calculate the required sample size under different non-compliance rate assumptions during the design stage [50, 54] or to first perform a pilot study before the actual TwiCs study to obtain insights in the actual refusal rates [17].
As a final note on the sample size we would like to point out that the discussion of the (diluted) ITT effect so far holds for superiority trials. A diluted ITT effect makes it easier to demonstrate non-inferiority or equivalence. In general, the ITT effect in non-inferiority trials is anti-conservative [65]. Therefore, in designing and analyzing TwiCs non-inferiority trials, a per protocol analysis excluding non-compliance should be considered. However, since non-compliance only occurs in the alternative treatment arm, it is unclear how this will affect treatment group balance and hence the interpretation of non-inferiority. To our knowledge, there have been no proposed or conducted non-inferiority TwiCs studies to date.
Multiple TwiCs studies within the same cohort
Until now, the discussion about the methodological challenges encountered in TwiCs studies was focused on performing only one TwiCs study within a cohort. However, in the Introduction Section, we mentioned the possibility of running multiple TwiCs studies within the same cohort. A TwiCs study uses a broad prospective observational cohort study and this cohort typically represents a broad population of interest. When running multiple TwiCs studies within the same cohort, either consecutively or in parallel, these studies are most often considered separate, stand-alone trials that each answer a different research question and that use their own concurrent control and intervention participants. They may also target different sub populations within the cohort. This is not any different than performing multiple standard RCTs in a general population, or in collaborative networks across study sites and it is therefore not required to adjust for multiplicity or Type I error inflation when running multiple TwiCs studies within the same cohort. Only in the scenario where, e.g., different TwiCs studies use a shared control group, similar to how controls can be used in platform trials, a multiplicity correction (e.g. controlling the family wise error rate) may be required [66, 67].
Simulations have shown that when two TwiCs studies share the same control group, results between the two trials are correlated [68]. However, since the objectives of each individual TwiCs study stands on its own, there is no intention to investigate the effect of a series of treatments that are linked together. The scenario of overlapping control arms is thus not likely to occur. Also confounding between two treatment arms (control or intervention) of two different TwiCs studies (e.g. when patients can only receive the alternative treatment in one study) tend to result in correlated trial results [68], but this scenario is not very likely at all as it violates the equal treatment assignment probability across patients [69]. Moreover, observational cohort studies include a large number of patients and most cohort studies are recruiting cohorts, which also decreases the chance of overlapping treatment arms across TwiCs studies.
In sum, overlapping treatment arms across multiple TwiCs studies is considered a minor potential methodological challenge. However, if it does occur, the availability of a cohort study offers an important advantage, because a patients’ treatment status in other TwiCs studies within the same cohort is known and can thus be taken into account when randomizing patients for a new TwiCs study. For example, in the Dutch PLCRC, the RECTAL BOOST [28] and SPONGE [34] trial are two consecutive trials and the trial status of the RECTAL BOOST trial was used as stratification factor when randomizing patients for the SPONGE trial. In contrast, when running multiple standard RCTs within a general population, other trial inclusions are not structurally collected and therefore not always known.