Introduction

Gastric cancer is the second most common cause of cancer mortality worldwide [1]. Patients with advanced disease represent over two-thirds of newly diagnosed cases [2]. Despite advances in diagnosis and treatment, the prognosis for patients with advanced disease remains poor with the median survival reported to range from 5.3 to 10.2 months [3]. Several randomised trials demonstrated survival and quality of life (QL) benefits of chemotherapy compared with best supportive care [4]. Fluorouracil (5-FU) remains one of the most widely used drugs with the introduction of several other agents including cisplatin and anthracyclines being investigated in doublet and triplet combinations together with 5-FU or capcitabene [57]. The survival advantage of any of these combination regimens, compared with each other, is small and as such no internationally accepted standard of care regimen has emerged [6, 8]. The primary objectives of treatment in this palliative setting are to relieve symptoms, improve QL and prolong survival [9, 10]. However, a recent literature review and meta-analysis concluded that the impact of chemotherapy-related toxicity on the patients’ quality of life has been insufficiently studied in patients with advanced gastric cancer [7].

Webb et al. compared the combination of epirubicin, cisplatin and 5-FU (ECF) with 5-FU, doxorubicin and methotrexate (FAMTX) in previously untreated patients with advanced esophagogastric cancer [11]. Using the EORTC QLQ-C30, the authors showed that ECF resulted in a better QL at 24 weeks compared with FAMTX. Subsequently, Ross et al. (2002) showed that ECF resulted in a better QL at 3 and 6 months when compared with mytomycin C, cisplatin and 5-FU (MCF) [12]. ECF has never been directly compared with CF, although a meta-analysis suggests ECF has a survival advantage over CF. However, concerns remain over the toxicity of ECF and the role of epirubicin in the combination [13]. More recently, Van Cutsem et al. (2006) compared the combination treatments docetaxel plus cisplatin and fluorouracil (DCF) vs. cisplatin and fluorouracil (CF) as first-line therapy for advanced gastric cancer [14]. The study met its primary endpoint showing a significant improvement in time-to-progression (TTP) with DCF compared with CF, and also an improvement in survival, and response rate were reported. Although, higher incidences of toxicity were observed in the DCF treatment arm, this did not appear to impact on QL which was significantly better in the DCF arm. These results suggest that better tumour control may also have lead to better symptom control in the DCF arm [14, 15].

Following promising results using irinotecan in combination with either 5-FU or cisplatin in phase II trials [1619], a phase II–III trial was initiated in previously untreated advanced gastric adenocarcinoma patients comparing irinotecan plus cisplatin with irinotecan plus 5-FU given as an infusional AIO regimen (IF) [20]. Based on the risk/benefit ratio for IF in this study, a phase III trial was designed to compare IF to cisplatin plus 5-FU administered using a 5-day infusional regimen (CF). QL results for the phase III part of the study are reported here. The clinical results and initial QL results have been presented elsewhere [21].

As with other studies [14], the initial analysis of the QL data for the current study was carried out using time-to-event analysis (e.g. time to a 5% deterioration of the global health status/QL scale) in accordance with the statistical analysis plan. It is generally considered that for QL data, time-to-event analysis is limited since it does not take into account the repeated measures aspect of the data or the potential bias introduced by missing data. The analysis presented in this report addresses the fact that QL is a process and consequently is subject to change over time, that measurements taken at different time points are correlated, and that patients drop out during the study or have intermittent missing data, thus taking the entire structure of the QL data into account.

Patients and methods

Patient eligibility

Patients were to have histologically confirmed adenocarcinoma (including diffuse type, intestinal type and linitis) of the stomach or esophagogastric junction, with measurable or evaluable metastatic disease (cytology or histology was mandatory if a single metastatic lesion was the only manifestation of disease) or locally recurrent disease with at least one measurable lymph node. Patients were also required to be between 18 and 75 years of age, have Karnofsky performance status (KPS) >70%, life expectancy >3 months and adequate haematological parameters. The study was conducted in accordance with the Declaration of Helsinki and was approved by national or local ethics committees, as appropriate. All patients provided written informed consent. Further details regarding patient eligibility are provided elsewhere [21].

Study treatments

Subjects randomised to the IF arm were scheduled to receive irinotecan 80 mg/m2 over a 30-min i.v. infusion, followed by FA 500 mg/m2 as a 2-h i.v. infusion, immediately followed by 5-FU 2,000 mg/m2 over a 22-h i.v. infusion, day 1 every week for 6 weeks followed by a 1-week rest. In the CF arm patients were scheduled to receive cisplatin 100 mg/m2 as a 1–3-h i.v. infusion, day 1, followed by 5-FU 1,000 mg/m2/day over a 24-h i.v. infusion, days 1–5, and every 4 weeks. Treatment was administrated until disease progression, unacceptable toxicity or withdrawal of consent.

Study design

The primary objective of the phase III part of the study was to detect a statistically significant increase in TTP for the IF arm relative to the CF arm. In addition, a non-inferiority comparison was specified in the protocol, in case of a non-significant trend towards superiority of TTP for the IF arm [22, 23]. Tumour progression was assessed according to World Health Organization Criteria and TTP measured from randomisation until the date of progression or death. Randomisation was performed using the minimisation technique [24], stratifying patients according to presence of measurable vs. evaluable disease, liver involvement, baseline weight loss, prior surgery and centre.

QL assessments

The EORTC QLQ-C30 (version 3.0) and EuroQoL (EQ-5D) instruments were used in this study. The QLQ-C30 is a cancer specific, self administered assessment of QL. The scale scores were calculated as per the scoring procedure defined in the EORTC QLQ-C30 Scoring Manual [25]. The EQ-5D is also a self-administered instrument comprising five questions and a visual analogue scale, which represents a rating of the patient’s health state [26]. The five single items are combined to obtain a health utility index (HUI) score. This report focuses on the global health status\QL, physical functioning, social functioning, pain and nausea/vomiting scales of the EORTC QLQ-C30 and the two EQ-5D scales.

Quality of life assessments were required at baseline, every 8 weeks until documentation of disease progression and every 3 months from the documentation of the progression until death. To be considered evaluable at baseline, a questionnaire must have been filled in within 15 days before randomisation. To be considered evaluable on treatment, a questionnaire had to be filled in more than 5 days (IF arm) or 9 days (CF arm) after the start of the latest infusion so as not to take into account the immediate toxicities following infusion. The different lag durations after the start of the infusion allowed for the different infusion durations to be taken into account (1 day in the IF arm, 5 days in the CF arm).

Questionnaires without a date of assessment, or filled in after the cut-off date or after a further anti-tumour therapy were excluded (i.e. considered non-evaluable). Since assessments were planned to be evaluated independently from cycle duration, data were to be analysed according to time windows (8 week periods, i.e. ±4 weeks of the theoretical assessment date for assessments before investigator documented progressive disease). In case of multiple evaluable questionnaires in a time window, the mean value per subject for each scale in the time window was calculated.

Questionnaires excluded from the analysis were either considered present but not evaluable (i.e. see above description) or missing. The reason for missing questionnaires was collected on the CRF pages. The reasons were categorised as follows: random (i.e. administrative and similar reasons not directly related to patient QL), QL related (e.g. the patient was too ill to complete the questionnaire) or dead.

Statistical methods

Quality of life compliance was calculated as the ratio of the total number of subjects with at least one evaluable questionnaire per time window over the total number of expected questionnaires [27]. Patients were counted in the total number of expected questionnaires in the window only until further anti-tumour therapy or death prevented assessment.

Summary measures of QL scores were generated: i.e. the minimum, maximum and mean post-baseline QL scores within each patient, for each scale over all evaluable questionnaires, were calculated and summarised by treatment group. The Wilcoxon non-parametric test was used to compare treatment groups as the summary measures, particularly for the minimum and maximum, were not expected to be normally distributed.

A logistic regression model was fitted to test if the QL data missing from patients who dropped out was missing completely at random (MCAR) [28]. The model included terms for time (as a linear variable expressing the 8-weekly assessment time points), treatment (as a binary variable), time by treatment interaction and two terms representing global health status\QL scores: (1) sum of the two previous scores and (2) the difference between the two previous scores. The P-value for the Wald chi-squared statistic was used to test the effect of QL scores on dropout.

A pattern-mixture model was fitted in SAS using Proc Mixed [29, 30]. This model allows one to model the repeated measures structure of the data taking into account the dropout pattern. Terms in the model included treatment, time, dropout pattern and their interactions. Thus, a priori, the fixed effects as well as the covariance parameters were allowed to vary unconstrained according to the dropout pattern. In addition, several baseline clinical variables (age, gender, WHO performance status, pain assessed by the clinician, prior surgery and weight loss) were considered as covariates in the model. Model reduction was carried out using a likelihood ratio test to identify the most parsimonious model consistent with the data. If the treatment effect in the final model was pattern dependent, the delta method would be used to obtain the marginal treatment effect [31]. As such, the treatment effect is estimated within each pattern and the overall marginal treatment effect is estimated using a weighted summation of the individual within pattern treatment effects, weighted according to the proportion of subjects in each dropout pattern. The null hypothesis of no treatment effect would be tested using a Wald statistic, which approximates a chi-squared distribution with one degree of freedom.

Results

Clinical results

Between June 2000 and March 2002, 337 patients were randomised (172 IF, 165 CF). Two patients in each arm were never treated, one due to disease progression and three due to adverse events. Thus the full analysis population, defined as treated patients analysed in the arm to which they were randomised, consisted of 333 patients (170 IF, 163 CF). The median treatment duration was 21 and 17 weeks in the IF and CF arms, respectively. The proportion of patients for whom an adverse event was reported as the reason for discontinuation was higher in the CF arm (10.0% IF, 21.5% CF; P = 0.004). A non-significant trend towards a longer median time-to-progression (TTP) was observed in the IF arm (5.0 months) compared with the CF arm (4.2 months: P = 0.088). The non-inferiority criteria, that the lower limit of the 95% CI of the Cox hazard ratio be at least 0.93, was satisfied for TTP in the full analysis population but not in the per protocol population. The median overall survival was 9.0 and 8.7 months in the IF and CF arms, respectively.

Safety was assessed in the 333 treated patients, according to treatment actually received (167 patients IF, 166 patients CF). A total of six treatment-related deaths occurred, one in the IF arm and five in the CF arm. The rate of hospitalisations was similar between arms, including the rate of hospitalisations due to toxicity (27.6% of patients). Neutropenia grade 3–4 was observed in 24.8 and 51.6% of IF and CF patients, respectively. Thrombocytopenia grade 3–4 was observed in 1.8 and 11.7% of IF and CF patients, respectively. Diarrhoea was observed more frequently in the IF arm (21.6 vs. 7.2% grade 3–4) whereas stomatitis was more prevalent in the CF arm (16.9 vs. 2.4%). Neurological toxicity was also more frequent in the CF arm, with 22.9% of patients experiencing grade 1–4 events, compared with 5.4% in the IF arm. Further clinical results are provided elsewhere [21].

QL compliance

Table 1 presents the compliance of the EORTC QLQ-C30 questionnaires by treatment group. The number of patients in each time window decreased over time due to attrition of patients. The compliance rate was higher in the IF arm at weeks 8, 16, 24, 40 and 48. The overall compliance rates were 60 and 56% in the IF and CF arms, respectively. Sixteen patients did not complete any evaluable questionnaires during this period. Monotone dropout patterns (i.e. a complete series of questionnaires before dropout) were observed in 202 cases. The remaining 115 patients had intermittent missing questionnaires. Of the 202 monotone dropout cases, 98 patients completed the baseline questionnaire only.

Table 1 Analysis of compliance for QLQ-C30 questionnaires by protocol-planned assessment (full analysis population-randomisation group)

Table 2 presents the reasons for missing\non-evaluable questionnaires. During the first 48 weeks, a total of 63 and 57 assessments which were present were excluded from the analysis because of non-evaluability in the IF and CF treatment arms, respectively. The main reason for missing assessments at earlier visits (i.e. baseline, weeks 8 and 16) was due to administrative and similar reasons not directly related to patient QL, whereas the main reason for missing questionnaires at later time points was due to death.

Table 2 Reasons for missing\non evaluable questionnaires

Summary measures

The minimum, maximum and mean post-baseline QL scale scores were calculated for each patient and summarised by treatment group (see Table 3). There was no significant difference in QL scores between treatment groups for the minimum global health status\QL scale. However, there was a non-significant trend towards a difference when comparing the maximum (P = 0.062) and mean global health status\QL scores (P = 0.061) between groups suggesting a trend towards a better QL for patients receiving IF. The physical functioning scale and the EQ-5D thermometer consistently presented significantly better results for all summary measures in favour of the IF treatment arm. The nausea\vomiting scale and EQ-5D HUI also indicated significant results for both the minimum and mean summary measures in favour of the IF treatment arm. Trends in favour of IF were also exhibited for the social functioning and pain scales.

Table 3 Summary measures for secondary QL endpoints

Testing the dropout process

Table 4 presents the logistic regression analysis of dropout. The two QL terms in the model “difference in QL” and “sum of QL” were significant indicating that if the sum of the two previous QL scores were low then the probability of dropout was high and if there was a decrease in QL score from the previous assessment then the probability of dropout was also high. Thus, the missing data are not MCAR and caution needs to be taken when analysing the QL data.

Table 4 Logistic regression of dropout

Figure 1 presents the mean global health status\QL scores by time and dropout pattern for each treatment group. Dropout patterns were defined based on the time of the last completed questionnaire. Four patterns were defined as follows: 1 = dropout at baseline, 2 = dropout at week 8 or 16, 3 = dropout at week 24 or 32, 4 = dropout after week 32. This resulted in 80, 88, 91 and 58 patients in the four patterns, respectively, with sufficient data within each pattern to carry out formal statistical analyses. Figure 1 illustrates that the mean global health status\QL score increased initially in all patterns in the IF treatment arm, however the mean scores subsequently decreased prior to dropout indicating that there was an initial improvement in global health status\QL score, possibly due to treatment, and a subsequent deterioration prior to dropout. In both treatment groups, there is a clear indication of differences between patterns with respect to mean global health status\QL score. The findings from Fig. 1 are consistent with the logistic regression analysis, in particular it illustrated that patients with a low QL score and a decrease in QL had a higher probability of dropout.

Fig. 1
figure 1

Plot of the least squares means estimates of the EORTC QLQ-C30 mean global health status by treatment group and dropout pattern. A higher score represents a better QL. Full green line represents pattern 4 (i.e. dropout after week 32). Other lines represent patterns 1–3, i.e. dropout at baseline, dropout at week 8 or 16, dropout at week 24 or 32, respectively

Model fitting

Several baseline clinical variables were considered as covariates in the model. The final model included an autoregressive order 1 variance–covariance structure, the baseline variables pain and WHO performance status and the treatment by pattern interaction and the main time effect. As there was an interaction between the treatment effect and pattern the treatment effect was estimated using the delta method. Figure 2 presents the treatment estimates for all the QL variables investigated except for the EQ5D HUI score (P = 0.518) which is on a different scale. Significant treatment differences were observed for the physical functioning scale, nausea\vomiting and EQ-5D thermometer in favour of the IF treatment arm. All the other scales illustrated non-significant results.

Fig. 2
figure 2

Testing the treatment effect using the delta method. EQ-5D HUI is on a different scale and consequently is not included in this figure

Discussion

In this study, preliminary analysis using summary measures was carried out in an exploratory fashion. There were a number of significant results in the comparison of the two treatment groups consistently indicating a better QL in the IF treatment group. The main differences between treatment groups were observed for the physical functioning and nausea\vomiting scales from the EORTC QLQ-C30 and the two EQ-5D scales. Non-significant trends towards a difference were observed for the social functioning, pain and global health status\QL scale in favour of IF.

More questionnaires were completed in the IF arm than in the CF arm. As such the probability of observing an extreme result (e.g. minimum or maximum) is increased in the IF arm since the more frequently a process is observed the more often one will observe an extreme result. The number of questionnaires and the patterns of completion of questionnaires also varied considerably between patients. Missing data were prevalent. It was shown that missing data at earlier time points were due mainly to random reasons, e.g. administrative failure whereas missing questionnaires at later time points were missing mainly due to death. As such it may be argued that intermittent missing questionnaires were primarily due to random reasons (i.e. MCAR), whereas monotone missingness were due to progression of disease or death. This latter point is supported by testing the dropout process. Testing the dropout process indicated that questionnaires at the time of dropout were not MCAR. The results indicated that if QL scores were low then the probability of dropout was high. This phenomenon was confirmed when plotting the mean global health status\QL scores over time by dropout pattern. The imbalance in compliance and the dropout of patients suggests that simplified analyses such as time-to-event analysis and analysis using summary measures may be biased. Consequently, more complex analyses were carried out using pattern-mixture models to reduce any potential bias.

The final pattern-mixture model indicated that mean QL scores were dependent on dropout pattern and that the variance–covariance structure had an autoregressive component. Analysing the data neglecting to take this information into account could be considered wasteful and potentially biased. Using the pattern-mixture model, significant treatment differences were observed for the physical functioning scale, nausea\vomiting and EQ-5D thermometer in favour of the IF treatment arm. These results were mainly consistent with those using the mean as a summary measure. However, for most scales the treatment effect was less significant using the pattern-mixture model. This is partially explained by the fact that between patient variation is artificially reduced using summary measures thus resulting in larger effect sizes. The findings of the QL analysis are also consistent with the toxicity profile as recorded through adverse event reporting [21]. While the rates of diarrhoea, cholinergic syndrome and fever without infection were higher in patients receiving IF, these symptoms were manageable in the current study. The higher rates of neurological toxicities, anorexia, stomatitis, alopecia, febrile neutropenia/neutropenic infection, thrombocytopenia and creatinine elevation in the CF arm, in addition to nausea and vomiting, are consistent with a negative impact on the physical functioning and nausea\vomiting scales. This was also reflected in the significantly higher number of withdrawals due to treatment-related AEs in the CF arm. In addition, the previously reported advantages in terms of efficacy (TTP and time to treatment failure) and clinical benefit (KPS, appetite and weight loss) all favoured patients receiving IF [21].

Analysis of the QL data using pattern-mixture analysis yielded more significant results than using time-to-event analysis. The original analysis of this study and other studies have used time-to-event analyses [14, 21]. However analyses of QL data using time-to-event are limited and potentially biased for a number of reasons. For example, when analysing the global health status\QL scale, 58% of patients had censored time-to-events. In analysis of QL data of advanced gastric patients where the time-to-event is time-to-deterioration of QL one could argue that there is informative censoring, i.e. missing questionnaires after dropout are not MCAR and consequently the probability of being censored is not completely at random. This is particularly important in this study when analysing the global health status\QL scale as the majority of patients had censored time-to-events. Conversely, only 42% of patients had observed the event of interest (i.e. deterioration of QL score). As the number of QL events is small the power to detect a treatment difference is small. Consequently, even if large differences are expected between treatment groups, the probability of observing a significant difference is small. Thus time-to-event analyses would appear to be potentially biased and wasteful for analysing QL data.

Currently, there is no internationally agreed upon gold standard for conducting and reporting QL studies in cancer clinical trials [32]. While other authors have also used the EORTC QLQ-C30 questionnaire in advanced gastric cancer, sometimes reporting of results was poor and was limited to a few paragraphs within the overall clinical paper [11, 12, 14]. For example, details concerning compliance within treatment arms were not provided and methods of analysis were sub-optimal as they did not take into account the structure of the data, i.e. repeated (correlated) measurements with missing data. It is imperative that sufficient details concerning QL assessment, analysis and reporting are provided to allow comparisons of findings across studies. This is particularly relevant in diseases such as advanced gastric cancer where survival rates are similar across treatment regimens.

The use of irinotecan-based regimens for the treatment of advanced gastric cancer has been further explored in phase II studies during the last few years, especially with the availability of new targeted agents [3335]. Although initial results are promising, suggesting that IF could represent a potential platinum-free alternative backbone to be combined with new targeted agents, results from phase III studies are required before drawing any firm conclusions. QoL assessment should be incorporated as a prominent objective in phase III studies in advanced gastric cancer to help both patients and physicians to discuss treatment choices and aid decision making [6, 7].

In summary, there was a trend in favour of IF over CF in time-to-progression. The IF treatment arm also demonstrated a better safety profile than the CF arm and a better QL on a number of multi-item scales. These results would suggest that IF offers an alternative platinum-free first-line treatment option for advanced gastric cancer which should be explored further in combination with new targeted agents.