Applying the Toxicity Index to Patient-Reported Symptom Data: An Example Using the European Organization for Research and Treatment of Cancer Colorectal Cancer-Specific Quality of Life Questionnaire

Purpose: The toxicity index (TI) is a summary index that accounts for toxicity grades associated with cancer symptoms that is more sensitive than other toxicity systems to treatment differences. The TI can be used with patient-reported symptoms but requires that scores for different items represent equivalent severity. The purpose of this article is to provide an example of scoring patient-reported symptoms that satisfies the requirement of equivalent symptom severity. Methods: A sample of 1232 adults with rectal cancer from a Phase III clinical trial self-reported 18 symptoms on the European Organization for Research and Treatment of Cancer colorectal cancer measure using a 4-category response scale (not at all, a little bit, quite a bit, or very much). The participants were 22 to 85 years of age (mean age, 57 years), 30% were female, 85% were non-Hispanic white, 59% had stage II cancer, and 41% had stage III cancer. A recoded TI was created using item response theory category thresholds. Findings: The recoded TI had larger rank-order correlations than the original TI with Karnofsky performance status index, hemoglobin level, symptom bother, and other aspects of health-related quality of life. Implications: Recoding items based on category thresholds yielded a more valid TI score that can be used to summarize adverse events. ( Clin Ther . 2021;XX:XXX–XXX) © 2021 Elsevier HS Journals, Inc. ( Clin Ther. 2021;000:1–8.) © 2021 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


INTRODUCTION
Toxicity data consist of treatment-attributable adverse events (AEs) graded as 0, 1, 2, 3, 4, or 5 for each of the 790 AE terms, grouped in 26 system organ classes according to Common Terminology Criteria for Adverse Events (CTCAE), version 4.0. 1 Toxicity grading by clinicians is a standard component of cancer clinical trial data collection. Grade 0 AEs represent the absence of toxicity. The toxicity index (TI) was inspired by hash functions and provides a summary of all n observed toxicity grades. 2 Each of the n toxicity grades X i ( i = 1, …, n ) for an individual is represented in descending order: X 1 ≥ X 2 ≥ ≥ X n . An individual's TI score is a function of the ordered toxicity grades: (1 + X j ) Any TI ≥3 corresponds to a dose-limiting toxicity, and the maximum toxicity grade is the integer part of the final score. For example, a TI of 3.0 indicates a single grade 3 toxic event, whereas a TI of 3.5 means that the patient experienced at least 1 grade 3 toxic event plus additional toxic events. All toxicity grades are represented in the score, although lower grades contribute less to the final score than higher grades.
The TI has potential to be used with patientreported symptom measures. However, the TI assumes equal levels of impact for the item response categories for different symptoms. CTCAE grades are treated as equivalent across symptoms. It also may be acceptable for patient-reported symptoms measured using a response scale, such as not bothered at all, a little bit bothered, somewhat bothered, bothered quite a bit, and bothered very much. However, the TI approach is not ideal for summarizing patient reports of symptoms when severity is not captured. For example, use of the TI with reports about frequency of symptoms or extent to which symptoms occur may be problematic because severity may differ (eg, runny nose vs vomiting).
Category response curves provide information about item response options in multi-item scales that identifies where they fall on the underlying continuum. Item response theory can be used to get estimates of threshold parameters that represent the underlying trait level necessary to respond above each threshold with 0.50 probability. 3 These thresholds indicate the relative severity by item response options.
This article presents a comparison of scoring the TI for a patient-reported symptom measure scored assuming equal distances between response categories versus scoring based on item thresholds in the National Surgical Adjuvant Breast and Bowel Project R-04 rectal cancer clinical trial.

Study Design and Sample
Eligible patients were diagnosed with surgically resectable stage II or III rectal adenocarcinoma. A total of 1608 patients participated in the Phase III clinical trial of rectal cancer (NCT00058474) between 2004 and 2010. 4 , 5 All patients who spoke English, French, or Spanish were invited to complete a questionnaire at baseline before randomization to treatment. If the patient was not accessible in person, staff were encouraged to mail the questionnaire to the patient or collect responses by telephone.
The trial was approved by the local institutional review boards, and all patients provided written informed consent. The secondary analyses reported here were determined to be exempt by the Cedars Sinai and UCLA institutional review boards. The sample consisted of 1232 adults with complete data for 18 symptom items (see Measures) analyzed. Adults were 22 to 85 years of age (mean age, 57 years), 30% were female, 85% were non-Hispanic white, 59% had stage II cancer, and 41% had stage III cancer ( Table I ).

Measures
The baseline patient-reported survey included 112 questions. The focus of the analyses are 18 symptoms (items 60 to 77 on the baseline survey) assessed in the European Organization for Research and Treatment of Cancer colorectal cancer-specific quality of life questionnaire (QLQ-CR38). 6 The QLQ-CR38 assesses the extent to which symptoms were experienced in the past week: not at all, a little bit, quite a bit, and very much ( Table II ). A higher score indicates a greater extent of experiencing symptoms. Also included in the baseline survey was the Functional Assessment of Cancer Therapy-Colorectal Trial Outcomes Index (FACT-C TOI), the Functional Assessment of Cancer Therapy-Gynecologic Oncology Group-Neurotoxicity 13 (FACT-GOG-NTX-13), the 36-Item Short Form Health Survey (SF-36) version 2 vitality scale, and a 17-item symptom checklist (SCL-17) 7-10 Clinical measures analyzed were the maximum AE grade, hemoglobin level, and the Karnofsky performance status index.

Statistical Analysis
Our primary analysis used baseline survey data, but we looked at consistency of results with 1-year postsurgery survey data. The standard coding of the QLQ-CR38 items is as follows: 0, not at all; 1, a little bit; 2, quite a bit; and 3, very much. We report internal consistency reliability 11 and item-scale correlations for the 18-item QLQ-CR38 symptom score using this scoring. We used categorical confirmatory factor analysis with diagonally weighted least squares to evaluate whether the items were sufficiently unidimensional to  † Fully active (90-100) indicates able to perform all predisease performance without restriction. Restricted (70-80) indicates restricted in physically strenuous activity but ambulator y. Ambulator y (K50-60) indicates ambulatory and capable of all self-care but unable to perform any work activities.
estimate response category thresholds using the item response theory graded response model. Because of content overlap (local dependency) among the QLQ-CR38 symptoms, we included 6 residual correlations (item pairs: 60 and 61, 63 and 64, 72 and 73, 72 and 74, 73 and 74, as well as 76 and 77). We evaluated model fit using the comparative fit index and the root mean square error of approximation. Comparative fit index values > 0.95 and root mean square error of approximation values < 0.06 are considered good fit. 12 Category thresholds for the 18 items (3 thresholds per item) were estimated from the graded response model. 13 The SEs around the thresholds were used to create 95% CIs. Overlapping CIs of the 3 thresholds for each item were identified. Threshold estimates were used to adjust the scoring of item responses. The 0 for not at all was preserved, but the distance between scores assigned for other response options were shifted based on differences in item thresholds.
The TI is scored so that a higher score represents a greater toxicity. We hypothesized positive correlations with measures scored so that a higher score is worse (maximum AE grade, SCL-17, and worried about health in the future) and negative correlations with measures scored so that a higher score is better (Karnofsky performance status, hemoglobin, FACT-C TOI, FACT-GOG-NTX-13, and SF-36 version 2 vitality scale). Spearman rank-order correlations of the TI with these variables were estimated.
Confirmatory factor analysis was conducted using Mplus 14 version 7 and all other analyses with SAS software, version 9.4, TSIM3 (SAS Institute, Cary, North Carolina).

RESULTS
Internal consistency reliability for the 18 symptom item scale was 0.79 and item-scale correlations (corrected for item overlap with the scale total) ranged from 0.26 to 0.49. Sufficient unidimensionality of the 18 QLQ-CR38 symptom items was supported by the fit of the 1factor confirmatory factor analysis model (comparative fit index = 0.962 and root mean square error of approximation = 0.054). The χ 2 was 587.385 with 129 df ( P < 0.0001).
The original scoring of the QLQ-CR38 symptom items is given in Table III (0, not at all; 1, a little bit; 2, quite a bit; and 3, very much). Threshold estimates from the graded response model for each item are also given. To determine how to modify the original scoring of the QLQ-CR38 symptoms for the TI summary measure, we compared thresholds across items. We used as many integers as needed and no more than needed to reflect the variation in severity across symptoms indicated by the thresholds. We ended up needing to add 2 integers (4 and 5) to reflect variation in thresholds. For example, we scored responses of very much as 5 for item 60 but 4 for item 63 (bloated feeling in your abdomen) because the threshold between quite a bit and very much for the latter was smaller (mean [  [0.41]). The TI index was computed from the original scoring, and then the revised TI was scored based on item category thresholds.
The Spearman rank-order correlations between the TI and revised TI at baseline was 0.65. The revised TI was more strongly associated with other variables than was the TI ( Table IV ). The revised TI was significantly more highly associated than the TI with 6 variables: (1) Karnofsky performance status index, (2) hemoglobin level, (3) SCL-17 scale, (4) FACT-C TOI, (5) FACT-GOG-NTX-13 score, and (6) SF-36 version 2 vitality scale. We found similar results for survey data collected 1 year after surgery ( Table V ).

DISCUSSION
The value of patient-reported symptoms has been documented for > 2 decades. 15 The work reported here is consistent with the ongoing efforts to incorporate the patient's voice into the assessment of AEs in cancer clinical trials. For example, Smith et al 16 Table I for item wording. Thresholds in rows that share a superscript letter do not differ significantly from one another. For example, the mean (SE) threshold between quite a bit and very much for item 60 (urinate frequently during the day) was 4.43 (0.41), whereas the mean (SE) threshold between not at all and a little bit was −1.28 (0.14), and these thresholds were significantly different.  greater relative validity of the revised TI compared with the TI was supported by consistently larger associations with other variables as hypothesized.
One limitation of the TI is that it requires rankbased analysis because it does not follow any wellknown probability distribution, such as the normal distribution. However, it contains more information than other toxicity analysis methods by accounting for both the multiplicity and severity of toxic effects, without losing the natural interpretability of the maximum grade approach. This added information yields greater power in detecting treatment differ-   ences than maximum grade and average toxicity approaches. 17 , 18 CONCLUSIONS This article provides a prototype of how the TI can be applied to patient-reported symptom measures and illustrates the value of adjusting item scoring to account for different levels of underlying symptom severity. The method used to adjust scores is not the only or necessarily the best approach. Future research and applications are needed to evaluate similar and different strategies to adjust category scoring of polytomous symptom items to satisfy the underlying assumption of equivalence across items implicit in the scoring of the TI.

ACKNOWLEDGMENTS
Author contributions are as follows: Ron D. Hays: conceptualization, formal analysis, methodology, and writing original draft; Patricia A. Ganz: data curation, funding acquisition, project administration, resources, supervision, and review and editing; Karen L. Spritzer: formal analysis, software supervision, and review and editing of manuscript; and André Rogatko: funding acquisition, project administration, resources, review and editing of manuscript.

FUNDING SOURCES
This work was supported in part by grant 1U01CA232859-01 from the National Cancer Institute, National Institutes of Health. The sponsor did not have a role in the study design, collection, analysis, interpretation of data, or writing of the manuscript.