Appendix 2: Classification tree showing the classification of newspaper articles into 1 of 3 categories: class 1 = moderately to highly exaggerated claims (relative to scientific article); class 2 = slightly exaggerated claims, class 3 = no exaggerated claims.

The tree shown is the smallest tree within 1 standard error of the “%relative cost”% (i.e., misclassification rate) of the smallest tree determined by 10-fold cross-validation. Twoing for ranked data was the splitting method.24,25 The following paragraphs describe how the classification software functions (for an overview of the software used in this analysis, see www.salford-systems.com/products-cart.html) and details the left-most partitions of the tree.

The construction of classification and regression trees (CARTs) has become a common method for building statistical models from simple data. The results can be represented as binary trees, permitting simple graphic presentation, even when many variables are under consideration.

At each branching point or node of the decision tree is a binary question or statement (with a Yes or No answer) about some feature of the data set. The terminal groups or “%leaves”% of the tree represent the best split of the data, according to the “%learn”% or “%training”% data set. A terminal group may be a single member of some class (as in this analysis), a probability density function or a predicted mean value for a continuous variable. The basic CART algorithm is given a set of samples and instructed to find some variable that splits the data so as to maximize the differences (and minimize the similarity or “%impurity”%) of the 2 partitions. This splitting of the data into groups is applied recursively until some stopping criterion is reached, such as a minimum number of samples in an individual partition (or branch of the tree).

Because our purpose was to discriminate between 3 categories of newspaper articles (those with moderately to highly exaggerated claims, those with slightly exaggerated claims and those with no exaggerated claims), we used the classification tree analysis to determine the contribution of the descriptor variables listed in Table 2 to the assignment of the respective newspaper articles to the 3 categories.

The interpretation of this tree is straightforward. The left-hand partitions of the tree are described here in some detail, and the other branches can be interpreted in a similar fashion. The relative contribution of each variable to the overall shape of the tree is presented in Table 2 (as the variable importance score). At the top of the tree, all of the data are best divided into 2 groups according to the likelihood of risks or costs being reported in the scientific paper. The next node (branching point) to the left splits the data on the basis of the likelihood of risks or costs being reported in the newspaper article, and the same criterion applies to the third node. Year of publication is the variable that splits the remaining data into 2 penultimate groups. At the terminal split, the final groups (or “%leaves”% of the tree) are determined by the primary theme reported in the newspaper article. Node 1 (n = 29) corresponds to cases with slightly exaggerated claims in the newspaper article, whereas node 2 (n = 14) corresponds to cases with no exaggerated claims. Node 2 is “%pure,”% in that all cases were assigned to class 3 (100%); however, node 1 is “%mixed”% because both class 2 (48.3%) and class 3 (51.7%) are represented. Further splitting would resolve these apparent impurities, but the CART pruning and stopping rules that we employed determined that this was the most appropriate terminal group

Table4