## Abstract

Experiments with complex treatment structures are not uncommon in horticultural research. For example, in augmented factorial designs, one control treatment is added to a full factorial arrangement, or an experiment might be arranged as a two-factorial design with some groups omitted because they are practically not of interest. Several statistical procedures have been proposed to analyze such designs. Suitable linear models followed by F-tests provide only global inference for main effects and their interactions. Orthogonal contrasts are demanding to formulate and cannot always reflect all experimental questions underlying the design. Finally, simple mean comparisons following global F-tests do not control the overall error rate of the experiment in the strong sense. In this article, we show how multiple contrast tests can be used as a tool to address the experimental questions underlying complex designs while preserving the overall error rate of the conclusions. Using simultaneous confidence intervals allows for displaying the direction, magnitude, and relevance of the mean comparisons of interest. Along with application in statistical software, shown by two examples, we discuss the possibilities and limitations of the proposed approach.

In agricultural and horticultural research, controlled experiments are set up to evaluate the effect of several treatments and their interactions on physiological or developmental variables. If the levels of a first factor of potential influence are combined with all levels of a second factor, the resulting experiment has a two-factorial, completely crossclassified treatment structure. However, often the experimental questions are manifold and, together with the background knowledge on the practical problem, lead to complex treatment structures. Augmented factorial designs are a common example, in which one or several control treatments are added to a completely crossclassified factorial design. Then, comparisons between the control treatment and the full factorial part are of interest as well as the analysis of the factors arranged in the full factorial part. Another reasonable setting arises when the effect of two factors and their interaction is investigated, but not all combinations between the levels of the two factors are of practical interest. Then, these treatments are reasonably omitted from the experiment, leading to a treatment structure that is crossclassified with a few missing cells.

Among others, Marini (2003) discussed different strategies for analyzing augmented factorial designs. He considered four different linear models followed by comparisons of particular treatments. First, a one-way model was performed followed by all pairwise comparisons of the seven treatment means using the least significant difference (lsd) procedure. However, even when protected by a preceding global F-test, decisions based on lsd do not control the familywise error in the strong sense (Hochberg and Tamhane, 1987). Second, a two-way model was performed with subsequent F-tests for the main effect of formulation, the main effect of the concentration, and their interaction. When this approach is used, least square means cannot be estimated properly. Third, a pseudo one-way model comprising all seven treatment groups was used to compute six orthogonal contrasts defining some hypotheses of interest. The orthogonal contrasts involved the comparison of the control group to the average of all other treatments, two contrasts for the comparisons of the three formulations, one contrast for the comparison of the two formulations, and two contrasts for certain interaction effects. However, this approach has severe restrictions; the number of hypotheses that can be formulated is restricted to *k-1* if *k* is the number of treatment groups (Dean and Voss, 1999; Marini, 2003; Petersen, 1994). Often, not all hypotheses of practical interest can be included in this type of analysis. Moreover, the problem of multiple comparisons is not taken into account. In a last approach, a mixed model was performed including the concentrations as a quantitative variable and the formulations as a qualitative variable, whereas the blocks were included as random effect.

Piepho et al. (2006) propose another strategy. The authors show how to develop linear models by insertion of an additional variable containing the value CON for observations in the control and the value TREAT for observations belonging to treatments in the complete factorial structure; thus, the new variable represents the fragmentation of the control group and the six treatments and one is able to write out a nested model representing the experimental layout correctly. Note that in this approach, the nesting of effects is not used to represent nested random effects in a hierarchical model. Rather, the nesting operator is used to define an appropriate design matrix for fixed effects. Based on this model, least square means can be estimated properly and an analysis of variance can be performed that takes all existing groups into account and provides F-tests for relevant hypotheses in the complex treatment structure. One disadvantage of this approach is that results of F-tests generally provide only global information about main effects and interaction effects, i.e., a significant result gives evidence for a difference in means among any of the considered treatments. Information about the location of the difference(s), the effect size, or comparisons of particular interest is not available from this approach.

If the interesting experimental questions can be expressed best as a set of comparisons among particular treatment means, a multiple comparison problem results. If testing an increasing number of hypotheses with the number of true hypotheses unknown, the probability of at least one wrong testing decision increases. If it is the aim of the statistical analysis to control the probability of at least one false rejection among all the tested null hypotheses, procedures are needed that control the familywise error (FWE). Computationally simple procedures like the Bonferroni or Scheffé adjustment (e.g., in Nelson, 1989) are suitable for any type of comparisons between means but are known to be conservative because they ignore the correlations among the comparisons. More advanced standard approaches to control the FWE are the tests according to Tukey (e.g., in Hochberg and Tamhane, 1987, citing Tukey, 1953) or Dunnett (1955). However, these do usually not reflect the experimental questions underlying designs with complex treatment structures. Tukey's procedure is appropriate for all pairwise comparisons and therefore is often considered as conservative when testing and adjusting for more hypotheses than are actually of interest. Dunnett's procedure performs comparisons of several treatments to a control. With complex treatment structures, usually more than these comparisons are of interest. In the recent years, multiple comparisons procedures have been made available (Bretz et al., 2001, 2002; Hothorn et al., 2008a; Westfall 1997; Westfall et al., 1999), which provide the feature of controlling the FWE for a certain user-defined set of comparisons formulated as multiple contrasts of the treatment means. The number of comparisons as well as the correlation among the comparisons is taken into account specifically for each user-defined set of contrasts. Although these methods are computationally more demanding than well-known standard procedures, they are numerically available in various softwares.

If multiple hypotheses are tested without controlling the FWE rate at *α*, the probability of finding at least one of the considered differences to be significant when this difference is in fact zero can be markedly higher than *α*. For motivation of multiple comparison procedures, we present simulated FWE probabilities in Tables 1 and 2.

Probability to reject *x* out of *M* tested hypotheses when there is no difference among three, five, and 10 treatments and all pairwise comparisons are performed using multiple *t* tests (comparisons wise Type I error probability 5%) without adjustment for multiple testing or the protection of a global F-test.^{z}

Probability to reject *x* out of *M* tested hypotheses when there is no true difference in two, four, and nine independent two-sided contrasts (comparisons wise Type I error probability 5%) without adjustment for multiple testing.^{z}

Assuming that no true effect is present in an experiment with three, five, or 10 groups, Table 1 shows the probabilities to find no significant difference and one, two, or more significant differences when all pairwise comparisons are performed among these treatments without multiplicity adjustment or protection by a global F-test. When there are only three treatments, resulting in three tests, one will conclude for at least one significant difference in ≈12 of 100 experiments. If 10 treatments are compared in such a way, ≈61 of 100 experiments will identify an effect that is not reproducible in a follow-up experiment because a relatively large difference in means simply occurred by chance.

Table 2 shows similar numbers for the situation that two, four, or nine independent two-sided *t* tests are performed. This is a situation close to that of testing two, four, or nine orthogonal (independent) contrasts with a common residual degree of freedom (df). With only two independent tests, the overall chance of a Type I error is 0.0975, i.e., 1 – (1–0.05)^{2}, whereas when testing nine independent hypotheses, the odds are ≈1:2 to observe an effect that is not present in truth. When overall (familywise) Type I errors of this magnitude are not acceptable, lsd or single df contrasts should not be used, and multiple contrast procedures, as described in the following, are more appropriate.

In the following section, we introduce two example data sets and briefly review the well-known concept of multiple contrast tests. Subsequently, we show the application of multiple contrast tests to the two examples and give an interpretation of the results. Finally, we discuss the advantages and limitations of the proposed approach.

## Material and Methods

### Two example data sets.

Marini (2003) describes the analysis of an experiment with an augmented factorial treatment structure. The effect of three formulations of gibberellic acid (f1, f2, f3), each in two different concentrations, 10 and 20 mg·L^{−1}, on fruit set of apple was investigated. Additional to this two-factorial structure, a control group (application of water) was included in the design. The composition of the treatments is summarized in Table 3.

Experimental layout of Example 1 with three gibberellic acid formulations (f1, f2, f3), each in two different concentrations (10 and 20 mg·L^{−1}), and a water control (H_{2}O).

The seven treatment groups were arranged in a randomized complete block design using six apple trees as blocks. As the response variable, the ratio of fruits (65 d after bloom) per 100 flower clusters was presented (Marini, 2003). In previous analyses, Marini (2003) and Piepho et al. (2006) proposed the following experimental questions; the comparison of all treatments with the untreated control aims to show that the experiment was sensitive to reveal marked effects on the fruit set. The comparison of increasing dosages pooled over the formulations aims to assess whether and until which concentration a dose effect is present. For the two-factorial part of the trial, interest was also in the main effects of the formulations and concentrations taking the possibility of an interaction into account. By the formulation of orthogonal contrasts, Marini (2003) addressed the question for interactions more explicitly; namely, whether the different formulations affect the increase of fruit set from dose 10 to dose 20.

As a second example, we consider a fixed-dose combination experiment originally published by Adeli and Varco (2002). The effects of potassium (K) management on cotton yield (kg·ha^{−1}) were investigated. The objective was “to determine potassium (K) fertilizer rate and placement effects on cotton lint yield” (Adeli and Varco, 2002). The experimental setup contained two different application methods for K: “broadcast” (Bc) and “banded” (Bn), each in four concentrations. The factorial arrangement included 0, 68, and 136 kg·ha^{−1} K broadcast in all possible combinations with 0, 34, and 68 kg·ha^{−1} K banded application. Two additional treatments, with 204 kg·ha^{−1} K broadcast with zero banded and 102 kg·ha^{−1} K banded with zero broadcast application, were included in the design. This treatment structure can be imagined as arisen from a complete two-factorial structure with those cells omitted that lead to inappropriately high total dosages of kg·ha^{−1} K. Table 4 summarizes the treatment combinations. The doses were chosen carefully so that the total amount of K is the same in four pairs of treatments (in Table 4, members of each pair are given the same symbol). The resulting 11 treatment groups were assumed to be arranged in a completely randomized design with replication number n = 12. The data used for the analysis are simulated based on the published summary statistics.

Treatment structure of Example 2 (Adeli and Varco, 2002) comprising 11 combinations of broadcast (Bc) and banded (Bn) application of different doses of potassium (K) fertilizer.^{z}

The analysis should give information about supposed beneficial influences of the different placement methods and an expected diverging effect of the two methods. Furthermore, the aim is to find treatment combinations resulting in superior cotton lint yield given that the total K fertilizer application is the same. Note that regression method, and more specifically response surface regression, is a viable option to analyze this example. Compared with the methods discussed subsequently, regression methods have the advantage of using fewer parameters to describe the data, but additionally rely on assumptions concerning the dose–response relationship. Response surface regression is particularly useful when the aim is to estimate the optimum combination of the two quantitative variables. For an introduction to response surface regression, see, for example, Montgomery (2005).

### Simultaneous confidence intervals for user-defined multiple contrasts.

In this section, we review the concept of multiple contrast tests as, for example, described in Bretz et al. (2001). A simple linear model to explain the observations *Y _{ij}* is:

*Y*denoting the

_{ij}*j*th observation of the

*i*th treatment group, with

*i = 1*, …,

*k*, and

*μ*denoting the mean of the_{i}*i*th treatment group; anddenoting the residual error for the_{ij}*j*th observation in the*i*th group.

The errors * _{ij}* are assumed to be independent, i.e., the observations are derived from a completely randomized design and to be Gaussian distributed with equal variances:

*∼*

_{ij}*N*(0,

*σ*

^{2}). From fitting this one-way model, we derive estimates

*μ*or, more generally, as contrasts of the treatment means

_{i}*μ*. A contrast

_{i}*L*is a weighted difference of the

*μ*,

_{1}*μ*, …,

_{2}*μ*, where the weights

_{k}*c*are chosen such that a certain difference of interest is built:

_{i}*L*= Σ

*c*=

_{i}μ_{i}*c*

_{1}

*μ*

_{1}+

*c*

_{2}

*μ*

_{2}+ … +

*c*. For example, in a design comprising four treatments,

_{k}μ_{k}*i = 1*,

*2*,

*3*,

*4*, the difference between Treatments 1 and 2,

*μ*

_{2}−

*μ*

_{1}can be written as Σ

*c*with

_{i}μ_{i}*c*

_{1}= −1,

*c*

_{2}= 1,

*c*

_{3}= 0,

*c*

_{4}= 0. However, also, more complicated differences, like for example the difference of the first treatment's mean to the average mean of the three remaining treatments, (

*μ*

_{2}+

*μ*

_{3}+

*μ*

_{4})/3 −

*μ*

_{1}can be formulated:

*c*

_{1}= −1,

*c*

_{2}= 1/3,

*c*

_{3}= 1/3,

*c*

_{4}= 1/3.

For the choice of the *c _{i}*s, we impose only the restriction that the sum of

*c*s should be zero, Σ

_{i}*c*= 0. This ensures that the contrast has expectation 0 if in fact all

_{i}*μ*are equal. Moreover, we usually choose the

_{i}*c*such that the sum of all negative coefficients is –1 and hence the sum of all positive coefficients is 1. Then, we can interpret the confidence intervals for the contrasts as differences of (weighted averages of) treatment means. Usually, several, say

_{i}*M*, such contrasts are necessary to represent the experimental questions of interest. Note that there are no further restrictions on the choice of the

*c*depending on the remaining set of contrasts. That is, there is no necessity to define the

_{i}*M*contrasts orthogonal to each other, and there is no restriction on the number of contrasts

*M*, as in the case of orthogonal single df contrasts. The test statistic for one contrast can be calculated from:

*M*contrasts can be calculated fromusing

*M*and correlation matrix

*R*. For general contrasts, the correlation matrix

*R*has a complicated structure with elements depending on the sample sizes

*n*and the contrast coefficients

_{i}*c*.

_{i}For a particular contrast, the null hypothesis that the difference defined by the contrast has the value zero can be rejected if

We favor the graphical display of the simultaneous confidence intervals for reporting the results of a statistical analysis. From such plots, the significance of a particular difference at a FWE level α can be inferred if the value zero is not included in the confidence interval. Additionally, the direction (decrease or increase), magnitude, and, possibly, relevance of an effect can be assessed. If interpreting the relevance of the measured effect is of interest, confidence intervals are advantageous compared with *P* values because of displaying the effect size in the scale of the measured variable rather than in the scale of probability. Finally, the uncertainty concerning the estimated effect, depending on the sample variance and the sample size, is displayed by the width of the confidence interval.

## Results

### Evaluation of Example 1.

From previous discussions of Example 1 (Marini, 2003; Piepho et al., 2006), the following experimental questions can be deduced: Is the experimental setting capable of revealing effects on the response? Do the formulations differ? Do the concentrations differ?

In the following, hypotheses that might be of interest are stated as differences of treatment means using the acronyms introduced in Table 3. The contrast coefficients (*c _{i}*) leading to the stated differences are summarized in Table 5. The first difference of interest compares the pooled means of all treatments versus the untreated control group:

Contrast coefficients (c_{i}) are summarized for the multiple contrast tests indicated in the above text.^{z}

Simultaneous confidence intervals for the comparisons formally defined in Table 5 are plotted in Figure 1. The complete analysis has been performed in one simple procedure with all done tests adjusted for multiplicity inherently.

Altogether, the six gibberellin treatments lead to a significant increase in the number of fruits per flower cluster. Hence, the experimental setting is capable of revealing effects of the gibberellin treatments compared with an untreated control. On average, over the six gibberellin treatments, we can expect an increase of at least five fruits per 100 clusters more than in the untreated control with 95% confidence. Practically, the mean increase in the response is not very interesting; it is not possible to decide which treatments mainly contribute to the overall effect. Moreover, controlling the FWE for all eight hypotheses, none of the remaining tests is significant at the 5% level. None of the differences among the three formulations, each pooled over the concentrations, are significantly different from 0. Although comparing the average effect of the three formulations at concentration 20 with their average effect at concentration 10 shows a mean increase, the observed difference is not significant when controlling the overall Type I error probability at 5%. Finally, none of the three interaction contrasts differs significantly from zero, although the mean increase in the response when increasing the concentration from 10 to 20 is somewhat more pronounced in f1 and f2 compared with f3 (Comparisons 7 and 8). That is, given the limited sample size of the trial, we cannot conclude that there are differences among the formulations, between the concentrations, and cannot prove the presence of interactions when controlling the FWE at 5%.

The contrasts in Table 5 were constructed to show that hypotheses similar to those of the analysis of variance F-tests used by Piepho et al. (2006) can be tested using a multiple contrast approach. In practice, other comparisons can be more interesting and are as simple to implement. In the following, we show an analysis, alternative to that in Table 5.

First, it could be of interest whether any of the six gibberellin treatments leads to a change in the number of fruit limbs per number of flowers. Hypotheses 1 to 6 represent these comparisons; the resulting contrast coefficients are presented in Table 6.

Second, it might be of interest whether the formulations differ, taking the possibility of an interaction into account. This may result in comparisons of the different formulations at each of the two concentration levels:Finally, the concentrations 10 and 20 could be compared separately for each formulation:These are 15 comparisons in total, which could not have been performed using orthogonal contrasts. The correlation structure among these 15 comparisons is not trivial; however, it is taken into account inherently by the statistical software.Contrast coefficients (c_{i}) of 15 contrasts for the alternative evaluation of Example 1.^{z}

Figure 2 shows simultaneous 95% confidence intervals for the contrasts defined in Table 6.

This evaluation results in a more informative interpretation than the first approach: Two of the six gibberellin treatments result in a significant increase of the number of fruits per 100 flower clusters. With 95% confidence, we can expect an increase in the mean number of fruits per 100 flower clusters of at least two when Formulation 1 with Concentration 20 is applied. Using Formulation 2 with Concentration 20, one can expect at least 18 fruits per 100 clusters more than in the untreated control. The remaining combinations of formulation and concentration led to a mean increase of the number of fruits per cluster, but, controlling the FWE for all comparisons, the observed differences are not significant. Furthermore, the pairwise differences among the formulations are not significant when considered separately for each concentration (Comparisons 7 through 12). Finally, a difference between Concentrations 10 and 20 cannot be shown at the 5% level for any of the three formulations.

### Evaluation of Example 2.

The experiment presented by Adeli and Varco (2002) shows a more complex treatment structure. First, it could be of interest which K rate or application method increases the yield compared with control. That is, the differences of the 10 different K treatments to the untreated control treatment Bc0Bn0 are of primary practical interest. Furthermore, interest is in the magnitude of increase that can at least be expected with high probability, i.e., in lower confidence limits. Hypotheses 1 to 3 compare the three treatments with only broadcast application with the untreated control group:

Hypotheses 4, 5, and 6 compare the treatments with only banded application with the untreated control group:Hypotheses 7 to 10 compare the treatments with mixed application of banded and broadcast application with the untreated control:Furthermore, the aim could be to analyze whether any treatment that combines broadcast with banded application leads to an increase in yield compared with treatments resulting in the same amount of K with only one application method, broadcast or banded. The corresponding comparisons are:Finally, it could be of interest whether banded application of 68 kg·ha^{−1}K leads to an increase in yield compared with broadcast application of the same amount:

The contrast coefficients resulting from these 14 contrasts are presented in Table 7.

Contrast coefficients (c_{i}) for the 14 comparisons among the 11 treatments of the cotton example.^{z}

Simultaneous confidence intervals for the contrasts defined in Table 7 are plotted in Figure 3.

With 95% confidence, one can state that broadcast application of K leads to a significant increase in yield when applied with 136 kg·ha^{−1} or 204 kg·ha^{−1}. For banded application alone, no significant effect can be found. All combinations of banded and broadcast applications lead to a significant increase of cotton yield compared with the untreated control. Applying the combinations Bc68Bn34, BC68Bn68, and Bc136Bn34 leads to a mean increase in yield compared with the untreated control of at least 30 kg·ha^{−1}, 55 kg·ha^{−1}, and 180 kg·ha^{−1}, respectively. None of the treatments combining broadcast and banded application leads to a significant increase in yield compared with treatments applying the same amount of K with only one application method. Finally, there is no significant difference in yield between banded and broadcast application of 68 kg·ha^{−1} K.

## Discussion

This article shows that simultaneous confidence intervals for multiple contrasts are a flexible method to evaluate factorial experiments with nonstandard treatment structures as, for example, augmented factorial designs or experiments with two or more factors, which are crossclassified with some factor combinations omitted. The strategy can be summarized as follows; estimators for the treatment means and variance are derived from a simple general linear model with all treatments combined in a single factor (i.e., a pseudo-one-way layout or cell means model). Contrast coefficients are chosen by the user such that the hypotheses of interest are reflected as differences of (weighted averages of) treatment means. Like other procedures following the general linear model (Marini, 2003; Piepho et al., 2006), the described procedure relies on the assumptions that the observations are mutually independent and continuous with normal distributed errors and homogeneous variances. The method is computationally available for the R environment for statistical computing as well as in SAS.

Compared with other methods that have been proposed for evaluation of experiments with complex treatment structures, the described method has a number of advantages. The individual contrasts give more specific information than the global decisions provided by an analysis of variance F-test. When simultaneous confidence intervals are used, the significance, relevance, and direction (increase or decrease) of the effect of interest as well as the uncertainty concerning the estimates can be interpreted in a scale close to that of the measured variable, which is often easier than interpreting *P* values in the scale of probability. Compared with orthogonal single df contrasts, the contrasts formulated for the method described in this article do not need to be mutually orthogonal and are not restricted in their number. Finally, the overall Type I error probability is controlled inherently for a user-defined set of contrasts.

The described method is of limited use if experiments are analyzed that comprise many, say more than 20, treatments. Being still methodologically correct, it then has the drawback that the contrast matrix becomes huge and it is hard to control for typos in the definition of comparisons of interest. Also, when the number of contrasts becomes very high, computations can be very time-consuming or impossible. For some scenarios, the methods described by Piepho et al. (2006) can then be more appropriate.

In this article, we discuss only complex treatment structures and for brevity assume a simple randomization structure and homoscedastic Gaussian error distribution for the response variable. Nevertheless, the concept of multiple contrast tests can be extended so that situations with different assumptions or randomization schemes are also covered. Block effects or more complex randomization structures may be included as random effects in a linear mixed effects model, whereas the complex treatment structure remains in the fixed part (for example Piepho et al., 2003). Computationally, approximate simultaneous confidence intervals for multiple contrasts in mixed models are covered in the SAS PROC GLIMMIX as well as the R package multcomp. When the assumption of the Gaussian distribution is not adequate but counts or proportions are considered, generalized linear models (McCullagh and Nelder, 1989; Piepho, 1999) are an alternative. By default, the primary comparisons are then performed on the log scale for count data and on the logit scale for binomial proportions. Using the inverse link function to transform back results in confidence intervals for ratios of means and odds ratios when the log and logit link is used, respectively. Again, these cases are computationally solved in PROC GLIMMIX and multcomp. However, also in the general linear model with Gaussian errors, the comparisons of interest could be formulated in terms of ratios rather than in differences of means (Dilba et al., 2006). When interest is in a combination of one-sided and two-sided hypotheses, Braat et al. (2008) provide a method related to the methods shown in this article.

## Literature Cited

Adeli, A. & Varco, J.J. 2002 Potassium management effects on cotton yield, nutrition, and soil potassium level

*J. Plant Nutr.*25 2229 2242Braat, S., Gerhard, D. & Hothorn, L.A. 2008 Joint one-sided and two-sided simultaneous confidence intervals

*J. Biopharm. Stat.*18 293 306Bretz, F., Genz, A. & Hothorn, L.A. 2001 On the numerical availability of multiple comparison procedures

*Biom. J.*43 645 656Bretz, F., Hothorn, T. & Westfall, P.H. 2002 On multiple comparisons in R

*R News*2 14 17Dean, A. & Voss, D. 1999 Design and analysis of experiments Springer-Verlag New York, NY

Dilba, G., Bretz, F. & Guiard, V. 2006 Simultaneous confidence sets and confidence intervals for multiple ratios

*J. Stat. Plan. Infer.*136 2640 2658Dunnett, C.W. 1955 A multiple comparison procedure for comparing several treatments with a control

*J. Amer. Stat. Assoc.*50 1096 1121Hochberg, A.C. & Tamhane, Y. 1987 Multiple comparison procedures Wiley New York, NY

Hothorn, T., Bretz, F. & Westfall, P. 2008a Simultaneous inference in general parametric models

*Biometrical Journal*50 346 363Hothorn, T., F. Bretz, P. Westfall, and R.M. Heiberger. 2008b. Multcomp: Simultaneous inference for general linear hypotheses. R package version 0.993-2.

Marini, R.P. 2003 Approaches to analyzing experiments with factorial arrangements of treatments plus other treatments

*HortScience*38 117 120McCullagh, P. & Nelder, J.A. 1989 Generalized linear models Chapman & Hall/CRC Boca Raton, FL

Montgomery, D.C. 2005 Design and analysis of experiments 6th Ed Wiley Hoboken, NJ

Nelson, P.R. 1989 Multiple comparisons of means using simultaneous confidence intervals

*J. Qual. Technol.*21 232 289Petersen, R.G. 1994 Agricultural field experiments—Design and analysis Marcel Dekker New York, NY

Piepho, H.-P. 1999 Analysing disease incidence data from designed experiments by generalized linear mixed models

*Plant Pathol.*48 668 674Piepho, H.-P., Büchse, A. & Emrich, K. 2003 A hitchhiker's guide to mixed models for randomized experiments

*J. Agron. Crop Sci.*189 310 322Piepho, H.-P., Williams, E.R. & Fleck, M. 2006 A note on the analysis of designed experiments with complex treatment structure

*HortScience*41 446 452Hochberg, A.C. & Tamhane, Y. 1987 Multiple comparison procedures Wiley New York, NY

Hothorn, T., Bretz, F. & Westfall, P. 2008a Simultaneous inference in general parametric models

*Biometrical Journal*50 346 363Hothorn, T., F. Bretz, P. Westfall, and R.M. Heiberger. 2008b. Multcomp: Simultaneous inference for general linear hypotheses. R package version 0.993-2.

R Development Core Team 2008 R: A language and environment for statistical computing. Version 2.6.2. R Foundation for Statistical Computing Vienna Austria

SAS Institute 2006 The GLIMMIX procedure, June 2006 SAS Institute Cary, NC

Tukey, J. 1953. The problem of multiple comparisons, unpublished manuscript, reprinted in: Braun, H.I. (Ed.) 1994. The collected works of John W. Tukey. VIII. Multiple comparisons. Chapman and Hall, New York, NY.

Westfall, P.H. 1997 Multiple testing of general contrasts using logical constraints and correlations

*J. Amer. Stat. Assoc.*92 299 306Westfall, P.H., Tobias, R.D., Rom, D., Wolfinger, R.D. & Hochberg, Y. 1999 Multiple comparisons and multiple tests using the SAS System SAS Institute Cary, NC

The two example data sets are available at http://www.biostat.uni-hannover.de/software/. After loading the data sets into the R workspace under the names ExFruitset (Example 1) and ExKCotton (Example 2), the following R code reproduces the analyses of the two examples shown in this article.

SAS program files, including the data sets and the calculation of the simultaneous intervals, are available at http://www.biostat.uni-hannover.de/software/.

The following R code reproduces the calculation of simultaneous confidence intervals plotted in Figure 1:

The following R code reproduces the calculation of simultaneous confidence intervals plotted in Figure 2:

The following R code reproduces the calculation of simultaneous confidence intervals plotted in Figure 3: