Covarion Structure in Plastid Genome Evolution: A New Statistical Test

doi:10.1093/molbev/msi076

MBE Advance Access originally published online on December 29, 2004
Molecular Biology and Evolution 2005 22(4):914-924; doi:10.1093/molbev/msi076

This Article

	Abstract
	FREE Full Text (PDF)
	All Versions of this Article: 22/4/914 most recent msi076v1
	Alert me when this article is cited
	Alert me if a correction is posted

Services

	Email this article to a friend
	Similar articles in this journal
	Similar articles in ISI Web of Science
	Similar articles in PubMed
	Alert me to new issues of the journal
	Add to My Personal Archive
	Download to citation manager
	Cited by other online articles
	Search for citing articles in: ISI Web of Science (13)
	Request Permissions

Google Scholar

	Articles by Ané, C.
	Articles by Sanderson, M. J.

PubMed

	PubMed Citation
	Articles by Ané, C.
	Articles by Sanderson, M. J.

Research Article

Covarion Structure in Plastid Genome Evolution: A New Statistical Test

Cécile Ané¹, J. Gordon Burleigh, Michelle M. McMahon and Michael J. Sanderson

Section of Evolution and Ecology, University of California, Davis

E-mail: ane{at}stat.wisc.edu.

Abstract

TOP
Abstract
Introduction
Methods
Results
Discussion
Acknowledgements
References

Covarion models of molecular evolution allow the rate of evolutionof a site to vary through time. There are few simple and effectivetests for covarion evolution, and consequently, little is knownabout the presence of covarion processes in molecular evolution.We describe two new tests for covarion evolution and demonstratewith simulations that they perform well under a wide range ofconditions. A survey of covarion evolution in sequenced plastidgenomes found evidence of covarion drift in at least 26 outof 57 genes. Covarion evolution is most evident in first andsecond codon positions of the plastid genes, and there is noevidence of covarion evolution in third codon positions. Therefore,the significant covarion tests are likely due to changes inthe selective constraints of amino acids. The frequency of covarionevolution within the plastid genome suggests that covarion processesof evolution were important in generating the observed patternsof sequence variation among plastid genomes.

Key Words: covarion • model testing • parametric bootstrap • plastid genome evolution

Introduction

TOP
Abstract
Introduction
Methods
Results
Discussion
Acknowledgements
References

The ever-increasing amount of sequence data and the availabilityof new statistical and computational methods have motivatedthe development of progressively more complex models of evolution(Liò and Goldman 1998; Whelan et al. 2001; Holder and Lewis 2003).Though variation in the substitution rate acrossnucleotide or amino acid sites is commonly incorporated intoevolutionary analyses (e.g., Yang 1994, 1996), variation inthe rate of evolution of a site through time is rarely considered(Galtier 2001; Huelsenbeck 2002). This may be due to difficultiesin implementing such a model and the lack of simple tests todetermine if it would be appropriate. The covarion (for concomitantlyvariable codon) hypothesis of molecular evolution proposes thatselective pressures on an amino acid or nucleotide site changethroughout time, and therefore, a site's rate of evolution alsochanges (Fitch and Markowitz 1970; Fitch 1971). The term covarionsometimes specifically refers to protein sequences while covariotiderefers to nucleotide sequences (Shoemaker and Fitch 1989); however,we will use the more general term covarion to refer to shiftsin the substitution rate of any character. In the covarion hypothesis,the functional constraints of a site change through time, anda character that is functionally constrained within one lineagemay not be constrained in another lineage. In an extreme case,a site may be either completely constrained or unconstrainedand susceptible to substitutions. Under this model, there wouldbe an excess of sites that are invariant in one part of a treebut variant in another. Different sites could be variant orinvariant in different parts of the tree. Consequently, thefirst evidence of covarion-like patterns of evolution was basedon detecting sites that had no variation among taxa in one cladeand variation among taxa in another clade (e.g., Fitch and Markowitz 1970;Fitch 1971; Miyamoto and Fitch 1995; Lockhart et al. 1998).

Tuffley and Steel (1998; also see Penny et al. 2001) developedthe first formal model of covarion evolution. In their model,the substitution process can be turned ON or OFF. Whenever asite is ON, it evolves according to some substitution process,and when a site is OFF, that site is invariant. The ON substitutionprocess can be modeled with any reversible substitution ratematrix (see e.g., Swofford et al. 1996). The switches betweenON and OFF are modeled as an additional stationary Markov processwith two parameters, the ON equilibrium frequency ${sigma}$ and the averagenumber of switches per substitution ${nu}$ . The transition matrixof the switch process has ON/OFF switching rate s₀₁ and an OFF/ONswitching rate s₁₀ that are determined by ${sigma}$ and ${nu}$ . Specifically,the ON/OFF switching rate s₀₁ = ${sigma}$ ${nu}$ /(2(1 – ${sigma}$ )), and the OFF/ONswitching rate s₁₀ = ${nu}$ /2. If ${sigma}$ = 0, all sites are always invariable,and if ${sigma}$ = 1, the sites always evolve according to the normalsubstitution model. Furthermore, if ${nu}$ = 0, there are no switchesbetween ON and OFF or OFF and ON, and any given site will beeither invariable or variable throughout the tree. As ${nu}$ convergeson ${infty}$ , the model of evolution resembles the normal substitutionmodel. Such a model may incorporate variation in rates of evolutionacross sites, which is often modeled using a discrete gammadistribution (Yang 1994). In a model with rate variation acrosssites, sites may have different rates of evolution, but therate of evolution for a single site remains constant throughoutthe tree. We will refer to a model with variable rates acrosssites as a RAS model, a covarion model as a COV model (COVarion),and a model with both among-site rate variation and a covarionevolution as a COV + RAS model. Huelsenbeck (2002) adapted theTuffley and Steel (1998) covarion model to allow any reversiblenucleotide model, including RAS models, to incorporate covarionevolution. Galtier (2001) developed a different covarion modelthat allows the rate at a site to vary over time and does notrequire that sites turn completely on or off.

There are few formal tests of covarion evolution in sequencedata. Lockhart et al. (1998) developed a nonparametric testto detect covarion evolution under the Tuffley and Steel (1998)model and applied it to two loci, and another nonparametriccovarion test detected evidence of covarion evolution in fourprotein-coding loci (Lockhart et al. 2000). Galtier (2001) usedapproximate likelihood estimates to perform likelihood ratiotests on two loci. Huelsenbeck (2002) developed an integratedlikelihood ratio test for covarion evolution using MCMC andfound evidence of covarion evolution in 9 out of 11 loci. Thoughlikelihood ratio tests often perform well in evolutionary modeltesting (Posada and Crandall 2001), the likelihood ratio testsof covarion models are computationally complex and difficultto implement. The complexity of calculating the likelihood witha covarion model is due to bigger substitution matrices (8 x8 instead of 4 x 4) that, moreover, need to be diagonalizedmore often during optimization. Additionally, the performanceof covarion model testing methods is not well characterized.Therefore, it is not surprising that these tests are rarelyutilized, and the importance of the covarion process in molecularevolution is largely unknown.

We describe two new tests for covarion evolution that are basedon a test statistic from Tuffley and Steel (1998) and Lockhart et al. (1998).We examine the performance and power of the newtests under a wide range of conditions using simulations. Finally,we use the test to survey for the presence of covarion patternsof evolution in 57 genes obtained from complete plastid genomes.

Methods

TOP
Abstract
Introduction
Methods
Results
Discussion
Acknowledgements
References

Tests of the Covarion Hypothesis
The tests of covarion evolution are based on the Tuffley and Steel (1998)model of the covarion process, though the testsalso account for rate variation among sites. The test statisticsmeasure the independence of the substitution process betweentwo groups of taxa. First, an unrooted topology is split intoa bipartition. Each element of the bipartition is called a group(fig. 1A). The total number of sites in a sequence alignmentis N. Let N₁ be the number of sites that vary in the first group,N₂ be the number of sites that vary in the second group, andN₁₂ be the number of sites that vary in both groups. The probabilityp₁ that a site varies in the first group is estimated by N₁/N,and N₂/N is an estimate of the probability p₂ that a site variesin the second group. If site variation is independent in thetwo groups, then the probability p₁₂ that a site varies in bothgroups would be the product p₁ p₂. The difference w = p₁₂ –p₁ p₂ is a measure of the correlation of site variation in thetwo groups. Because N₁₂/N is an estimate of p₁₂, we constructthe test statistic

View larger version (10K):
[in this window]
[in a new window]

FIG. 1.— Example trees for heterogeneity and covarion tests. (A) is an unrooted tree with a bipartition separating two groups. (B) is an example of a four-taxon rooted tree. T is the distance from the root of each clade to the root of the tree, and t is the distance from each tip to the root of its clade. (C) and (D) represent trees used in the simulation experiments. The tree in (C) represents a tree with deep clades, and the tree in (D) represents shallow clades. The numbers on the branches are the average number of substitutions per site.

If there is no rate variation across sites or within the treeand the underlying substitution process follows the Kimura 3ST(Kimura 1981) or Jukes and Cantor (1969) model, then w is exactly0 (Tuffley and Steel 1998) because in these models the probabilitythat a site is variable within a group is not dependent on theancestral nucleotide. This result holds approximately with otherreversible substitution matrices. However, under a RAS model,w is positive (Lockhart et al. 1998). For example, if a sitevaries in one group under a RAS model, it suggests that thesite is evolving rapidly. This in turn suggests that the probabilitythat the site is variable in the second group is higher thanit would be with no information about its variability in thefirst group. In other words, site variation in the two groupsis positively correlated. If sites are evolving under the Tuffley and Steel (1998)covarion model, w should also be greater than0. For example, if a site varies in the first group, it suggeststhat the site is ON at the parent node of the first group. Theprobability that the site is ON at the parent node of the secondgroup would be higher than if the site were OFF in the firstgroup. It follows that the probability that the site variesin the second group is also higher than it would be withoutobserving variation in the first group. However, the value ofw under a COV + RAS model should be lower than it is under theRAS model. Whereas in the RAS model all fast or slow sites inone group are also fast or slow in the other group, under thecovarion model, some of the sites will switch from ON to OFFor OFF to ON between groups. These switches diminish the correlationbetween group variability. Therefore, the test statistic W isan estimate of a value w that is approximately 0 under a homogeneous(no RAS nor COV) pattern of evolution, positive under a COVor COV + RAS pattern of evolution, and usually even larger undera RAS pattern of evolution. Still, it is not simple to distinguishbetween models based solely on the value of W.

We can compute w analytically for the different models in asimple example. In the calculations that follow, we assume asimpler RAS distribution and substitution model than in theremainder of the paper. They provide formulae that illustratewhy W is a useful statistic for examining the covarion and RASmodels. Assume that a rooted tree is formed by two clades (groups)each containing two taxa, that each clade's root node is atdistance T from the root of the tree, and that each taxon isat distance t from the root node of the clade (fig. 1B). Undera Jukes and Cantor (1969) model with a homogeneous rate equalto r, w is 0. Under a RAS model with a proportion of invariablesites (p_inv) and the remaining sites evolving at rate r,

when t is small. If the evolution processis a COV model, then we can derive from Tuffley and Steel (1998)that

when t is small. We recoverthe same result as before in case the switching rate ${nu}$ is 0,because the COV model becomes a mixture of invariable sitesand constant rate sites with p_inv=1 – ${sigma}$ . If we assume aRAS + COV model with the previous parameters, a proportion p_invof sites are permanently invariable sites and a proportion (1– p_inv) of sites can turn ON and OFF. Among the sitesthat may be variable, a proportion (1 – p_inv) ${sigma}$ are ON (variable),and a proportion (1 – p_inv)(1 – ${sigma}$ ) are OFF (temporarilyinvariable), and

when t issmall. These results first show that w is positive in all ofthe heterogeneous (RAS, COV, or RAS + COV) models. Second, itshows the conditions under which w values for different modelsare close or far apart. For instance, w_cov in the COV modelis further from 0 (homogeneous case) when T is small and t islarge, and close to 0 when T is large or t small. The effectof T is exponential, unlike t. Thus, T appears to have a greatereffect on w than the t values. It is easier to distinguish thehomogeneous and COV models when the clade depth is large andthe interior branches are small. The difference in w betweenthe RAS and RAS + COV models is

Therefore,it is usually positive because of the small value of the exponentialterm, yielding a greater value of w in the RAS model than inthe COV + RAS model. As in the previous case, w_ras and w_ras+covare furthest from each other when the clade depth t is large.However, a small value of T decreases the difference betweenw_ras and w_ras+cov. Thus, a long branch between clades is betterto distinguish between the RAS and COV + RAS models.

We present two tests that consider a null hypothesis nestedwithin an alternate covarion hypothesis. In the first test,the heterogeneity test, the null hypothesis is a homogeneousmodel and the alternate hypothesis is any heterogeneous model(COV, RAS, or COV + RAS). If an RAS model has been rejected,this test can use a COV alternate model. In the second test,the covarion test, the null hypothesis is the RAS model, andthe alternate hypothesis is the COV + RAS model. The null RASmodel has gamma-distributed rate heterogeneity (Yang 1994).The distribution of the test statistic W for both tests is obtainedusing parametric bootstrapping (e.g., Huelsenbeck et al. 1996).The maximum likelihood parameters used in the parametric bootstrappingare estimated under the null (non-covarion) hypothesis, andmany data sets with the same number of characters as the originalmatrix are simulated using these parameter values. W is computedfor each of the simulated data sets to determine its distributionunder the null hypothesis. In the heterogeneity test, the Pvalue is the percentage of W's from simulated data sets thatare larger or equal to the observed W, and in the covarion testthe P value is the percentage of W's from simulated data setsthat are smaller or equal to the observed W. The calculationof W is implemented in a C program that is available at http://ginger.ucdavis.edu/.

The heterogeneity and covarion tests are similar to the previousnonparametric covarion test of Lockhart et al. (1998) but differin two ways. First, the Lockhart test incorporates an estimateof p_inv, whereas the heterogeneity and covarion tests do not.It may be difficult to estimate both p_inv and the ${alpha}$ shape parameteraccurately from a data set with limited taxon sampling (Sullivan et al. 1999).Also, p_inv is difficult to estimate when it differsamong groups (Steel et al. 2000). Thus, the test of Lockhart et al. (2000)does not require estimation of p_inv. The originalLockhart et al. (1998) test statistic together with parametricbootstrapping performed poorly in a preliminary simulation analysis(data not shown). However, the performance of the test greatlyimproved when p_inv was not estimated, and the new test statisticappears to perform well even when data is simulated with a modelthat includes invariable sites (see below). Second, our testmodifies the confidence region w > 0 used by Lockhart et al. (1998)because, as demonstrated analytically in the previousexample, w should exceed 0 if sites are evolving under the Tuffley and Steel (1998)covarion model. Instead, the significance ofthe heterogeneity or covarion tests is determined based on thedistribution of the test statistic W from parametric bootstrapping.

Performance of the Covarion Tests
The performance of the heterogeneity and covarion tests wasexamined with simulation experiments. Trees were generated usinga conditional pure-birth (Yule) diversification process witha speciation rate of 2.0 using r8s (Sanderson 2003). The modeltrees were all rooted with a length of 1 substitution per sitefrom the root to tips (fig. 1C and D). To examine the type 1error rate (the level) of the covarion test, or how often thetest rejects the null hypothesis when it is true, sequenceswere simulated under an HKY substitution model (Hasegawa, Kishino, and Yano 1985)with a transition to transversion ratio of two,equal nucleotide frequencies, and gamma-distributed rate heterogeneitywith four discrete rate categories and a shape parameter ${alpha}$ =0.25 (Yang 1994). To examine the power of the covarion testor how often it correctly rejects the null hypothesis, we simulateddata sets according to the same model with an additional covarionprocess. Similarly, the level of the heterogeneity test wasdetermined by simulating characters using an HKY model withoutrate heterogeneity, and the power was determined by using aCOV model and the HKY substitution model described above. Foreach set of parameter values, 1,000 trees were generated. Eachtree was used to simulate one sequence alignment, and each simulateddata matrix had a length of 1,000 nucleotides.

All simulations were done using a version of Seq-Gen (Rambaut and Grassly 1997)modified to incorporate covarion evolution.The modified version of Seq-Gen is available at http://www.ginger.ucdavis.edu/.The test statistic W was calculated using a C program, and maximumlikelihood parameter estimates for parametric bootstrappingwere obtained using the model tree from PAUP* 4.0b10 (Swofford 2002).The relevant test was then conducted based on 1,000 parametricbootstrap replicates, accepting P values at the 5% level.

We varied several conditions to evaluate sensitivity of thetests. First, the effect of the number of taxa per group onthe level and power of the tests were examined. In this experiment,trees had from 4 to 32 taxa contained in two clades of equalsize, such that each clade contained 2, 4, 6, 8, or 16 taxa.Additionally, there were two clade depth treatments. In thefirst, the length from the root of the clade to the tips was0.5 substitutions per site, and in the second it was 0.1 (fig.1C and D). Thus, the tree depth to clades depth ratio was either2 (deep clades; fig. 1C) or 10 (shallow clades; fig. 1D). Inorder to keep the tree depth equal to 1, branch lengths on thetree were rescaled by a factor of 1/ ${sigma}$ before the simulationsin the covarion model simulations. For both of these experiments,the covarion parameters ${nu}$ and ${sigma}$ were set to 0.4 and 0.6, respectively,as these values are close to estimates in some genes from Huelsenbeck (2002).

We also determined the effect of the RAS gamma distributionparameters on the level and the power of the covarion tests.Simulations were performed as previously described, except thatall trees had eight taxa in both clades. To examine the effectof the ${alpha}$ shape parameter on the test, simulations were performedwith ${alpha}$ values of 0.25, 0.5, 1.0, 1.5, and 2.0 with no invariablesites. We also examined the effect of invariable sites, althoughthe test does not assume their presence. Sequences were simulatedwith ${alpha}$ = 0.25 and 0%, 10%, 15%, and 20% invariable sites. Thesesimulations assess the tests' performance using an incorrectmodel.

Finally, we examined the effect of ${nu}$ and ${sigma}$ on the power of theheterogeneity and covarion tests. To examine the effect of ${nu}$ ,the simulations were performed as described in the first simulationexperiment except that all trees contained eight taxa in bothclades, while the value of ${nu}$ varied from 0.005 to 2 and ${sigma}$ =0.6.To examine the effect of ${sigma}$ , ${nu}$ was set to 0.4 and ${sigma}$ varied from0.001 to 1.

Covarion Evolution in Plastid Genes
We obtained the amino acid and nucleotide sequences from theprotein-coding genes in 23 completely sequenced plant plastidgenomes available from GenBank (http://www.ncbi.nlm.nih.gov/;table 1). The number of genes within plastid genomes variedfrom 25 (Epifagus virginiana) to 174 (Chlorella vulgaris). Setsof potentially homologous amino acid sequences ("clusters")were identified using BLASTCLUST (Dondoshansky 2002) with aclustering threshold of 50% similarity. Of the 72 resultingclusters with at least 12 taxa, we retained 57 that containedat most one sequence from any single taxon. The amino acid sequencesfrom these genes were aligned using ClustalW (Thompson et al. 1994),and the aligned amino acid sequences were used to makenucleotide sequence alignments. The nucleotide alignments areavailable at http://ginger.ucdavis.edu/. All tests used nucleotidesequence data. For all the model tests, we used a referencetree that represents the likely relationships of plant taxawith complete chloroplast sequences (e.g., Pryer et al. 2002;fig. 2), though there is some controversy regarding the phylogenyof the green plant plastid genome (e.g., Goremykin et al. 2003,2004; D. E. Soltis and P. S. Soltis 2004).

View this table:
[in this window]
[in a new window]

Table 1 GenBank Accession Numbers for the Complete Plastid Genome Sequences Used in the Tests of Covarion Evolution

View larger version (27K):
[in this window]
[in a new window]

FIG. 2.— Reference tree of taxa with completely sequenced plastid genomes (e.g., Pryer et al. 2002). This topology was used in the tests for covarion evolution along with the MP and ML trees made from a single concatenated alignment of all 57 plastid genes.

We first performed a likelihood ratio test on each of the 57genes to evaluate the null hypothesis of equal rates acrosssites versus the alternate hypothesis of gamma-distributed rateheterogeneity. We calculated the likelihood for each gene usingthe HKY model and the HKY model with gamma-distributed rateheterogeneity (Hasegawa et al. 1985; Yang 1994). The likelihoodratio test statistic was evaluated based on ${chi}$ ² distribution witha 50:50 mixture of 0 and 1 degree of freedom (Self and Liang 1987;Goldman and Whelan 2000; Ota et al. 2000).

Next, we performed the heterogeneity and covarion tests to examinethe null hypotheses of homogeneous and RAS evolution for eachof the 57 chloroplast genes. Sites with gaps in the alignmentmay affect the calculation of W, and therefore, W was only calculatedfrom sites that contained no gaps. For the heterogeneity test,the distribution of the test statistic was determined from 2,000simulations using the maximum likelihood parameters estimatedwith the HKY model (Hasegawa et al. 1985), with empirical basefrequencies and estimating a parameter for ratio of transitionsto transversions. Though the parameters were estimated fromthe full sequence alignment, the simulated data matrices wereonly as long as the number of sites without gaps (table 2).For the covarion test, the distribution of the test statisticwas determined from 2,000 data sets simulated using an HKY modelwith rate variation among sites following a gamma distributionwith four discrete rate categories (Yang 1994) and the referencetree (fig. 2). The original groups used to calculate W werethe angiosperms and the nonangiosperms (fig. 2). We also performedthe covarion test using eudicots and noneudicots groups as wellas seed plant and nonseed plant groups (fig. 2). In order toexamine the effect of tree topology, we performed the covariontest with the angiosperm and nonangiosperm partition using parsimonyand maximum likelihood trees made using the concatenated nucleotidealignment of all 57 genes. The parsimony tree was found usinga heuristic tree search starting from a random sequence additiontree with tree bissertion reconnection (TBR) branch swapping.The maximum likelihood tree was constructed with the HKY model(Hasegawa et al. 1985) using empirical base frequencies, a transition/transversionratio of 2, and no rate variability across sites, using thesame tree search heuristic as parsimony. Parsimony and likelihoodanalyses were done using PAUP* (Swofford 2002). We also investigatedthe presence of covarion evolution at different codon positionsperforming all tests on data sets that included the first andsecond codon positions and data sets that included only thethird codon position.

View this table:
[in this window]
[in a new window]

Table 2 Results of Two Tests for 57 Chloroplast Protein-Coding DNA Sequence Datasets

	Results

TOP Abstract Introduction Methods Results Discussion Acknowledgements References

Performance of Covarion Tests in Simulation
The heterogeneity and covarion tests appear to perform wellunder a wide range of simulation conditions. The level of thetests is close to or below the targeted level of 5% regardlessof the number of taxa in each group (fig. 3A). The level appearsslightly lower in the covarion test than the heterogeneity test,and the largest difference between the actual and targeted leveloccurs for the covarion test with shallow clades and few taxain each clade (fig. 3A). In this case, the test tends to beconservative (level of 0.6% instead of 5%). As the number oftaxa increases, the level generally gets closer to the targetedvalue 5%. The level of the test also remains near 5% in thecovarion test as the shape parameter ${alpha}$ varies (fig. 4A). However,the level of the test exceeds 5% when there are deep cladesand the proportion of invariable sites is 15% or 20% (fig. 4B).

View larger version (12K):
[in this window]
[in a new window]

FIG. 3.— Results of the simulation study testing the effect of group size on the level and power of the heterogeneity and covarion tests. Each point on the graphs represents 1,000 simulated data sets. The covarion test was performed on data sets simulated with a RAS model using shape parameter ${alpha}$ = 0.25, and the heterogeneity test was performed on data sets simulated without rate variation across sites. In figure 1A, the simulations were performed under the null model, and the level of the tests is the percentage of simulation replicates in which the null hypothesis was mistakenly rejected. In figure 1B, the simulations were performed under a covarion model (COV only for the heterogeneity tests; RAS + COV for the covarion test). The power of the test shows the percentage of simulation replicates that correctly rejected the null hypothesis.

View larger version (10K):
[in this window]
[in a new window]

FIG. 4.— Results of the simulation study showing the effect of the ${alpha}$ shape parameter and p_inv on the covarion test. When simulations use a covarion model (circles), the rejection percentage is the power of the test. When simulations do not use a covarion model (indicated by x in the graph), the rejection percentage is the level of the test. Rate variation across sites was simulated as a mixture of discrete gamma distribution with invariable sites. The tests assume p_inv = 0. In the simulations for (A), the ${alpha}$ shape parameter for the gamma distribution varied and p_inv = 0. In the simulations for (B), ${alpha}$ = 0.25 and p_inv varied.

The tests' power increases with the number of taxa per clade(fig. 3B). The power of the test is often higher with deep cladesand a short intermediate branch than with shallow clades anda long intermediate branch (figs. 3B, 4, and 5). Power is lowwith only two taxa in each clade, except in case of the heterogeneitytest with deep clades (fig. 3B). Six taxa in each clade areusually needed to reach a power of 90%, except for the heterogeneitytest with shallow clades, which requires greater than 16 taxain each clade (fig. 3B). The power of the covarion test is littleaffected by ${alpha}$ or the proportion of invariable sites (fig. 5A).The test power is still above 90% when ${alpha}$ is 2 as well as whenthe proportion of invariable sites is 0.2 (fig. 4). In the covariontest, the power is high when ${sigma}$ is between 0.04 and 0.7 and when ${nu}$ is below 1. The latter condition should be fulfilled on realdata, as the switching process is expected to be much slowerthan the substitution process. The range of good performanceis smaller for the heterogeneity test, requiring ${sigma}$ between 0.2and 0.6 and ${nu}$ below 0.6 switches/substitution.

View larger version (14K):
[in this window]
[in a new window]

FIG. 5.— Results of the simulation study showing the effect of the covarion parameters ( ${nu}$ and ${sigma}$ ) on the power of the heterogeneity and covarion tests. In the simulations for (A), ${nu}$ is varied and ${sigma}$ = 0.6. In the simulations for (B), ${sigma}$ is varied, ${nu}$ = 0.4 switches/substitution.

Plastid Genes
The likelihood ratio test strongly rejects (P < 0.0005) ahomogenous rates model in favor of a RAS model in all 57 genesfor all sites and for first and second codon positions only.The homogenous rates model was also strongly rejected in 56of the 57 genes when only third codon positions were includedin the test, the exception being petN (P = 0.115).

The heterogeneity test detected significant heterogeneity (RASand/or COV) in all of the 57 data sets when all codon positionsare included (table 2). When applied to first and second positionsonly, heterogeneity was detected in 52 out of 57 genes, andin 44 of 57 genes when only third positions are included (table 2).The covarion test detected covarion evolution in 14 of 57genes across all positions; however, covarion evolution is detectedin 26 out of 57 genes when only first and second positions areincluded (table 2). Only two genes show evidence of a covarionstructure in third codon positions, which is not greater thanthe number of significant tests we would expect by chance alone(table 2).

The results of the covarion test are nearly the same when weused the maximum parsimony (MP) tree and maximum likelihood(ML) tree from the combined data matrix instead of the referencetree (table 3). All of the 57 genes still rejected the homogeneitymodel with all positions included. With first and second codonpositions, heterogeneity was detected in 53 (MP tree) or 52(ML tree) data sets, and in 42 data sets (for both MP and MLtrees) for third positions (table 3). Covarion evolution wasdetected in 14 (MP tree) or 16 (ML tree) genes when all positionswere included, 24 genes with first and second positions, andtwo genes with third positions only (table 3).

View this table:
[in this window]
[in a new window]

Table 3 Summary of the Results from Heterogeneity and Covarion Tests

Fewer significant results for the covarion test were obtainedwhen taxa were partitioned into eudicot and noneudicot groupsrather than angiosperm and nonangiosperm groups, but slightlymore significant results were obtained using the seed plantand nonseed plant groups (table 3). In the covarion test witheudicot and noneudicot groups, the RAS-only model was rejectedin only one or two genes with all codon positions, 11 genes(7 for MP tree, 10 for ML tree) with first and second positionand two genes with third positions only (table 3). However,the covarion test with the seed plant–nonseed plant partitiongenerally rejected the RAS model at least as many times as withthe angiosperm–nonangiosperm groups. With the seed plantpartition, the covarion test rejected the RAS model 15 or 18times (depending on the tree) for all sites and 27 or 32 timesfor only the first and second codon positions (table 3). Theresults of the heterogeneity test were nearly uniform for allsets of groups. Either 56 or all 57 genes rejected the homogeneitymodel with all positions included, and 51 or 53 genes rejectedhomogeneity with first and second positions (table 3).

Discussion

TOP
Abstract
Introduction
Methods
Results
Discussion
Acknowledgements
References

Though the idea of covarion evolution was proposed over 30 yearsago (Fitch and Markowitz 1970; Fitch 1971), there is still surprisinglylittle data regarding its frequency or importance. Huelsenbeck (2002)rejected a noncovarion model in favor of a covarion modelfor 9 out of 11 loci from a variety of organisms, but few otherstudies have examined more than one or two loci for evidenceof covarion evolution (e.g., Lockhart et al. 1998; Galtier 2001;Misof et al. 2002). We present a large-scale analysis of covarionevolution in which we examined most genes from completely sequencedgreen plant plastid genomes. Our test for covarion patternsof evolution performs well under a wide range of simulated conditions(figs. 3–5 ) and is simple to implement. Using this test,nearly half of the plastid genes show evidence of covarion evolutionin the first two codon positions (table 2), indicating thatchanges in selective constraints of amino acids through timeare an important factor in creating the sequence variation amongmany plastid genes.

The simulations and the analysis of the plastid genes indicateseveral conditions under which the heterogeneity and covariontests will perform well. The level of the tests are near thedesired 5%, except in the heterogeneity test when the interiorbranch separating the two groups is short and the terminal branchesare long or when there is a high percentage of invariable sites(figs. 3A and 5). Thus, the frequency of type 1 error shouldbe relatively low using the covarion test or when using theheterogeneity test when the interior branch separating the groupsis long compared to the terminal branches. The number of taxaper group also is clearly important for the power of the covariontest, and we recommend that groups have at least six taxa (fig.3B).

The choice of the bipartition can greatly affect the performanceof the heterogeneity and covarion tests. For example, therewere many fewer significant covarion tests of the plastid genesusing the eudicot–noneudicot bipartition than the angiosperm–nonangiospermbipartition (table 3). The heterogeneity and covarion testsshould work using any true bipartition of the gene phylogenybut the choice of bipartition can affect the power of the test.We suggest selecting a bipartition based on tree properties,such as the group sizes and the length of the interior branchseparating the groups. The interior branch should not be tooshort so there can be switches in the ON or OFF state of a sitebetween groups. These switches will diminish the correlationof site variability among groups due to the RAS model. Furthermore,the bipartition must also leave an adequate number of taxa ineach group. The tree topology can also affect the performanceof the test. If the topology is wrong, a significant resultmay reflect a rejection of the topology rather than the statednull hypothesis. In the plastid data, the covarion test performssimilarly using the maximum likelihood or parsimony topologiesinferred using all 57 loci (table 3). The results were alsosimilar using the ML or MP topologies of the individual genes(data not shown). However, we suggest using only well-supportedtopologies to obtain the most accurate results.

As expected, the power of the tests decreases when covarionparameters are so extreme that they cause the covarion modelto resemble the null hypothesis. Power decreases for both theheterogeneity and covarion tests when ${nu}$ is large. For example,the power of the covarion test is 80% when ${nu}$ = 5 and drops furtheras ${nu}$ increases (data not shown), and the heterogeneity test appearsto be even more sensitive to ${nu}$ . Power also decreases when ${nu}$ =0 for the covarion test (fig. 5A). However, it is not the casewhen ${nu}$ = 0 in the heterogeneity test because when ${nu}$ = 0, the sequencesare a heterogeneous mixture of invariable and constant ratesites. The power of the heterogeneity and covarion tests alsodrops when the ON frequency ${sigma}$ is zero or one (fig. 5B). The powerremains high for the low values of ${sigma}$ due to rescaling the branchlengths by a factor of 1/ ${sigma}$ , which is necessary to keep an averageof one substitution from root to tip of the trees. Due to thisconstraint, the evolution model does not converge to completelyinvariable sites when ${sigma}$ approaches zero. However, the model slowlyconverges to the null hypothesis model (either a homogeneousor RAS model) as ${sigma}$ nears 0.

Other tests of covarion evolution are computationally complexand have seen little use (Galtier 2001; Huelsenbeck 2002). Thelikelihood ratio test for gamma-distributed rate variation acrosssites appears to be more powerful than our heterogeneity test(table 2). Kelly and Rice (1996) proposed another likelihoodratio test for any distribution of rate variation across sitesthat use parametric bootstrapping. If variation in the rateof evolution across sites is causing the heterogeneous patternsof evolution, it may be best to test for heterogeneity witha likelihood ratio test before using the covarion test. However,a likelihood ratio test may not detect heterogeneity if it iscaused by covarion evolution without significant rate variationacross sites. Therefore, if a likelihood ratio test fails todetect significant heterogeneous evolution, we suggest usingour heterogeneity test to see if the COV model may be appropriate.In the plastid loci, both the likelihood ratio test for variationin rates across sites and the heterogeneity tests were nearlyalways significant, suggesting that heterogeneity in the processof evolution is nearly ubiquitous. Thus, the covarion test islikely more important than the heterogeneity test. The covariontest is a fast and computationally simple alternative to likelihoodratio tests, and it should be useful for screening large numbersof loci for covarion evolution. Also, unlike the current likelihoodratio tests, its performance has been extensively tested withsimulations (figs. 3–5 ). The covarion test detected evidenceof covarion drift in 26 plastid loci (tables 2 and 3), and thereis strong evidence that this is due to evolution in the rateof nonsynonymous substitution. The most significant tests ofcovarion evolution were from data sets that included only thefirst and second codon positions, which are much more likelyto represent nonsynonymous substitutions than the third codonposition (tables 2 and 3). Thus, adding the third codon positionsites appears to mask evidence of covarion drift, and thereis no evidence of significant change in selective constraintsof third codon positions. The increased frequency of significantcovarion tests using only first and second codon positions isconsistent with the hypothesis that changes in selective constraintsof the amino acids play an important role in the evolution ofthe observed patterns of sequence variation in many plastidgenes. The size of the plastid gene data sets, especially whenonly some of the codon sites were included, were often muchsmaller than the 1,000-bp simulated data sets used to examinethe power of the tests suggesting that the signal for covarionevolution is strong.

The apparent prevalence of covarion patterns of evolution furthersuggests that models that do not incorporate rate variationthrough time may not be adequate for evolutionary inference.The apparent general lack of covarion structure in third codonpositions also suggests that the different codon positions evolveunder different processes and may not be described adequatelywith a single model of evolution. We note that our test assumesa stationary covarion drift model of evolution, in which thesites change from the ON to OFF state throughout all lineagesin the tree. It does not explicitly test for covarion shiftin which there is a large change in the proportion of invariantsites in a specific lineage or part of the tree. It is thuspossible that our test is not detecting all instances of covarionevolution. Previous studies have argued that failing to accountfor the rate variation of sites through time may be problematicfor inferring phylogenies, and a covarion structure might helpexplain the presence of a phylogenetic signal among ancientlydiverged lineages (Lockhart et al. 1998, 2000; Lopez et al. 1999;Philippe and Germot 2000; Steel et al. 2000). Still, fewphylogenetic studies have implemented covarion models of evolution.Vogl et al. (2003) demonstrated that, using standard modelsof evolution, many plastid genes appear to have significantlydifferent phylogenies, and they postulated that a covarion processof evolution in some loci may explain some of the incongruence.Given our findings that covarion evolution is in fact commonplaceamong these genes, it would be interesting to examine whetherincongruence was in part due to reconstructing gene trees withouttaking covarion evolution into account.

Acknowledgements

TOP
Abstract
Introduction
Methods
Results
Discussion
Acknowledgements
References

This research was funded by NSF grant DEB0075319. We thank PeterLockhart and two anonymous reviewers for helpful comments.

Footnotes

¹ Present address: Department of Statistics, University of Wisconsin.

Peter Lockhart, Associate Editor

References

TOP
Abstract
Introduction
Methods
Results
Discussion
Acknowledgements
References

Dondoshansky, I. 2002. Blastclust (NCBI Software Development Toolkit). NCBI, Bethesda, Md.

Fitch, W. M. 1971. Rate of change of concomitantly variable codons. J. Mol. Evol. 1:84–96.[CrossRef][Medline]

Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579–593.[CrossRef][ISI][Medline]

Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. 18:866–873.[Abstract/Free Full Text]

Goldman, N., and S. Whelan. 2000. Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics. Mol. Biol. Evol. 17:975–978.[Free Full Text]

Goremykin, V. V., K. I. Hirsch-Ernst, S. Wölfl, and F. H. Hellwig. 2003. Analysis of the Amborella trichopoda chloroplast genome suggests that Amborella is not a basal angiosperm. Mol. Biol. Evol. 20:1499–1505.[Abstract/Free Full Text]

———. 2004. The chloroplast genome of Nymphaea alba: whole-genome analyses and the problem of identifying the most basal angiosperm. Mol. Biol. Evol. 21:1445–1454.[Abstract/Free Full Text]

Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21:160–174.

Holder, M., and P. O. Lewis. 2003. Phylogeny estimation: traditional and Bayesian approaches. Nat. Rev. Genet. 4:275–284.[CrossRef][ISI][Medline]

Huelsenbeck, J. P. 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19:698–707.[Abstract/Free Full Text]

Huelsenbeck, J. P., D. M. Hillis, and R. Jones. 1996. Parametric bootstrapping in molecular phylogenetics: applications and performance. Pp. 19–45 in J. D. Ferraris and S. R. Palumbi, eds. Molecular zoology: advances, strategies, and protocols. Wiley-Liss, New York.

Jukes, T. H., and C. R. Cantor. 1969. Evolution of protein molecules. Pp. 21–132 in H. N. Manro, ed. Mammalian protein metabolism. Academic Press, New York.

Kelly, C., and J. Rice. 1996. Modeling nucleotide evolution: a heterogeneous rate analysis. Math. Biosci. 133:85–109.[CrossRef][ISI][Medline]

Kimura, M. 1981. Estimation of evolutionary distances between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 78:454–458.[Abstract/Free Full Text]

Liò, P., and N. Goldman. 1998. Models of molecular evolution and phylogeny. Genome Res. 8:1233–1244.[Abstract/Free Full Text]

Lockhart, P. J., D. Huson, U. Maier, M. J. Fraunholz, Y. Van de Peer, A. C. Barbrook, C. J. Howe, and M. A. Steel. 2000. How molecules evolve in eubacteria. Mol. Biol. Evol. 17:835–838.[Free Full Text]

Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:1183–1188.[Abstract]

Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496–508.[ISI][Medline]

Misof, B., C. L. Anderson, T. R. Buckley, D. Erpenbeck, A. Rickert, and K. Misof. 2002. An empirical analysis of mt 16S rRNA covarion-like evolution in insects: site-specific rate variation is clustered and frequently detected. J. Mol. Evol. 56:330–340.[CrossRef][ISI]

Miyamoto, M. M., and W. M. Fitch. 1995. Testing the covarion hypothesis of molecular evolution. Mol. Biol. Evol. 12:503–513.[Abstract]

Ota, R., P. J. Waddell, M. Hasegawa, H. Shimodaira, and H. Kishino. 2000. Appropriate likelihood ratio tests and marginal distributions for evolutionary tree models with constraints on parameters. Mol. Biol. Evol. 17:798–803.[Abstract/Free Full Text]

Penny, D., B. J. McCormish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: the covarion model of molecular evolution. J. Mol. Evol. 53:711–723.[CrossRef][ISI][Medline]

Philippe, H., and A. Germot. 2000. Phylogeny of eukaryotes based on ribosomal RNA: long-branch attraction and models of sequence evolution. Mol. Biol. Evol. 17:830–834.[Free Full Text]

Posada, D., and K. A. Crandall. 2001. A comparison of different strategies for selecting models of DNA substitution. Syst. Biol. 50:580–601.[CrossRef][ISI][Medline]

Pryer, K. M., H. Schneider, E. A. Zimmer, and J. A. Banks. 2002. Deciding among green plants for whole genome studies. Trends Plant Sci. 7:550–554.[CrossRef][ISI][Medline]

Rambaut, A., and N. C. Grassly. 1997. Seq-Gen: an application for the Monte-Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235–238.[Abstract/Free Full Text]

Sanderson, M. J. 2003. r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock. Bioinformatics 19:301–302.[Abstract/Free Full Text]

Self, S. G., and K.-Y. Liang. 1987. Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J. Am. Stat. Assoc. 82:605–610.[ISI]

Shoemaker, J. S., and W. M. Fitch. 1989. Evidence from nuclear sequences that invariable sites should be considered when sequence divergence is calculated. Mol. Biol. Evol. 6:270–289.[Abstract]

Soltis, D. E., and P. S. Soltis. 2004. Amborella not a "basal angiosperm"? Not so fast. Am. J. Bot. 91:997–1001.[Abstract/Free Full Text]

Steel, M., D. Huson, and P. J. Lockhart. 2000. Invariable sites models and their use in phylogeny reconstruction. Syst. Biol. 49:225–232.[CrossRef][ISI][Medline]

Sullivan, J., D. L. Swofford, and G. J. P. Naylor. 1999. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Mol. Biol. Evol. 16:1347–1356.[Free Full Text]

Swofford, D. L. 2002. PAUP*: phylogenetic analysis using parsimony (*and other methods). Version 10. Sinauer Associates, Sunderland, Mass.

Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic inference. Pp. 407–514 in D. M. Hillis, C. Moritz, and B. K. Mable, eds. Molecular systematics. Sinauer, Sunderland, Mass.

Thompson, J. D., D. G. Higgins, and T. J. Gibson. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.[Abstract/Free Full Text]

Tuffley, C., and M. Steel. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63–91.[CrossRef][ISI][Medline]

Vogl, C., J. Badger, P. Kearney, M. Li, M. Clegg, and T. Jiang. 2003. Probabilistic analysis indicates discordant gene trees in chloroplast evolution. J. Mol. Evol. 56:330–340.[CrossRef][ISI][Medline]

Whelan, S., P. Liò, and N. Goldman. 2001. Molecular phylogenetics: state of the art methods for looking into the past. Trends Genet. 17:262–272.[CrossRef][ISI][Medline]

Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites. J. Mol. Evol. 39:306–314.[CrossRef][ISI][Medline]

———. 1996. Among-site rate variation and its impact on phylogenetic analysis. Trends Ecol. Evol. 11:367–372.[CrossRef][ISI]

Accepted for publication December 21, 2004.

This article has been cited by other articles:

K. Shalchian-Tabrizi, M. Skanseng, F. Ronquist, D. Klaveness, T. R. Bachvaroff, C. F. Delwiche, A. Botnen, T. Tengs, and K. S. Jakobsen
Heterotachy Processes in Rhodophyte-Derived Secondhand Plastid Genes: Implications for Addressing the Origin and Evolution of Dinoflagellate Plastids
Mol. Biol. Evol., August 1, 2006; 23(8): 1504 - 1515.
[Abstract] [Full Text] [PDF]

A. Rokas, D. Kruger, and S. B. Carroll
Animal Evolution and the Molecular Signature of Radiations Compressed in Time
Science, December 23, 2005; 310(5756): 1933 - 1938.
[Abstract] [Full Text] [PDF]

P. Lockhart, P. Novis, B. G. Milligan, J. Riden, A. Rambaut, and T. Larkum
Heterotachy and Tree Building: A Case Study with Plastids and Eubacteria
Mol. Biol. Evol., January 1, 2006; 23(1): 40 - 45.
[Abstract] [Full Text] [PDF]

J. Leebens-Mack, L. A. Raubeson, L. Cui, J. V. Kuehl, M. H. Fourcade, T. W. Chumley, J. L. Boore, R. K. Jansen, and C. W. dePamphilis
Identifying the Basal Angiosperm Node in Chloroplast Genome Phylogenies: Sampling One's Way Out of the Felsenstein Zone
Mol. Biol. Evol., October 1, 2005; 22(10): 1948 - 1963.
[Abstract] [Full Text] [PDF]

V. V. Goremykin, B. Holland, K. I. Hirsch-Ernst, and F. H. Hellwig
Analysis of Acorus calamus Chloroplast Genome and Its Phylogenetic Implications
Mol. Biol. Evol., September 1, 2005; 22(9): 1813 - 1822.
[Abstract] [Full Text] [PDF]

M. J. Buck and W. R. Atchley
Networks of Coevolving Sites in Structural and Functional Domains of Serpin Proteins
Mol. Biol. Evol., July 1, 2005; 22(7): 1627 - 1634.
[Abstract] [Full Text] [PDF]

This Article

Abstract

FREE Full Text (PDF)

All Versions of this Article:
22/4/914 most recent
msi076v1

Alert me when this article is cited

Alert me if a correction is posted

Services

Email this article to a friend

Similar articles in this journal

Similar articles in ISI Web of Science

Similar articles in PubMed

Alert me to new issues of the journal

Add to My Personal Archive

Download to citation manager

Search for citing articles in:
ISI Web of Science (13)

Request Permissions

Google Scholar

Articles by Ané, C.

Articles by Sanderson, M. J.

PubMed

PubMed Citation

Articles by Ané, C.

Articles by Sanderson, M. J.

				A. Rokas, D. Kruger, and S. B. Carroll Animal Evolution and the Molecular Signature of Radiations Compressed in Time Science, December 23, 2005; 310(5756): 1933 - 1938. [Abstract] [Full Text] [PDF]

				P. Lockhart, P. Novis, B. G. Milligan, J. Riden, A. Rambaut, and T. Larkum Heterotachy and Tree Building: A Case Study with Plastids and Eubacteria Mol. Biol. Evol., January 1, 2006; 23(1): 40 - 45. [Abstract] [Full Text] [PDF]

				J. Leebens-Mack, L. A. Raubeson, L. Cui, J. V. Kuehl, M. H. Fourcade, T. W. Chumley, J. L. Boore, R. K. Jansen, and C. W. dePamphilis Identifying the Basal Angiosperm Node in Chloroplast Genome Phylogenies: Sampling One's Way Out of the Felsenstein Zone Mol. Biol. Evol., October 1, 2005; 22(10): 1948 - 1963. [Abstract] [Full Text] [PDF]

				V. V. Goremykin, B. Holland, K. I. Hirsch-Ernst, and F. H. Hellwig Analysis of Acorus calamus Chloroplast Genome and Its Phylogenetic Implications Mol. Biol. Evol., September 1, 2005; 22(9): 1813 - 1822. [Abstract] [Full Text] [PDF]

				M. J. Buck and W. R. Atchley Networks of Coevolving Sites in Structural and Functional Domains of Serpin Proteins Mol. Biol. Evol., July 1, 2005; 22(7): 1627 - 1634. [Abstract] [Full Text] [PDF]