Tutorial


In the following worked example, the known chimera AY297986 is investigated.  All sequences used in this example were downloaded from the NCBI website at http://www.ncbi.nlm.nih.gov/.  

The program works by comparing the query sequence (in this case AY297986) with a trusted subject sequence, phylogenetically close to the query.  One way of getting a phylogenetically close subject sequence is to use the NCBI online Blast tool (at http://www.ncbi.nlm.nih.gov/BLAST/) to search the public database.  Doing so with AY297986 identifies the Nitrospira record X96726  as its 'nearest neighbour'.

Entering both sequences into the program, and running with the default settings, generates the following:

Screenshot of program.

At least one in twenty 16S rRNA sequence records currently held

Analysing the output

In essence, the program works by comparing evolutionary distances between query and subject over the length of the 16S rRNA gene, by employing a sampling window of specified size, progressing a fixed number of bases at a time along the length of the gene.  

So in this example, the program aligned AY297986 with X96726, then recorded the percentage of base mismatches every 25 base positions (i.e., step size = 25), within a sampling window of 300 bases (window size = 300).  The resulting 'observed percentage differences' were then plotted (red line) against base position:

Screenshot detail showing plot of observed differences against base position.

Plot generated by program, showing the variation in base mismatches between query and subject along the length of the 16S rRNA gene.

This observed percentage differences line reveals the degree to which the query and subject vary from one another along the length of the gene.  The mean of this data is roughly equivalent to uncorrected evolutionary distance, and in this example is 10.3%.  

The expected percentage differences line (in gray) illustrates the sort of variation one might normally expect to see between sequences exhibiting this degree of evolutionary distance.  As can be seen, this line is largely horizontal (the undulations reflecting the positions of the hypervariable regions in the gene), and is in stark contrast to the observed results.

In effect, the plot shows that the query and subject are virtually identical at the 3' end (with a mean difference of less than 1 %), yet distinctly different at the 5' end (mean difference roughly 20%) - a pattern typical of a chimera.

The variation between observation and expectation is summarised by the Deviation from Expectation (DE) statistic.  In this example it is 8.48.  To put this figure in context, the program summarises DE values obtained from reliable type-strain comparisons with similar evolutionary distances.

Screenshot detail, illustrating DE values generated from typestrain comparisons.

Table generated by program, summarising DE values from reliable type strain sequences.

Less than 0.1% of the type strain comparisons have a DE higher than 5.10.  Therefore, based on these previous type-strain comparisons, the probability of two sequences producing a DE of 8.48, when they differ by 10.3 % overall, is estimated to be  P < 0.001.

Of the basis of this DE value, the program concludes that there is strong evidence of a sequence anomaly.

Is the subject reliable?

The above output identifies the query as chimeric.  But can the subject X96726 be trusted?  We need to confirm that the detected abnormality does indeed derive from the query and is not the result of a chimeric subject.  

The simplest way of checking the reliability of the subject is to re-run the program, this time with X96726 as the query and its nearest neighbour, as subject.  A further Blast search reveals the Nitrospira record AB021303 as being phylogenetically closest, and with this sequence the following output is generated:


No anomaly is detected.  A check of the details of records X96726 and AB021303 show that they came from separate clone libraries generated by different research groups and so are entirely independent of one another, so we can be confident that X96726 is not anomalous.

So our query AY297986 is definitely the source of the anomaly, and as a final conformatory step, if AY297986 is compared with AB021303, the same chimeric pattern is observed.

Further investigation

AY297986 is clearly chimeric with the 3' end originating from the Nitrospira taxon, and the 5' end deriving from a phylogenetically more distant source.  To identify that second source, the simplest way is to Blast search with the 5' end alone, and repeat the above procedure.  

The following screenshots, show the query being progressively trimmed until the Nitrospira end is removed.  Note how the expected line moves progressively closer to the observed line, as more of the 3' end is removed.


(i) Original query AY297986, compared with subject, X96726.

(ii) Same comparison but with with 178 bases removed from 5' end of query.

(iii) Same comparison but with with 359 bases removed from 5' end of query.

(iv) Same comparison but with with 540 bases removed from 5' end of query.

(v) Same comparison but with with 753 bases removed from 5' end of query.


The remaining sequence is then used in a Blast search, identifying the Firmicutes sequence AJ243189 as its nearest neighbour.  Using this sequence as the new subject we get the following plot:

Query compared with the Firmicutes record AJ243189.

Using different window sizes and estimating the breakpoint

Altering window and step size changes the amount of detail displayed in the plot.  Reducing the window size to 50, for example, allows the hypervariable regions to be defined more clearly, as well as enabling a more accurate determination of break position:

The query compared with  AJ243189, using a window size of 50.

The query compared with  X96726, using a window size of 50.

Superimposing one plot over the other we can conclude the breakpoint to be approximately at E. coli position 700.

In Conclusion

AY297986 is confirmed as a two-fragment chimera with the 5' end derived from a Firmicutes bacterium, the 3' end of Nitrospira origin, and the breakpoint occurring in the region of base position 700.