Gas station without pumps

2012 April 15

Reconstructing genes from PacBio reads

Filed under: Uncategorized — gasstationwithoutpumps @ 11:09
Tags: , ,

In Working with other people’s data, I talked a bit about the data I was using for a gene sequencing project:

… there is a prokaryotic gene that is highly variable and it varies by having a couple of repeat regions that have more or fewer copies of a longish repeat sequence.  The repeat sequence is not always the same, but recent duplication events can result in 1000-long identical blocks.  The repeat block itself may be over 3000 bases long. The goal of the project is to accurately reconstruct the gene for 100s of different strains.

I mentioned the importance of long reads:

More important for me than the average length is the greatly increased number of long reads. That extra length makes a huge difference in how easy it is to reconstruct the repeats—having reads that actually span the repeat blocks makes reconstruction much, much easier. 

And I talked about the error model:

I used a very different error model for Pacbio reads than I would for other sequencing technologies.  …  For the PacBio reads, I treated all errors as indel errors, because indels are very common and base substitution is fairly rare.  

Today I want to talk a little about how the reconstruction works and about some of the errors in PacBio reads other than short indels are important to consider in doing the reconstruction of the gene sequence.

The basic method I’m using for reconstructing a gene is a simple iterative procedure for improving a guess at the gene sequence:

  • Find subset of the reads that are similar to the guess.
  • Orient the reads so that they are all on the same strand as the guess.
  • Build a profile hidden Markov model (HMM) from the guess and retrain it on the reads found. Use model surgery to improve the structure of the HMM.
  • Find a larger set of reads that are similar to the consensus of the retrained HMM, and orient them.
  • Align the larger set using the HMM.
  • Make a new guess based on the consensus of the alignment.
  • Starting from this new guess, repeat the procedure.

There are a lot of details to be filled in to make this outline of the process functional. I’ll cover just a few of them in this post.

Selecting seeds

One of the most important things in any iterative-improvement method is starting from a good initial guess (I call the initial guess a “seed”).  Unless you are working on very easy problems, iterative-improvement methods nearly always suffer from “local optima”—you can only find the best solution close to the guess you start with.  If your seed is really bad, you may converge to something other than the best overall solution.

A technique that has been used in other genome assembly projects is to start from a “reference sequence” and look for changes.  Because the gene I’m trying to sequence has been sequenced in other strains, this initially looks like a good approach—big chunks of the gene have pretty high conservation.  But there is a big danger of “reference bias”—reconstructing the gene to look like the reference, even though the data supports a different model of the repeats better.

I have seen that iterative-improvement method I’m using is reluctant to change the number of repeats in a block. If it starts from a seed with too few repeats, it is unlikely to align the repeats so perfectly that many reads support adding an extra repeat in the same place (which is what would be needed for model surgery to insert the repeat).  The correction of the number of repeats does happen sometimes, but not often, so starting from a reference sequence would introduce a large bias toward keeping the number of repeats the same.  Getting the number of repeats right is precisely the problem that the expensive long-read sequencing is supposed to be addressing, so I cannot tolerate this reference bias.

Instead of starting from a reference sequence, I start from one of the reads.  This means that the iteration does not have any reference bias, but it brings up another question: which read should I use as a seed?  Much of the work I’ve done over the last few months has been on ways to choose good seeds.

I started with an obvious method, suggested by something I saw on the Pac Bio site, but now can’t find: using the longest reads as seeds.  This turned out to be a terrible choice.  Looking at the histogram of lengths in my Working with other people’s data post, you can easily see that in the C2 chemistry, almost all the really long reads are artifacts: they’re longer than the input DNA fragment.  What is not so obvious is that the long reads in the older chemistry were also full of artifacts. The artifacts seem to be of two types: doubled reads and circular reads without adapters.

The doubled reads look like what would happen if two of the double-stranded DNA molecules were ligated together before the hairpin adapters were added in library preparation.  If you take a chunk out of the middle of such a doubled molecule, you’d see a circular permutation of the original DNA fragment.

The circular reads without adapters look like what would happen if you managed to circularize the fragment without an adapter, with the end of the molecule followed by its reverse complement, without an intervening adapter.  I suppose that could result from the same sort of mechanism as a doubled read: if you ligate two of your double-stranded molecules together head to head (instead of head to tail), then a chunk out of the middle would look like a mirroring about one of the ends of the original molecule.  I looked at a few of the very long reads for the C2 chemistry, and they looked like the sequence followed by the reversed sequence, though there were some other artifacts that were harder to characterize.

It would be tempting to use quality data to select seeds, but much of the data I had was in fasta format without quality data.  In any case, I expect a similar problem as with selecting by length:  there may be some very good reads that have duplication artifacts or that are from contamination of the initial DNA sample.

I ended up using a reference sequence, but only to select reads to use as seeds.  I did not use a full gene as a reference sequence, but took the three non-repeat regions (there are two different variable repeat blocks in this gene), and looked for reads that matched before and after the first repeat block and for reads that matched before and after the second repeat block (looking for a consistent orientation of the matches).  Rather than taking just one seed, I took ten seeds, five attempting to span the first repeat block and five attempting to span the second repeat block.  After training for a few iterations from each seed, I took the consensus of the HMM that scored reads the best from each set of five, then merged those consensus sequences to get a full-length consensus sequence.  That full-length consensus was then used as a seed for more iterations.

It turned out that the most important criterion for a successful reconstruction of the gene was finding at least one good read to use as a seed spanning the first repeat block and one good read to use as a seed spanning the second repeat block.

HMM regularizers

The HMM program I’m using is the the buildmodel program from the SAM HMM suite, mainly because I’m very familiar with the suite, having used it for about 15 years in protein structure prediction.  It is not optimized for the problem I’m giving it here, as I don’t want to change the emission probabilities of the states, and am mainly interested in its model surgery capabilities (which are not as robust as I’d like, as they have not been exercised much).  I solved the problem of the emission probabilities by giving the HMMs a very stiff emission regularizer and a moderately stiff transition regularizer that makes mismatches much more expensive than insertions and deletions.  The transition regularizer was trained from an alignment of a few hundred good reads to the correct sequence.  I ended up using different regularizers for the C1 and C2 chemistries, because I had the data to train them separately, but they are not so different that one would get a huge difference in results from using the wrong one.

Orienting Reads

One problem with SAM’s buildmodel is that it expects all the training data to be on the same strand.  (This was never a limitation when working with protein sequences, but it is a big problem when working with unoriented reads.)  I needed to orient the reads, but I quickly found out that there are a lot of strand-switching artifacts in PacBio reads.  A lot of them look like circularization problems, sometimes with fragments that are not full length.  That is, the reads look like A A’ or even A A’ A, rather than random chimeras, but the mirroring is not necessarily at the end of the original DNA fragment.  I suspect that careful library prep makes a big difference in how much of what sort of artifact ones sees, but that is well outside the control of a bioinformatician.

To select and orient reads, I used blastn to find matches between all the reads and the current seeds.  After selecting some number of them as training reads, I used the blastn output to select orientation for either the whole read or for two parts to the read, pretending that it had the form A A’, and looking for the best place to put the symmetric cut point. This orienting of subparts of reads gave me much cleaner training data for the HMMs.  It was an absolutely essential step in removing artifacts that otherwise trapped the iterative improvement into bogus optima.  Just discarding reads that matched both strands would have eliminated too many of the best reads.

Note that this read orientation relies on there not being any significant mirroring happening in the real DNA being sequenced.  That was true for the gene I was working on, but would definitely not be true for all sequencing one might want to do.  For example, David Bernick identified transpositions in the Pyrobaculum oguniense genome that result from homologous (and so far as we can determine identical) regions of the genome that are oriented in opposite directions.  Trying to reconstruct a piece of DNA that contained such an inverted repeat would need a different orientation method than the one that I’m using for the gene I’ve been reconstructing from PacBio data.



There are still a lot of details I’ve not talked about, and I haven’t gotten to one of the interesting results (how many reads are needed to do a decent reconstruction), but this post is getting too long, so I’ll leave those for a later post.


  1. […] the highly repetitive gene I’ve been working on, despite the noisy reads.  The problem is to recreate a gene that has long multiple repeats in it that evolve very rapidly (massive tandem duplication or deletion in a matter of days) from […]

    Pingback by PacBio error model « Gas station without pumps — 2012 June 11 @ 23:29 | Reply

  2. Hi, We just recently develop a method that we call “Hierarchical Genome Assembly Process” (HGAp). You can find some code and example in PacBio’s DevNet. Overall, it follows similar line of thought in your post, although the details of the current implementation is quite different. In the part that we call “pre-assembly” we use as graph based method to generate good consensus. A HHM based can be used. Indeed, the final part in the assembly process, we use “Quiver” which does use HMM to get the best consensus at the end. We will likely use it for the pre-assembly part too. Please take a look at
    Any feedback is welcome. You might be able to apply to your method to the released data sets too.

    Comment by Jason Chin — 2012 November 10 @ 21:28 | Reply

  3. sorry, the link to our reference implementation of HGAp is actually

    Comment by Jason Chin — 2012 November 10 @ 21:31 | Reply

  4. […] and Although I did not publish the work formally, I shared it with the collaborators at PacBio, who […]

    Pingback by Sabbatical leave report | Gas station without pumps — 2015 September 8 @ 22:23 | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: