Gas station without pumps

2011 December 16

PacBio artifacts

Filed under: Uncategorized — gasstationwithoutpumps @ 21:30
Tags: , , , , ,

I recently was given some PacBio read data to assemble to figure out a repeat-rich area of a genome, and I’m curious about some of the artifacts I saw in the data.  I had much more data than I needed to assemble the region of interest, so the artifacts are not important for this particular project, but I’m wondering if anyone has done an analysis of errors other than indels in PacBio reads.

I’m working with the fasta files that are output from the “Secondary Analysis” step of the PacBio pipeline, and I have no access to the PacBio tools themselves, so I don’t know if the artifacts I’m seeing are  the result of the secondary analysis or are in the movies of the sequencing itself.

The first artifact I looked for was the obvious one that I expected: remnants of the adapter that had not been caught and removed by the secondary analysis.  There was a little of this contamination, but less than I expected.  Out of over 250,000 reads only about 400 had adapter sequences detectable by megablast (and half of those were only detectable by looking for double adapters).  These numbers are so small as to be negligible.

An artifact I was not expecting was for several reads to have a “fold” in them.  That is, the sequence would advance along one strand in the normal way, then turn around and go back along the other strand.  I’ve not counted how many of these there were, but they occurred often enough to pose some risk of contaminating the assembly. [UPDATE 17 Dec 2011: about 3% of the reads map to both strands using megablast.  Of course, these reads tend to be longer, so are more enriched in the set of reads that I used for assembly.] I first noticed them because the gene I was sequencing had a strong A vs. T imbalance, and suggested insertions were appearing with the wrong letter enriched.  When I looked at how the reads mapped to the consensus so far, I found megablast hits like this:

# Fields: Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
xxx      ua-try16        89.70   2156    28      180     2       2088    3       2033    0.0     2321
xxx      ua-try16        87.97   1214    19      114     2160    3342    1997    880     0.0     1122

Note: that the read (which I’ve renamed “xxx”) matches in the forward direction for the first 2000 bases, then matches backwards from there for the next 1200 bases. I had many such reads, and it took some reading for me to realize that these artifacts were caused by the “circular consensus” library preparation protocol, which ligates a hairpin onto each end of double-stranded DNA.  The PacBio analysis is supposed to recognize the sequence of the hairpins and use the two halves to build a better consensus, but it clearly failed to do this in many cases.

The technique for generating circular consensus libraries for the PacBio instrument. Image from PacBio literature: http://www.pacificbiosciences.com/assets/files/pacbio_technology_backgrounder.pdf

I could probably extract more information from a set of reads by looking for the reverse complement mapping pairs and splitting the affected reads, so that the two halves could be independently mapped (in opposite directions) and both contribute to the consensus.

Incidentally, the megablast parameters are not right for aligning PacBio reads, as the gap openings should be much more frequent but the %identity much higher. I did not bother to figure out how to tweak the megablast costs to get better scoring, but in an alignment of a few thousand reads to my final consensus (confirmed by Sanger sequencing), using a different method, I got essentially no base errors, but short insertions and deletions are frequent. The average run length for matches was only 6.88, and the average run-length for inserts and deletes were 1.28 and 1.12, respectively.  Inserts were about 3.82 times as common as deletes.  Of course, some of these statistics are artifacts of an alignment method that preferred opening gaps to mismatching bases and preferred many short gaps to fewer longer ones.  Still, this suggests a match-match transition probability of 0.85, match-delete of 0.03, match-insert of 0.12, insert-insert of 0.22, and delete-delete of 0.11 (which are not the parameters I used in making the alignment).

5 Comments »

  1. Yes, I have seen this too. I am trying to assemble a multi-gene locus and it is proving to be a nightmare with ~10 repeats of 11 kb that are close to identical. Have managed two repeats so far but these artefacts initially threw me for a loop (pun intended) as a unique signature present in one repeat was showing up twice in some clones, but in inverted orientation. I have also observed an excess of insertions over deletions over mismatches. Love your analysis and explanation.

    Comment by Laurie Graham — 2013 October 22 @ 08:02 | Reply

  2. I’ve found much the same but my inversions all have adapter between them. Have you found inversions without adapter?

    I’ve also found partial adapters at the ends of the last sub read. Not many but best removed for assembly work.

    Comment by Colin — 2013 November 28 @ 01:14 | Reply

  3. I realize this question is quite old, but I’m replying anyway to see if anyone has any answers. These artifacts are consistent in my data. I call them foldovers, and there is no adapter sequence or primer at the fold.

    Comment by Annette — 2018 May 11 @ 09:33 | Reply

    • I’ve not looked at recent PacBio data, so I have no idea what artifacts are now seen. I think that most of the artifacts occur in the library preparation, and that has changed a lot in the seven years since the data for this post was collected.

      Comment by gasstationwithoutpumps — 2018 May 11 @ 10:00 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: