I recently came across this list of orthology databases, which can be very useful for people trying to figure out what various proteins and genes do, and even more useful for those trying to make tools that rely on comparative genomics.
What is orthology and why is it important?
Let’s start with a more general concept: homology. Two sequences are said to be homologous (to be homologs) if they are both descended from a common ancestral sequence. The term can also be applied to phenotypic traits, if the genes and regulatory control regions producing the traits are homologous. Note that it is generally incorrect to talk about “percent homology”—as sequences either are or are not homologous. Generally, when people use that term they mean “percent identity” (the fraction of bases or amino acids in the sequence that are identical between the sequences) or occasionally “percent similarity” (the ratio of the similarity of the two sequences to the similarity of a sequence to itself, using some arbitrary definition of similarity such as the Smith-Waterman alignment score with a particular BLOSUM matrix).
It is, however, reasonable to talk about close or distant homology, referring to the evolutionary time back to the common ancestor. Since close homologs usually have near-identical sequences and distant homologs have diverged much more, the percent identity between sequences is often a good proxy for the evolutionary distance.
Homologs arise in evolution by two main mechanisms: gene duplication within a genome and speciation. Biologists classify pairs of homologs according to which of these tow mechanisms occurred at the split from the common ancestor. If the split between the sequences occurred as a gene duplication, the homologs are referred to as paralogs, while if the split was a speciation event, then the sequences are called orthologs.
It is not always easy to tell whether a pair of homologs are paralogs or orthologs, as there may have been several intervening gene duplication or speciation events on the lineage of either sequence. The evolutionary history needs to be accurately reconstructed to tell which event occurred at the split between the sequences. Furthermore, the whole notion of homologs, paralogs, and orthologs assumes that the sequence is an atomic object in evolution, and that its evolutionary history is a tree. Many proteins are formed by arrangements of protein domains, each of which may have a different evolutionary history, and for distantly related proteins, only parts of the proteins may be homologous, not the entire proteins.
It can also be the case that several paralogs within one species are orthologous to a single protein in a different species, if the duplication of the genes forming the paralogs occurred more recently than the speciation event. For that matter, it is possible to have several paralogs in both species, each of which is orthologous to all the paralogs in the other species.
Why does orthology matter?
If it is so difficult to tell paralogs from orthologs, why does anyone bother?
Biologists believe (on theoretical, rather than empirical grounds) that genes or proteins that are orthologous are more likely to have a common function than ones that are paralogous. This orthology conjecture drives a lot of functional inference and annotation. It has recently been challenged by those who claim that sequence identity (closeness of homology) is a better predictor of functional similarity than the ortholog/paralog distinction, but the evidence on both sides of the question is still rather weak, particularly since inference of orthology is done mainly by closeness of homology, and so the two models are hard to distinguish.
Why so many databases?
Lots of people have needed to annotate pairs of sequences as orthologs or paralogs, but because they have different applications and because the ortholog/paralog inference is difficult, they have ended up with subtly different solutions. For example, people looking for functions of protein domains may want orthology information for individual domains, while those looking for functions of whole proteins may want to use the domain architecture of the whole protein both for orthology inference and function prediction. Those looking for subtle relationships among sequences from recently diverged species (such as the primates) may have very different needs from those looking for relationships between eukaryotic and bacterial or archaeal sequences.
Why not roll your own?
If choosing orthologs is so application-dependent, why use someone else’s database, rather than defining your own set? Indeed, the proliferation of orthology databases comes from just this approach. But creating a set of orthologs is quite tricky, and the simple techniques that people generally use when they create their own sets of orthologous pairs (like reciprocal best BLAST) do not work very well. So if there is a well-constructed and maintained database whose definitions and scope are suitable for the problem you are addressing, it is probably better to use the database than to try to redo the work yourself—unless (of course) you have come up with a better method for inferring orthology than the methods used in the existing databases. If you have a better method, you should probably put your own database on the web for others to use.