semantic phenotype annotations and descriptive taxonomy

Stunning Evaniscus rufithorax specimen – one of the reasons I am enamored of these wasps.

Our lab group has published a small flurry of papers in the last two years, through which we highlight problems in the way that phenotype data are represented (visually and textually) in descriptive taxonomy and comparative morphology. I should probably make time to write these ideas up as blog posts, as I feel pretty strongly that the issues – if not our lab’s solutions – warrant deeper discussion. The two most recent and maybe most approachable syntheses are:

  1. Deans AR, Mikó I, Wipfler B, Friedrich F (2012) Evolutionary phenomics and the emerging enlightenment of arthropod systematics. Invertebrate Systematics 26: 323–330. doi: 10.1071/IS12063
  2. Deans AR, Yoder MJ, Balhoff JP (2012) Time to change how we describe biodiversity. Trends in Ecology and Evolution 27 (2): 78-84. doi: 10.1016/j.tree.2011.11.007

The latest installation, which yields real world examples of semantic phenotype annotations, in the context of descriptive taxonomy (see our TREE opinion) came out today in ZooKeys:

  1. Mullins PL, Kawada R, Balhoff JP, Deans AR (2012) A revision of Evaniscus (Hymenoptera, Evaniidae) using ontology-based semantic phenotype annotation. ZooKeys 223: 1–38, doi: 10.3897/zookeys.223.3572

One thing we attempted to do in this paper is layer semantic phenotype annotations (composed in OWL, linked to relevant phenotype ontologies; see Deans et al. 2012) on top of our natural language character descriptions. The end result should be a more explicit textual representation of the phenotype represented in the character. For example, taxonomists, including us, are obsessed with measuring body parts. In this paper we measured the diameter of the lateral ocellus (LOD, for short) and compared it to the shortest distance between the lateral ocellus and the compound eye (OOL, for short):

The ocelli are the somewhat circular light-detecting structures on top of the wasp’s head. This individual has three of them – two lateral ocelli and one median ocellus. I annotated one lateral ocellus to (poorly) illustrate the measurements we made.

The ratio of these measurements in Evaniscus wasps is diagnostic at the species level. But how do we represent these data in a species description? Maybe something like “ocellar ocular line length as long or longer than lateral ocellus diameter” or maybe “OOL ≥ LOD” or maybe even as a character and its state “ocellar ocular line length vs. lateral ocellus diameter: as long or longer”. Or someone could be even more verbose and describe the character as “the shortest distance between the lateral ocellus and the margin of the compound eye equal to or perhaps shorter than the diameter of said ocellus”.

You see where I’m going. There are many different ways to represent this phenotype using prosaic natural language. Human readers probably would interpret most of these variations correctly and understand the character. But what happens when we pool all species descriptions for Hymenoptera (>145,000 described species) or, even better, for Insecta (>1,000,000 described species). Can humans read all these descriptions—composed by thousands of taxonomists with different backgrounds, preferences, eccentricities, or even different languages—and interpret the characters correctly? Almost definitely not. Yet data about ocellus size are potentially relevant to many scientific endeavors. Maybe I found an ocellus mutation in Drosophila melanogaster and want to know how common the phenotype is in nature. Or maybe I have a hypothesis about ocellus size (these structures are relatively larger in temperate insects, where light patterns vary greatly, according to season) that requires massive amounts of standardized phenotype annotations (connected to distribution data) for a proper test.

So, can we represent complex phenotypes using rigorous concepts and a standard syntax, like we do with DNA (i.e., IUPAC nucleotide symbols)? Our working—let’s call it draft—solution is to represent the phenotype in OWL, using multiple phenotype-relevant ontologies. The ocellus character state mentioned above looks like this and is applied to the specimen examined:

has_part some (ocular ocellar line and (is bearer of some (length and ((increased_in_magnitude_relative_to some (length and (inheres in some lateral ocellus))) or (similar_in_magnitude_relative_to some (length and (inheres in some lateral ocellus)))))))

Looks weird perhaps, but the goal is to write these annotations in a way that makes them understandable/retrievable by computers. These annotations would appear in addition to natural language prose in our new model for descriptive taxonomy. I’ll write a bit more about the approach after our next paper is finished, as we have a fair bit of discussion about advantages, limitations, and real utility (e.g., we do some basic queries across a larger species description data set).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s