CCP logo

BioNLP-Corpora

 

The GeneHomonym Dataset

DOWNLOAD GENEHOMONYM

Daniel J McGoldrick, Ph.D.
UCHSC Health Sciences Center, Dept of Parmacology, Center for Computational Pharmacology.

Gene symbol homonyms are gene symbols that refer to more than one gene.  In data integration, (DI), Nnatural language processing (NLP), and semantic mapping (SM) it is possible to encounter logical artifacts that arise in computations when the string value of a gene symbol matches an alias of another gene entity.  This homonym effect is amplified in transitive closure when ambiguous gene symbols are used to further propagate downstream transitive inferences leading to mixed meaning in the final clique of linked gene identifiers, or ontology terms.  This is exactly what we observed in validating computed cross references for gene identifiers and ontology terms in practical cases where symbol equality is used to connect gene entities.  Imagine the case where one research group is using the gene symbol “AAT1” to indicate the gene “alanine amino transferase” while another is using the same symbol to indicate the gene “Aortic aneurysm, familial thoracic 1”.  Herein we have a problem because automated mapping to semantic terms such as the GOA ontology based on the string “AAT1” would lead to very different GO functions for these two genes and likely false conclusions based on homonyms.  Homonyms are a machine logic issue that must be addressed in NLP, DI and SM.

We have two kinds of homonyms - asserted and proveable.

Asserted:
There are values (such as gene symbol AAT1) that are defined by a public database but have different asserted meanings even though they are spelled the same. Perhaps one interpretation is "AliasSymbol" another is "OfficialSymbol" (as in this example) where the asserted kind of data field in a public database is variable.

Proveable:
For example in DI there is a value
(e.g. AMD)
that can be proven to eventually cross-reference different gene entities by transitive closure. Here we have to prove inconsistency in the implied cross-references using a graph traversal involving the value. For example, we might have an AliasSymbol (such as MCAD) that can be shown to reference multiple "gene entities" because the closure set of cross-references is not consistent. Perhaps there are multiple chromosomal locations implied, or multiple Entrez Gene identifiers cross referenced that are clearly not the same gene. Maybe the sequences of all cross-referenced Refseq identifiers are not homologous... Here we do not arrive at a homonym (e.g. "MCAD") by conflicting type assertion (aliassymbol), but by a logical proof showing inconsistent entailments - Acadm, CDH15 share the same alias. In some cases counting the number of other identifiers such as Entrez Gene identifiers that are cross-referenced will prove a symbol is a homonym (say there are two) but in other cases we find the different gene identifiers actually are consistent with a single gene and no one has noticed yet or one is deprecated. Proving a gene homonym is harder, to say the least, especially when there are literally hundreds of thousands of proofs to review and current reasoner technologies are operating at a second a proof for large datasets. One approach is to define rules - orthologous, same chromosome,overlapping genomic span, same strand, same species ... to identify the inconsistencies.

Asserted homonyms are released as as a raw list and as database reports
EntrezGeneHomonymyTable_yyyymmdd.rpt.gz
OfficialSymbolHomonymyTable_yyyymmdd.rpt.gz

Proveable Homonyms are soon to be released...


Maintained by Helen L. Johnson.
This file last modified Monday, 04-May-2009 18:10:33 UTC

Get BioNLP at SourceForge.net. Fast, secure and Free Open Source software downloads