PICorpus: The Protein Interaction Corpus


The Protein Design Group's protein-protein interaction corpus was originally created at the PDG in a idiosyncratic format. We refactored the corpus by formatting the data into industry-established formats WordFreak and Genia-style embedded-XML. The newly refactored corpus (PICorpus) can be used for a variety of biomedical language processing (BLP) tasks, including testing entity extraction, relation identification and relation extraction systems.

We are interested in your feedback about this corpus. Please direct all bug reports and comments about the contents of the corpus to the BioNLP-Corpora Bug Tracker. Be sure to choose the "PICorpus" from the dropdown options in the "Category" field.

If you are interested in helping with this effort, please send a message to the PICorpus help/discussion forum. Be sure to include "PICorpus" in the subject line.

To reference usage of the PICorpus, cite this paper:
Johnson, H.L.; W.A. Baumgartner, Jr.; M. Krallinger; K.B. Cohen; L. Hunter. (2007) Corpus Refactoring: a Feasibility Study. Journal of Biomedical Discovery and Collaboration. [pdf will appear soon] <bibtex>
A short paper was also published previous to the above:
Johnson, H.L.; W.A. Baumgartner, Jr.; M. Krallinger; K.B. Cohen; L. Hunter (2006) Refactoring Corpora. Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. pp 116-117 <pdf> <bibtex>

