Quick Facts
  • 67 full text articles
  • >560,000 Tokens
  • >21,000 Sentences
  • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies
    • Chemical Entities of Biological Interest
    • Cell Ontology
    • Entrez Gene
    • Gene Ontology (biological process, cellular component, and molecular function)
    • NCBI Taxonomy
    • Protein Ontology
    • Sequence Ontology
  • Penn Treebank markup for each sentence
  • Multiple output formats available
  • Integrated with UIMA

The Colorado Richly Annotated Full Text Corpus (CRAFT) is a manually annotated corpus consisting of 67 full-text biomedical journal articles. Each article is a member of the PubMed Central Open Access Subset.

Annotation guidelines used during the construction of CRAFT:


The CRAFT annotations are licensed under the Creative Commons Attribution 3.0 license (CC BY).

  • October 19th, 2012 -- Version 1.0 of the CRAFT corpus has been released. Click on the above link to download or visit: 
    • Version 1.0 contains updated versions of the Gene Ontology Biological Process and Molecular Function annotations and minor modifications to other annotations.
  • May 27th, 2012 -- Version 0.9 of the CRAFT corpus has been released.
    • Version 0.9 contains the complete CRAFT corpus, with one exception: the Gene Ontology Biological Process and Molecular Function annotations are undergoing a quality assurance review. Some of the GO BP/MF annotations included in the v0.9 release will likely change as a result of the Q/A review. When the review is complete, CRAFT v1.0 will be released.

To reference the CRAFT corpus, please cite one or both of:
  • Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgartner Jr., W. A., Cohen, K. B., Verspoor, K., Blake, J. A., and Hunter, L. E. Concept Annotation in the CRAFT Corpus. BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161. [PubMed:22776079]

  • Verspoor, K.*, Cohen, K.B.*, Lanfranchi, A., Warner, C., Johnson, H.L., Roeder, C., Choi, J.D., Funk, C., Malenkiy, Y., Eckert, M., Xue, N., Baumgartner Jr., W.A., Bada, M., Palmer, M., Hunter L.E. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 2012 Aug 17;13(1):207. [PubMed:22901054]


Accompanying the 1.0 release of CRAFT is a software module that integrates CRAFT with the Unstructured Information Management Architecture (UIMA). The software module is a Maven project. It includes a Collection Reader for the CRAFT corpus as well as the annotations themselves (in the form of UIMA XMI). The CRAFT annotations are made available in two UIMA type systems: CCP and U-Compare.

LICENSE: The craft-code software module has been released under the BSD New (3-Clause) license


SOURCE CODE: Download craft-code-1.0


<!-- the craft collection reader using the ccp type system -->


MAVEN COORDINATES (U-Compare type system):

<!-- the craft collection reader using the u-compare type system -->



To receive up-to-date information about the CRAFT corpus and future releases, please sign up for the BioNLP-Corpora-CRAFT mailing list.