Quick Facts
  • 67 full text articles
  • >560,000 Tokens
  • >21,000 Sentences
  • ~100,000 concept annotations to 7 different biomedical ontologies/terminologies
    • Chemical Entities of Biological Interest
    • Cell Ontology
    • Entrez Gene
    • Gene Ontology (biological process, cellular component, and molecular function)
    • NCBI Taxonomy
    • Protein Ontology
    • Sequence Ontology
  • Coreference annotations
  • Penn Treebank markup for each sentence
  • Multiple output formats available
  • Integrated with UIMA

The Colorado Richly Annotated Full Text Corpus (CRAFT) is a manually annotated corpus consisting of 67 full-text biomedical journal articles. Each article is a member of the PubMed Central Open Access Subset.

Annotation guidelines used during the construction of CRAFT:


The CRAFT annotations are licensed under the Creative Commons Attribution 3.0 license (CC BY).

  • November 22nd, 2016 -- Version 2.0 of the CRAFT corpus (addition of manually curated coreference annotations) has been released. Click on the above link to download or visit: 

  • October 19th, 2012 -- Version 1.0 of the CRAFT corpus has been released. Click on the above link to download or visit: 
    • Version 1.0 contains updated versions of the Gene Ontology Biological Process and Molecular Function annotations and minor modifications to other annotations.

  • May 27th, 2012 -- Version 0.9 of the CRAFT corpus has been released.
    • Version 0.9 contains the complete CRAFT corpus, with one exception: the Gene Ontology Biological Process and Molecular Function annotations are undergoing a quality assurance review. Some of the GO BP/MF annotations included in the v0.9 release will likely change as a result of the Q/A review. When the review is complete, CRAFT v1.0 will be released.

To reference the CRAFT corpus, please cite one of:
  • Bada, M., Eckert, M., Evans, D., Garcia, K., Shipley, K., Sitnikov, D., Baumgartner Jr., W. A., Cohen, K. B., Verspoor, K., Blake, J. A., and Hunter, L. E. Concept Annotation in the CRAFT Corpus. BMC Bioinformatics. 2012 Jul 9;13:161. doi: 10.1186/1471-2105-13-161. [PubMed:22776079]

  • Verspoor, K.*, Cohen, K.B.*, Lanfranchi, A., Warner, C., Johnson, H.L., Roeder, C., Choi, J.D., Funk, C., Malenkiy, Y., Eckert, M., Xue, N., Baumgartner Jr., W.A., Bada, M., Palmer, M., Hunter L.E. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools. BMC Bioinformatics. 2012 Aug 17;13(1):207. [PubMed:22901054]

  • K. Bretonnel Cohen, Arrick Lanfranchi, Miji Joo-young Choi; Michael Bada, William A. Baumgartner Jr., Natalya Panteleyeva, Karin Verspoor, Martha Palmer, Lawrence E. Hunter. Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles.BMC Bioinformatics. 2016.


Accompanying the release of CRAFT is a software module that integrates CRAFT with the Unstructured Information Management Architecture (UIMA). The software module is a Maven project. It includes a Collection Reader for the CRAFT corpus as well as the annotations themselves (in the form of UIMA XMI).

LICENSE: The craft-code software module has been released under the BSD New (3-Clause) license


SOURCE CODE: Download craft-code-2.0


<!-- the craft collection reader using the ccp type system -->



To receive up-to-date information about the CRAFT corpus and future releases, please sign up for the BioNLP-Corpora-CRAFT mailing list.