other‎ > ‎

resources

As part of our research activities in the OntoGene project we have compiled a list of terminological resources which we use in our Text Mining applications. We make them available on this page in the hope that they can be helpful for various research activities.

Important notice: the resources in this page are derived from public databases. Most of these databases have restrictive user licences, which typically allow research usage, but not commercial usage. If you are unsure about the legal status of any of the resources below, please consult the originating site.


 DATA  

 SOURCE

 DATE

 UniProtKBhttp://www.uniprot.org/downloadsavailable: 2010-11-30
 EntrezGeneftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz available: 2010-11-30
 NCBI (species)
http://www.ncbi.nlm.nih.gov/taxonomyavailable: 2010-12-23
 CLKB (cell lines)
http://stateslab.org/data/CellLineOntology/cellline.owl 2010 - 12 - 13 (not changed since early 2009)
 PSI-MI (exp. methods)
http://psidev.sourceforge.net/mi/psi-mi.oboavailable: 2010-11-11
 Affymetrix Identifiers  
http://www.affymetrix.com/support/technical/annotationfilesmain.affx 2010-06-24 (available: 2010-08-11)
 Enzymeshttp://ca.expasy.org/enzyme/ 2010-01-29 (available: 2010-12-01)
 Small molecules
http://www.ebi.ac.uk/chebi/ 2010-12-06 (latest)
 Diseaseshttp://www.nlm.nih.gov/research/umls/licensedcontent/downloads.html 2010-01-29 (available: 2010-11-01)
 Symptomshttp://www.nlm.nih.gov/research/umls/licensedcontent/downloads.html 2010-01-29 (available: 2010-11-01)
 Drugshttp://www.drugbank.ca/public/downloads/current/drugcards.zip 2009-04-16 (latest)


The list of terms can be downloaded as a text file encoded in UTF-8 and with a simple 3-column format where the columns are separated by the TAB-character. The first column contains IDs, the 2nd column the terms, and the 3rd column the ID types. While each entry in the list is unique, the same term, ID, or ID type can occur multiple times — a term can have multiple IDs (ambiguity), an ID can be referenced by multiple terms (synonymy). While each ID has exactly 1 type, each type applies to one or more IDs. I.e. ID types can be understood as a very coarse-grained form of identifying the terms.

There are a few cases where an ID can have multiple types. This is particularly true for species, e.g. ID HUMAN can be both from NCBI and CLKB. In this particular case, the reason is that we use cell lines as a proxy for species, i.e. we are interested in the species that the cell line is derived from, as our main aim is to use the species for disambiguation purposes.The statistics generator should properly take it into account, e.g. by qualifying the IDs: NCBI:HUMAN and CLKB:HUMAN.

For more information about the data see the papers

  • Kaarel Kaljurand, Fabio Rinaldi, Thomas Kappeler, Gerold Schneider. Using existing biomedical resources to detect and ground terms in biomedical literature. Artificial Intelligence in Medicine, Verona, July 2009.
  • Fabio Rinaldi, Kaarel Kaljurand, Rune Saetre. Terminological resources for Text Mining over Biomedical Scientific Literature. Journal of Artificial Intelligence in Medicine (to appear).

Automatically computed statistics about these resources are available here.


THE DATA ARE PROVIDED "AS IS" AND WE MAKE NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT WITHOUT LIMITATION, WE MAKE NO REPRESENTATIONS OR WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE DATA WILL MEET YOUR REQUIREMENTS OR THAT THE USE OF THE DATA OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY'S PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. FURTHERMORE, WE DO NOT WARRANT OR MAKE ANY REPRESENTATIONS REGARDING THE USE OF THE RESULTS OF THE USE OF THE DATA IN TERMS OF CORRECTNESS, ACCURACY, RELIABILITY, OR OTHERWISE OR THAT DEFECTS IN THE DATA WILL BE CORRECTED. WE WILL NOT BE LIABLE FOR ANY CONSEQUENTIAL, INCIDENTAL, OR SPECIAL DAMAGES, OR ANY OTHER RELIEF, OR FOR ANY CLAIM BY ANY THIRD PARTY, ARISING FROM THE USE OF THE DATA.

For any problem, comment, or suggestion please contact Fabio Rinaldi (fabio AT ontogene.org)

Comments