As part of our research activities in the OntoGene project we have compiled a list of terminological resources which we use in our Text Mining applications. We make them available on this page in the hope that they can be helpful for various research activities. Important notice: the resources in this page are derived from public databases. Most of these databases have restrictive user licences, which typically allow research usage, but not commercial usage. If you are unsure about the legal status of any of the resources below, please consult the originating site. The list of terms can be downloaded as a text file encoded in UTF-8 and with a simple 3-column
format where the columns are separated by the TAB-character.
The first column contains IDs, the 2nd column the terms, and the 3rd column the ID types.
While each entry in the list is unique, the same term, ID, or ID type can occur multiple times —
a term can have multiple IDs (ambiguity), an ID can be referenced by multiple terms (synonymy).
While each ID has exactly 1 type, each type applies to one or more IDs.
I.e. ID types can be understood as
a very coarse-grained form of identifying the terms. There are a few cases where an ID can have multiple types. This is particularly true for species, e.g. ID HUMAN can be both from NCBI and CLKB. In this particular case, the reason is that we use cell lines as a proxy for species, i.e. we are interested in the species that the cell line is derived from, as our main aim is to use the species for disambiguation purposes.The statistics generator should properly take it into account, e.g. by qualifying the IDs: NCBI:HUMAN and CLKB:HUMAN. For more information about the data see the papers
Automatically computed statistics about these resources are available here.
For any problem, comment, or suggestion please contact Fabio Rinaldi (fabio AT ontogene.org) |
other >