Important note: this is just a list of a few selected publications, meant to provide a descriptive overview of the activities of the OntoGene research group. The full list of publications contains almost 100 peer-reviewed publications (including more than 20 journal papers). If needed, contact us for free preprints.- Marco
Basaldella, Lenz Furrer, Carlo Tasso, Fabio Rinaldi. Entity Recognition
in the Biomedical domain using a hybrid approach. Journal of Biomedical
Semantics (2017), 8:51. doi:10.1186/s13326-017-0157-6 ABSTRACT: This article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles. The approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a machine-learning classifier. First, the OGER entity recognizer, which has a bias towards high recall, annotates the terms that appear in selected domain ontologies. Subsequently, the Distiller framework uses this information as a feature for a machine learning algorithm to select the relevant entities only. For this step, we compare two different supervised machine-learning algorithms: Conditional Random Fields and Neural Networks. In an in-domain evaluation using the CRAFT corpus, we test the performance of the combined systems when recognizing chemicals, cell types, cellular components, biological processes, molecular functions, organisms, proteins, and biological sequences. Our best system combines dictionary-based candidate generation with Neural-Network-based filtering. It achieves an overall precision of 86% at a recall of 60% on the named entity recognition task, and a precision of 51% at a recall of 49% on the concept recognition task. These results are to our knowledge the best reported so far in this particular task.
- Fabio Rinaldi, Oscar Lithgow, Socorro Gama-Castro, Hilda
Solano, Alejandra López-Fuentes Luis José Muñiz Rascado, Cecilia
Ishida-Gutiérrez, Carlos-Francisco Méndez-Cruz, Julio Collado-Vides;
Strategies towards digital and semi-automated curation in RegulonDB.
Database (Oxford) 2017; 2017 (1): bax012. doi:10.1093/database/bax012 ABSTRACT: Experimentally generated biological
information needs to be organized and structured in order to become
meaningful knowledge. However, the rate at which new information is
being published makes manual curation increasingly unable to cope.
Devising new curation strategies that leverage upon data mining and text
analysis is, therefore, a promising avenue to help life science
databases to cope with the deluge of novel information. In this article,
we describe the integration of text mining technologies in the curation
pipeline of the RegulonDB database, and discuss how the process can
enhance the productivity of the curators. Specifically, a named
entity recognition approach is used to pre-annotate terms referring to a
set of domain entities which are potentially relevant for the curation
process. The annotated documents are presented to the curator, who,
thanks to a custom-designed interface, can select sentences containing
specific types of entities, thus restricting the amount of text that
needs to be inspected. Additionally, a module capable of computing
semantic similarity between sentences across the entire collection of
articles to be curated is being integrated in the system. We tested the
module using three sets of scientific articles and six domain experts.
All these improvements are gradually enabling us to obtain a high
throughput curation process with the same quality as manual curation.
- Fabio Rinaldi, Simon Clematide, Hernani Marques, Tilia Ellendorff, Martin Romacker, Raul Rodriguez-Esteban. OntoGene web services for biomedical text mining. BMC Bioinformatics 2014, 15(Suppl 14):S6
doi:10.1186/1471-2105-15-S14-S6
- Wanli Liu, Rezarta Islamaj Doğan, Dongseop Kwon, Hernani Marques, Fabio Rinaldi, W. John Wilbur, Donald C. Comeau. BioC implementations in Go, Perl, Python and Ruby. Database 2014: bau059, Oxford Journals. doi:10.1093/database/bau059
ABSTRACT: As part of a communitywide effort for evaluating text mining and
information extraction systems applied to the biomedical
domain, BioC is focused on the goal of
interoperability, currently a major barrier to wide-scale adoption of
text mining tools.
BioC is a simple XML format, specified by DTD, for
exchanging data for biomedical natural language processing. With initial
implementations in C++ and Java, BioC provides
libraries of code for reading and writing BioC text documents and
annotations.
We extend BioC to Perl, Python, Go and Ruby. We
used SWIG to extend the C++ implementation for Perl and one Python
implementation.
A second Python implementation and the Ruby
implementation use native data structures and libraries. BioC is also
implemented
in the Google language Go. BioC modules are
functional in all of these languages, which can facilitate text mining
tasks.
BioC implementations are freely available through
the BioC site: http://bioc.sourceforge.net
- Socorro
Gama-Castro, Fabio Rinaldi, Alejandra Lopez-Fuentes, Yalbi Itzel
Balderas-Martinez, Simon Clematide, Tilia Renate Ellendorff, Alberto
Santos-Zavaleta, Hernani Marques-Madeira, Julio Collado-Vides. Assisted curation of regulatory interactions and growth conditions of OxyR in E. coli K-12. Database 2014: bau049, Oxford Journals. doi:10.1093/database/bau049 ABSTRACT: Given the current explosion of data within original publications
generated in the field of genomics, a recognized bottleneck
is the transfer of such knowledge into
comprehensive databases. We have for years organized knowledge on
transcriptional regulation
reported in the original literature of Escherichia coli K-12 into RegulonDB (http://regulondb.ccg.unam.mx), our database that is currently supported by >5000 papers. Here, we report a first step towards the automatic biocuration
of growth conditions in this corpus. Using the OntoGene text-mining system (http://www.ontogene.org),
we extracted and manually validated regulatory interactions and growth
conditions in a new approach based on filters that
enable the curator to select informative sentences
from preprocessed full papers. Based on a set of 48 papers dealing with
oxidative stress by OxyR, we were able to retrieve
100% of the OxyR regulatory interactions present in RegulonDB, including
the transcription factors and their effect on
target genes. Our strategy was designed to extract, as we did, their
growth
conditions. This result provides a proof of concept
for a more direct and efficient curation process, and enables us to
define
the strategy of the subsequent steps to be
implemented for a semi-automatic curation of original literature dealing
with regulation
of gene expression in bacteria. This project will
enhance the efficiency and quality of the curation of knowledge present
in the literature of gene regulation, and
contribute to a significant increase in the encoding of the regulatory
network of
E. coli.
- Fabio
Rinaldi, Simon Clematide, Yael Garten, Michelle Whirl-Carrillo, Li
Gong, Joan M. Hebert, Katrin Sangkuhl, Caroline F. Thorn, Teri E. Klein,
and Russ B. Altman. Using ODIN for a PharmGKB revalidation experiment. The Journal of Biological Databases and Curation, Oxford Journals, 2012, bas021; doi:10.1093/database/bas021 ABSTRACT: In
this article, we describe an experiment aimed at verifying whether a
text-mining tool capable of extracting meaningful relationships among
domain entities can be successfully integrated into the curation
workflow of a major biological database. We evaluate in particular (i)
the usability of the system's interface, as perceived by users, and (ii)
the correlation of the ranking of interactions, as provided by the
text-mining system, with the choices of the curators.
- Fabio Rinaldi, Gerold Schneider, Simon Clematide. Relation Mining Experiments in the Pharmacogenomics Domain. Journal of Biomedical Informatics (Elsevier), Volume 45, Issue 5, October 2012, pages 851-861, 2012. doi:10.1016/j.jbi.2012.04.014 ABSTRACT: The
mutual interactions among genes, diseases, and drugs are at the heart
of biomedical research, and are especially important for the
pharmacological industry. The recent trend towards personalized medicine
makes it increasingly relevant to be able to tailor drugs to specific
genetic makeups. The pharmacogenetics and pharmacogenomics knowledge
base (PharmGKB) aims at capturing relevant information about such
interactions from several sources, including curation of the biomedical
literature. Advanced
text mining tools which can support the process of manual curation are
increasingly necessary in order to cope with the deluge of new published
results. However, effective evaluation of those tools requires the
availability of manually curated data as gold standard.
In
this paper we discuss how the existing PharmGKB database can be used
for such an evaluation task in a way similar to the usage of gold
standard data derived from protein–protein interaction databases in one
of the recent BioCreative shared tasks. Additionally, we present our own
considerations and results on the feasibility and difficulty of such a
task. Simon Clematide, Fabio Rinaldi. Ranking relations betweeen diseases, drugs, and genes for a curation task. Journal of Biomedical Semantics (BMC), 3 (Suppl 3): S5, 2012. doi:10.1186/2041-1480-3-S3-S5 ABSTRACT: We
propose a simple and effective method based on logistic regression
(also known as maximum entropy modeling) for an optimized ranking of
relation candidates utilizing curated abstracts. Furthermore, we examine
the effects and difficulties of using widely available metadata (i.e.
MeSH terms and chemical substance index terms) for relation extraction.
Cross-validation experiments result in an improvement of the ranking
quality in terms of AUCiP/R by 39% (PharmGKB) and 116% (CTD) against a
frequency-based baseline of 0.39 (PharmGKB) and 0.21 (CTD). For the
TAP-10 metrics, we achieve an improvement of 53% (PharmGKB) and 134%
(CTD) against the same baseline system (0.21 PharmGKB and 0.15 CTD). Our
experiments with the PharmGKB and the CTD database show a strong
positive effect for the ranking of relation candidates utilizing the
vast amount of curated relations covered by currently available
knowledge databases. The tasks of concept identification and candidate
relation generation profit from the adaptation to previously curated
material. This presents an effective and practical method suitable for
conservative extension and re-validation of biomedical relations from
texts that has been successfully used for curation experiments with the
PharmGKB and CTD database. Fabio Rinaldi, Simon Clematide, Simon Hafner, Gerold Schneider, Gintare Grigonyte, Martin Romacker, Therese Vachon. Using the OntoGene pipeline for the triage task of BioCreative 2012, The Journal of Biological Databases and Curation, Database 2013: bas053, Oxford Journals. doi:10.1093/database/bas053 ABSTRACT: In
this article, we describe the architecture of the OntoGene Relation
mining pipeline and its application in the triage task of BioCreative
2012. The aim of the task is to support the triage of abstracts relevant
to the process of curation of the Comparative Toxicogenomics Database.
We use a conventional information retrieval system (Lucene) to provide a
baseline ranking, which we then combine with information provided by
our relation mining system, in order to achieve an optimized ranking.
Our approach additionally delivers domain entities mentioned in each
input document as well as candidate relationships, both ranked according
to a confidence score computed by the system. This information is
presented to the user through an advanced interface aimed at supporting
the process of interactive curation. Thanks, in particular, to the
high-quality entity recognition, the OntoGene system achieved the best
overall results in the task. - Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Simon Clematide, Thérèse Vachon, Martin Romacker, OntoGene in BioCreative II.5, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3), pp. 472-480, 2010. http://doi.ieeecomputersociety.org/10.1109/TCBB.2010.50 ABSTRACT: We describe a system
for the detection of mentions of protein-protein interactions in the
biomedical scientific literature. The original system was developed as a
part of the OntoGene project, which focuses on using advanced
computational linguistic techniques for text mining applications in the
biomedical domain. In this paper, we focus in particular on the
participation to the BioCreative II.5 challenge, where the OntoGene
system achieved best-ranked results. Additionally, we
describe a feature-analysis experiment performed after the challenge,
which shows the unexpected result that one single feature alone performs
better than the combination of features used in the challenge.
- Fabio
Rinaldi, Thomas Kappeler, Kaarel Kaljurand, Gerold Schneider, Manfred
Klenner, Simon Clematide, Michael Hess, Jean-Marc von Allmen, Pierre
Parisot, Martin Romacker, Therese Vachon. OntoGene in BioCreative II. Genome Biology, 2008, 9:S13. ABSTRACT: In
this report we describe approaches taken within the scope of the second
BioCreative competition in order to solve two aspects of this problem:
detection of novel protein interactions reported in scientific articles,
and detection of the experimental method that was used to confirm the
interaction. Our approach to the former problem is based on a
high-recall protein annotation step, followed by two strict
disambiguation steps. The remaining proteins are then combined according
to a number of lexico-syntactic filters, which deliver high-precision
results while maintaining reasonable recall. The detection of the
experimental methods is tackled by a pattern matching approach, which
has delivered the best results in the official BioCreative evaluation. Although
the results of BioCreative clearly show that no tool is sufficiently
reliable for fully automated annotations, a few of the proposed
approaches (including our own) already perform at a competitive level.
This makes them interesting either as standalone tools for preliminary
document inspection, or as modules within an environment aimed at
supporting the process of curation of biomedical literature.
We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation.
|