publications‎ > ‎

selected publications

Important note: this is just a list of a few selected publications, meant to provide a descriptive overview of the activities of the OntoGene research group. The full list of publications contains more than 70 peer-reviewed publications (including more than 20 journal papers). If needed, contact us for free preprints.

We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation.
  • Fabio Rinaldi, Thomas Kappeler, Kaarel Kaljurand, Gerold Schneider, Manfred Klenner, Simon Clematide, Michael Hess, Jean-Marc von Allmen, Pierre Parisot, Martin Romacker, Therese Vachon. OntoGene in BioCreative IIGenome Biology, 2008, 9:S13.
In this report we describe approaches taken within the scope of the second BioCreative competition in order to solve two aspects of this problem: detection of novel protein interactions reported in scientific articles, and detection of the experimental method that was used to confirm the interaction. Our approach to the former problem is based on a high-recall protein annotation step, followed by two strict disambiguation steps. The remaining proteins are then combined according to a number of lexico-syntactic filters, which deliver high-precision results while maintaining reasonable recall. The detection of the experimental methods is tackled by a pattern matching approach, which has delivered the best results in the official BioCreative evaluation. Although the results of BioCreative clearly show that no tool is sufficiently reliable for fully automated annotations, a few of the proposed approaches (including our own) already perform at a competitive level. This makes them interesting either as standalone tools for preliminary document inspection, or as modules within an environment aimed at supporting the process of curation of biomedical literature.
  • Fabio Rinaldi, Gerold Schneider, Kaarel Kaljurand, Simon Clematide, Thérèse Vachon, Martin Romacker, OntoGene in BioCreative II.5, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7(3), pp. 472-480, 2010. <>
We describe a system for the detection of mentions of protein-protein interactions in the biomedical scientific literature. The original system was developed as a part of the OntoGene project, which focuses on using advanced computational linguistic techniques for text mining applications in the biomedical domain. In this paper, we focus in particular on the participation to the BioCreative II.5 challenge, where the OntoGene system achieved best-ranked results. Additionally, we describe a feature-analysis experiment performed after the challenge, which shows the unexpected result that one single feature alone performs better than the combination of features used in the challenge.
  • Fabio Rinaldi, Simon Clematide, Yael Garten, Michelle Whirl-Carrillo, Li Gong, Joan M. Hebert, Katrin Sangkuhl, Caroline F. Thorn, Teri E. Klein, and Russ B. Altman. Using ODIN for a PharmGKB revalidation experiment. The Journal of Biological Databases and Curation, Oxford Journals, 2012, bas021; doi:10.1093/database/bas021
In this article, we describe an experiment aimed at verifying whether a text-mining tool capable of extracting meaningful relationships among domain entities can be successfully integrated into the curation workflow of a major biological database. We evaluate in particular (i) the usability of the system's interface, as perceived by users, and (ii) the correlation of the ranking of interactions, as provided by the text-mining system, with the choices of the curators.
  • Fabio Rinaldi, Gerold Schneider, Simon Clematide. Relation Mining Experiments in the Pharmacogenomics Domain. Journal of Biomedical Informatics (Elsevier),  Volume 45, Issue 5, October 2012, pages 851-861, 2012. doi:10.1016/j.jbi.2012.04.014
The mutual interactions among genes, diseases, and drugs are at the heart of biomedical research, and are especially important for the pharmacological industry. The recent trend towards personalized medicine makes it increasingly relevant to be able to tailor drugs to specific genetic makeups. The pharmacogenetics and pharmacogenomics knowledge base (PharmGKB) aims at capturing relevant information about such interactions from several sources, including curation of the biomedical literature. Advanced text mining tools which can support the process of manual curation are increasingly necessary in order to cope with the deluge of new published results. However, effective evaluation of those tools requires the availability of manually curated data as gold standard.

In this paper we discuss how the existing PharmGKB database can be used for such an evaluation task in a way similar to the usage of gold standard data derived from protein–protein interaction databases in one of the recent BioCreative shared tasks. Additionally, we present our own considerations and results on the feasibility and difficulty of such a task.

  • Simon Clematide, Fabio Rinaldi. Ranking relations betweeen diseases, drugs, and genes for a curation task. Journal of Biomedical Semantics (BMC), 3 (Suppl 3): S5, 2012. doi:10.1186/2041-1480-3-S3-S5

We propose a simple and effective method based on logistic regression (also known as maximum entropy modeling) for an optimized ranking of relation candidates utilizing curated abstracts. Furthermore, we examine the effects and difficulties of using widely available metadata (i.e. MeSH terms and chemical substance index terms) for relation extraction. Cross-validation experiments result in an improvement of the ranking quality in terms of AUCiP/R by 39% (PharmGKB) and 116% (CTD) against a frequency-based baseline of 0.39 (PharmGKB) and 0.21 (CTD). For the TAP-10 metrics, we achieve an improvement of 53% (PharmGKB) and 134% (CTD) against the same baseline system (0.21 PharmGKB and 0.15 CTD). Our experiments with the PharmGKB and the CTD database show a strong positive effect for the ranking of relation candidates utilizing the vast amount of curated relations covered by currently available knowledge databases. The tasks of concept identification and candidate relation generation profit from the adaptation to previously curated material. This presents an effective and practical method suitable for conservative extension and re-validation of biomedical relations from texts that has been successfully used for curation experiments with the PharmGKB and CTD database.

  • Fabio Rinaldi, Simon Clematide, Simon Hafner, Gerold Schneider, Gintare Grigonyte, Martin Romacker, Therese Vachon. Using the OntoGene pipeline for the triage task of BioCreative 2012, The Journal of Biological Databases and CurationDatabase 2013: bas053, Oxford Journals. doi:10.1093/database/bas053
In this article, we describe the architecture of the OntoGene Relation mining pipeline and its application in the triage task of BioCreative 2012. The aim of the task is to support the triage of abstracts relevant to the process of curation of the Comparative Toxicogenomics Database. We use a conventional information retrieval system (Lucene) to provide a baseline ranking, which we then combine with information provided by our relation mining system, in order to achieve an optimized ranking. Our approach additionally delivers domain entities mentioned in each input document as well as candidate relationships, both ranked according to a confidence score computed by the system. This information is presented to the user through an advanced interface aimed at supporting the process of interactive curation. Thanks, in particular, to the high-quality entity recognition, the OntoGene system achieved the best overall results in the task.

  • Socorro Gama-Castro, Fabio Rinaldi, Alejandra Lopez-Fuentes, Yalbi Itzel Balderas-Martinez, Simon Clematide, Tilia Renate Ellendorff, Alberto Santos-Zavaleta, Hernani Marques-Madeira, Julio Collado-Vides. Assisted curation of regulatory interactions and growth conditions of OxyR in E. coli K-12. Database 2014: bau049, Oxford Journals. doi:10.1093/database/bau049
Given the current explosion of data within original publications generated in the field of genomics, a recognized bottleneck is the transfer of such knowledge into comprehensive databases. We have for years organized knowledge on transcriptional regulation reported in the original literature of Escherichia coli K-12 into RegulonDB (, our database that is currently supported by >5000 papers. Here, we report a first step towards the automatic biocuration of growth conditions in this corpus. Using the OntoGene text-mining system (, we extracted and manually validated regulatory interactions and growth conditions in a new approach based on filters that enable the curator to select informative sentences from preprocessed full papers. Based on a set of 48 papers dealing with oxidative stress by OxyR, we were able to retrieve 100% of the OxyR regulatory interactions present in RegulonDB, including the transcription factors and their effect on target genes. Our strategy was designed to extract, as we did, their growth conditions. This result provides a proof of concept for a more direct and efficient curation process, and enables us to define the strategy of the subsequent steps to be implemented for a semi-automatic curation of original literature dealing with regulation of gene expression in bacteria. This project will enhance the efficiency and quality of the curation of knowledge present in the literature of gene regulation, and contribute to a significant increase in the encoding of the regulatory network of E. coli.

  • Wanli Liu, Rezarta Islamaj Doğan, Dongseop Kwon, Hernani Marques, Fabio Rinaldi, W. John Wilbur, Donald C. Comeau. BioC implementations in Go, Perl, Python and Ruby. Database 2014:  bau059, Oxford Journals. doi:10.1093/database/bau059
As part of a communitywide effort for evaluating text mining and information extraction systems applied to the biomedical domain, BioC is focused on the goal of interoperability, currently a major barrier to wide-scale adoption of text mining tools. BioC is a simple XML format, specified by DTD, for exchanging data for biomedical natural language processing. With initial implementations in C++ and Java, BioC provides libraries of code for reading and writing BioC text documents and annotations. We extend BioC to Perl, Python, Go and Ruby. We used SWIG to extend the C++ implementation for Perl and one Python implementation. A second Python implementation and the Ruby implementation use native data structures and libraries. BioC is also implemented in the Google language Go. BioC modules are functional in all of these languages, which can facilitate text mining tasks. BioC implementations are freely available through the BioC site:
  • Fabio Rinaldi, Simon Clematide, Hernani Marques, Tilia Ellendorff, Martin Romacker, Raul Rodriguez-Esteban. OntoGene web services for biomedical text mining. BMC Bioinformatics 2014, 15(Suppl 14):S6  doi:10.1186/1471-2105-15-S14-S6

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are gaining increased interest.  
    We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community- organized evaluation challenges, with top ranked results in several of them.