The aim of this proposal is to implement a novel way of processing and
accessing the vast detailed knowledge contained within collections of
scientific publications on the regulation of transcription initiation in
bacterial models. In principle, this model for processing and reading
information and new knowledge is applicable to other biological domains,
potentially benefiting any area of biomedical knowledge. It is
certainly criticl to generate new strategies to cope with the
ever-increasing amount of knowledge generated in genomics and in
biomedical research at large. Improving the efficiency of the
traditional high-quality manual curation of scientific publications will
enable us also to expand the type of biological knowledge, beyond
mechanisms and their elements in the genome, to start including their
connections with larger regulated processes and eventually physiological
properties of the cell. We will first implement the necessary
technology to improve our curation by means of a computational system
that has text mining capabilities for preprocessing the papers before a
human expert curator identifies which sentences contain the information
that is to be added to the database. Premarked options selected by the
curators will accelerate their decisions. The accumulative precise
mapping between sentences and curated knowledge will provide training
sets for text mining technologies to improve their automatic extraction.
The curator practices will become more efficient, enabling us to curate
selected high-impact published reviews to place mechanisms into a rich
context of their physiological processes and general biology. Another
relevant component of our proposal is the improved modeling of regulated
processes by means of new concepts in biology that capture larger
collections of coregulated genes and their concatenated reactions.
Starting from all interactions of a local regulator, coregulated
regulators and their domain of action will be incorporated to construct
the biobricks of complex decisions, as they are encoded in the genome.
These are conceptual containers that capture the organization of
knowledge to describe the genetic programming of cellular capabilities.
These proposals will be formalized and proposed within an international
consortium focused in enriching standard models or ontologies of gene
regulation for use by the scientific community. Finally, a portal to
navigate across all the sentences of a given corpus of a large number
(more than 5,000) of related papers will be implemented. The different
avenues of navigation will essentially use two technologies, one dealing
with automatically generating simpler sentences from original sentences
as input, and the other one with the classification of papers based on
their theme or ontology. Their combination will enable a novel
navigation reading system. If we achieve our aims, this project will
give a proof-of-principle prototype with clearly innovative higher
levels of large amounts of integrated knowledge. Future directions may
adapt these concepts and methods to the biology of higher organisms,
including humans. Grant ID: 5R01GM110597-03 Funding Agency: NIH Title: "High-Throughput Literature Curation of Genetic Regulation in Bacterial Models" Funding: $406,247 for the first year Duration: 4 years (1. Jan 2015 - 31. Dec 2018) PI: Dr. Julio Collado-Vides Collaborators: Dr. Michael Savageau, UCDavis; Dr. Stephen Busby, Univ. of Birmingham; Dr. Fabio Rinaldi, Univ. of Zurich. One of the goals of this NIH-funded project is to integrate advanced text mining techniques in the curation process of a life science database (RegulonDB). The project will make use of ODIN (OntoGene Document Inspector), a user-friendly interface designed by the OntoGene group for curation tasks, which integrates with the OntoGene text mining pipeline. Selected Publications
Screenshot of ODIN interface, customized for RegulonDB curation. |
recent projects >