RG 2841/2: Beyond the Exome (SP 05)

At a glance

Subproject of

RG 2841: Beyond the Exome – Identifying, Analyzing, and Predicting the Disease Potential of Non-Coding DNA Variants

Project duration

07/2023 – 07/2026

DFG classification of subject areas

Medical Informatics and Medical Bioinformatics

Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing

Computer Science

Funded by

DFG Research Unit

Project description

The study of regulatory DNA elements, features and processes in human cells has a long tradition in biomedical research. Having a comprehensive and high-quality account of the current state of knowledge on human gene regulation is a prerequisite for the design of future experiments and fundamental for the biomedical research carried out in the different projects of the research unit.
The aim of this subproject is to develop a comprehensive repository of regulatory genomic features and their variation in human diseases. Our project is structured in a data integration (DI) and an information extraction (IE) part. Regarding DI, in the first 24 months of the first phase, we identified suitable data sources, implemented data transformation pipelines, selected RegulationSpotter (Schwarz et al., 2019) as the most suitable target for integrating this information, and implemented a prototype of the integration pipeline and result visualization. Regarding IE, we developed the first annotated text corpus of regulatory information, consisting of 305 abstracts that mention 156 transcription factors and 494 enhancer regions associated with 350 unique diseases and 985 unique genes (Garda et al., 2022). This corpus was used to train text mining algorithms to detect regulatory sequence elements in new texts and then applied to the entire PubMed collection, yielding the first large and systematically obtained collections of these elements and their putative associations to genes and diseases. Furthermore, we developed a pipeline based on deep neural networks to perform neural entity normalization to ease the downstream task of curating extracted information in the future. Improving and evaluating this pipeline together with an extension of the corpus is ongoing work.
In the second phase, we plan to continue our successful work along the lines of DI and IE. Regarding DI, our focus will be on updating and increasing the number of integrated databases and on automating the integration process. Regarding IE, we plan to shift our focus from entity detection and normalization to relationship extraction, a field in which significant progress has been achieved over the last years. This step requires a re-annotation of the corpus regarding defined types of relationships of regulatory features to genes, variants, and diseases. We will adapt and tune methods for relationship extraction on this corpus, apply the trained models to disease-specific text collections, and develop an efficient framework for expert-curating the extracted information efficiently. As a major innovation we plan to develop a novel real-time curation technology to improve user satisfaction during curation, which is an urgent requirement for real-life curation processes. We will also develop a user-friendly web interface for easy access to the regulatory features obtained by both text mining and from public data from large research consortia.

Open project website

Topics

Medizininformatik text mining Genetik

Project head

Person
Prof. Dr. Ulf Leser
- Faculty of Mathematics and Natural Sciences
- Department of Computer Science

Participating institutions

Department of Computer Science
Address
Rudower Chaussee 25, 12489 Berlin
General contact
Tel.: +49 30 2093-41140

Cooperation partners

Cooperation partner
UniversityGermany
Charité – Berlin University Medicine

At a glance

Project description

Project head

Prof. Dr. Ulf Leser

Participating institutions

Department of Computer Science

Cooperation partners

Charité – Berlin University Medicine