RG 2841/2: Beyond the Exome (SP 05)
Facts
Medical Informatics and Medical Bioinformatics
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Computer Science
DFG Research Unit
![]()
Description
The study of regulatory DNA elements, features and processes in human cells has a long tradition in biomedical research. Having a comprehensive and high-quality account of the current state of knowledge on human gene regulation is a prerequisite for the design of future experiments and fundamental for the biomedical research carried out in the different projects of the research unit.
The aim of this subproject is to develop a comprehensive repository of regulatory genomic features and their variation in human diseases. Our project is structured in a data integration (DI) and an information extraction (IE) part. Regarding DI, in the first 24 months of the first phase, we identified suitable data sources, implemented data transformation pipelines, selected RegulationSpotter (Schwarz et al., 2019) as the most suitable target for integrating this information, and implemented a prototype of the integration pipeline and result visualization. Regarding IE, we developed the first annotated text corpus of regulatory information, consisting of 305 abstracts that mention 156 transcription factors and 494 enhancer regions associated with 350 unique diseases and 985 unique genes (Garda et al., 2022). This corpus was used to train text mining algorithms to detect regulatory sequence elements in new texts and then applied to the entire PubMed collection, yielding the first large and systematically obtained collections of these elements and their putative associations to genes and diseases. Furthermore, we developed a pipeline based on deep neural networks to perform neural entity normalization to ease the downstream task of curating extracted information in the future. Improving and evaluating this pipeline together with an extension of the corpus is ongoing work.
In the second phase, we plan to continue our successful work along the lines of DI and IE. Regarding DI, our focus will be on updating and increasing the number of integrated databases and on automating the integration process. Regarding IE, we plan to shift our focus from entity detection and normalization to relationship extraction, a field in which significant progress has been achieved over the last years. This step requires a re-annotation of the corpus regarding defined types of relationships of regulatory features to genes, variants, and diseases. We will adapt and tune methods for relationship extraction on this corpus, apply the trained models to disease-specific text collections, and develop an efficient framework for expert-curating the extracted information efficiently. As a major innovation we plan to develop a novel real-time curation technology to improve user satisfaction during curation, which is an urgent requirement for real-life curation processes. We will also develop a user-friendly web interface for easy access to the regulatory features obtained by both text mining and from public data from large research consortia.
Organization entities
Knowledge Management in Bioinformatics
Address
Johann von Neumann-Haus, Institutsgeb?ude, Rudower Chaussee 25, 12489 BerlinGeneral contactTel.: 030 2093-41280