Data Discovery in the Presence of Temporal Drifts
Auf einen Blick
Informatik
DAAD
Projektbeschreibung
Over the past decade, the proliferation of public and enterprise data lakes has fueled intensive research into data discovery, aiming to identify the most relevant data from vast and complex corpora to support diverse user tasks. The problem of finding a handful of relevant data to the task at hand from a large lake of tables with hundreds of millions of datasets resembles finding a needle in a haystack.
Significant progress has been made through the development of innovative index structures, similarity measures, and querying infrastructures. Additionally, integrated systems such as Delta Lake and Blend have emerged to streamline end-to-end data discovery workflows. Despite these advances, a critical aspect remains overlooked: the temporal validity of discovered data. Relevance is time-varying. Data lakes frequently contain multiple versions of datasets, accumulated over time, where earlier versions may no longer be valid for current analysis. On the other hand, to ensure reproducibility of a downstream analysis, one might require a very specific version of the data in a restricted time window.
Existing discovery methods largely ignore this temporal dimension, especially when explicit date or time metadata is missing. This gap in research leaves practitioners vulnerable to relying on temporal and semantically drifting data. This gap between the current time-agnostic data discovery solutions and the real world temporally rich data lakes results in ineffective downstream analysis of data. This includes, unreliable machine learning model training, incorrect statistical model of the universe, and decisions made based on semantically outdated information.
To fill this gap, we propose to develop a formal framework for temporally valid data discovery systems to answer the following research question: Given a specific downstream task—such as reproducibility requiring precise historical data versions, or machine learning model training needing the most recent valid data—how can data discovery systems effectively account for the temporal validity of datasets to ensure that retrieved data is both semantically relevant and temporally appropriate, especially when explicit temporal metadata is missing?
Our research will answer this research question through several key milestones:
- Temporal Lineage Inference: Automatically inferring versioned relationships between datasets, even in the absence of explicit temporal metadata, to uncover how datasets have evolved.
- Change Log Synthesis: Formalizing mechanisms to generate optimal change streams that capture the evolution of datasets across versions.
- Time-Aware Data Discovery Models and Indexes: Leveraging inferred temporal lineage and change streams to design efficient, time-sensitive data discovery methods, enabling precise querying for datasets valid at a specific temporal point or interval.
Beteiligte Einrichtungen
Institut für Informatik
Anschrift
Johann von Neumann-Haus, Institutsgeb?ude, Rudower Chaussee 25, 12489 BerlinAllgemeiner 金贝棋牌Tel.: +49 30 2093-41140