ReDaS Associated Team

Scientific Program

Context and Objective

This project is named ReDaS which stands for : Analysis Techniques and Workflow Methodologies for Reproducible Data Science. This is an INRIA Associated Team accepted at the 2019–2021 call, proposed by the POLARIS INRIA research team and conducted together with our partners from the Federal University of Rio Grande do Sul (\UFRGS), Porto Alegre, Brazil. The main scientific context of this project is to develop novel analysis techniques and workflow methodologies to support reproducible data science.

We focus our efforts along three axes :

  1. Analysis Techniques: large volumes of data are hard to summarize using simple statistics that hides important behavior in the data. Therefore, raw information visualization plays a key role to explore such data, in particular when curating data and trying to develop intuition about the mathematical models underlying data. Yet, such visualizations require data aggregation, which may lead to significant information loss. It is thus essential to investigate adaptive data aggregation schemes that enable the reduction of the data while controlling the information loss.
  2. Workflow Methodologies: the analysis process often involves a mix of tools to produce the end result. The data has to be filtered before it can be passed to some standard statistical tool to, eventually, produce some projection of the transformed data that can be visualized and studied by the analyst. Furthermore, the process is interactive: when the analyst is unsatisfied with the end result, a part of the analysis has to be changed to produce a new visualization. These adaptations of the whole analysis typically start from intermediate data and only a part of the analysis has to be rerun. The issue comes with the increasing size of these analysis, the disparity of the analysis tools and the large space of analysis parameters.
  3. Evaluation: In the previous work packages we will propose both a theoretical and practical methodology whose relevance should be evaluated with real case studies. We will build our evaluation on well identified and quite different datasets originating from the following three areas, on which we already have some past experience:

    Performance analysis of HPC applications
    These applications and their underlying runtimes tend to be increasingly complex and dynamic. As a consequence, their execution traces become too large and impossible to analyze with classical tools.
    Long-term phenology behavior analysis and correlation with climate change
    The phenology is the study of plant grow through the use of digital cameras attached to towers installed in the middle of the natural environments. These cameras take photos every a certain number of minutes and enable the researcher to verify how certain species grow, including their relation with the climate.
    General public datasets from governement transparency reports
    All public Brazilian institutions are obliged by law to provide datasets about any publicly-financed data measurements. The city of Porto Alegre has long-term weather datasets that contain temperature, pressure and other indicators from different parts of the city. The goal in this case study is very exploratory, for example to envision a way to represent such data in a geographical manner to verify if certain parts of the city may suffer from flash flood more than others.

Executive Summary

The goals of this project are thus to

develop interactive, reproducible and scalable analysis workflows

provide uncertainty and quality estimators about the analysis

This will enable the analyst to understand the behaviors hidden in complex datasets collected in large scale dynamic systems, and to proceed with confidence.

List of participants

Name Team Status Main expertise
Guillaume Huard \INRIA Associate Professor Performance evaluation
Jean-Marc Vincent \INRIA Associate Professor Statistics & Perf. Evaluation
Arnaud Legrand \INRIA Researcher Reproducible Research
Lucas Mello Schnorr \UFRGS Adjunct Professor Performance Evaluation
João Comba \UFRGS Associate Professor Information Visualization
Alexis Janon \POLARIS PhD Candidate HW Counters Co-Design
Tom Cornebize \POLARIS PhD Candidate Data Analytics for HPC
Nils Defauw \POLARIS Master Information Theory and Aggregation
Flora Gautheron \POLARIS Master Data Analytics for Social Sciences
Vinicius G. Pinto \UFRGS Postdoc Task-based Performance Analysis
Lucas Nesi \UFRGS PhD Candidate Task-based Performance Analysis
Marcelo Miletto \UFRGS Master student Sparse Task-based Performance Analysis
Guilherme Alles \UFRGS Master student Phenology analysis

Detailed documents and project progress

We provide here detailed documents about the scientific proposal of the project, its active ongoing works and it main achievements

  • Original detailed proposal that presents the planned research efforts and the expected output
  • First annual report that presents the activities conducted during the year 2019, the results obtained and the planning for the following year