CUSO Winter School in Computer Science

Data Science in Information Society: From Data Acquisition to Data Analysis


January 30 - February 3, 2017
Champéry, Switzerland

 
Overview
Venue
Program
Speakers
Registration
 
Contact
 
 
 
Program

Participation & Programme

Student participants are asked to bring a poster of their research work, either existing results or ongoing research that they are currently conducting. Further instructions of how their work will be showcased during the Winter School will be sent by email to everyone closer to the event.

Programme

The program includes presentations of six speakers:

1. Introduction to Data Science and Big Data

Prof. Philippe Cudre-Mauroux, University of Fribourg

In this talk, I will give a general introduction to the topics of Data Science and Big Data. I will start by giving some motivation, and then give an overview of both Big Data infrastructures and recent advances in Data Science. Finally, I will delve into one topic close to my heart: the third V of Big Data (Variety) and how recent advances combining Big Data, NLP and Machine-Learning technologies can bring us one step closer to being able to integrate heterogeneous pieces of information automatically.

2. Data Management and Analysis as a Service

Prof. Bettina Kemme, McGill University, Canada

As data gets generated with unprecedented growth, and with it the desire to analyze and mine it, a wide range of platforms have been developed for its storage, management, transformation, dissemination and analysis. Given the complexity of the tasks at hand, these platforms are not simple press-the-button-and-all-works tools. Instead, users need a good understanding in order to deploy the right tools, with the data models and query languages that best suit the applications' need, and in order to configure these platforms so that they achieve the best possible performance. In this tutorial, we give an overview of well-known data storage and analysis platforms. We present their fundamental building blocks, and the techniques that are used to provide reliable, scalable and performant query execution. Platforms that are likely to be covered are the Hadoop Framework (including the Hbase data management system and the map-reduce data processing framework), Spark (enhanced distributed analysis), and the publish/subscribe paradigm for data dissemination. The challenges involved in deploying such platforms in the cloud will be discussed.

3. Codes for Cloud and Networked Distributed Storage

Prof. Daniel Enrique Lucani Roetter, Aalborg University, Denmark

This tutorial surveys traditional coding techniques for cloud and networked distributed storage systems as well as cutting-edge research results and recent developments in the topic. Providing robust and repairable storage capabilities originally started in a single computer, e.g., using RAID systems, to be able to withstand the loss of storage disks. The idea to distribute data across multiple storage nodes, which are interconnected over a network, e.g., within one or multiple data centers, is quite natural as it provides a much higher robustness to the loss of individual disks and even entire computers with multiple disks. Although the simplest technique to provide this reliability is to replicate (keep multiple copies of the file), the cost in storage space is large. The key goal of coding techniques in these systems is to achieve fault tolerance efficiently by being able to repair losses with controlled network and storage use in order to ensure data durability/survival over time. Given the abundance of cloud storage systems and massive content generated each year, reducing storage costs and exploiting storage beyond controlled data centers settings, e.g., leveraging mobile devices, have become a must. This tutorial has two parts. The first part provides the fundamentals of cloud and networked distributed storage, including, key concepts and performance metrics as well as fundamental trade-offs. This first part will also provide key code constructions and examples of codes specifically designed for repairability, including locally repairable codes and network codes. We shall cover both static, centralized designs suitable for current cloud storage systems as well as dynamic, distributed designs at the edge of current research for exploiting wireless mobile devices for storage systems. The latter introduces interesting constraints and higher node unavailability, which makes the design of very structured codes near impossible. Random code designs are presented as a way to cope with these new challenges. The second part of the course focuses on more practical and application specific designs, particularly, looking at effects of download and access time, data update, and security. This part will be grounded on real-world systems, implementation details, and measurements.

4. HPC and Big Data Convergence

Prof. Jesus Carretero, University Carlos III of Madrid, Spain

Traditional solutions for HPC and Big Data have followed separate paths for several years. However, since 2014 there is a strong effort to unify both worlds due to the strong interest in science and industry to get more insights from data stored in huge data vaults, to enhance their typical solutions entering data from previous experiences, and to create complex processes (presented as workflows) to solve multistage problems. All this trends, included under the name "data-intensive computing", are changing both worlds. However, convergence is not easy, as current Big Data solutions are mostly focused to a small part the the problems that a HPC system may face. In this tutorial, we´ll show the specifics of data-intensive computing problems, requirements, and major paradigms used in Big Data and HPC worlds to cope with them. Then, we´ll show the major effort being made to promote convergence between both worlds, and some of the lessons learned from the adaptation of real-world problems.

5. Technologies for Blockchain - Consensus and Cryptography

Dr Christian Cachin, IBM Research - Zurich

A blockchain is a public ledger for recording transactions, maintained by many nodes without central authority through a secure cryptographic protocol. All nodes validate the information to be appended to the blockchain, and the protocol ensures that the nodes agree on a unique order in which entries are appended.

This tutorial introduces some fundamental abstractions for programming distributed applications that tolerate faults, including reliable broadcast and consensus. It illustrates protocols that realize these concepts in systems subject to uncertainty and failures, and extends them to environments where malicious attacks may occur (so-called Byzantine faults). Applications of Byzantine consensus and distributed cryptography to blockchains are discussed, with focus on current efforts to build a blockchain for the enterprise under the Hyperledger Project.

6. Data Wrangling and Viz

Dr Michele Catasta, EPFL

Many industry experts report that "Data scientists spend up to 80% of their time on data wrangling". Being such a time consuming task, pre-processing the data before performing in-depth analysis is often considered a frustrating experience, with little appeal compared to machine learning and data mining. In this tutorial, you will learn the techniques and the tools of the trade to perform efficiently data wrangling. Moreover, you will see examples of when taking shortcuts and rushing the data wrangling process can badly affect all the subsequent steps in the data science pipeline. The ultimate goal of this tutorial is to shine a positive light over data wrangling, as it can be leveraged to assess the quality of the pre-processing step in your data science pipeline. Such quality checking process is iterative in nature and requires a solid background in visualization, so this tutorial will also include a live demo and a crash course on the best practices in viz. To conclude, you will get an overview of the most promising research projects in the field.

 


 

Schedule

Monday 30 14h00-15h30 Philippe Cudre-Mauroux:
Introduction to Data Science and Big Data
16h00-19h30 Bettina Kemme:
Data Management and Analysis as a Service
Tuesday 31 08h30-12h00 Daniel Enrique Lucani Roetter:
Codes for Cloud and Networked Distributed Storage
17h00-19h00 PhD student presentations :
Veronica Estrada
Dimitri Percia David
Mostafa Karimzadeh
Silvina Caino-Lores
Andrei Lapin
Xavier Ouvrard
Yarco Hayduk
Alexandre DeMasi
Roberta Barbi
Davide Alocci
Wednesday 1 08h30-12h00 Jesus Carretero:
HPC and Big Data Convergence
17h00-19h00 PhD student presentations :
Jonnahtan Saltarin
Paolo Rosso
Bertil Chapuis
Clément Labadie
Vaibhav Kulkarni
Pascal Gremaud
Arnaud Durand
Oumaima Ajmi
Dana Naous
Thursday 2 08h30-12h00 Christian Cachin:
Technologies for Blockchain - Consensus and Cryptography
17h00-19h00 PhD student presentations :
Allan Berrocal
Laura Rettig
Aurelien HAVET
Aziza Merzouki
Dina Elikan
Simin Jabbari
Selena Baset
Rafael Pires
Mirco Kocher
Friday 3 08h30-12h00 Michele Catasta:
Data Wrangling and Viz
12h00-12h30 Wrap-up