31/05/2016

AN INTRODUCTION TO DATA-DRIVEN MODELLING & GRAPH CLUSTERING

Dr. Dante Conti

9.3.1 Summary

Nowadays when society is immersed at the era of Information and Communication Technologies, the presence of massive data in different fields and real-world applications has encouraged the use of Data Mining, Machine Learning and Artificial Intelligence approaches aimed to discover and extract non-trivial information from databases. These novel approaches are the result of multidisciplinary researches and advances associated to Applied Mathematics, Statistics, Computer Sciences, Engineering and Physics. Some authors mention the new era of Data Science and Data Scientists by referring to academic and professional profiles with skills focused on analytics, IT and multidisciplinary thinking to solve problems under the idea of knowledge discovery in databases.
Currently, the so-called data-driven models (DDM) are becoming more and more common. DDM is based on analysing the data about a system, in particular finding connections between the system state variables (input, internal and output variables) without explicit knowledge of the physical behaviour of the system. These methods represent large advances on conventional empirical modelling with many applications which include Finance, Marketing, Medicine, Management and Environmental Sciences and so on.
Job market is seeking for experts in Analytics. Most demanded profiles include mathematicians, statisticians and engineers. Some European and American universities already include data science and data modelling in their academic curricula for undergraduate and graduate programs in Applied Mathematics, Statistics and Systems Engineering and similar disciplines.
Data-driven modelling assumes the presence of a considerable and sufficient amount of data describing the underlying system. Data are used to perform basically tasks of classification, pattern recognition, associative & predictive analysis.
Under these premises, the objective of this course is to introduce students in data-driven
modelling. A brief overview of the concepts and methodology will be presented. Also, the main methods will be described with the support of specialized software (in this case R: A language and environment for statistical computing). An emphasis on classification and clustering will be presented in order to solve two real problems where data-driven modelling has been implemented with successful results: (1) detecting consumption patterns in urban water networks and (2) graph analysis in flow networks – A case study in air transport.
The course is designed to interact directly with the participants. Two sessions of lectures are scheduled (about 6 hours). The rest of the time will be reserved to solve real-problems under the basis of the Hydroinformatics application or/and the flow networks (graph theory) by supporting and coaching the participants.

9.3.2 Prerequisites:

Participants should have attended some previous courses in Operations Research or Linear Programming, Basic Statistics and some knowledge in computer software (R) is advisable.
For those with no R knowledge, an introduction to this software is available at:
• https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf
• http://ocw.mit.edu/courses/sloan-school-of-management/15-097-prediction-machinelearning-and-statistics-spring-2012/lecture-notes/MIT15_097S12_lec02.pdf
• https://cran.r-project.org/web/views/MachineLearning.html

9.3.3 Software:

R: A language and environment for statistical computing
Available at:
https://cran.r-project.org/

Main packages to be used: igraph, igraphdata, randomForest, rpart, tree, e1071, Nbclust.

9.3.4 Scheduling:

9.3.4.1 Monday July 04th:Lecture 1

An introduction to Data-driven Modelling and main algorithms (3-4 hours). Afternoon (from 2 p.m. or 3 p.m.). Homework: some R examples and presentation of the first problem related to Water consumption patterns: (Milan – Italy & London U.K.)

9.3.4.2 Tuesday July 05th: Lecture 2

Graph Theory and Graph Clustering: emphasis on shortest path applications and max-flow min-cut (3-4 hours). Afternoon (from 2 p.m. or 3 p.m.). Presentation of the second problem related to Air transport in US airports. Homework: Practice of igraph:
http://kateto.net/networks-r-igraph

9.3.4.3 From July 06th to 08th

Coaching for participants and solving of the proposed problems. Participants will be divided in groups in order to facilitate the solution of the problems. My availability will be from 9.00 a.m. till 7 p.m.

9.3.4.4 Saturday July 09th

Final reports and oral presentations.

9.3.5 Languages:

Presentations and coaching activities will be in Portuguese. Bibliography is 100% English.

9.3.6 Bibliography:

It is necessary and advisable to read (or at least, a quick review) the following papers which will be used all the week long:

1) Survey: Graph clustering by Satu Elisa Schaeffer. Available at:
http://www.leonidzhukov.net/hse/2016/networks/papers/GraphClustering_Schaeffer07.pdf

2) Data-driven modelling: some past experiences and new approaches by Dimitri P. Solomatine and Avi Ostfeld. Available at:
http://jh.iwaponline.com/content/ppiwajhydro/10/1/3.full.pdf

3) Predictive models for forecasting hourly urban water demand. By Manuel Herrera et al. Available at:
https://www.researchgate.net/publication/223694461_Predictive_models_for_forecasting_hourly_urban_water_demand_J_Hydrol_3871-2141-150

Voltar para a Escola de Matemática Aplicada