Big Data Analytics - Methodological Training in Statistical Data Science

Oktober 28th - 30th 2019

Prof. Dr. Diego Kuonen, CStat PStat CSci, Statoo Consulting

Training course objectives:
There is no question that `big data' (i.e. the simple yet seemingly revolutionary belief that data are valuable) and `machine learning' (i.e. simply put, a field of advanced statistics designed for a world of `big data') have hit business and industry, academia, engineering and government. The demand for skills in data science - a rebranding of data mining - is unprecedented in sectors where value, competitiveness and efficiency are driven by data. Nowadays, this is amplified by the digital transformation and the related data revolution.

Data mining technology and methodology have been applied to understand and to optimise various processes within business and industry, academia, engineering and government. It is widely believed that data mining will have a profound impact on our society and that data mining can bring real value. But how can data mining contribute to achieving operational excellence? Is data mining worth the trouble or is it `statistical déjà vu'?

Target group:
This course is aimed at specialists in service or industrial companies who wish to carry out their analyses with a high-quality data basis.

Course contents:
This three-day training course will provide you with an overview of the potential and limitations of data mining and with a thorough methodological, practical and, most importantly, software-vendor independent coverage of state-of-art data mining techniques (e.g. from statistics, machine learning and artificial intelligence). It highlights its applicability to accumulated data, and it will enable you to apply the presented methodology and its underlying philosophy to benchmark or your own data.

This training will provide you with a thorough methodological and practical coverage of state-of-art data mining techniques (e.g. from statistics, machine learning and artificial intelligence) that identify unexpected patterns, structures, models or trends in data to make crucial decisions. This course will provide you with practical data mining experience and throughout the course illustrations of the concepts and methods will be given. Moreover, you will be able to apply what you have learnt within a state-of-art data-mining workbench using benchmark or your own data.

Course goals:
The naïve and blind “black-box” use of data mining software packages has its obvious pitfalls and can, and probably often does, lead to practically worthless results and misleading conclusions. Data mining is easy to do badly. It is therefore important to understand enough of the characteristics of the underlying data mining methodologies (both their advantages and their pitfalls) to be able to make an informed choice about which data mining methods to use and also to be able to critically appraise their own results and those of others. In this course we will apply a “white-box” methodology, which emphasises an understanding of the algorithmic and statistical model structures underlying the “black-box” software.

Instruction proceeds from tangible examples to theory – from the big picture, or “whole”, to details, or “parts” – and from a conceptual understanding to the ability to perform specific statistical data mining tasks.

Consequently, the course begins with a brief discussion of the role and applicability of data mining to empower companies to extract previously unrealised information from their data repositories. Next, a general overview of data mining, the art and science of learning from data, will be given. Only then we do see individual tools in detail and note how they fit into the big picture. As such, in the main part of this training a software-vendor independent overview of the statistical data mining terminology and methods, resources and practical issues will be given. For all techniques considered the basic methodology will be explained and illustrated with examples. Finally, the course will enable you to apply the presented methodology and its underlying philosophy to benchmark or your own data. 

In summary, this three-day course divides class time between lectures covering, in a softwarevendor independent way, the methodological aspects and practical applications of statistical data mining, and between hands-on practise, where you will have a chance to try on your own the methods learnt in the course within a state-of-art data  mining workbench using benchmark or your own data.

Overview of Data Mining Methodology:

  • Introduction
  •  Demystifying the “big data” hype
  • Demystifying the “Internet of things” hype
  • Applicability of data mining
  • What is data mining?
    • Is data mining “statistical déjà vu”?
    • What distinguishes data mining from statistics?
  • Demystifying the “data science” hype
  • Demystifying the “machine learning” hype
  • A process model for data mining
  • Data and data preprocessing
    • Data sources
    • Why data preprocessing?
    • Major tasks in data preprocessing (e.g. data integration, data cleaning, data transformation, data reduction, data discretisation)
  • Data mining techniques and tasks
  • Description and visualisation
  • Characterising multivariate data
  • Dissimilarity and distance measures
  • Unsupervised methods (“class discovery”)
    • Principal component analysis
    • Multidimensional scaling
    • Correspondence analysis
    • Cluster analysis (e.g. hierarchical algorithms, partitioning algorithms, using clustering in practise)
    • Kohonen's self-organising maps
    • Affinity grouping or association rules
    • A look forward
  • Supervised methods (“class prediction”)
    • Introduction (e.g. inductive bias and model complexity, score functions, internal validation, external validation)
    • Classification modelling (e.g. discriminant analysis, support vector machines, nearest neighbour classification, naïve Bayes classifier)
    • Regression modelling (e.g. multiple linear models, generalised linear models, nonparametric regression models, generalised additive models, multivariate adaptive regression splines)
    • Neural networks
    • Tree-based methods (e.g. CART, C4.5 and C5.0, CHAID)
    • Ensemble learning (e.g. bagging, subagging, random forests, arcing, boosting, stochastic gradient tree boosting)
    • The curse of dimensionality (e.g. feature extraction, feature subset selection: filters, wrappers, embedded methods)
    • Evaluating and comparing classifiers
    • Comparing regression models
    • A look forward
    • Comparison of chosen supervised learning methods
    • Recent lessons – what has been learnt?
  • Criteria for potential data mining success
  • Conclusion
  • References and resources

The lecture will be given in German. During the course questions may be asked in English, French or German. Training documents will be all in English. All participants will receive a printed version of the documentation for personal use only.

StatSoft (Europe) GmbH, Hamburg - Subject to change.

Participants should be familiar with basic statistics, including multiple linear regression. A laptop with preinstalled TIBCO Statistica course license will be provided. We will provide you with the details before the course begins.

Course fees and discounts:
Public course fee              EUR 2.500

Academic discount           30% discount on the public course fee. No further discounts apply.

Group discounts               Group discounts are possible if two or more people from the same organisation register together at the same time. For further information please do not hesitate to contact us. No further discounts apply.

Early bird discount           10% discount on the public course fee if you register by 6 weeks before the course date. No further discounts apply.

The prices include printed documentation for personal use, coffee breaks and lunch and exclude VAT. All participants will receive a confirmation of participation.

Duration: 3 days           Time: 9:00 - 17:00 h            Price: EUR 2.500 (plus VAT) per participant


back to overview