Contents & Examples

Disclaimer: Except otherwise noted, the workflow and code examples are available under the Creative Commons Attribution-NonCommercial 4.0 International Public License (CC BY-NC 4.0).

Download all Code Snippets & Workflows

1 Introduction

1.1 Motivation
1. 1.1.1 Data and Knowledge
2. 1.1.2 Tycho Brahe and Johannes Kepler
3. 1.1.3 Intelligent Data Analysis
1.2 The Data Analysis Process
1.3 Methods, Tasks, and Tools
1.4 How to Read This Book

In this introductory chapter we provide a brief overview of some core ideas of data science and their motivation.

In a first step we carefully distinguish between “data” and “knowledge” in order to obtain clear notions that help us to work out why it is usually not enough to simply collect data and why we have to strive to turn them into knowledge. As an illustration, we consider a well-known example from the history of science. In a second step we characterize the data science process, also often referred to as the knowledge discovery process, in which modeling is one important step. We characterize standard data science tasks and summarize the catalog of methods to tackle them.

2 Practical Data Analysis: An Example

2.1 The Setup
2.2 Data Understanding and Pattern Finding
2.3 Explanation Finding
2.4 Prediction the Future
2.5 Concluding the Remarks

Before talking about the full-fledged data science process and diving into the details of individual methods, this chapter demonstrates some typical pitfalls one encounters when analyzing real-world data.

We start our journey through the data science process by looking over the shoulders of two (pseudo) data scientists, Stan and Laura, working on some hypothetical data science problems in a sales environment. Being differently skilled, they show how things should and should not be done. Throughout the chapter, a number of typical problems that data analysts meet in real work situations are demonstrated as well. We will skip algorithmic and other details here and only briefly mention the intention behind applying some of the processes and methods. They will be discussed in depth in subsequent chapters.

3 Project Understanding

3.1 Determine the Project Objective
3.2 Assess the Situation
3.3 Determine Analysis Goals
3.4 Further Reading

We are at the beginning of a series of interdependent steps, where the project understanding phase marks the first. In this initial phase of the data analysis project, we have to map a problem onto one or many data analysis tasks.

In a nutshell, we conjecture that the nature of the problem at hand can be adequately captured by some data sets (that still have to be identified or constructed), that appropriate modeling techniques can successfully be applied to learn the relationships in the data, and finally that the gained insights or models can be transferred back to the real case and applied successfully. This endeavor relies on a number of assumptions and is threatened by several risks, so the goal of the project understanding phase is to assess the main objective, the potential benefits, as well as the constraints, assumptions, and risks. While the number of data analysis projects is rapidly expanding, the failure rate is still high, so this phase should be carried out seriously to rate the chances of success realistically. The project understanding phase should be carried out with care to keep the project on the right track.

4 Data Understanding

4.1 Attribute Understanding
4.2 Data Quality
4.3 Data Visualization
1. 4.3.1 Methods for One and Two Attributes
2. 4.3.2 Methods for Higher-Dimensional Data
4.4 Correlation Analysis
4.5 Outlier Detection
1. 4.5.1 Outlier Detection for Single Attributes
2. 4.5.2 Outlier Detection for Multidimensional Data
4.6 Missing Values
4.7 A Checklist for Data Understanding
4.8 Data Understanding in Practice
1. 4.8.1 Visualizing the Iris Data
2. 4.8.2 Visualizing a Three-Dimensional Data Set on a Two-Coordinate Plot

The main goal of data understanding is to gain general insights about the data that will potentially be helpful for the further steps in the data analysis process, but data understanding should not be driven exclusively by the goals and methods to be applied in later steps.

Although these requirements should be kept in mind during data understanding, one should approach the data from a neutral point of view. Never trust any data as long as you have not carried out some simple plausibility checks. Methods for such plausibility checks will be discussed in this chapter. At the end of the data understanding phase, we know much better whether the assumptions we made during the project understanding phase concerning representativeness, informativeness, data quality, and the presence or absence of external factors are justified.

5 Principles of Modelling

5.1 Model Classes
5.2 Fitting Criteria and Score Functions
1. 5.2.1 Error Functions for Classification Problems
2. 5.2.2 Measures of Interestingness
5.3 Algorithms for Model Fitting
1. 5.3.1 Closed Form Solutions
2. 5.3.2 Gradient Method
3. 5.3.3 Combinatorial Optimization
4. 5.3.4 Random Search, Greedy Strategies, and Other Heuristics
5.4 Types of Errors
1. 5.4.1 Experimental Error
2. 5.4.2 Sample Error
3. 5.4.3 Model Error
4. 5.4.4 Algorithmic Error
5. 5.4.5 Machine Learning Bias and Variance
6. 5.4.6 Learning Without Bias?
5.5 Model Validation
1. 5.5.1 Training and Test Data
2. 5.5.2 Cross-Validation
3. 5.5.3 Bootstrapping
4. 5.5.4 Measures for Model Complexity
5. 5.5.5 Coping with Unbalanced Data
5.6 Model Errors and Validation in Practice
1. 5.6.1 Scoring Models for Classification
2. 5.6.2 Scoring Models for Numeric Predictions
5.7 Further Reading

After we have gone through the phases of project and data understanding, we are either confident that modeling will be successful or return to the project understanding phase to revise objectives (or to stop the project).

In the former case, we have to prepare the data set for subsequent modeling. However, as some of the data preparation steps are motivated by modeling itself, we first discuss the principles of modeling. Many modeling methods will be introduced in the following chapters, but this chapter is devoted to problems and aspects that are inherent in and common to all the methods for analyzing the data.

Training and Testing Data

6 Data Preparation

6.1 Select Data
1. 6.1.1 Feature Selection
2. 6.1.2 Dimensionality Reduction
3. 6.1.3 Record Selection
6.2 Clean Data
1. 6.2.1 Improve Data Quality
2. 6.2.2 Missing Values
3. 6.2.3 Remove Outliers
6.3 Construct Data
1. 6.3.1 Provide Operability
2. 6.3.2 Assure Impartiality
3. 6.3.3 Maximize Efficiency
6.4 Complex Data Types
6.5 Data Integration
1. 6.5.1 VerticalData Integration
2. 6.5.2 Horizontal Data Integration
6.6 Data Preparation in Practice
1. 6.6.1 Removing Empty or Almost Empty Attributes and Records in a Data Set
2. 6.6.2 Normalization and Denormalization
3. 6.6.3 Backward Feature Elimination
6.7 Further Reading

In the data understanding phase we have explored all available data and carefully checked if they satisfy our assumptions and correspond to our expectations. We intend to apply various modeling techniques to extract models from the data.

Although we have not yet discussed any modeling technique in greater detail (see following chapters), we have already glimpsed at some fundamental techniques and potential pitfalls in the previous chapter. Before we start modeling, we have to prepare our data set appropriately, that is, we are going to modify our data set so that the modeling techniques are best supported but least biased.

7 Finding Patterns

7.1 Hierarchical Clustering
1. 7.1.1 Overview
2. 7.1.2 Construction
3. 7.1.3 Variations and Issues
7.2 Notion of (Dis-)Similarity
7.3 Prototype- and Model-Based Clustering
1. 7.3.1 Overview
2. 7.3.2 Construction
3. 7.3.3 Variations and Issues
7.4 Density-BasedClustering
1. 7.4.1 Overview
2. 7.4.2 Construction
3. 7.4.3 Variations and Issues
7.5 Self-organizing Maps
1. 7.5.1 Overview
2. 7.5.2 Construction
7.6 Frequent Pattern Mining and Association Rules
1. 7.6.1 Overview
2. 7.6.2 Construction
3. 7.6.3 Variations and Issues
7.7 Deviation Analysis
1. 7.7.1 Overview
2. 7.7.2 Construction
3. 7.7.3 Variations and Issues
7.8 Finding Patterns in Practice
1. 7.8.1 Hierarchical Clustering
2. 7.8.2 k-Means and DBSCAN
3. 7.8.3 Association Rule Mining
7.9 Further Reading

In this chapter we are concerned with summarizing, describing, or exploring the data set as a whole.

We do not (yet) concentrate on a particular target attribute, which will be the focus of Chaps. 8 and 9. Compact and informative representations of (possibly only parts of) the data set stimulate the data understanding phase and are extremely helpful when exploring the data. While a table with mean and standard deviation for all fields also summarizes the data somehow, such a representation would miss any interaction between the variables. Investigating two fields in a scatter plot and examining their correlation coefficients could reveal linear dependencies among two variables, but what if more variables are involved, and the dependency is restricted to a certain part of the data only? We will review several techniques that try to group or organize the data intelligently, so that the individual parts are meaningful or interesting by themselves. For the purpose of getting a quick impression of the methods we will address in this section, consider a data set of car offers from an internet platform. A number of interactions and dependencies, well known to everyone who has actively searched for a car, can be found in such a data set.

8 Finding Explanations

8.1 Decision Trees
1. 8.1.1 Overview
2. 8.1.2 Construction
3. 8.1.3 Variations and Issues
8.2 Bayes Classifiers
1. 8.2.1 Overview
2. 8.2.2 Construction
3. 8.2.3 Variations and Issues
8.3 Regression
1. 8.3.1 Overview
2. 8.3.2 Construction
3. 8.3.3 Variations and Issues
4. 8.3.4 Two Class Problems
5. 8.3.5 Regularization for Logistic Regression
8.4 Rule learning
1. 8.4.1 Propositional Rules
2. 8.4.2 Inductive Logic Programming or First-Order Rules
8.5 Finding Explanations in Practice
1. 8.5.1 Decision Trees
2. 8.5.2 Naïve Bayes
3. 8.5.3 Logistic Regression
8.6 Further Reading

In the previous chapter we discussed methods that find patterns of different shapes in data sets. All these methods needed measures of similarity in order to group similar objects. In this chapter we will discuss methods that address a very different setup: instead of finding structure in a data set, we are now focusing on methods that find explanations for an unknown dependency within the data.

Such a search for a dependency usually focuses on a so-called target attribute, which means we are particularly interested in why one specific attribute has a certain value. In case of the target attribute being a nominal variable, we are talking about a classification problem; in case of a numerical value we are referring to a regression problem. Examples for such problems would be understanding why a customer belongs to the category of people who cancel their account (e.g., classifying her into a yes/no category) or better understanding the risk factors of customers in general.

9 Finding Predictors

9.1 Nearest-Neighbor Predictors
1. 9.1.1 Overview
2. 9.1.2 Construction
3. 9.1.3 Variations and Issues
9.2 Artifical Neural Networks
1. 9.2.1 Overview
2. 9.2.2 Construction
3. 9.2.3 Variations and Issues
9.3 Deep Learning
1. 9.3.1 Recurrent Neural Networks and Long-Short Term Memory Units
2. 9.3.2 Convolutional Neural Networks
3. 9.3.3 More Deep Learning Networks: Generative-Adversarial Networks (GANs)
9.4 Support Vector Machines
1. 9.4.1 Overview
2. 9.4.2 Construction
3. 9.4.3 Variations and Issues
9.5 Ensemble Methods
1. 9.5.1 Overview
2. 9.5.2 Construction
3. 9.5.3 Further Reading
9.6 Finding Predictors in Practice
1. 9.6.1 k Nearest Neighbor (kNN)
2. 9.6.2 Artificial Neural Networks and Deep Learning
3. 9.6.3 Support Vector Machine (SVM)
4. 9.6.4 Random Forest and Gradient Boosted Trees
9.7 Further Reading

In this chapter we consider methods of constructing predictors for class labels or numeric target attributes.

However, in contrast to Chap. 8, where we discussed methods for basically the same purpose, the methods in this chapter yield models that do not help much to explain the data or even dispense with models altogether. Nevertheless, they can be useful, namely if the main goal is good prediction accuracy rather than an intuitive and interpretable model. Especially artificial neural networks and support vector machines, which we study in Sects. 9.2 and 9.4, are known to outperform other methods with respect to accuracy in many tasks. However, due to the abstract mathematical structure of the prediction procedure, which is usually difficult to map to the application domain, the models they yield are basically “black boxes” and almost impossible to interpret in terms of the application domain. Hence they should be considered only if a comprehensible model that can easily be checked for plausibility is not required, and high accuracy is the main concern.

10 Evaluation and Deployment

10.1 Model Deployment
1. 10.1.1 Interactive Applications
2. 10.1.2 Model Scoring as a Service
3. 10.1.3 Model Representation Standards
4. 10.1.4 Frequent Causes for Deployment Failures
10.2 Model Management
1. 10.2.1 Model Updating and Retraining
2. 10.2.2 Model Factories
10.3 Model Deployment and Management in Practice
1. 10.3.1 Deployment to a Dashboard
2. 10.3.2 Deployment as REST Service
3. 10.3.3 Integrated Deployment

We have shown in Chap. 5 how to evaluate the models and discussed how they are generated using techniques from Chaps. 7–9. The models were also interpreted to gain new insights for feature construction (or even data acquisition). What we have ignored so far is the deployment of the models into production so they can be used by other applications or non data scientists.

Important here are also issues of continued monitoring and potentially even automatic updating of the models in production.

A Statistics

A.1 Terms and Notation
A.2 Descriptive Statistics
1. A.2.1 Tabular Representations
2. A.2.2 Graphical Representations
3. A.2.3 Characteristic Measures for One-Dimensional Data
4. A.2.4 Characteristic Measures for Multidimensional Data
5. A.2.5 Principal Component Analysis
A.3 Probability Theory
1. A.3.1 Probability
2. A.3.2 Basic Methods and Theorems
3. A.3.3 Random Variables
4. A.3.4 Characteristic Measures of Random Variables
5. A.3.5 Some Special Distributions
A.4 Inferential Statistics
1. A.4.1 Random Samples
2. A.4.2 Parameter Estimation
3. A.4.3 Hypothesis Testing

Since classical statistics provides many data analysis methods and supports and justifies a lot of others, we provide in this appendix a brief review of some basics of statistics. We discuss descriptive statistics, inferential statistics, and needed fundamentals from probability theory. Since we strove to make this appendix as self-contained as possible, some overlap with the chapters of this book is unavoidable. However, material is not simply repeated here but presented from a slightly different point of view, emphasizing different aspects and using different examples.

B KNIME

B.1 Installation and Overview
B.2 Building Workflows
B.3 Example Workflow

KNIME (pronounced [naim]) Analytics Platform is an open and opensource modular data science platform that covers all your data needs from data access to data preparation, from data visualization to the training of machine learning algorithms, and finally from testing to deployment. KNIME Analytics Platform is based on visual programming. Basic units (nodes) dedicated to a certain task are dragged and dropped onto the canvas to build a pipeline (workflow) of actions to take your data from their raw form into the final state, being this a prediction for an input vector, a summary dashboard, or a particular KPI. KNIME Server complements KNIME Analytics Platform as its enterprise IT platform, allowing for easy deployment, collaboration, scalability, automation, and application management.