Andrea Ahlemeyer-Stubbe, Shirley Coleman
A Practical Guide to Data Mining for Business and Industry (eBook, PDF)
60,99 €
60,99 €
inkl. MwSt.
Sofort per Download lieferbar
60,99 €
Als Download kaufen
60,99 €
inkl. MwSt.
Sofort per Download lieferbar
Andrea Ahlemeyer-Stubbe, Shirley Coleman
A Practical Guide to Data Mining for Business and Industry (eBook, PDF)
- Format: PDF
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
Bitte loggen Sie sich zunächst in Ihr Kundenkonto ein oder registrieren Sie sich bei
bücher.de, um das eBook-Abo tolino select nutzen zu können.
Hier können Sie sich einloggen
Hier können Sie sich einloggen
Sie sind bereits eingeloggt. Klicken Sie auf 2. tolino select Abo, um fortzufahren.
Bitte loggen Sie sich zunächst in Ihr Kundenkonto ein oder registrieren Sie sich bei bücher.de, um das eBook-Abo tolino select nutzen zu können.
Data mining is well on its way to becoming a recognized discipline in the overlapping areas of IT, statistics, machine learning, and AI. Practical Data Mining for Business presents a user-friendly approach to data mining methods, covering the typical uses to which it is applied. The methodology is complemented by case studies to create a versatile reference book, allowing readers to look for specific methods as well as for specific applications. The book is formatted to allow statisticians, computer scientists, and economists to cross-reference from a particular application or method to sectors of interest.…mehr
- Geräte: PC
- mit Kopierschutz
- eBook Hilfe
- Größe: 19.41MB
Data mining is well on its way to becoming a recognized discipline in the overlapping areas of IT, statistics, machine learning, and AI. Practical Data Mining for Business presents a user-friendly approach to data mining methods, covering the typical uses to which it is applied. The methodology is complemented by case studies to create a versatile reference book, allowing readers to look for specific methods as well as for specific applications. The book is formatted to allow statisticians, computer scientists, and economists to cross-reference from a particular application or method to sectors of interest.
Dieser Download kann aus rechtlichen Gründen nur mit Rechnungsadresse in D ausgeliefert werden.
Produktdetails
- Produktdetails
- Verlag: John Wiley & Sons
- Erscheinungstermin: 21. März 2014
- Englisch
- ISBN-13: 9781118763728
- Artikelnr.: 40683321
- Verlag: John Wiley & Sons
- Erscheinungstermin: 21. März 2014
- Englisch
- ISBN-13: 9781118763728
- Artikelnr.: 40683321
Andrea Ahlemeyer-Stubbe, Director Strategic Analytics, DRAFTFCB München GmbH, Germany Shirley Coleman, Principal Statistician, Industrial Statistics Research Unit, School of Maths and Statistics, Newcastle University, UK
Glossary of terms xii
Part I Data Mining Concept 1
1 Introduction 3
1.1 Aims of the Book 3
1.2 Data Mining Context 5
1.2.1 Domain Knowledge 6
1.2.2 Words to Remember 7
1.2.3 Associated Concepts 7
1.3 Global Appeal 8
1.4 Example Datasets Used in This Book 8
1.5 Recipe Structure 11
1.6 Further Reading and Resources 13
2 Data Mining Definition 14
2.1 Types of Data Mining Questions 15
2.1.1 Population and Sample 15
2.1.2 Data Preparation 16
2.1.3 Supervised and Unsupervised Methods 16
2.1.4 Knowledge-Discovery Techniques 18
2.2 Data Mining Process 19
2.3 Business Task: Clarification of the Business Question behind the
Problem 20
2.4 Data: Provision and Processing of the Required Data 21
2.4.1 Fixing the Analysis Period 22
2.4.2 Basic Unit of Interest 23
2.4.3 Target Variables 24
2.4.4 Input Variables/Explanatory Variables 24
2.5 Modelling: Analysis of the Data 25
2.6 Evaluation and Validation during the Analysis Stage 25
2.7 Application of Data Mining Results and Learning from the Experience 28
Part II Data Mining Practicalities 31
3 All about data 33
3.1 Some Basics 34
3.1.1 Data, Information, Knowledge and Wisdom 35
3.1.2 Sources and Quality of Data 36
3.1.3 Measurement Level and Types of Data 37
3.1.4 Measures of Magnitude and Dispersion 39
3.1.5 Data Distributions 41
3.2 Data Partition: Random Samples for Training, Testing and Validation 41
3.3 Types of Business Information Systems 44
3.3.1 Operational Systems Supporting Business Processes 44
3.3.2 Analysis-Based Information Systems 45
3.3.3 Importance of Information 45
3.4 Data Warehouses 47
3.4.1 Topic Orientation 47
3.4.2 Logical Integration and Homogenisation 48
3.4.3 Reference Period 48
3.4.4 Low Volatility 48
3.4.5 Using the Data Warehouse 49
3.5 Three Components of a Data Warehouse: DBMS, DB and DBCS 50
3.5.1 Database Management System (DBMS) 51
3.5.2 Database (DB) 51
3.5.3 Database Communication Systems (DBCS) 51
3.6 Data Marts 52
3.6.1 Regularly Filled Data Marts 53
3.6.2 Comparison between Data Marts and Data Warehouses 53
3.7 A Typical Example from the Online Marketing Area 54
3.8 Unique Data Marts 54
3.8.1 Permanent Data Marts 54
3.8.2 Data Marts Resulting from Complex Analysis 56
3.9 Data Mart: Do's and Don'ts 58
3.9.1 Do's and Don'ts for Processes 58
3.9.2 Do's and Don'ts for Handling 58
3.9.3 Do's and Don'ts for Coding/Programming 59
4 Data Preparation 60
4.1 Necessity of Data Preparation 61
4.2 From Small and Long to Short and Wide 61
4.3 Transformation of Variables 65
4.4 Missing Data and Imputation Strategies 66
4.5 Outliers 69
4.6 Dealing with the Vagaries of Data 70
4.6.1 Distributions 70
4.6.2 Tests for Normality 70
4.6.3 Data with Totally Different Scales 70
4.7 Adjusting the Data Distributions 71
4.7.1 Standardisation and Normalisation 71
4.7.2 Ranking 71
4.7.3 Box-Cox Transformation 71
4.8 Binning 72
4.8.1 Bucket Method 73
4.8.2 Analytical Binning for Nominal Variables 73
4.8.3 Quantiles 73
4.8.4 Binning in Practice 74
4.9 Timing Considerations 77
4.10 Operational Issues 77
5 Analytics 78
5.1 Introduction 79
5.2 Basis of Statistical Tests 80
5.2.1 Hypothesis Tests and P Values 80
5.2.2 Tolerance Intervals 82
5.2.3 Standard Errors and Confidence Intervals 83
5.3 Sampling 83
5.3.1 Methods 83
5.3.2 Sample Sizes 84
5.3.3 Sample Quality and Stability 84
5.4 Basic Statistics for Pre-analytics 85
5.4.1 Frequencies 85
5.4.2 Comparative Tests 88
5.4.3 Cross Tabulation and Contingency Tables 89
5.4.4 Correlations 90
5.4.5 Association Measures for Nominal Variables 91
5.4.6 Examples of Output from Comparative and Cross Tabulation Tests 92
5.5 Feature Selection/Reduction of Variables 96
5.5.1 Feature Reduction Using Domain Knowledge 96
5.5.2 Feature Selection Using Chi-Square 97
5.5.3 Principal Components Analysis and Factor Analysis 97
5.5.4 Canonical Correlation, PLS and SEM 98
5.5.5 Decision Trees 98
5.5.6 Random Forests 98
5.6 Time Series Analysis 99
6 Methods 102
6.1 Methods Overview 104
6.2 Supervised Learning 105
6.2.1 Introduction and Process Steps 105
6.2.2 Business Task 105
6.2.3 Provision and Processing of the Required Data 106
6.2.4 Analysis of the Data 107
6.2.5 Evaluation and Validation of the Results (during the Analysis) 108
6.2.6 Application of the Results 108
6.3 Multiple Linear Regression for use when Target is Continuous 109
6.3.1 Rationale of Multiple Linear Regression Modelling 109
6.3.2 Regression Coefficients 110
6.3.3 Assessment of the Quality of the Model 111
6.3.4 Example of Linear Regression in Practice 113
6.4 Regression when the Target is not Continuous 119
6.4.1 Logistic Regression 119
6.4.2 Example of Logistic Regression in Practice 121
6.4.3 Discriminant Analysis 126
6.4.4 Log-Linear Models and Poisson Regression 128
6.5 Decision Trees 129
6.5.1 Overview 129
6.5.2 Selection Procedures of the Relevant Input Variables 134
6.5.3 Splitting Criteria 134
6.5.4 Number of Splits (Branches of the Tree) 135
6.5.5 Symmetry/Asymmetry 135
6.5.6 Pruning 135
6.6 Neural Networks 137
6.7 Which Method Produces the Best Model? A Comparison of Regression,
Decision Trees and Neural Networks 141
6.8 Unsupervised Learning 142
6.8.1 Introduction and Process Steps 142
6.8.2 Business Task 143
6.8.3 Provision and Processing of the Required Data 143
6.8.4 Analysis of the Data 145
6.8.5 Evaluation and Validation of the Results (during the Analysis) 147
6.8.6 Application of the Results 148
6.9 Cluster Analysis 148
6.9.1 Introduction 148
6.9.2 Hierarchical Cluster Analysis 149
6.9.3 K-Means Method of Cluster Analysis 150
6.9.4 Example of Cluster Analysis in Practice 151
6.10 Kohonen Networks and Self-Organising Maps 151
6.10.1 Description 151
6.10.2 Example of SOMs in Practice 152
6.11 Group Purchase Methods: Association and Sequence Analysis 155
6.11.1 Introduction 155
6.11.2 Analysis of the Data 157
6.11.3 Group Purchase Methods 158
6.11.4 Examples of Group Purchase Methods in Practice 158
7 Validation and Application 161
7.1 Introduction to Methods for Validation 161
7.2 Lift and Gain Charts 162
7.3 Model Stability 164
7.4 Sensitivity Analysis 167
7.5 Threshold Analytics and Confusion Matrix 169
7.6 ROC Curves 170
7.7 Cross-Validation and Robustness 171
7.8 Model Complexity 172
Part III Data Mining in Action 173
8 Marketing: Prediction 175
8.1 Recipe 1: Response Optimisation: to Find and Address the Right Number
of Customers 176
8.2 Recipe 2: To Find the x% of Customers with the Highest Affinity to an
Offer 186
8.3 Recipe 3: To Find the Right Number of Customers to Ignore 187
8.4 Recipe 4: To Find the x% of Customers with the Lowest Affinity to an
Offer 190
8.5 Recipe 5: To Find the x% of Customers with the Highest Affinity to Buy
191
8.6 Recipe 6: To Find the x% of Customers with the Lowest Affinity to Buy
192
8.7 Recipe 7: To Find the x% of Customers with the Highest Affinity to a
Single Purchase 193
8.8 Recipe 8: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Communication Areas 194
8.9 Recipe 9: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Insurance Areas 196
9 Intra-Customer Analysis 198
9.1 Recipe 10: To Find the Optimal Amount of Single Communication to
Activate One Customer 199
9.2 Recipe 11: To Find the Optimal Communication Mix to Activate One
Customer 200
9.3 Recipe 12: To Find and Describe Homogeneous Groups of Products 206
9.4 Recipe 13: To Find and Describe Groups of Customers with Homogeneous
Usage 210
9.5 Recipe 14: To Predict the Order Size of Single Products or Product
Groups 216
9.6 Recipe 15: Product Set Combination 217
9.7 Recipe 16: To Predict the Future Customer Lifetime Value of a Customer
219
10 Learning from a Small Testing Sample and Prediction 225
10.1 Recipe 17: To Predict Demographic Signs (Like Sex, Age, Education and
Income) 225
10.2 Recipe 18: To Predict the Potential Customers of a Brand New Product
or Service in Your Databases 236
10.3 Recipe 19: To Understand Operational Features and General Business
Forecasting 241
11 Miscellaneous 244
11.1 Recipe 20: To Find Customers Who Will Potentially Churn 244
11.2 Recipe 21: Indirect Churn Based on a Discontinued Contract 249
11.3 Recipe 22: Social Media Target Group Descriptions 250
11.4 Recipe 23: Web Monitoring 254
11.5 Recipe 24: To Predict Who is Likely to Click on a Special Banner 258
12 Software and Tools: A Quick Guide 261
12.1 List of Requirements When Choosing a Data Mining Tool 261
12.2 Introduction to the Idea of Fully Automated Modelling (FAM) 265
12.2.1 Predictive Behavioural Targeting 265
12.2.2 Fully Automatic Predictive Targeting and Modelling Real-Time Online
Behaviour 266
12.3 FAM Function 266
12.4 FAM Architecture 267
12.5 FAM Data Flows and Databases 268
12.6 FAM Modelling Aspects 269
12.7 FAM Challenges and Critical Success Factors 270
12.8 FAM Summary 270
13 Overviews 271
13.1 To Make Use of Official Statistics 272
13.2 How to Use Simple Maths to Make an Impression 272
13.2.1 Approximations 272
13.2.2 Absolute and Relative Values 273
13.2.3 % Change 273
13.2.4 Values in Context 273
13.2.5 Confidence Intervals 274
13.2.6 Rounding 274
13.2.7 Tables 274
13.2.8 Figures 274
13.3 Differences between Statistical Analysis and Data Mining 275
13.3.1 Assumptions 275
13.3.2 Values Missing Because 'Nothing Happened' 275
13.3.3 Sample Sizes 276
13.3.4 Goodness-of-Fit Tests 276
13.3.5 Model Complexity 277
13.4 How to Use Data Mining in Different Industries 277
13.5 Future Views 283
Bibliography 285
Index 296
Part I Data Mining Concept 1
1 Introduction 3
1.1 Aims of the Book 3
1.2 Data Mining Context 5
1.2.1 Domain Knowledge 6
1.2.2 Words to Remember 7
1.2.3 Associated Concepts 7
1.3 Global Appeal 8
1.4 Example Datasets Used in This Book 8
1.5 Recipe Structure 11
1.6 Further Reading and Resources 13
2 Data Mining Definition 14
2.1 Types of Data Mining Questions 15
2.1.1 Population and Sample 15
2.1.2 Data Preparation 16
2.1.3 Supervised and Unsupervised Methods 16
2.1.4 Knowledge-Discovery Techniques 18
2.2 Data Mining Process 19
2.3 Business Task: Clarification of the Business Question behind the
Problem 20
2.4 Data: Provision and Processing of the Required Data 21
2.4.1 Fixing the Analysis Period 22
2.4.2 Basic Unit of Interest 23
2.4.3 Target Variables 24
2.4.4 Input Variables/Explanatory Variables 24
2.5 Modelling: Analysis of the Data 25
2.6 Evaluation and Validation during the Analysis Stage 25
2.7 Application of Data Mining Results and Learning from the Experience 28
Part II Data Mining Practicalities 31
3 All about data 33
3.1 Some Basics 34
3.1.1 Data, Information, Knowledge and Wisdom 35
3.1.2 Sources and Quality of Data 36
3.1.3 Measurement Level and Types of Data 37
3.1.4 Measures of Magnitude and Dispersion 39
3.1.5 Data Distributions 41
3.2 Data Partition: Random Samples for Training, Testing and Validation 41
3.3 Types of Business Information Systems 44
3.3.1 Operational Systems Supporting Business Processes 44
3.3.2 Analysis-Based Information Systems 45
3.3.3 Importance of Information 45
3.4 Data Warehouses 47
3.4.1 Topic Orientation 47
3.4.2 Logical Integration and Homogenisation 48
3.4.3 Reference Period 48
3.4.4 Low Volatility 48
3.4.5 Using the Data Warehouse 49
3.5 Three Components of a Data Warehouse: DBMS, DB and DBCS 50
3.5.1 Database Management System (DBMS) 51
3.5.2 Database (DB) 51
3.5.3 Database Communication Systems (DBCS) 51
3.6 Data Marts 52
3.6.1 Regularly Filled Data Marts 53
3.6.2 Comparison between Data Marts and Data Warehouses 53
3.7 A Typical Example from the Online Marketing Area 54
3.8 Unique Data Marts 54
3.8.1 Permanent Data Marts 54
3.8.2 Data Marts Resulting from Complex Analysis 56
3.9 Data Mart: Do's and Don'ts 58
3.9.1 Do's and Don'ts for Processes 58
3.9.2 Do's and Don'ts for Handling 58
3.9.3 Do's and Don'ts for Coding/Programming 59
4 Data Preparation 60
4.1 Necessity of Data Preparation 61
4.2 From Small and Long to Short and Wide 61
4.3 Transformation of Variables 65
4.4 Missing Data and Imputation Strategies 66
4.5 Outliers 69
4.6 Dealing with the Vagaries of Data 70
4.6.1 Distributions 70
4.6.2 Tests for Normality 70
4.6.3 Data with Totally Different Scales 70
4.7 Adjusting the Data Distributions 71
4.7.1 Standardisation and Normalisation 71
4.7.2 Ranking 71
4.7.3 Box-Cox Transformation 71
4.8 Binning 72
4.8.1 Bucket Method 73
4.8.2 Analytical Binning for Nominal Variables 73
4.8.3 Quantiles 73
4.8.4 Binning in Practice 74
4.9 Timing Considerations 77
4.10 Operational Issues 77
5 Analytics 78
5.1 Introduction 79
5.2 Basis of Statistical Tests 80
5.2.1 Hypothesis Tests and P Values 80
5.2.2 Tolerance Intervals 82
5.2.3 Standard Errors and Confidence Intervals 83
5.3 Sampling 83
5.3.1 Methods 83
5.3.2 Sample Sizes 84
5.3.3 Sample Quality and Stability 84
5.4 Basic Statistics for Pre-analytics 85
5.4.1 Frequencies 85
5.4.2 Comparative Tests 88
5.4.3 Cross Tabulation and Contingency Tables 89
5.4.4 Correlations 90
5.4.5 Association Measures for Nominal Variables 91
5.4.6 Examples of Output from Comparative and Cross Tabulation Tests 92
5.5 Feature Selection/Reduction of Variables 96
5.5.1 Feature Reduction Using Domain Knowledge 96
5.5.2 Feature Selection Using Chi-Square 97
5.5.3 Principal Components Analysis and Factor Analysis 97
5.5.4 Canonical Correlation, PLS and SEM 98
5.5.5 Decision Trees 98
5.5.6 Random Forests 98
5.6 Time Series Analysis 99
6 Methods 102
6.1 Methods Overview 104
6.2 Supervised Learning 105
6.2.1 Introduction and Process Steps 105
6.2.2 Business Task 105
6.2.3 Provision and Processing of the Required Data 106
6.2.4 Analysis of the Data 107
6.2.5 Evaluation and Validation of the Results (during the Analysis) 108
6.2.6 Application of the Results 108
6.3 Multiple Linear Regression for use when Target is Continuous 109
6.3.1 Rationale of Multiple Linear Regression Modelling 109
6.3.2 Regression Coefficients 110
6.3.3 Assessment of the Quality of the Model 111
6.3.4 Example of Linear Regression in Practice 113
6.4 Regression when the Target is not Continuous 119
6.4.1 Logistic Regression 119
6.4.2 Example of Logistic Regression in Practice 121
6.4.3 Discriminant Analysis 126
6.4.4 Log-Linear Models and Poisson Regression 128
6.5 Decision Trees 129
6.5.1 Overview 129
6.5.2 Selection Procedures of the Relevant Input Variables 134
6.5.3 Splitting Criteria 134
6.5.4 Number of Splits (Branches of the Tree) 135
6.5.5 Symmetry/Asymmetry 135
6.5.6 Pruning 135
6.6 Neural Networks 137
6.7 Which Method Produces the Best Model? A Comparison of Regression,
Decision Trees and Neural Networks 141
6.8 Unsupervised Learning 142
6.8.1 Introduction and Process Steps 142
6.8.2 Business Task 143
6.8.3 Provision and Processing of the Required Data 143
6.8.4 Analysis of the Data 145
6.8.5 Evaluation and Validation of the Results (during the Analysis) 147
6.8.6 Application of the Results 148
6.9 Cluster Analysis 148
6.9.1 Introduction 148
6.9.2 Hierarchical Cluster Analysis 149
6.9.3 K-Means Method of Cluster Analysis 150
6.9.4 Example of Cluster Analysis in Practice 151
6.10 Kohonen Networks and Self-Organising Maps 151
6.10.1 Description 151
6.10.2 Example of SOMs in Practice 152
6.11 Group Purchase Methods: Association and Sequence Analysis 155
6.11.1 Introduction 155
6.11.2 Analysis of the Data 157
6.11.3 Group Purchase Methods 158
6.11.4 Examples of Group Purchase Methods in Practice 158
7 Validation and Application 161
7.1 Introduction to Methods for Validation 161
7.2 Lift and Gain Charts 162
7.3 Model Stability 164
7.4 Sensitivity Analysis 167
7.5 Threshold Analytics and Confusion Matrix 169
7.6 ROC Curves 170
7.7 Cross-Validation and Robustness 171
7.8 Model Complexity 172
Part III Data Mining in Action 173
8 Marketing: Prediction 175
8.1 Recipe 1: Response Optimisation: to Find and Address the Right Number
of Customers 176
8.2 Recipe 2: To Find the x% of Customers with the Highest Affinity to an
Offer 186
8.3 Recipe 3: To Find the Right Number of Customers to Ignore 187
8.4 Recipe 4: To Find the x% of Customers with the Lowest Affinity to an
Offer 190
8.5 Recipe 5: To Find the x% of Customers with the Highest Affinity to Buy
191
8.6 Recipe 6: To Find the x% of Customers with the Lowest Affinity to Buy
192
8.7 Recipe 7: To Find the x% of Customers with the Highest Affinity to a
Single Purchase 193
8.8 Recipe 8: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Communication Areas 194
8.9 Recipe 9: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Insurance Areas 196
9 Intra-Customer Analysis 198
9.1 Recipe 10: To Find the Optimal Amount of Single Communication to
Activate One Customer 199
9.2 Recipe 11: To Find the Optimal Communication Mix to Activate One
Customer 200
9.3 Recipe 12: To Find and Describe Homogeneous Groups of Products 206
9.4 Recipe 13: To Find and Describe Groups of Customers with Homogeneous
Usage 210
9.5 Recipe 14: To Predict the Order Size of Single Products or Product
Groups 216
9.6 Recipe 15: Product Set Combination 217
9.7 Recipe 16: To Predict the Future Customer Lifetime Value of a Customer
219
10 Learning from a Small Testing Sample and Prediction 225
10.1 Recipe 17: To Predict Demographic Signs (Like Sex, Age, Education and
Income) 225
10.2 Recipe 18: To Predict the Potential Customers of a Brand New Product
or Service in Your Databases 236
10.3 Recipe 19: To Understand Operational Features and General Business
Forecasting 241
11 Miscellaneous 244
11.1 Recipe 20: To Find Customers Who Will Potentially Churn 244
11.2 Recipe 21: Indirect Churn Based on a Discontinued Contract 249
11.3 Recipe 22: Social Media Target Group Descriptions 250
11.4 Recipe 23: Web Monitoring 254
11.5 Recipe 24: To Predict Who is Likely to Click on a Special Banner 258
12 Software and Tools: A Quick Guide 261
12.1 List of Requirements When Choosing a Data Mining Tool 261
12.2 Introduction to the Idea of Fully Automated Modelling (FAM) 265
12.2.1 Predictive Behavioural Targeting 265
12.2.2 Fully Automatic Predictive Targeting and Modelling Real-Time Online
Behaviour 266
12.3 FAM Function 266
12.4 FAM Architecture 267
12.5 FAM Data Flows and Databases 268
12.6 FAM Modelling Aspects 269
12.7 FAM Challenges and Critical Success Factors 270
12.8 FAM Summary 270
13 Overviews 271
13.1 To Make Use of Official Statistics 272
13.2 How to Use Simple Maths to Make an Impression 272
13.2.1 Approximations 272
13.2.2 Absolute and Relative Values 273
13.2.3 % Change 273
13.2.4 Values in Context 273
13.2.5 Confidence Intervals 274
13.2.6 Rounding 274
13.2.7 Tables 274
13.2.8 Figures 274
13.3 Differences between Statistical Analysis and Data Mining 275
13.3.1 Assumptions 275
13.3.2 Values Missing Because 'Nothing Happened' 275
13.3.3 Sample Sizes 276
13.3.4 Goodness-of-Fit Tests 276
13.3.5 Model Complexity 277
13.4 How to Use Data Mining in Different Industries 277
13.5 Future Views 283
Bibliography 285
Index 296
Glossary of terms xii
Part I Data Mining Concept 1
1 Introduction 3
1.1 Aims of the Book 3
1.2 Data Mining Context 5
1.2.1 Domain Knowledge 6
1.2.2 Words to Remember 7
1.2.3 Associated Concepts 7
1.3 Global Appeal 8
1.4 Example Datasets Used in This Book 8
1.5 Recipe Structure 11
1.6 Further Reading and Resources 13
2 Data Mining Definition 14
2.1 Types of Data Mining Questions 15
2.1.1 Population and Sample 15
2.1.2 Data Preparation 16
2.1.3 Supervised and Unsupervised Methods 16
2.1.4 Knowledge-Discovery Techniques 18
2.2 Data Mining Process 19
2.3 Business Task: Clarification of the Business Question behind the
Problem 20
2.4 Data: Provision and Processing of the Required Data 21
2.4.1 Fixing the Analysis Period 22
2.4.2 Basic Unit of Interest 23
2.4.3 Target Variables 24
2.4.4 Input Variables/Explanatory Variables 24
2.5 Modelling: Analysis of the Data 25
2.6 Evaluation and Validation during the Analysis Stage 25
2.7 Application of Data Mining Results and Learning from the Experience 28
Part II Data Mining Practicalities 31
3 All about data 33
3.1 Some Basics 34
3.1.1 Data, Information, Knowledge and Wisdom 35
3.1.2 Sources and Quality of Data 36
3.1.3 Measurement Level and Types of Data 37
3.1.4 Measures of Magnitude and Dispersion 39
3.1.5 Data Distributions 41
3.2 Data Partition: Random Samples for Training, Testing and Validation 41
3.3 Types of Business Information Systems 44
3.3.1 Operational Systems Supporting Business Processes 44
3.3.2 Analysis-Based Information Systems 45
3.3.3 Importance of Information 45
3.4 Data Warehouses 47
3.4.1 Topic Orientation 47
3.4.2 Logical Integration and Homogenisation 48
3.4.3 Reference Period 48
3.4.4 Low Volatility 48
3.4.5 Using the Data Warehouse 49
3.5 Three Components of a Data Warehouse: DBMS, DB and DBCS 50
3.5.1 Database Management System (DBMS) 51
3.5.2 Database (DB) 51
3.5.3 Database Communication Systems (DBCS) 51
3.6 Data Marts 52
3.6.1 Regularly Filled Data Marts 53
3.6.2 Comparison between Data Marts and Data Warehouses 53
3.7 A Typical Example from the Online Marketing Area 54
3.8 Unique Data Marts 54
3.8.1 Permanent Data Marts 54
3.8.2 Data Marts Resulting from Complex Analysis 56
3.9 Data Mart: Do's and Don'ts 58
3.9.1 Do's and Don'ts for Processes 58
3.9.2 Do's and Don'ts for Handling 58
3.9.3 Do's and Don'ts for Coding/Programming 59
4 Data Preparation 60
4.1 Necessity of Data Preparation 61
4.2 From Small and Long to Short and Wide 61
4.3 Transformation of Variables 65
4.4 Missing Data and Imputation Strategies 66
4.5 Outliers 69
4.6 Dealing with the Vagaries of Data 70
4.6.1 Distributions 70
4.6.2 Tests for Normality 70
4.6.3 Data with Totally Different Scales 70
4.7 Adjusting the Data Distributions 71
4.7.1 Standardisation and Normalisation 71
4.7.2 Ranking 71
4.7.3 Box-Cox Transformation 71
4.8 Binning 72
4.8.1 Bucket Method 73
4.8.2 Analytical Binning for Nominal Variables 73
4.8.3 Quantiles 73
4.8.4 Binning in Practice 74
4.9 Timing Considerations 77
4.10 Operational Issues 77
5 Analytics 78
5.1 Introduction 79
5.2 Basis of Statistical Tests 80
5.2.1 Hypothesis Tests and P Values 80
5.2.2 Tolerance Intervals 82
5.2.3 Standard Errors and Confidence Intervals 83
5.3 Sampling 83
5.3.1 Methods 83
5.3.2 Sample Sizes 84
5.3.3 Sample Quality and Stability 84
5.4 Basic Statistics for Pre-analytics 85
5.4.1 Frequencies 85
5.4.2 Comparative Tests 88
5.4.3 Cross Tabulation and Contingency Tables 89
5.4.4 Correlations 90
5.4.5 Association Measures for Nominal Variables 91
5.4.6 Examples of Output from Comparative and Cross Tabulation Tests 92
5.5 Feature Selection/Reduction of Variables 96
5.5.1 Feature Reduction Using Domain Knowledge 96
5.5.2 Feature Selection Using Chi-Square 97
5.5.3 Principal Components Analysis and Factor Analysis 97
5.5.4 Canonical Correlation, PLS and SEM 98
5.5.5 Decision Trees 98
5.5.6 Random Forests 98
5.6 Time Series Analysis 99
6 Methods 102
6.1 Methods Overview 104
6.2 Supervised Learning 105
6.2.1 Introduction and Process Steps 105
6.2.2 Business Task 105
6.2.3 Provision and Processing of the Required Data 106
6.2.4 Analysis of the Data 107
6.2.5 Evaluation and Validation of the Results (during the Analysis) 108
6.2.6 Application of the Results 108
6.3 Multiple Linear Regression for use when Target is Continuous 109
6.3.1 Rationale of Multiple Linear Regression Modelling 109
6.3.2 Regression Coefficients 110
6.3.3 Assessment of the Quality of the Model 111
6.3.4 Example of Linear Regression in Practice 113
6.4 Regression when the Target is not Continuous 119
6.4.1 Logistic Regression 119
6.4.2 Example of Logistic Regression in Practice 121
6.4.3 Discriminant Analysis 126
6.4.4 Log-Linear Models and Poisson Regression 128
6.5 Decision Trees 129
6.5.1 Overview 129
6.5.2 Selection Procedures of the Relevant Input Variables 134
6.5.3 Splitting Criteria 134
6.5.4 Number of Splits (Branches of the Tree) 135
6.5.5 Symmetry/Asymmetry 135
6.5.6 Pruning 135
6.6 Neural Networks 137
6.7 Which Method Produces the Best Model? A Comparison of Regression,
Decision Trees and Neural Networks 141
6.8 Unsupervised Learning 142
6.8.1 Introduction and Process Steps 142
6.8.2 Business Task 143
6.8.3 Provision and Processing of the Required Data 143
6.8.4 Analysis of the Data 145
6.8.5 Evaluation and Validation of the Results (during the Analysis) 147
6.8.6 Application of the Results 148
6.9 Cluster Analysis 148
6.9.1 Introduction 148
6.9.2 Hierarchical Cluster Analysis 149
6.9.3 K-Means Method of Cluster Analysis 150
6.9.4 Example of Cluster Analysis in Practice 151
6.10 Kohonen Networks and Self-Organising Maps 151
6.10.1 Description 151
6.10.2 Example of SOMs in Practice 152
6.11 Group Purchase Methods: Association and Sequence Analysis 155
6.11.1 Introduction 155
6.11.2 Analysis of the Data 157
6.11.3 Group Purchase Methods 158
6.11.4 Examples of Group Purchase Methods in Practice 158
7 Validation and Application 161
7.1 Introduction to Methods for Validation 161
7.2 Lift and Gain Charts 162
7.3 Model Stability 164
7.4 Sensitivity Analysis 167
7.5 Threshold Analytics and Confusion Matrix 169
7.6 ROC Curves 170
7.7 Cross-Validation and Robustness 171
7.8 Model Complexity 172
Part III Data Mining in Action 173
8 Marketing: Prediction 175
8.1 Recipe 1: Response Optimisation: to Find and Address the Right Number
of Customers 176
8.2 Recipe 2: To Find the x% of Customers with the Highest Affinity to an
Offer 186
8.3 Recipe 3: To Find the Right Number of Customers to Ignore 187
8.4 Recipe 4: To Find the x% of Customers with the Lowest Affinity to an
Offer 190
8.5 Recipe 5: To Find the x% of Customers with the Highest Affinity to Buy
191
8.6 Recipe 6: To Find the x% of Customers with the Lowest Affinity to Buy
192
8.7 Recipe 7: To Find the x% of Customers with the Highest Affinity to a
Single Purchase 193
8.8 Recipe 8: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Communication Areas 194
8.9 Recipe 9: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Insurance Areas 196
9 Intra-Customer Analysis 198
9.1 Recipe 10: To Find the Optimal Amount of Single Communication to
Activate One Customer 199
9.2 Recipe 11: To Find the Optimal Communication Mix to Activate One
Customer 200
9.3 Recipe 12: To Find and Describe Homogeneous Groups of Products 206
9.4 Recipe 13: To Find and Describe Groups of Customers with Homogeneous
Usage 210
9.5 Recipe 14: To Predict the Order Size of Single Products or Product
Groups 216
9.6 Recipe 15: Product Set Combination 217
9.7 Recipe 16: To Predict the Future Customer Lifetime Value of a Customer
219
10 Learning from a Small Testing Sample and Prediction 225
10.1 Recipe 17: To Predict Demographic Signs (Like Sex, Age, Education and
Income) 225
10.2 Recipe 18: To Predict the Potential Customers of a Brand New Product
or Service in Your Databases 236
10.3 Recipe 19: To Understand Operational Features and General Business
Forecasting 241
11 Miscellaneous 244
11.1 Recipe 20: To Find Customers Who Will Potentially Churn 244
11.2 Recipe 21: Indirect Churn Based on a Discontinued Contract 249
11.3 Recipe 22: Social Media Target Group Descriptions 250
11.4 Recipe 23: Web Monitoring 254
11.5 Recipe 24: To Predict Who is Likely to Click on a Special Banner 258
12 Software and Tools: A Quick Guide 261
12.1 List of Requirements When Choosing a Data Mining Tool 261
12.2 Introduction to the Idea of Fully Automated Modelling (FAM) 265
12.2.1 Predictive Behavioural Targeting 265
12.2.2 Fully Automatic Predictive Targeting and Modelling Real-Time Online
Behaviour 266
12.3 FAM Function 266
12.4 FAM Architecture 267
12.5 FAM Data Flows and Databases 268
12.6 FAM Modelling Aspects 269
12.7 FAM Challenges and Critical Success Factors 270
12.8 FAM Summary 270
13 Overviews 271
13.1 To Make Use of Official Statistics 272
13.2 How to Use Simple Maths to Make an Impression 272
13.2.1 Approximations 272
13.2.2 Absolute and Relative Values 273
13.2.3 % Change 273
13.2.4 Values in Context 273
13.2.5 Confidence Intervals 274
13.2.6 Rounding 274
13.2.7 Tables 274
13.2.8 Figures 274
13.3 Differences between Statistical Analysis and Data Mining 275
13.3.1 Assumptions 275
13.3.2 Values Missing Because 'Nothing Happened' 275
13.3.3 Sample Sizes 276
13.3.4 Goodness-of-Fit Tests 276
13.3.5 Model Complexity 277
13.4 How to Use Data Mining in Different Industries 277
13.5 Future Views 283
Bibliography 285
Index 296
Part I Data Mining Concept 1
1 Introduction 3
1.1 Aims of the Book 3
1.2 Data Mining Context 5
1.2.1 Domain Knowledge 6
1.2.2 Words to Remember 7
1.2.3 Associated Concepts 7
1.3 Global Appeal 8
1.4 Example Datasets Used in This Book 8
1.5 Recipe Structure 11
1.6 Further Reading and Resources 13
2 Data Mining Definition 14
2.1 Types of Data Mining Questions 15
2.1.1 Population and Sample 15
2.1.2 Data Preparation 16
2.1.3 Supervised and Unsupervised Methods 16
2.1.4 Knowledge-Discovery Techniques 18
2.2 Data Mining Process 19
2.3 Business Task: Clarification of the Business Question behind the
Problem 20
2.4 Data: Provision and Processing of the Required Data 21
2.4.1 Fixing the Analysis Period 22
2.4.2 Basic Unit of Interest 23
2.4.3 Target Variables 24
2.4.4 Input Variables/Explanatory Variables 24
2.5 Modelling: Analysis of the Data 25
2.6 Evaluation and Validation during the Analysis Stage 25
2.7 Application of Data Mining Results and Learning from the Experience 28
Part II Data Mining Practicalities 31
3 All about data 33
3.1 Some Basics 34
3.1.1 Data, Information, Knowledge and Wisdom 35
3.1.2 Sources and Quality of Data 36
3.1.3 Measurement Level and Types of Data 37
3.1.4 Measures of Magnitude and Dispersion 39
3.1.5 Data Distributions 41
3.2 Data Partition: Random Samples for Training, Testing and Validation 41
3.3 Types of Business Information Systems 44
3.3.1 Operational Systems Supporting Business Processes 44
3.3.2 Analysis-Based Information Systems 45
3.3.3 Importance of Information 45
3.4 Data Warehouses 47
3.4.1 Topic Orientation 47
3.4.2 Logical Integration and Homogenisation 48
3.4.3 Reference Period 48
3.4.4 Low Volatility 48
3.4.5 Using the Data Warehouse 49
3.5 Three Components of a Data Warehouse: DBMS, DB and DBCS 50
3.5.1 Database Management System (DBMS) 51
3.5.2 Database (DB) 51
3.5.3 Database Communication Systems (DBCS) 51
3.6 Data Marts 52
3.6.1 Regularly Filled Data Marts 53
3.6.2 Comparison between Data Marts and Data Warehouses 53
3.7 A Typical Example from the Online Marketing Area 54
3.8 Unique Data Marts 54
3.8.1 Permanent Data Marts 54
3.8.2 Data Marts Resulting from Complex Analysis 56
3.9 Data Mart: Do's and Don'ts 58
3.9.1 Do's and Don'ts for Processes 58
3.9.2 Do's and Don'ts for Handling 58
3.9.3 Do's and Don'ts for Coding/Programming 59
4 Data Preparation 60
4.1 Necessity of Data Preparation 61
4.2 From Small and Long to Short and Wide 61
4.3 Transformation of Variables 65
4.4 Missing Data and Imputation Strategies 66
4.5 Outliers 69
4.6 Dealing with the Vagaries of Data 70
4.6.1 Distributions 70
4.6.2 Tests for Normality 70
4.6.3 Data with Totally Different Scales 70
4.7 Adjusting the Data Distributions 71
4.7.1 Standardisation and Normalisation 71
4.7.2 Ranking 71
4.7.3 Box-Cox Transformation 71
4.8 Binning 72
4.8.1 Bucket Method 73
4.8.2 Analytical Binning for Nominal Variables 73
4.8.3 Quantiles 73
4.8.4 Binning in Practice 74
4.9 Timing Considerations 77
4.10 Operational Issues 77
5 Analytics 78
5.1 Introduction 79
5.2 Basis of Statistical Tests 80
5.2.1 Hypothesis Tests and P Values 80
5.2.2 Tolerance Intervals 82
5.2.3 Standard Errors and Confidence Intervals 83
5.3 Sampling 83
5.3.1 Methods 83
5.3.2 Sample Sizes 84
5.3.3 Sample Quality and Stability 84
5.4 Basic Statistics for Pre-analytics 85
5.4.1 Frequencies 85
5.4.2 Comparative Tests 88
5.4.3 Cross Tabulation and Contingency Tables 89
5.4.4 Correlations 90
5.4.5 Association Measures for Nominal Variables 91
5.4.6 Examples of Output from Comparative and Cross Tabulation Tests 92
5.5 Feature Selection/Reduction of Variables 96
5.5.1 Feature Reduction Using Domain Knowledge 96
5.5.2 Feature Selection Using Chi-Square 97
5.5.3 Principal Components Analysis and Factor Analysis 97
5.5.4 Canonical Correlation, PLS and SEM 98
5.5.5 Decision Trees 98
5.5.6 Random Forests 98
5.6 Time Series Analysis 99
6 Methods 102
6.1 Methods Overview 104
6.2 Supervised Learning 105
6.2.1 Introduction and Process Steps 105
6.2.2 Business Task 105
6.2.3 Provision and Processing of the Required Data 106
6.2.4 Analysis of the Data 107
6.2.5 Evaluation and Validation of the Results (during the Analysis) 108
6.2.6 Application of the Results 108
6.3 Multiple Linear Regression for use when Target is Continuous 109
6.3.1 Rationale of Multiple Linear Regression Modelling 109
6.3.2 Regression Coefficients 110
6.3.3 Assessment of the Quality of the Model 111
6.3.4 Example of Linear Regression in Practice 113
6.4 Regression when the Target is not Continuous 119
6.4.1 Logistic Regression 119
6.4.2 Example of Logistic Regression in Practice 121
6.4.3 Discriminant Analysis 126
6.4.4 Log-Linear Models and Poisson Regression 128
6.5 Decision Trees 129
6.5.1 Overview 129
6.5.2 Selection Procedures of the Relevant Input Variables 134
6.5.3 Splitting Criteria 134
6.5.4 Number of Splits (Branches of the Tree) 135
6.5.5 Symmetry/Asymmetry 135
6.5.6 Pruning 135
6.6 Neural Networks 137
6.7 Which Method Produces the Best Model? A Comparison of Regression,
Decision Trees and Neural Networks 141
6.8 Unsupervised Learning 142
6.8.1 Introduction and Process Steps 142
6.8.2 Business Task 143
6.8.3 Provision and Processing of the Required Data 143
6.8.4 Analysis of the Data 145
6.8.5 Evaluation and Validation of the Results (during the Analysis) 147
6.8.6 Application of the Results 148
6.9 Cluster Analysis 148
6.9.1 Introduction 148
6.9.2 Hierarchical Cluster Analysis 149
6.9.3 K-Means Method of Cluster Analysis 150
6.9.4 Example of Cluster Analysis in Practice 151
6.10 Kohonen Networks and Self-Organising Maps 151
6.10.1 Description 151
6.10.2 Example of SOMs in Practice 152
6.11 Group Purchase Methods: Association and Sequence Analysis 155
6.11.1 Introduction 155
6.11.2 Analysis of the Data 157
6.11.3 Group Purchase Methods 158
6.11.4 Examples of Group Purchase Methods in Practice 158
7 Validation and Application 161
7.1 Introduction to Methods for Validation 161
7.2 Lift and Gain Charts 162
7.3 Model Stability 164
7.4 Sensitivity Analysis 167
7.5 Threshold Analytics and Confusion Matrix 169
7.6 ROC Curves 170
7.7 Cross-Validation and Robustness 171
7.8 Model Complexity 172
Part III Data Mining in Action 173
8 Marketing: Prediction 175
8.1 Recipe 1: Response Optimisation: to Find and Address the Right Number
of Customers 176
8.2 Recipe 2: To Find the x% of Customers with the Highest Affinity to an
Offer 186
8.3 Recipe 3: To Find the Right Number of Customers to Ignore 187
8.4 Recipe 4: To Find the x% of Customers with the Lowest Affinity to an
Offer 190
8.5 Recipe 5: To Find the x% of Customers with the Highest Affinity to Buy
191
8.6 Recipe 6: To Find the x% of Customers with the Lowest Affinity to Buy
192
8.7 Recipe 7: To Find the x% of Customers with the Highest Affinity to a
Single Purchase 193
8.8 Recipe 8: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Communication Areas 194
8.9 Recipe 9: To Find the x% of Customers with the Highest Affinity to Sign
a Long-Term Contract in Insurance Areas 196
9 Intra-Customer Analysis 198
9.1 Recipe 10: To Find the Optimal Amount of Single Communication to
Activate One Customer 199
9.2 Recipe 11: To Find the Optimal Communication Mix to Activate One
Customer 200
9.3 Recipe 12: To Find and Describe Homogeneous Groups of Products 206
9.4 Recipe 13: To Find and Describe Groups of Customers with Homogeneous
Usage 210
9.5 Recipe 14: To Predict the Order Size of Single Products or Product
Groups 216
9.6 Recipe 15: Product Set Combination 217
9.7 Recipe 16: To Predict the Future Customer Lifetime Value of a Customer
219
10 Learning from a Small Testing Sample and Prediction 225
10.1 Recipe 17: To Predict Demographic Signs (Like Sex, Age, Education and
Income) 225
10.2 Recipe 18: To Predict the Potential Customers of a Brand New Product
or Service in Your Databases 236
10.3 Recipe 19: To Understand Operational Features and General Business
Forecasting 241
11 Miscellaneous 244
11.1 Recipe 20: To Find Customers Who Will Potentially Churn 244
11.2 Recipe 21: Indirect Churn Based on a Discontinued Contract 249
11.3 Recipe 22: Social Media Target Group Descriptions 250
11.4 Recipe 23: Web Monitoring 254
11.5 Recipe 24: To Predict Who is Likely to Click on a Special Banner 258
12 Software and Tools: A Quick Guide 261
12.1 List of Requirements When Choosing a Data Mining Tool 261
12.2 Introduction to the Idea of Fully Automated Modelling (FAM) 265
12.2.1 Predictive Behavioural Targeting 265
12.2.2 Fully Automatic Predictive Targeting and Modelling Real-Time Online
Behaviour 266
12.3 FAM Function 266
12.4 FAM Architecture 267
12.5 FAM Data Flows and Databases 268
12.6 FAM Modelling Aspects 269
12.7 FAM Challenges and Critical Success Factors 270
12.8 FAM Summary 270
13 Overviews 271
13.1 To Make Use of Official Statistics 272
13.2 How to Use Simple Maths to Make an Impression 272
13.2.1 Approximations 272
13.2.2 Absolute and Relative Values 273
13.2.3 % Change 273
13.2.4 Values in Context 273
13.2.5 Confidence Intervals 274
13.2.6 Rounding 274
13.2.7 Tables 274
13.2.8 Figures 274
13.3 Differences between Statistical Analysis and Data Mining 275
13.3.1 Assumptions 275
13.3.2 Values Missing Because 'Nothing Happened' 275
13.3.3 Sample Sizes 276
13.3.4 Goodness-of-Fit Tests 276
13.3.5 Model Complexity 277
13.4 How to Use Data Mining in Different Industries 277
13.5 Future Views 283
Bibliography 285
Index 296