Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer, Peter-Paul De Wolf
Statistical Disclosure Control
By Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi et al.
Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer, Peter-Paul De Wolf
Statistical Disclosure Control
By Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi et al.
- Gebundenes Buch
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
A reference to answer all your statistical confidentiality questions.
This handbook provides technical guidance on statistical disclosure control and on how to approach the problem of balancing the need to provide users with statistical outputs and the need to protect the confidentiality of respondents. Statistical disclosure control is combined with other tools such as administrative, legal and IT in order to define a proper data dissemination strategy based on a risk management approach.
The key concepts of statistical disclosure control are presented, along with the methodology and…mehr
Andere Kunden interessierten sich auch für
- Willem E. SarisDesign of Questionnaires 2E104,99 €
- Jelke BethlehemApplied Survey Methods170,99 €
- Marcello D'OrazioStatistical Matching154,99 €
- Paul S. LevySampling of Populations191,99 €
- Ineke StoopImproving Survey Response131,99 €
- Online Panel Research111,99 €
- Sixten LundstromEstimation in Surveys with Nonresponse148,99 €
-
-
-
A reference to answer all your statistical confidentiality questions.
This handbook provides technical guidance on statistical disclosure control and on how to approach the problem of balancing the need to provide users with statistical outputs and the need to protect the confidentiality of respondents. Statistical disclosure control is combined with other tools such as administrative, legal and IT in order to define a proper data dissemination strategy based on a risk management approach.
The key concepts of statistical disclosure control are presented, along with the methodology and software that can be used to apply various methods of statistical disclosure control. Numerous examples and guidelines are also featured to illustrate the topics covered.
Statistical Disclosure Control:
Presents a combination of both theoretical and practical solutions
Introduces all the key concepts and definitions involved with statistical disclosure control.
Provides a high level overview of how to approach problems associated with confidentiality.
Provides a broad-ranging review of the methods available to control disclosure.
Explains the subtleties of group disclosure control.
Features examples throughout the book along with case studies demonstrating how particular methods are used.
Discusses microdata, magnitude and frequency tabular data, and remote access issues.
Written by experts within leading National Statistical Institutes.
Official statisticians, academics and market researchers who need to be informed and make decisions on disclosure limitation will benefit from this book.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
This handbook provides technical guidance on statistical disclosure control and on how to approach the problem of balancing the need to provide users with statistical outputs and the need to protect the confidentiality of respondents. Statistical disclosure control is combined with other tools such as administrative, legal and IT in order to define a proper data dissemination strategy based on a risk management approach.
The key concepts of statistical disclosure control are presented, along with the methodology and software that can be used to apply various methods of statistical disclosure control. Numerous examples and guidelines are also featured to illustrate the topics covered.
Statistical Disclosure Control:
Presents a combination of both theoretical and practical solutions
Introduces all the key concepts and definitions involved with statistical disclosure control.
Provides a high level overview of how to approach problems associated with confidentiality.
Provides a broad-ranging review of the methods available to control disclosure.
Explains the subtleties of group disclosure control.
Features examples throughout the book along with case studies demonstrating how particular methods are used.
Discusses microdata, magnitude and frequency tabular data, and remote access issues.
Written by experts within leading National Statistical Institutes.
Official statisticians, academics and market researchers who need to be informed and make decisions on disclosure limitation will benefit from this book.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Produktdetails
- Produktdetails
- Wiley Series in Survey Methodology
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 304
- Erscheinungstermin: 17. September 2012
- Englisch
- Abmessung: 244mm x 161mm x 20mm
- Gewicht: 528g
- ISBN-13: 9781119978152
- ISBN-10: 1119978157
- Artikelnr.: 35259463
- Herstellerkennzeichnung
- Libri GmbH
- Europaallee 1
- 36244 Bad Hersfeld
- 06621 890
- Wiley Series in Survey Methodology
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 304
- Erscheinungstermin: 17. September 2012
- Englisch
- Abmessung: 244mm x 161mm x 20mm
- Gewicht: 528g
- ISBN-13: 9781119978152
- ISBN-10: 1119978157
- Artikelnr.: 35259463
- Herstellerkennzeichnung
- Libri GmbH
- Europaallee 1
- 36244 Bad Hersfeld
- 06621 890
Anco Hundepool, Statistics Netherlands, The Netherlands. Josep Domingo-Ferrer, Universitat Rovira i Virgili, Spain. Luisa Franconi, Head of Unit on Statistical Disclosure Control Methods, ISTAT, Italy. Sarah Giessing, Federal Statistical Office of Germany, Germany. Keith Spicer, Office for National Statistics, Portsmouth, UK. Eric Schulte Nordholt, Senior researcher and project leader at Statistics, The Netherlands. Peter-Paul De Wolf, Methodologist at National Institute of Statistics, The Netherlands.
Preface xi
Acknowledgements xv
1 Introduction 1
1.1 Concepts and definitions 2
1.1.1 Disclosure 2
1.1.2 Statistical disclosure control 3
1.1.3 Tabular data 3
1.1.4 Microdata 3
1.1.5 Risk and utility 4
1.2 An approach to Statistical Disclosure Control 7
1.2.1 Why is confidentiality protection needed? 7
1.2.2 What are the key characteristics and uses of the data? 8
1.2.3 What disclosure risks need to be protected against? 8
1.2.4 Disclosure control methods 8
1.2.5 Implementation 9
1.3 The chapters of the handbook 9
2 Ethics, principles, guidelines and regulations - a general background 10
2.1 Introduction 10
2.2 Ethical codes and the new ISI code 11
2.2.1 ISI Declaration on Professional Ethics 11
2.2.2 New ISI Declaration on Professional Ethics 12
2.2.3 European Statistics Code of Practice 15
2.3 UNECE principles and guidelines 16
2.3.1 UNECE Principles and Guidelines on Confidentiality Aspects of Data
Integration 18
2.3.2 Future activities on the UNECE principles and guidelines 19
2.4 Laws 19
2.4.1 Committee on Statistical Confidentiality 20
2.4.2 European Statistical System Committee 20
3 Microdata 23
3.1 Introduction 23
3.2 Microdata concepts 24
3.2.1 Stage 1: Assess need for confidentiality protection 24
3.2.2 Stage 2: Key characteristics and use of microdata 27
3.2.3 Stage 3: Disclosure risk 30
3.2.4 Stage 4: Disclosure control methods 32
3.2.5 Stage 5: Implementation 34
3.3 Definitions of disclosure 36
3.3.1 Definitions of disclosure scenarios 37
3.4 Definitions of disclosure risk 38
3.4.1 Disclosure risk for categorical quasi-identifiers 39
3.4.2 Notation and assumptions 40
3.4.3 Disclosure risk for continuous quasi-identifiers 41
3.5 Estimating re-identification risk 43
3.5.1 Individual risk based on the sample: Threshold rule 44
3.5.2 Estimating individual risk using sampling weights 44
3.5.3 Estimating individual risk by Poisson model 47
3.5.4 Further models that borrow information from other sources 48
3.5.5 Estimating per record risk via heuristics 49
3.5.6 Assessing risk via record linkage 50
3.6 Non-perturbative microdata masking 51
3.6.1 Sampling 51
3.6.2 Global recoding 52
3.6.3 Top and bottom coding 53
3.6.4 Local suppression 53
3.7 Perturbative microdata masking 53
3.7.1 Additive noise masking 54
3.7.2 Multiplicative noise masking 57
3.7.3 Microaggregation 60
3.7.4 Data swapping and rank swapping 72
3.7.5 Data shuffling 73
3.7.6 Rounding 73
3.7.7 Re-sampling 74
3.7.8 Pram 74
3.7.9 Massc 78
3.8 Synthetic and hybrid data 78
3.8.1 Fully synthetic data 79
3.8.2 Partially synthetic data 84
3.8.3 Hybrid data 86
3.8.4 Pros and cons of synthetic and hybrid data 98
3.9 Information loss in microdata 100
3.9.1 Information loss measures for continuous data 101
3.9.2 Information loss measures for categorical data 108
3.10 Release of multiple files from the same microdata set 110
3.11 Software 111
3.11.1 ¿-argus 111
3.11.2 sdcMicro 113
3.11.3 IVEware 115
3.12 Case studies 116
3.12.1 Microdata files at Statistics Netherlands 116
3.12.2 The European Labour Force Survey microdata for research purposes 118
3.12.3 The European Structure of Earnings Survey microdata for research
purposes 121
3.12.4 NHIS-linked mortality data public use file, USA 128
3.12.5 Other real case instances 130
4 Magnitude tabular data 131
4.1 Introduction 131
4.1.1 Magnitude tabular data: Basic terminology 131
4.1.2 Complex tabular data structures: Hierarchical and linked tables 132
4.1.3 Risk concepts 134
4.1.4 Protection concepts 137
4.1.5 Information loss concepts 137
4.1.6 Implementation: Software, guidelines and case study 138
4.2 Disclosure risk assessment I: Primary sensitive cells 138
4.2.1 Intruder scenarios 138
4.2.2 Sensitivity rules 140
4.3 Disclosure risk assessment II: Secondary risk assessment 152
4.3.1 Feasibility interval 152
4.3.2 Protection level 154
4.3.3 Singleton and multi cell disclosure 155
4.3.4 Risk models for hierarchical and linked tables 155
4.4 Non-perturbative protection methods 157
4.4.1 Global recoding 157
4.4.2 The concept of cell suppression 157
4.4.3 Algorithms for secondary cell suppression 158
4.4.4 Secondary cell suppression in hierarchical and linked tables 161
4.5 Perturbative protection methods 163
4.5.1 A pre-tabular method: Multiplicative noise 165
4.5.2 A post-tabular method: Controlled tabular adjustment 165
4.6 Information loss measures for tabular data 166
4.6.1 Cell costs for cell suppression 166
4.6.2 Cell costs for CTA 167
4.6.3 Information loss measures to evaluate the outcome of table protection
167
4.7 Software for tabular data protection 168
4.7.1 Empirical comparison of cell suppression algorithms 169
4.8 Guidelines: Setting up an efficient table model systematically 173
4.8.1 Defining spanning variables 174
4.8.2 Response variables and mapping rules 175
4.9 Case studies 178
4.9.1 Response variables and mapping rules of the case study 178
4.9.2 Spanning variables of the case study 179
4.9.3 Analysing the tables of the case study 179
4.9.4 Software issues of the case study 181
5 Frequency tables 183
5.1 Introduction 183
5.2 Disclosure risks 184
5.2.1 Individual attribute disclosure 185
5.2.2 Group attribute disclosure 186
5.2.3 Disclosure by differencing 187
5.2.4 Perception of disclosure risk 190
5.3 Methods 191
5.3.1 Pre-tabular 191
5.3.2 Table re-design 192
5.3.3 Post-tabular 193
5.4 Post-tabular methods 193
5.4.1 Cell suppression 193
5.4.2 ABS cell perturbation 193
5.4.3 Rounding 194
5.5 Information loss 199
5.6 Software 201
5.6.1 Introduction 201
5.6.2 Optimal, first feasible and RAPID solutions 202
5.6.3 Protection provided by controlled rounding 203
5.7 Case studies 204
5.7.1 UK Census 204
5.7.2 Australian and New Zealand Censuses 205
6 Data access issues 208
6.1 Introduction 208
6.2 Research data centres 209
6.3 Remote execution 209
6.4 Remote access 210
6.5 Licensing 211
6.6 Guidelines on output checking 211
6.6.1 Introduction 211
6.6.2 General approach 212
6.6.3 Rules for output checking 215
6.6.4 Organisational/procedural aspects of output checking 224
6.6.5 Researcher training 233
6.7 Additional issues concerning data access 236
6.7.1 Examples of disclaimers 236
6.7.2 Output description 236
6.8 Case studies 237
6.8.1 The US Census Bureau Microdata Analysis System 237
6.8.2 Remote access at Statistics Netherlands 239
Glossary 243
References 261
Author index 279
Subject index 282
Acknowledgements xv
1 Introduction 1
1.1 Concepts and definitions 2
1.1.1 Disclosure 2
1.1.2 Statistical disclosure control 3
1.1.3 Tabular data 3
1.1.4 Microdata 3
1.1.5 Risk and utility 4
1.2 An approach to Statistical Disclosure Control 7
1.2.1 Why is confidentiality protection needed? 7
1.2.2 What are the key characteristics and uses of the data? 8
1.2.3 What disclosure risks need to be protected against? 8
1.2.4 Disclosure control methods 8
1.2.5 Implementation 9
1.3 The chapters of the handbook 9
2 Ethics, principles, guidelines and regulations - a general background 10
2.1 Introduction 10
2.2 Ethical codes and the new ISI code 11
2.2.1 ISI Declaration on Professional Ethics 11
2.2.2 New ISI Declaration on Professional Ethics 12
2.2.3 European Statistics Code of Practice 15
2.3 UNECE principles and guidelines 16
2.3.1 UNECE Principles and Guidelines on Confidentiality Aspects of Data
Integration 18
2.3.2 Future activities on the UNECE principles and guidelines 19
2.4 Laws 19
2.4.1 Committee on Statistical Confidentiality 20
2.4.2 European Statistical System Committee 20
3 Microdata 23
3.1 Introduction 23
3.2 Microdata concepts 24
3.2.1 Stage 1: Assess need for confidentiality protection 24
3.2.2 Stage 2: Key characteristics and use of microdata 27
3.2.3 Stage 3: Disclosure risk 30
3.2.4 Stage 4: Disclosure control methods 32
3.2.5 Stage 5: Implementation 34
3.3 Definitions of disclosure 36
3.3.1 Definitions of disclosure scenarios 37
3.4 Definitions of disclosure risk 38
3.4.1 Disclosure risk for categorical quasi-identifiers 39
3.4.2 Notation and assumptions 40
3.4.3 Disclosure risk for continuous quasi-identifiers 41
3.5 Estimating re-identification risk 43
3.5.1 Individual risk based on the sample: Threshold rule 44
3.5.2 Estimating individual risk using sampling weights 44
3.5.3 Estimating individual risk by Poisson model 47
3.5.4 Further models that borrow information from other sources 48
3.5.5 Estimating per record risk via heuristics 49
3.5.6 Assessing risk via record linkage 50
3.6 Non-perturbative microdata masking 51
3.6.1 Sampling 51
3.6.2 Global recoding 52
3.6.3 Top and bottom coding 53
3.6.4 Local suppression 53
3.7 Perturbative microdata masking 53
3.7.1 Additive noise masking 54
3.7.2 Multiplicative noise masking 57
3.7.3 Microaggregation 60
3.7.4 Data swapping and rank swapping 72
3.7.5 Data shuffling 73
3.7.6 Rounding 73
3.7.7 Re-sampling 74
3.7.8 Pram 74
3.7.9 Massc 78
3.8 Synthetic and hybrid data 78
3.8.1 Fully synthetic data 79
3.8.2 Partially synthetic data 84
3.8.3 Hybrid data 86
3.8.4 Pros and cons of synthetic and hybrid data 98
3.9 Information loss in microdata 100
3.9.1 Information loss measures for continuous data 101
3.9.2 Information loss measures for categorical data 108
3.10 Release of multiple files from the same microdata set 110
3.11 Software 111
3.11.1 ¿-argus 111
3.11.2 sdcMicro 113
3.11.3 IVEware 115
3.12 Case studies 116
3.12.1 Microdata files at Statistics Netherlands 116
3.12.2 The European Labour Force Survey microdata for research purposes 118
3.12.3 The European Structure of Earnings Survey microdata for research
purposes 121
3.12.4 NHIS-linked mortality data public use file, USA 128
3.12.5 Other real case instances 130
4 Magnitude tabular data 131
4.1 Introduction 131
4.1.1 Magnitude tabular data: Basic terminology 131
4.1.2 Complex tabular data structures: Hierarchical and linked tables 132
4.1.3 Risk concepts 134
4.1.4 Protection concepts 137
4.1.5 Information loss concepts 137
4.1.6 Implementation: Software, guidelines and case study 138
4.2 Disclosure risk assessment I: Primary sensitive cells 138
4.2.1 Intruder scenarios 138
4.2.2 Sensitivity rules 140
4.3 Disclosure risk assessment II: Secondary risk assessment 152
4.3.1 Feasibility interval 152
4.3.2 Protection level 154
4.3.3 Singleton and multi cell disclosure 155
4.3.4 Risk models for hierarchical and linked tables 155
4.4 Non-perturbative protection methods 157
4.4.1 Global recoding 157
4.4.2 The concept of cell suppression 157
4.4.3 Algorithms for secondary cell suppression 158
4.4.4 Secondary cell suppression in hierarchical and linked tables 161
4.5 Perturbative protection methods 163
4.5.1 A pre-tabular method: Multiplicative noise 165
4.5.2 A post-tabular method: Controlled tabular adjustment 165
4.6 Information loss measures for tabular data 166
4.6.1 Cell costs for cell suppression 166
4.6.2 Cell costs for CTA 167
4.6.3 Information loss measures to evaluate the outcome of table protection
167
4.7 Software for tabular data protection 168
4.7.1 Empirical comparison of cell suppression algorithms 169
4.8 Guidelines: Setting up an efficient table model systematically 173
4.8.1 Defining spanning variables 174
4.8.2 Response variables and mapping rules 175
4.9 Case studies 178
4.9.1 Response variables and mapping rules of the case study 178
4.9.2 Spanning variables of the case study 179
4.9.3 Analysing the tables of the case study 179
4.9.4 Software issues of the case study 181
5 Frequency tables 183
5.1 Introduction 183
5.2 Disclosure risks 184
5.2.1 Individual attribute disclosure 185
5.2.2 Group attribute disclosure 186
5.2.3 Disclosure by differencing 187
5.2.4 Perception of disclosure risk 190
5.3 Methods 191
5.3.1 Pre-tabular 191
5.3.2 Table re-design 192
5.3.3 Post-tabular 193
5.4 Post-tabular methods 193
5.4.1 Cell suppression 193
5.4.2 ABS cell perturbation 193
5.4.3 Rounding 194
5.5 Information loss 199
5.6 Software 201
5.6.1 Introduction 201
5.6.2 Optimal, first feasible and RAPID solutions 202
5.6.3 Protection provided by controlled rounding 203
5.7 Case studies 204
5.7.1 UK Census 204
5.7.2 Australian and New Zealand Censuses 205
6 Data access issues 208
6.1 Introduction 208
6.2 Research data centres 209
6.3 Remote execution 209
6.4 Remote access 210
6.5 Licensing 211
6.6 Guidelines on output checking 211
6.6.1 Introduction 211
6.6.2 General approach 212
6.6.3 Rules for output checking 215
6.6.4 Organisational/procedural aspects of output checking 224
6.6.5 Researcher training 233
6.7 Additional issues concerning data access 236
6.7.1 Examples of disclaimers 236
6.7.2 Output description 236
6.8 Case studies 237
6.8.1 The US Census Bureau Microdata Analysis System 237
6.8.2 Remote access at Statistics Netherlands 239
Glossary 243
References 261
Author index 279
Subject index 282
Preface xi
Acknowledgements xv
1 Introduction 1
1.1 Concepts and definitions 2
1.1.1 Disclosure 2
1.1.2 Statistical disclosure control 3
1.1.3 Tabular data 3
1.1.4 Microdata 3
1.1.5 Risk and utility 4
1.2 An approach to Statistical Disclosure Control 7
1.2.1 Why is confidentiality protection needed? 7
1.2.2 What are the key characteristics and uses of the data? 8
1.2.3 What disclosure risks need to be protected against? 8
1.2.4 Disclosure control methods 8
1.2.5 Implementation 9
1.3 The chapters of the handbook 9
2 Ethics, principles, guidelines and regulations - a general background 10
2.1 Introduction 10
2.2 Ethical codes and the new ISI code 11
2.2.1 ISI Declaration on Professional Ethics 11
2.2.2 New ISI Declaration on Professional Ethics 12
2.2.3 European Statistics Code of Practice 15
2.3 UNECE principles and guidelines 16
2.3.1 UNECE Principles and Guidelines on Confidentiality Aspects of Data
Integration 18
2.3.2 Future activities on the UNECE principles and guidelines 19
2.4 Laws 19
2.4.1 Committee on Statistical Confidentiality 20
2.4.2 European Statistical System Committee 20
3 Microdata 23
3.1 Introduction 23
3.2 Microdata concepts 24
3.2.1 Stage 1: Assess need for confidentiality protection 24
3.2.2 Stage 2: Key characteristics and use of microdata 27
3.2.3 Stage 3: Disclosure risk 30
3.2.4 Stage 4: Disclosure control methods 32
3.2.5 Stage 5: Implementation 34
3.3 Definitions of disclosure 36
3.3.1 Definitions of disclosure scenarios 37
3.4 Definitions of disclosure risk 38
3.4.1 Disclosure risk for categorical quasi-identifiers 39
3.4.2 Notation and assumptions 40
3.4.3 Disclosure risk for continuous quasi-identifiers 41
3.5 Estimating re-identification risk 43
3.5.1 Individual risk based on the sample: Threshold rule 44
3.5.2 Estimating individual risk using sampling weights 44
3.5.3 Estimating individual risk by Poisson model 47
3.5.4 Further models that borrow information from other sources 48
3.5.5 Estimating per record risk via heuristics 49
3.5.6 Assessing risk via record linkage 50
3.6 Non-perturbative microdata masking 51
3.6.1 Sampling 51
3.6.2 Global recoding 52
3.6.3 Top and bottom coding 53
3.6.4 Local suppression 53
3.7 Perturbative microdata masking 53
3.7.1 Additive noise masking 54
3.7.2 Multiplicative noise masking 57
3.7.3 Microaggregation 60
3.7.4 Data swapping and rank swapping 72
3.7.5 Data shuffling 73
3.7.6 Rounding 73
3.7.7 Re-sampling 74
3.7.8 Pram 74
3.7.9 Massc 78
3.8 Synthetic and hybrid data 78
3.8.1 Fully synthetic data 79
3.8.2 Partially synthetic data 84
3.8.3 Hybrid data 86
3.8.4 Pros and cons of synthetic and hybrid data 98
3.9 Information loss in microdata 100
3.9.1 Information loss measures for continuous data 101
3.9.2 Information loss measures for categorical data 108
3.10 Release of multiple files from the same microdata set 110
3.11 Software 111
3.11.1 ¿-argus 111
3.11.2 sdcMicro 113
3.11.3 IVEware 115
3.12 Case studies 116
3.12.1 Microdata files at Statistics Netherlands 116
3.12.2 The European Labour Force Survey microdata for research purposes 118
3.12.3 The European Structure of Earnings Survey microdata for research
purposes 121
3.12.4 NHIS-linked mortality data public use file, USA 128
3.12.5 Other real case instances 130
4 Magnitude tabular data 131
4.1 Introduction 131
4.1.1 Magnitude tabular data: Basic terminology 131
4.1.2 Complex tabular data structures: Hierarchical and linked tables 132
4.1.3 Risk concepts 134
4.1.4 Protection concepts 137
4.1.5 Information loss concepts 137
4.1.6 Implementation: Software, guidelines and case study 138
4.2 Disclosure risk assessment I: Primary sensitive cells 138
4.2.1 Intruder scenarios 138
4.2.2 Sensitivity rules 140
4.3 Disclosure risk assessment II: Secondary risk assessment 152
4.3.1 Feasibility interval 152
4.3.2 Protection level 154
4.3.3 Singleton and multi cell disclosure 155
4.3.4 Risk models for hierarchical and linked tables 155
4.4 Non-perturbative protection methods 157
4.4.1 Global recoding 157
4.4.2 The concept of cell suppression 157
4.4.3 Algorithms for secondary cell suppression 158
4.4.4 Secondary cell suppression in hierarchical and linked tables 161
4.5 Perturbative protection methods 163
4.5.1 A pre-tabular method: Multiplicative noise 165
4.5.2 A post-tabular method: Controlled tabular adjustment 165
4.6 Information loss measures for tabular data 166
4.6.1 Cell costs for cell suppression 166
4.6.2 Cell costs for CTA 167
4.6.3 Information loss measures to evaluate the outcome of table protection
167
4.7 Software for tabular data protection 168
4.7.1 Empirical comparison of cell suppression algorithms 169
4.8 Guidelines: Setting up an efficient table model systematically 173
4.8.1 Defining spanning variables 174
4.8.2 Response variables and mapping rules 175
4.9 Case studies 178
4.9.1 Response variables and mapping rules of the case study 178
4.9.2 Spanning variables of the case study 179
4.9.3 Analysing the tables of the case study 179
4.9.4 Software issues of the case study 181
5 Frequency tables 183
5.1 Introduction 183
5.2 Disclosure risks 184
5.2.1 Individual attribute disclosure 185
5.2.2 Group attribute disclosure 186
5.2.3 Disclosure by differencing 187
5.2.4 Perception of disclosure risk 190
5.3 Methods 191
5.3.1 Pre-tabular 191
5.3.2 Table re-design 192
5.3.3 Post-tabular 193
5.4 Post-tabular methods 193
5.4.1 Cell suppression 193
5.4.2 ABS cell perturbation 193
5.4.3 Rounding 194
5.5 Information loss 199
5.6 Software 201
5.6.1 Introduction 201
5.6.2 Optimal, first feasible and RAPID solutions 202
5.6.3 Protection provided by controlled rounding 203
5.7 Case studies 204
5.7.1 UK Census 204
5.7.2 Australian and New Zealand Censuses 205
6 Data access issues 208
6.1 Introduction 208
6.2 Research data centres 209
6.3 Remote execution 209
6.4 Remote access 210
6.5 Licensing 211
6.6 Guidelines on output checking 211
6.6.1 Introduction 211
6.6.2 General approach 212
6.6.3 Rules for output checking 215
6.6.4 Organisational/procedural aspects of output checking 224
6.6.5 Researcher training 233
6.7 Additional issues concerning data access 236
6.7.1 Examples of disclaimers 236
6.7.2 Output description 236
6.8 Case studies 237
6.8.1 The US Census Bureau Microdata Analysis System 237
6.8.2 Remote access at Statistics Netherlands 239
Glossary 243
References 261
Author index 279
Subject index 282
Acknowledgements xv
1 Introduction 1
1.1 Concepts and definitions 2
1.1.1 Disclosure 2
1.1.2 Statistical disclosure control 3
1.1.3 Tabular data 3
1.1.4 Microdata 3
1.1.5 Risk and utility 4
1.2 An approach to Statistical Disclosure Control 7
1.2.1 Why is confidentiality protection needed? 7
1.2.2 What are the key characteristics and uses of the data? 8
1.2.3 What disclosure risks need to be protected against? 8
1.2.4 Disclosure control methods 8
1.2.5 Implementation 9
1.3 The chapters of the handbook 9
2 Ethics, principles, guidelines and regulations - a general background 10
2.1 Introduction 10
2.2 Ethical codes and the new ISI code 11
2.2.1 ISI Declaration on Professional Ethics 11
2.2.2 New ISI Declaration on Professional Ethics 12
2.2.3 European Statistics Code of Practice 15
2.3 UNECE principles and guidelines 16
2.3.1 UNECE Principles and Guidelines on Confidentiality Aspects of Data
Integration 18
2.3.2 Future activities on the UNECE principles and guidelines 19
2.4 Laws 19
2.4.1 Committee on Statistical Confidentiality 20
2.4.2 European Statistical System Committee 20
3 Microdata 23
3.1 Introduction 23
3.2 Microdata concepts 24
3.2.1 Stage 1: Assess need for confidentiality protection 24
3.2.2 Stage 2: Key characteristics and use of microdata 27
3.2.3 Stage 3: Disclosure risk 30
3.2.4 Stage 4: Disclosure control methods 32
3.2.5 Stage 5: Implementation 34
3.3 Definitions of disclosure 36
3.3.1 Definitions of disclosure scenarios 37
3.4 Definitions of disclosure risk 38
3.4.1 Disclosure risk for categorical quasi-identifiers 39
3.4.2 Notation and assumptions 40
3.4.3 Disclosure risk for continuous quasi-identifiers 41
3.5 Estimating re-identification risk 43
3.5.1 Individual risk based on the sample: Threshold rule 44
3.5.2 Estimating individual risk using sampling weights 44
3.5.3 Estimating individual risk by Poisson model 47
3.5.4 Further models that borrow information from other sources 48
3.5.5 Estimating per record risk via heuristics 49
3.5.6 Assessing risk via record linkage 50
3.6 Non-perturbative microdata masking 51
3.6.1 Sampling 51
3.6.2 Global recoding 52
3.6.3 Top and bottom coding 53
3.6.4 Local suppression 53
3.7 Perturbative microdata masking 53
3.7.1 Additive noise masking 54
3.7.2 Multiplicative noise masking 57
3.7.3 Microaggregation 60
3.7.4 Data swapping and rank swapping 72
3.7.5 Data shuffling 73
3.7.6 Rounding 73
3.7.7 Re-sampling 74
3.7.8 Pram 74
3.7.9 Massc 78
3.8 Synthetic and hybrid data 78
3.8.1 Fully synthetic data 79
3.8.2 Partially synthetic data 84
3.8.3 Hybrid data 86
3.8.4 Pros and cons of synthetic and hybrid data 98
3.9 Information loss in microdata 100
3.9.1 Information loss measures for continuous data 101
3.9.2 Information loss measures for categorical data 108
3.10 Release of multiple files from the same microdata set 110
3.11 Software 111
3.11.1 ¿-argus 111
3.11.2 sdcMicro 113
3.11.3 IVEware 115
3.12 Case studies 116
3.12.1 Microdata files at Statistics Netherlands 116
3.12.2 The European Labour Force Survey microdata for research purposes 118
3.12.3 The European Structure of Earnings Survey microdata for research
purposes 121
3.12.4 NHIS-linked mortality data public use file, USA 128
3.12.5 Other real case instances 130
4 Magnitude tabular data 131
4.1 Introduction 131
4.1.1 Magnitude tabular data: Basic terminology 131
4.1.2 Complex tabular data structures: Hierarchical and linked tables 132
4.1.3 Risk concepts 134
4.1.4 Protection concepts 137
4.1.5 Information loss concepts 137
4.1.6 Implementation: Software, guidelines and case study 138
4.2 Disclosure risk assessment I: Primary sensitive cells 138
4.2.1 Intruder scenarios 138
4.2.2 Sensitivity rules 140
4.3 Disclosure risk assessment II: Secondary risk assessment 152
4.3.1 Feasibility interval 152
4.3.2 Protection level 154
4.3.3 Singleton and multi cell disclosure 155
4.3.4 Risk models for hierarchical and linked tables 155
4.4 Non-perturbative protection methods 157
4.4.1 Global recoding 157
4.4.2 The concept of cell suppression 157
4.4.3 Algorithms for secondary cell suppression 158
4.4.4 Secondary cell suppression in hierarchical and linked tables 161
4.5 Perturbative protection methods 163
4.5.1 A pre-tabular method: Multiplicative noise 165
4.5.2 A post-tabular method: Controlled tabular adjustment 165
4.6 Information loss measures for tabular data 166
4.6.1 Cell costs for cell suppression 166
4.6.2 Cell costs for CTA 167
4.6.3 Information loss measures to evaluate the outcome of table protection
167
4.7 Software for tabular data protection 168
4.7.1 Empirical comparison of cell suppression algorithms 169
4.8 Guidelines: Setting up an efficient table model systematically 173
4.8.1 Defining spanning variables 174
4.8.2 Response variables and mapping rules 175
4.9 Case studies 178
4.9.1 Response variables and mapping rules of the case study 178
4.9.2 Spanning variables of the case study 179
4.9.3 Analysing the tables of the case study 179
4.9.4 Software issues of the case study 181
5 Frequency tables 183
5.1 Introduction 183
5.2 Disclosure risks 184
5.2.1 Individual attribute disclosure 185
5.2.2 Group attribute disclosure 186
5.2.3 Disclosure by differencing 187
5.2.4 Perception of disclosure risk 190
5.3 Methods 191
5.3.1 Pre-tabular 191
5.3.2 Table re-design 192
5.3.3 Post-tabular 193
5.4 Post-tabular methods 193
5.4.1 Cell suppression 193
5.4.2 ABS cell perturbation 193
5.4.3 Rounding 194
5.5 Information loss 199
5.6 Software 201
5.6.1 Introduction 201
5.6.2 Optimal, first feasible and RAPID solutions 202
5.6.3 Protection provided by controlled rounding 203
5.7 Case studies 204
5.7.1 UK Census 204
5.7.2 Australian and New Zealand Censuses 205
6 Data access issues 208
6.1 Introduction 208
6.2 Research data centres 209
6.3 Remote execution 209
6.4 Remote access 210
6.5 Licensing 211
6.6 Guidelines on output checking 211
6.6.1 Introduction 211
6.6.2 General approach 212
6.6.3 Rules for output checking 215
6.6.4 Organisational/procedural aspects of output checking 224
6.6.5 Researcher training 233
6.7 Additional issues concerning data access 236
6.7.1 Examples of disclaimers 236
6.7.2 Output description 236
6.8 Case studies 237
6.8.1 The US Census Bureau Microdata Analysis System 237
6.8.2 Remote access at Statistics Netherlands 239
Glossary 243
References 261
Author index 279
Subject index 282