David Taniar, Clement H. C. Leung, Wenny Rahayu
High-Performance Parallel DP &
By David Taniar, Clement H. C. Taniar, Wenny Rahayu et al.
David Taniar, Clement H. C. Leung, Wenny Rahayu
High-Performance Parallel DP &
By David Taniar, Clement H. C. Taniar, Wenny Rahayu et al.
- Gebundenes Buch
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
The latest techniques and principles of parallel and grid database processing
The growth in grid databases, coupled with the utility of parallel query processing, presents an important opportunity to understand and utilize high-performance parallel database processing within a major database management system (DBMS). This important new book provides readers with a fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. It presents a balanced treatment of the theoretical and practical aspects of…mehr
Andere Kunden interessierten sich auch für
- Vittorio Castelli / Lawrence D. Bergman (Hgg.)Image Databases236,99 €
- Len SilverstonThe Data Model Resource Book, Volume 179,99 €
- Len SilverstonThe Data Model Resource Book, Volume 285,99 €
- Tamraparni DasuExploratory Data Mining and Data Cleaning178,99 €
- Tapio LahdenmakiRelational Database Index Design and the Optimizers154,99 €
- Paulraj PonniahData Modeling Fundamentals185,99 €
- Ralph KimballThe Data Warehouse ETL Toolkit49,99 €
-
-
-
The latest techniques and principles of parallel and grid database processing
The growth in grid databases, coupled with the utility of parallel query processing, presents an important opportunity to understand and utilize high-performance parallel database processing within a major database management system (DBMS). This important new book provides readers with a fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. It presents a balanced treatment of the theoretical and practical aspects of high-performance databases to demonstrate how parallel query is executed in a DBMS, including concepts, algorithms, analytical models, and grid transactions.
High-Performance Parallel Database Processing and Grid Databases serves as a valuable resource for researchers working in parallel databases and for practitioners interested in building a high-performance database. It is also a much-needed, self-contained textbook for database courses at the advanced undergraduate and graduate levels.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
The growth in grid databases, coupled with the utility of parallel query processing, presents an important opportunity to understand and utilize high-performance parallel database processing within a major database management system (DBMS). This important new book provides readers with a fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. It presents a balanced treatment of the theoretical and practical aspects of high-performance databases to demonstrate how parallel query is executed in a DBMS, including concepts, algorithms, analytical models, and grid transactions.
High-Performance Parallel Database Processing and Grid Databases serves as a valuable resource for researchers working in parallel databases and for practitioners interested in building a high-performance database. It is also a much-needed, self-contained textbook for database courses at the advanced undergraduate and graduate levels.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Produktdetails
- Produktdetails
- Wiley Series on Parallel and Distributed Computing
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 576
- Erscheinungstermin: 1. September 2008
- Englisch
- Abmessung: 240mm x 161mm x 35mm
- Gewicht: 906g
- ISBN-13: 9780470107621
- ISBN-10: 0470107626
- Artikelnr.: 23877698
- Herstellerkennzeichnung
- Libri GmbH
- Europaallee 1
- 36244 Bad Hersfeld
- 06621 890
- Wiley Series on Parallel and Distributed Computing
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 576
- Erscheinungstermin: 1. September 2008
- Englisch
- Abmessung: 240mm x 161mm x 35mm
- Gewicht: 906g
- ISBN-13: 9780470107621
- ISBN-10: 0470107626
- Artikelnr.: 23877698
- Herstellerkennzeichnung
- Libri GmbH
- Europaallee 1
- 36244 Bad Hersfeld
- 06621 890
David Taniar, PhD, lectures in information technology at Monash University, Australia. Dr. Taniar has published extensively in the field of high- performance parallel databases and is the Editor in Chief of the International Journal of Data Warehousing and Mining. Clement H. C. Leung, PhD, is Foundation Chair in Computer Science at Victoria University, Australia. Dr. Leung previously held the Established Chair in Computer Science at the University of London. Wenny Rahayu, PhD, is Associate Professor at La Trobe University, Australia, and actively works in the areas of database design and implementation, covering object-relational databases and Web databases. Sushant Goel, PhD, is a software consultant and holds a PhD in computer systems engineering from RMIT University, Australia. His research interests are in grid transaction management and software development processes, such as agile computing.
Preface xv
Part I Introduction
1. Introduction 3
1.1. A Brief Overview: Parallel Databases and Grid Databases 4
1.2. Parallel Query Processing: Motivations 5
1.3. Parallel Query Processing: Objectives 7
1.3.1. Speed Up 7
1.3.2. Scale Up 8
1.3.3. Parallel Obstacles 10
1.4. Forms of Parallelism 12
1.4.1. Interquery Parallelism 13
1.4.2. Intraquery Parallelism 14
1.4.3. Intraoperation Parallelism 15
1.4.4. Interoperation Parallelism 15
1.4.5. Mixed Parallelism-A More Practical Solution 18
1.5. Parallel Database Architectures 19
1.5.1. Shared-Memory and Shared-Disk Architectures 20
1.5.2. Shared-Nothing Architecture 22
1.5.3. Shared-Something Architecture 23
1.5.4. Interconnection Networks 24
1.6. Grid Database Architecture 26
1.7. Structure of this Book 29
1.8. Summary 30
1.9. Bibliographical Notes 30
1.10. Exercises 31
2. Analytical Models 33
2.1. Cost Models 33
2.2. Cost Notations 34
2.2.1. Data Parameters 34
2.2.2. Systems Parameters 36
2.2.3. Query Parameters 37
2.2.4. Time Unit Costs 37
2.2.5. Communication Costs 38
2.3. Skew Model 39
2.4. Basic Operations in Parallel Databases 43
2.4.1. Disk Operations 44
2.4.2. Main Memory Operations 45
2.4.3. Data Computation and Data Distribution 45
2.5. Summary 47
2.6. Bibliographical Notes 47
2.7. Exercises 47
Part II Basic Query Parallelism
3. Parallel Search 51
3.1. Search Queries 51
3.1.1. Exact-Match Search 52
3.1.2. Range Search Query 53
3.1.3. Multiattribute Search Query 54
3.2. Data Partitioning 54
3.2.1. Basic Data Partitioning 55
3.2.2. Complex Data Partitioning 60
3.3. Search Algorithms 69
3.3.1. Serial Search Algorithms 69
3.3.2. Parallel Search Algorithms 73
3.4. Summary 74
3.5. Bibliographical Notes 75
3.6. Exercises 75
4. Parallel Sort and GroupBy 77
4.1. Sorting, Duplicate Removal, and Aggregate Queries 78
4.1.1. Sorting and Duplicate Removal 78
4.1.2. Scalar Aggregate 79
4.1.3. GroupBy 80
4.2. Serial External Sorting Method 80
4.3. Algorithms for Parallel External Sort 83
4.3.1. Parallel Merge-All Sort 83
4.3.2. Parallel Binary-Merge Sort 85
4.3.3. Parallel Redistribution Binary-Merge Sort 86
4.3.4. Parallel Redistribution Merge-All Sort 88
4.3.5. Parallel Partitioned Sort 90
4.4. Parallel Algorithms for GroupBy Queries 92
4.4.1. Traditional Methods (Merge-All and Hierarchical Merging) 92
4.4.2. Two-Phase Method 93
4.4.3. Redistribution Method 94
4.5. Cost Models for Parallel Sort 96
4.5.1. Cost Models for Serial External Merge-Sort 96
4.5.2. Cost Models for Parallel Merge-All Sort 98
4.5.3. Cost Models for Parallel Binary-Merge Sort 100
4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort 101
4.5.5. Cost Models for Parallel Redistribution Merge-All Sort 102
4.5.6. Cost Models for Parallel Partitioned Sort 103
4.6. Cost Models for Parallel GroupBy 104
4.6.1. Cost Models for Parallel Two-Phase Method 104
4.6.2. Cost Models for Parallel Redistribution Method 107
4.7. Summary 109
4.8. Bibliographical Notes 110
4.9. Exercises 110
5. Parallel Join 112
5.1. Join Operations 112
5.2. Serial Join Algorithms 114
5.2.1. Nested-Loop Join Algorithm 114
5.2.2. Sort-Merge Join Algorithm 116
5.2.3. Hash-Based Join Algorithm 117
5.2.4. Comparison 120
5.3. Parallel Join Algorithms 120
5.3.1. Divide and Broadcast-Based Parallel Join Algorithms 121
5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms 124
5.4. Cost Models 128
5.4.1. Cost Models for Divide and Broadcast 128
5.4.2. Cost Models for Disjoint Partitioning 129
5.4.3. Cost Models for Local Join 130
5.5. Parallel Join Optimization 132
5.5.1. Optimizing Main Memory 132
5.5.2. Load Balancing 133
5.6. Summary 134
5.7. Bibliographical Notes 135
5.8. Exercises 136
Part III Advanced Parallel Query Processing
6. Parallel GroupBy-Join 141
6.1. Groupby-Join Queries 141
6.1.1. Groupby Before Join 142
6.1.2. Groupby After Join 142
6.2. Parallel Algorithms for Groupby-Before-Join Query Processing 143
6.2.1. Early Distribution Scheme 143
6.2.2. Early GroupBy with Partitioning Scheme 145
6.2.3. Early GroupBy with Replication Scheme 146
6.3. Parallel Algorithms for Groupby-After-Join Query Processing 148
6.3.1. Join Partitioning Scheme 148
6.3.2. GroupBy Partitioning Scheme 150
6.4. Cost Model Notations 151
6.5. Cost Model for Groupby-Before-Join Query Processing 153
6.5.1. Cost Models for the Early Distribution Scheme 153
6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme 156
6.5.3. Cost Models for the Early GroupBy with Replication Scheme 158
6.6. Cost Model for "Groupby-After-Join" Query Processing 159
6.6.1. Cost Models for the Join Partitioning Scheme 159
6.6.2. Cost Models for the GroupBy Partitioning Scheme 161
6.7. Summary 163
6.8. Bibliographical Notes 164
6.9. Exercises 164
7. Parallel Indexing 167
7.1. Parallel Indexing-an Internal Perspective on Parallel Indexing
Structures 168
7.2. Parallel Indexing Structures 169
7.2.1. Nonreplicated Indexing (NRI) Structures 169
7.2.2. Partially Replicated Indexing (PRI) Structures 171
7.2.3. Fully Replicated Indexing (FRI) Structures 178
7.3. Index Maintenance 180
7.3.1. Maintaining a Parallel Nonreplicated Index 182
7.3.2. Maintaining a Parallel Partially Replicated Index 182
7.3.3. Maintaining a Parallel Fully Replicated Index 188
7.3.4. Complexity Degree of Index Maintenance 188
7.4. Index Storage Analysis 188
7.4.1. Storage Cost Models for Uniprocessors 189
7.4.2. Storage Cost Models for Parallel Processors 191
7.5. Parallel Processing of Search Queries using Index 192
7.5.1. Parallel One-Index Search Query Processing 192
7.5.2. Parallel Multi-Index Search Query Processing 195
7.6. Parallel Index Join Algorithms 200
7.6.1. Parallel One-Index Join 200
7.6.2. Parallel Two-Index Join 203
7.7. Comparative Analysis 207
7.7.1. Comparative Analysis of Parallel Search Index 207
7.7.2. Comparative Analysis of Parallel Index Join 213
7.8. Summary 216
7.9. Bibliographical Notes 217
7.10. Exercises 217
8. Parallel Universal Qualification-Collection Join Queries 219
8.1. Universal Quantification and Collection Join 220
8.2. Collection Types and Collection Join Queries 222
8.2.1. Collection-Equi Join Queries 222
8.2.2. Collection-Intersect Join Queries 223
8.2.3. Subcollection Join Queries 224
8.3. Parallel Algorithms for Collection Join Queries 225
8.4. Parallel Collection-Equi Join Algorithms 225
8.4.1. Disjoint Data Partitioning 226
8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm 227
8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm 228
8.4.4. Parallel Hash Collection-Equi Join Algorithm 232
8.5. Parallel Collection-Intersect Join Algorithms 233
8.5.1. Non-Disjoint Data Partitioning 234
8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm
244
8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm 245
8.5.4. Parallel Hash Collection-Intersect Join Algorithm 246
8.6. Parallel Subcollection Join Algorithms 246
8.6.1. Data Partitioning 247
8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm 248
8.6.3. Parallel Sort-Hash Subcollection Join Algorithm 249
8.6.4. Parallel Hash Subcollection Join Algorithm 251
8.7. Summary 252
8.8. Bibliographical Notes 252
8.9. Exercises 254
9. Parallel Query Scheduling and Optimization 256
9.1. Query Execution Plan 257
9.2. Subqueries Execution Scheduling Strategies 259
9.2.1. Serial Execution Among Subqueries 259
9.2.2. Parallel Execution Among Subqueries 261
9.3. Serial vs. Parallel Execution Scheduling 264
9.3.1. Nonskewed Subqueries 264
9.3.2. Skewed Subqueries 265
9.3.3. Skewed and Nonskewed Subqueries 267
9.4. Scheduling Rules 269
9.5. Cluster Query Processing Model 270
9.5.1. Overview of Dynamic Query Processing 271
9.5.2. A Cluster Query Processing Architecture 272
9.5.3. Load Information Exchange 273
9.6. Dynamic Cluster Query Optimization 275
9.6.1. Correction 276
9.6.2. Migration 280
9.6.3. Partition 281
9.7. Other Approaches to Dynamic Query Optimization 284
9.8. Summary 285
9.9. Bibliographical Notes 286
9.10. Exercises 286
Part IV Grid Databases
10. Transactions in Distributed and Grid Databases 291
10.1. Grid Database Challenges 292
10.2. Distributed Database Systems and Multidatabase Systems 293
10.2.1. Distributed Database Systems 293
10.2.2. Multidatabase Systems 297
10.3. Basic Definitions on Transaction Management 299
10.4. Acid Properties of Transactions 301
10.5. Transaction Management in Various Database Systems 303
10.5.1. Transaction Management in Centralized and Homogeneous Distributed
Database Systems 303
10.5.2. Transaction Management in Heterogeneous Distributed Database
Systems 305
10.6. Requirements in Grid Database Systems 307
10.7. Concurrency Control Protocols 309
10.8. Atomic Commit Protocols 310
10.8.1. Homogeneous Distributed Database Systems 310
10.8.2. Heterogeneous Distributed Database Systems 313
10.9. Replica Synchronization Protocols 314
10.9.1. Network Partitioning 315
10.9.2. Replica Synchronization Protocols 316
10.10. Summary 318
10.11. Bibliographical Notes 318
10.12. Exercises 319
11. Grid Concurrency Control 321
11.1. A Grid Database Environment 321
11.2. An Example 322
11.3. Grid Concurrency Control 324
11.3.1. Basic Functions Required by GCC 324
11.3.2. Grid Serializability Theorem 325
11.3.3. Grid Concurrency Control Protocol 329
11.3.4. Revisiting the Earlier Example 333
11.3.5. Comparison with Traditional Concurrency Control Protocols 334
11.4. Correctness of GCC Protocol 336
11.5. Features of GCC Protocol 338
11.6. Summary 339
11.7. Bibliographical Notes 339
11.8. Exercises 339
12. Grid Transaction Atomicity and Durability 341
12.1. Motivation 342
12.2. Grid Atomic Commit Protocol (Grid-ACP) 343
12.2.1. State Diagram of Grid-ACP 343
12.2.2. Grid-ACP Algorithm 344
12.2.3. Early-Abort Grid-ACP 346
12.2.4. Discussion 348
12.2.5. Message and Time Complexity Comparison Analysis 349
12.2.6. Correctness of Grid-ACP 350
12.3. Handling Failure of Sites with Grid-ACP 351
12.3.1. Model for Storing Log Files at the Originator and Participating
Sites 351
12.3.2. Logs Required at the Originator Site 352
12.3.3. Logs Required at the Participant Site 353
12.3.4. Failure Recovery Algorithm for Grid-ACP 353
12.3.5. Comparison of Recovery Protocols 359
12.3.6. Correctness of Recovery Algorithm 361
12.4. Summary 365
12.5. Bibliographical Notes 366
12.6. Exercises 366
13. Replica Management in Grids 367
13.1. Motivation 367
13.2. Replica Architecture 368
13.2.1. High-Level Replica Management Architecture 368
13.2.2. Some Problems 369
13.3. Grid Replica Access Protocol (GRAP) 371
13.3.1. Read Transaction Operation for GRAP 371
13.3.2. Write Transaction Operation for GRAP 372
13.3.3. Revisiting the Example Problem 375
13.3.4. Correctness of GRAP 377
13.4. Handling Multiple Partitioning 378
13.4.1. Contingency GRAP 378
13.4.2. Comparison of Replica Management Protocols 381
13.4.3. Correctness of Contingency GRAP 383
13.5. Summary 384
13.6. Bibliographical Notes 385
13.7. Exercises 385
14. Grid Atomic Commitment in Replicated Data 387
14.1. Motivation 388
14.1.1. Architectural Reasons 388
14.1.2. Motivating Example 388
14.2. Modified Grid Atomic Commitment Protocol 390
14.2.1. Modified Grid-ACP 390
14.2.2. Correctness of Modified Grid-ACP 393
14.3. Transaction Properties in Replicated Environment 395
14.4. Summary 397
14.5. Bibliographical Notes 397
14.6. Exercises 398
Part V Other Data-Intensive Applications
15. Parallel Online Analytic Processing (OLAP) and Business Intelligence
401
15.1. Parallel Multidimensional Analysis 402
15.2. Parallelization of ROLLUP Queries 405
15.2.1. Analysis of Basic Single ROLLUP Queries 405
15.2.2. Analysis of Multiple ROLLUP Queries 409
15.2.3. Analysis of Partial ROLLUP Queries 411
15.2.4. Parallelization Without Using ROLLUP 412
15.3. Parallelization of CUBE Queries 412
15.3.1. Analysis of Basic CUBE Queries 413
15.3.2. Analysis of Partial CUBE Queries 416
15.3.3. Parallelization Without Using CUBE 417
15.4. Parallelization of Top-N and Ranking Queries 418
15.5. Parallelization of Cume_Dist Queries 419
15.6. Parallelization of NTILE and Histogram Queries 420
15.7. Parallelization of Moving Average and Windowing Queries 422
15.8. Summary 424
15.9. Bibliographical Notes 424
15.10. Exercises 425
16. Parallel Data Mining-Association Rules and Sequential Patterns 427
16.1. From Databases To Data Warehousing To Data Mining: A Journey 428
16.2. Data Mining: A Brief Overview 431
16.2.1. Data Mining Tasks 431
16.2.2. Querying vs. Mining 433
16.2.3. Parallelism in Data Mining 436
16.3. Parallel Association Rules 440
16.3.1. Association Rules: Concepts 441
16.3.2. Association Rules: Processes 444
16.3.3. Association Rules: Parallel Processing 448
16.4. Parallel Sequential Patterns 450
16.4.1. Sequential Patterns: Concepts 452
16.4.2. Sequential Patterns: Processes 456
16.4.3. Sequential Patterns: Parallel Processing 459
16.5. Summary 461
16.6. Bibliographical Notes 461
16.7. Exercises 462
17. Parallel Clustering and Classification 464
17.1. Clustering and Classification 464
17.1.1. Clustering 464
17.1.2. Classification 465
17.2. Parallel Clustering 467
17.2.1. Clustering: Concepts 467
17.2.2. k-Means Algorithm 468
17.2.3. Parallel k-Means Clustering 471
17.3. Parallel Classification 477
17.3.1. Decision Tree Classification: Structures 477
17.3.2. Decision Tree Classification: Processes 480
17.3.3. Decision Tree Classification: Parallel Processing 488
17.4. Summary 495
17.5. Bibliographical Notes 498
17.6. Exercises 498
Permissions 501
List of Conferences and Journals 507
Bibliography 511
Index 541
Part I Introduction
1. Introduction 3
1.1. A Brief Overview: Parallel Databases and Grid Databases 4
1.2. Parallel Query Processing: Motivations 5
1.3. Parallel Query Processing: Objectives 7
1.3.1. Speed Up 7
1.3.2. Scale Up 8
1.3.3. Parallel Obstacles 10
1.4. Forms of Parallelism 12
1.4.1. Interquery Parallelism 13
1.4.2. Intraquery Parallelism 14
1.4.3. Intraoperation Parallelism 15
1.4.4. Interoperation Parallelism 15
1.4.5. Mixed Parallelism-A More Practical Solution 18
1.5. Parallel Database Architectures 19
1.5.1. Shared-Memory and Shared-Disk Architectures 20
1.5.2. Shared-Nothing Architecture 22
1.5.3. Shared-Something Architecture 23
1.5.4. Interconnection Networks 24
1.6. Grid Database Architecture 26
1.7. Structure of this Book 29
1.8. Summary 30
1.9. Bibliographical Notes 30
1.10. Exercises 31
2. Analytical Models 33
2.1. Cost Models 33
2.2. Cost Notations 34
2.2.1. Data Parameters 34
2.2.2. Systems Parameters 36
2.2.3. Query Parameters 37
2.2.4. Time Unit Costs 37
2.2.5. Communication Costs 38
2.3. Skew Model 39
2.4. Basic Operations in Parallel Databases 43
2.4.1. Disk Operations 44
2.4.2. Main Memory Operations 45
2.4.3. Data Computation and Data Distribution 45
2.5. Summary 47
2.6. Bibliographical Notes 47
2.7. Exercises 47
Part II Basic Query Parallelism
3. Parallel Search 51
3.1. Search Queries 51
3.1.1. Exact-Match Search 52
3.1.2. Range Search Query 53
3.1.3. Multiattribute Search Query 54
3.2. Data Partitioning 54
3.2.1. Basic Data Partitioning 55
3.2.2. Complex Data Partitioning 60
3.3. Search Algorithms 69
3.3.1. Serial Search Algorithms 69
3.3.2. Parallel Search Algorithms 73
3.4. Summary 74
3.5. Bibliographical Notes 75
3.6. Exercises 75
4. Parallel Sort and GroupBy 77
4.1. Sorting, Duplicate Removal, and Aggregate Queries 78
4.1.1. Sorting and Duplicate Removal 78
4.1.2. Scalar Aggregate 79
4.1.3. GroupBy 80
4.2. Serial External Sorting Method 80
4.3. Algorithms for Parallel External Sort 83
4.3.1. Parallel Merge-All Sort 83
4.3.2. Parallel Binary-Merge Sort 85
4.3.3. Parallel Redistribution Binary-Merge Sort 86
4.3.4. Parallel Redistribution Merge-All Sort 88
4.3.5. Parallel Partitioned Sort 90
4.4. Parallel Algorithms for GroupBy Queries 92
4.4.1. Traditional Methods (Merge-All and Hierarchical Merging) 92
4.4.2. Two-Phase Method 93
4.4.3. Redistribution Method 94
4.5. Cost Models for Parallel Sort 96
4.5.1. Cost Models for Serial External Merge-Sort 96
4.5.2. Cost Models for Parallel Merge-All Sort 98
4.5.3. Cost Models for Parallel Binary-Merge Sort 100
4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort 101
4.5.5. Cost Models for Parallel Redistribution Merge-All Sort 102
4.5.6. Cost Models for Parallel Partitioned Sort 103
4.6. Cost Models for Parallel GroupBy 104
4.6.1. Cost Models for Parallel Two-Phase Method 104
4.6.2. Cost Models for Parallel Redistribution Method 107
4.7. Summary 109
4.8. Bibliographical Notes 110
4.9. Exercises 110
5. Parallel Join 112
5.1. Join Operations 112
5.2. Serial Join Algorithms 114
5.2.1. Nested-Loop Join Algorithm 114
5.2.2. Sort-Merge Join Algorithm 116
5.2.3. Hash-Based Join Algorithm 117
5.2.4. Comparison 120
5.3. Parallel Join Algorithms 120
5.3.1. Divide and Broadcast-Based Parallel Join Algorithms 121
5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms 124
5.4. Cost Models 128
5.4.1. Cost Models for Divide and Broadcast 128
5.4.2. Cost Models for Disjoint Partitioning 129
5.4.3. Cost Models for Local Join 130
5.5. Parallel Join Optimization 132
5.5.1. Optimizing Main Memory 132
5.5.2. Load Balancing 133
5.6. Summary 134
5.7. Bibliographical Notes 135
5.8. Exercises 136
Part III Advanced Parallel Query Processing
6. Parallel GroupBy-Join 141
6.1. Groupby-Join Queries 141
6.1.1. Groupby Before Join 142
6.1.2. Groupby After Join 142
6.2. Parallel Algorithms for Groupby-Before-Join Query Processing 143
6.2.1. Early Distribution Scheme 143
6.2.2. Early GroupBy with Partitioning Scheme 145
6.2.3. Early GroupBy with Replication Scheme 146
6.3. Parallel Algorithms for Groupby-After-Join Query Processing 148
6.3.1. Join Partitioning Scheme 148
6.3.2. GroupBy Partitioning Scheme 150
6.4. Cost Model Notations 151
6.5. Cost Model for Groupby-Before-Join Query Processing 153
6.5.1. Cost Models for the Early Distribution Scheme 153
6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme 156
6.5.3. Cost Models for the Early GroupBy with Replication Scheme 158
6.6. Cost Model for "Groupby-After-Join" Query Processing 159
6.6.1. Cost Models for the Join Partitioning Scheme 159
6.6.2. Cost Models for the GroupBy Partitioning Scheme 161
6.7. Summary 163
6.8. Bibliographical Notes 164
6.9. Exercises 164
7. Parallel Indexing 167
7.1. Parallel Indexing-an Internal Perspective on Parallel Indexing
Structures 168
7.2. Parallel Indexing Structures 169
7.2.1. Nonreplicated Indexing (NRI) Structures 169
7.2.2. Partially Replicated Indexing (PRI) Structures 171
7.2.3. Fully Replicated Indexing (FRI) Structures 178
7.3. Index Maintenance 180
7.3.1. Maintaining a Parallel Nonreplicated Index 182
7.3.2. Maintaining a Parallel Partially Replicated Index 182
7.3.3. Maintaining a Parallel Fully Replicated Index 188
7.3.4. Complexity Degree of Index Maintenance 188
7.4. Index Storage Analysis 188
7.4.1. Storage Cost Models for Uniprocessors 189
7.4.2. Storage Cost Models for Parallel Processors 191
7.5. Parallel Processing of Search Queries using Index 192
7.5.1. Parallel One-Index Search Query Processing 192
7.5.2. Parallel Multi-Index Search Query Processing 195
7.6. Parallel Index Join Algorithms 200
7.6.1. Parallel One-Index Join 200
7.6.2. Parallel Two-Index Join 203
7.7. Comparative Analysis 207
7.7.1. Comparative Analysis of Parallel Search Index 207
7.7.2. Comparative Analysis of Parallel Index Join 213
7.8. Summary 216
7.9. Bibliographical Notes 217
7.10. Exercises 217
8. Parallel Universal Qualification-Collection Join Queries 219
8.1. Universal Quantification and Collection Join 220
8.2. Collection Types and Collection Join Queries 222
8.2.1. Collection-Equi Join Queries 222
8.2.2. Collection-Intersect Join Queries 223
8.2.3. Subcollection Join Queries 224
8.3. Parallel Algorithms for Collection Join Queries 225
8.4. Parallel Collection-Equi Join Algorithms 225
8.4.1. Disjoint Data Partitioning 226
8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm 227
8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm 228
8.4.4. Parallel Hash Collection-Equi Join Algorithm 232
8.5. Parallel Collection-Intersect Join Algorithms 233
8.5.1. Non-Disjoint Data Partitioning 234
8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm
244
8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm 245
8.5.4. Parallel Hash Collection-Intersect Join Algorithm 246
8.6. Parallel Subcollection Join Algorithms 246
8.6.1. Data Partitioning 247
8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm 248
8.6.3. Parallel Sort-Hash Subcollection Join Algorithm 249
8.6.4. Parallel Hash Subcollection Join Algorithm 251
8.7. Summary 252
8.8. Bibliographical Notes 252
8.9. Exercises 254
9. Parallel Query Scheduling and Optimization 256
9.1. Query Execution Plan 257
9.2. Subqueries Execution Scheduling Strategies 259
9.2.1. Serial Execution Among Subqueries 259
9.2.2. Parallel Execution Among Subqueries 261
9.3. Serial vs. Parallel Execution Scheduling 264
9.3.1. Nonskewed Subqueries 264
9.3.2. Skewed Subqueries 265
9.3.3. Skewed and Nonskewed Subqueries 267
9.4. Scheduling Rules 269
9.5. Cluster Query Processing Model 270
9.5.1. Overview of Dynamic Query Processing 271
9.5.2. A Cluster Query Processing Architecture 272
9.5.3. Load Information Exchange 273
9.6. Dynamic Cluster Query Optimization 275
9.6.1. Correction 276
9.6.2. Migration 280
9.6.3. Partition 281
9.7. Other Approaches to Dynamic Query Optimization 284
9.8. Summary 285
9.9. Bibliographical Notes 286
9.10. Exercises 286
Part IV Grid Databases
10. Transactions in Distributed and Grid Databases 291
10.1. Grid Database Challenges 292
10.2. Distributed Database Systems and Multidatabase Systems 293
10.2.1. Distributed Database Systems 293
10.2.2. Multidatabase Systems 297
10.3. Basic Definitions on Transaction Management 299
10.4. Acid Properties of Transactions 301
10.5. Transaction Management in Various Database Systems 303
10.5.1. Transaction Management in Centralized and Homogeneous Distributed
Database Systems 303
10.5.2. Transaction Management in Heterogeneous Distributed Database
Systems 305
10.6. Requirements in Grid Database Systems 307
10.7. Concurrency Control Protocols 309
10.8. Atomic Commit Protocols 310
10.8.1. Homogeneous Distributed Database Systems 310
10.8.2. Heterogeneous Distributed Database Systems 313
10.9. Replica Synchronization Protocols 314
10.9.1. Network Partitioning 315
10.9.2. Replica Synchronization Protocols 316
10.10. Summary 318
10.11. Bibliographical Notes 318
10.12. Exercises 319
11. Grid Concurrency Control 321
11.1. A Grid Database Environment 321
11.2. An Example 322
11.3. Grid Concurrency Control 324
11.3.1. Basic Functions Required by GCC 324
11.3.2. Grid Serializability Theorem 325
11.3.3. Grid Concurrency Control Protocol 329
11.3.4. Revisiting the Earlier Example 333
11.3.5. Comparison with Traditional Concurrency Control Protocols 334
11.4. Correctness of GCC Protocol 336
11.5. Features of GCC Protocol 338
11.6. Summary 339
11.7. Bibliographical Notes 339
11.8. Exercises 339
12. Grid Transaction Atomicity and Durability 341
12.1. Motivation 342
12.2. Grid Atomic Commit Protocol (Grid-ACP) 343
12.2.1. State Diagram of Grid-ACP 343
12.2.2. Grid-ACP Algorithm 344
12.2.3. Early-Abort Grid-ACP 346
12.2.4. Discussion 348
12.2.5. Message and Time Complexity Comparison Analysis 349
12.2.6. Correctness of Grid-ACP 350
12.3. Handling Failure of Sites with Grid-ACP 351
12.3.1. Model for Storing Log Files at the Originator and Participating
Sites 351
12.3.2. Logs Required at the Originator Site 352
12.3.3. Logs Required at the Participant Site 353
12.3.4. Failure Recovery Algorithm for Grid-ACP 353
12.3.5. Comparison of Recovery Protocols 359
12.3.6. Correctness of Recovery Algorithm 361
12.4. Summary 365
12.5. Bibliographical Notes 366
12.6. Exercises 366
13. Replica Management in Grids 367
13.1. Motivation 367
13.2. Replica Architecture 368
13.2.1. High-Level Replica Management Architecture 368
13.2.2. Some Problems 369
13.3. Grid Replica Access Protocol (GRAP) 371
13.3.1. Read Transaction Operation for GRAP 371
13.3.2. Write Transaction Operation for GRAP 372
13.3.3. Revisiting the Example Problem 375
13.3.4. Correctness of GRAP 377
13.4. Handling Multiple Partitioning 378
13.4.1. Contingency GRAP 378
13.4.2. Comparison of Replica Management Protocols 381
13.4.3. Correctness of Contingency GRAP 383
13.5. Summary 384
13.6. Bibliographical Notes 385
13.7. Exercises 385
14. Grid Atomic Commitment in Replicated Data 387
14.1. Motivation 388
14.1.1. Architectural Reasons 388
14.1.2. Motivating Example 388
14.2. Modified Grid Atomic Commitment Protocol 390
14.2.1. Modified Grid-ACP 390
14.2.2. Correctness of Modified Grid-ACP 393
14.3. Transaction Properties in Replicated Environment 395
14.4. Summary 397
14.5. Bibliographical Notes 397
14.6. Exercises 398
Part V Other Data-Intensive Applications
15. Parallel Online Analytic Processing (OLAP) and Business Intelligence
401
15.1. Parallel Multidimensional Analysis 402
15.2. Parallelization of ROLLUP Queries 405
15.2.1. Analysis of Basic Single ROLLUP Queries 405
15.2.2. Analysis of Multiple ROLLUP Queries 409
15.2.3. Analysis of Partial ROLLUP Queries 411
15.2.4. Parallelization Without Using ROLLUP 412
15.3. Parallelization of CUBE Queries 412
15.3.1. Analysis of Basic CUBE Queries 413
15.3.2. Analysis of Partial CUBE Queries 416
15.3.3. Parallelization Without Using CUBE 417
15.4. Parallelization of Top-N and Ranking Queries 418
15.5. Parallelization of Cume_Dist Queries 419
15.6. Parallelization of NTILE and Histogram Queries 420
15.7. Parallelization of Moving Average and Windowing Queries 422
15.8. Summary 424
15.9. Bibliographical Notes 424
15.10. Exercises 425
16. Parallel Data Mining-Association Rules and Sequential Patterns 427
16.1. From Databases To Data Warehousing To Data Mining: A Journey 428
16.2. Data Mining: A Brief Overview 431
16.2.1. Data Mining Tasks 431
16.2.2. Querying vs. Mining 433
16.2.3. Parallelism in Data Mining 436
16.3. Parallel Association Rules 440
16.3.1. Association Rules: Concepts 441
16.3.2. Association Rules: Processes 444
16.3.3. Association Rules: Parallel Processing 448
16.4. Parallel Sequential Patterns 450
16.4.1. Sequential Patterns: Concepts 452
16.4.2. Sequential Patterns: Processes 456
16.4.3. Sequential Patterns: Parallel Processing 459
16.5. Summary 461
16.6. Bibliographical Notes 461
16.7. Exercises 462
17. Parallel Clustering and Classification 464
17.1. Clustering and Classification 464
17.1.1. Clustering 464
17.1.2. Classification 465
17.2. Parallel Clustering 467
17.2.1. Clustering: Concepts 467
17.2.2. k-Means Algorithm 468
17.2.3. Parallel k-Means Clustering 471
17.3. Parallel Classification 477
17.3.1. Decision Tree Classification: Structures 477
17.3.2. Decision Tree Classification: Processes 480
17.3.3. Decision Tree Classification: Parallel Processing 488
17.4. Summary 495
17.5. Bibliographical Notes 498
17.6. Exercises 498
Permissions 501
List of Conferences and Journals 507
Bibliography 511
Index 541
Preface xv
Part I Introduction
1. Introduction 3
1.1. A Brief Overview: Parallel Databases and Grid Databases 4
1.2. Parallel Query Processing: Motivations 5
1.3. Parallel Query Processing: Objectives 7
1.3.1. Speed Up 7
1.3.2. Scale Up 8
1.3.3. Parallel Obstacles 10
1.4. Forms of Parallelism 12
1.4.1. Interquery Parallelism 13
1.4.2. Intraquery Parallelism 14
1.4.3. Intraoperation Parallelism 15
1.4.4. Interoperation Parallelism 15
1.4.5. Mixed Parallelism-A More Practical Solution 18
1.5. Parallel Database Architectures 19
1.5.1. Shared-Memory and Shared-Disk Architectures 20
1.5.2. Shared-Nothing Architecture 22
1.5.3. Shared-Something Architecture 23
1.5.4. Interconnection Networks 24
1.6. Grid Database Architecture 26
1.7. Structure of this Book 29
1.8. Summary 30
1.9. Bibliographical Notes 30
1.10. Exercises 31
2. Analytical Models 33
2.1. Cost Models 33
2.2. Cost Notations 34
2.2.1. Data Parameters 34
2.2.2. Systems Parameters 36
2.2.3. Query Parameters 37
2.2.4. Time Unit Costs 37
2.2.5. Communication Costs 38
2.3. Skew Model 39
2.4. Basic Operations in Parallel Databases 43
2.4.1. Disk Operations 44
2.4.2. Main Memory Operations 45
2.4.3. Data Computation and Data Distribution 45
2.5. Summary 47
2.6. Bibliographical Notes 47
2.7. Exercises 47
Part II Basic Query Parallelism
3. Parallel Search 51
3.1. Search Queries 51
3.1.1. Exact-Match Search 52
3.1.2. Range Search Query 53
3.1.3. Multiattribute Search Query 54
3.2. Data Partitioning 54
3.2.1. Basic Data Partitioning 55
3.2.2. Complex Data Partitioning 60
3.3. Search Algorithms 69
3.3.1. Serial Search Algorithms 69
3.3.2. Parallel Search Algorithms 73
3.4. Summary 74
3.5. Bibliographical Notes 75
3.6. Exercises 75
4. Parallel Sort and GroupBy 77
4.1. Sorting, Duplicate Removal, and Aggregate Queries 78
4.1.1. Sorting and Duplicate Removal 78
4.1.2. Scalar Aggregate 79
4.1.3. GroupBy 80
4.2. Serial External Sorting Method 80
4.3. Algorithms for Parallel External Sort 83
4.3.1. Parallel Merge-All Sort 83
4.3.2. Parallel Binary-Merge Sort 85
4.3.3. Parallel Redistribution Binary-Merge Sort 86
4.3.4. Parallel Redistribution Merge-All Sort 88
4.3.5. Parallel Partitioned Sort 90
4.4. Parallel Algorithms for GroupBy Queries 92
4.4.1. Traditional Methods (Merge-All and Hierarchical Merging) 92
4.4.2. Two-Phase Method 93
4.4.3. Redistribution Method 94
4.5. Cost Models for Parallel Sort 96
4.5.1. Cost Models for Serial External Merge-Sort 96
4.5.2. Cost Models for Parallel Merge-All Sort 98
4.5.3. Cost Models for Parallel Binary-Merge Sort 100
4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort 101
4.5.5. Cost Models for Parallel Redistribution Merge-All Sort 102
4.5.6. Cost Models for Parallel Partitioned Sort 103
4.6. Cost Models for Parallel GroupBy 104
4.6.1. Cost Models for Parallel Two-Phase Method 104
4.6.2. Cost Models for Parallel Redistribution Method 107
4.7. Summary 109
4.8. Bibliographical Notes 110
4.9. Exercises 110
5. Parallel Join 112
5.1. Join Operations 112
5.2. Serial Join Algorithms 114
5.2.1. Nested-Loop Join Algorithm 114
5.2.2. Sort-Merge Join Algorithm 116
5.2.3. Hash-Based Join Algorithm 117
5.2.4. Comparison 120
5.3. Parallel Join Algorithms 120
5.3.1. Divide and Broadcast-Based Parallel Join Algorithms 121
5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms 124
5.4. Cost Models 128
5.4.1. Cost Models for Divide and Broadcast 128
5.4.2. Cost Models for Disjoint Partitioning 129
5.4.3. Cost Models for Local Join 130
5.5. Parallel Join Optimization 132
5.5.1. Optimizing Main Memory 132
5.5.2. Load Balancing 133
5.6. Summary 134
5.7. Bibliographical Notes 135
5.8. Exercises 136
Part III Advanced Parallel Query Processing
6. Parallel GroupBy-Join 141
6.1. Groupby-Join Queries 141
6.1.1. Groupby Before Join 142
6.1.2. Groupby After Join 142
6.2. Parallel Algorithms for Groupby-Before-Join Query Processing 143
6.2.1. Early Distribution Scheme 143
6.2.2. Early GroupBy with Partitioning Scheme 145
6.2.3. Early GroupBy with Replication Scheme 146
6.3. Parallel Algorithms for Groupby-After-Join Query Processing 148
6.3.1. Join Partitioning Scheme 148
6.3.2. GroupBy Partitioning Scheme 150
6.4. Cost Model Notations 151
6.5. Cost Model for Groupby-Before-Join Query Processing 153
6.5.1. Cost Models for the Early Distribution Scheme 153
6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme 156
6.5.3. Cost Models for the Early GroupBy with Replication Scheme 158
6.6. Cost Model for "Groupby-After-Join" Query Processing 159
6.6.1. Cost Models for the Join Partitioning Scheme 159
6.6.2. Cost Models for the GroupBy Partitioning Scheme 161
6.7. Summary 163
6.8. Bibliographical Notes 164
6.9. Exercises 164
7. Parallel Indexing 167
7.1. Parallel Indexing-an Internal Perspective on Parallel Indexing
Structures 168
7.2. Parallel Indexing Structures 169
7.2.1. Nonreplicated Indexing (NRI) Structures 169
7.2.2. Partially Replicated Indexing (PRI) Structures 171
7.2.3. Fully Replicated Indexing (FRI) Structures 178
7.3. Index Maintenance 180
7.3.1. Maintaining a Parallel Nonreplicated Index 182
7.3.2. Maintaining a Parallel Partially Replicated Index 182
7.3.3. Maintaining a Parallel Fully Replicated Index 188
7.3.4. Complexity Degree of Index Maintenance 188
7.4. Index Storage Analysis 188
7.4.1. Storage Cost Models for Uniprocessors 189
7.4.2. Storage Cost Models for Parallel Processors 191
7.5. Parallel Processing of Search Queries using Index 192
7.5.1. Parallel One-Index Search Query Processing 192
7.5.2. Parallel Multi-Index Search Query Processing 195
7.6. Parallel Index Join Algorithms 200
7.6.1. Parallel One-Index Join 200
7.6.2. Parallel Two-Index Join 203
7.7. Comparative Analysis 207
7.7.1. Comparative Analysis of Parallel Search Index 207
7.7.2. Comparative Analysis of Parallel Index Join 213
7.8. Summary 216
7.9. Bibliographical Notes 217
7.10. Exercises 217
8. Parallel Universal Qualification-Collection Join Queries 219
8.1. Universal Quantification and Collection Join 220
8.2. Collection Types and Collection Join Queries 222
8.2.1. Collection-Equi Join Queries 222
8.2.2. Collection-Intersect Join Queries 223
8.2.3. Subcollection Join Queries 224
8.3. Parallel Algorithms for Collection Join Queries 225
8.4. Parallel Collection-Equi Join Algorithms 225
8.4.1. Disjoint Data Partitioning 226
8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm 227
8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm 228
8.4.4. Parallel Hash Collection-Equi Join Algorithm 232
8.5. Parallel Collection-Intersect Join Algorithms 233
8.5.1. Non-Disjoint Data Partitioning 234
8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm
244
8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm 245
8.5.4. Parallel Hash Collection-Intersect Join Algorithm 246
8.6. Parallel Subcollection Join Algorithms 246
8.6.1. Data Partitioning 247
8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm 248
8.6.3. Parallel Sort-Hash Subcollection Join Algorithm 249
8.6.4. Parallel Hash Subcollection Join Algorithm 251
8.7. Summary 252
8.8. Bibliographical Notes 252
8.9. Exercises 254
9. Parallel Query Scheduling and Optimization 256
9.1. Query Execution Plan 257
9.2. Subqueries Execution Scheduling Strategies 259
9.2.1. Serial Execution Among Subqueries 259
9.2.2. Parallel Execution Among Subqueries 261
9.3. Serial vs. Parallel Execution Scheduling 264
9.3.1. Nonskewed Subqueries 264
9.3.2. Skewed Subqueries 265
9.3.3. Skewed and Nonskewed Subqueries 267
9.4. Scheduling Rules 269
9.5. Cluster Query Processing Model 270
9.5.1. Overview of Dynamic Query Processing 271
9.5.2. A Cluster Query Processing Architecture 272
9.5.3. Load Information Exchange 273
9.6. Dynamic Cluster Query Optimization 275
9.6.1. Correction 276
9.6.2. Migration 280
9.6.3. Partition 281
9.7. Other Approaches to Dynamic Query Optimization 284
9.8. Summary 285
9.9. Bibliographical Notes 286
9.10. Exercises 286
Part IV Grid Databases
10. Transactions in Distributed and Grid Databases 291
10.1. Grid Database Challenges 292
10.2. Distributed Database Systems and Multidatabase Systems 293
10.2.1. Distributed Database Systems 293
10.2.2. Multidatabase Systems 297
10.3. Basic Definitions on Transaction Management 299
10.4. Acid Properties of Transactions 301
10.5. Transaction Management in Various Database Systems 303
10.5.1. Transaction Management in Centralized and Homogeneous Distributed
Database Systems 303
10.5.2. Transaction Management in Heterogeneous Distributed Database
Systems 305
10.6. Requirements in Grid Database Systems 307
10.7. Concurrency Control Protocols 309
10.8. Atomic Commit Protocols 310
10.8.1. Homogeneous Distributed Database Systems 310
10.8.2. Heterogeneous Distributed Database Systems 313
10.9. Replica Synchronization Protocols 314
10.9.1. Network Partitioning 315
10.9.2. Replica Synchronization Protocols 316
10.10. Summary 318
10.11. Bibliographical Notes 318
10.12. Exercises 319
11. Grid Concurrency Control 321
11.1. A Grid Database Environment 321
11.2. An Example 322
11.3. Grid Concurrency Control 324
11.3.1. Basic Functions Required by GCC 324
11.3.2. Grid Serializability Theorem 325
11.3.3. Grid Concurrency Control Protocol 329
11.3.4. Revisiting the Earlier Example 333
11.3.5. Comparison with Traditional Concurrency Control Protocols 334
11.4. Correctness of GCC Protocol 336
11.5. Features of GCC Protocol 338
11.6. Summary 339
11.7. Bibliographical Notes 339
11.8. Exercises 339
12. Grid Transaction Atomicity and Durability 341
12.1. Motivation 342
12.2. Grid Atomic Commit Protocol (Grid-ACP) 343
12.2.1. State Diagram of Grid-ACP 343
12.2.2. Grid-ACP Algorithm 344
12.2.3. Early-Abort Grid-ACP 346
12.2.4. Discussion 348
12.2.5. Message and Time Complexity Comparison Analysis 349
12.2.6. Correctness of Grid-ACP 350
12.3. Handling Failure of Sites with Grid-ACP 351
12.3.1. Model for Storing Log Files at the Originator and Participating
Sites 351
12.3.2. Logs Required at the Originator Site 352
12.3.3. Logs Required at the Participant Site 353
12.3.4. Failure Recovery Algorithm for Grid-ACP 353
12.3.5. Comparison of Recovery Protocols 359
12.3.6. Correctness of Recovery Algorithm 361
12.4. Summary 365
12.5. Bibliographical Notes 366
12.6. Exercises 366
13. Replica Management in Grids 367
13.1. Motivation 367
13.2. Replica Architecture 368
13.2.1. High-Level Replica Management Architecture 368
13.2.2. Some Problems 369
13.3. Grid Replica Access Protocol (GRAP) 371
13.3.1. Read Transaction Operation for GRAP 371
13.3.2. Write Transaction Operation for GRAP 372
13.3.3. Revisiting the Example Problem 375
13.3.4. Correctness of GRAP 377
13.4. Handling Multiple Partitioning 378
13.4.1. Contingency GRAP 378
13.4.2. Comparison of Replica Management Protocols 381
13.4.3. Correctness of Contingency GRAP 383
13.5. Summary 384
13.6. Bibliographical Notes 385
13.7. Exercises 385
14. Grid Atomic Commitment in Replicated Data 387
14.1. Motivation 388
14.1.1. Architectural Reasons 388
14.1.2. Motivating Example 388
14.2. Modified Grid Atomic Commitment Protocol 390
14.2.1. Modified Grid-ACP 390
14.2.2. Correctness of Modified Grid-ACP 393
14.3. Transaction Properties in Replicated Environment 395
14.4. Summary 397
14.5. Bibliographical Notes 397
14.6. Exercises 398
Part V Other Data-Intensive Applications
15. Parallel Online Analytic Processing (OLAP) and Business Intelligence
401
15.1. Parallel Multidimensional Analysis 402
15.2. Parallelization of ROLLUP Queries 405
15.2.1. Analysis of Basic Single ROLLUP Queries 405
15.2.2. Analysis of Multiple ROLLUP Queries 409
15.2.3. Analysis of Partial ROLLUP Queries 411
15.2.4. Parallelization Without Using ROLLUP 412
15.3. Parallelization of CUBE Queries 412
15.3.1. Analysis of Basic CUBE Queries 413
15.3.2. Analysis of Partial CUBE Queries 416
15.3.3. Parallelization Without Using CUBE 417
15.4. Parallelization of Top-N and Ranking Queries 418
15.5. Parallelization of Cume_Dist Queries 419
15.6. Parallelization of NTILE and Histogram Queries 420
15.7. Parallelization of Moving Average and Windowing Queries 422
15.8. Summary 424
15.9. Bibliographical Notes 424
15.10. Exercises 425
16. Parallel Data Mining-Association Rules and Sequential Patterns 427
16.1. From Databases To Data Warehousing To Data Mining: A Journey 428
16.2. Data Mining: A Brief Overview 431
16.2.1. Data Mining Tasks 431
16.2.2. Querying vs. Mining 433
16.2.3. Parallelism in Data Mining 436
16.3. Parallel Association Rules 440
16.3.1. Association Rules: Concepts 441
16.3.2. Association Rules: Processes 444
16.3.3. Association Rules: Parallel Processing 448
16.4. Parallel Sequential Patterns 450
16.4.1. Sequential Patterns: Concepts 452
16.4.2. Sequential Patterns: Processes 456
16.4.3. Sequential Patterns: Parallel Processing 459
16.5. Summary 461
16.6. Bibliographical Notes 461
16.7. Exercises 462
17. Parallel Clustering and Classification 464
17.1. Clustering and Classification 464
17.1.1. Clustering 464
17.1.2. Classification 465
17.2. Parallel Clustering 467
17.2.1. Clustering: Concepts 467
17.2.2. k-Means Algorithm 468
17.2.3. Parallel k-Means Clustering 471
17.3. Parallel Classification 477
17.3.1. Decision Tree Classification: Structures 477
17.3.2. Decision Tree Classification: Processes 480
17.3.3. Decision Tree Classification: Parallel Processing 488
17.4. Summary 495
17.5. Bibliographical Notes 498
17.6. Exercises 498
Permissions 501
List of Conferences and Journals 507
Bibliography 511
Index 541
Part I Introduction
1. Introduction 3
1.1. A Brief Overview: Parallel Databases and Grid Databases 4
1.2. Parallel Query Processing: Motivations 5
1.3. Parallel Query Processing: Objectives 7
1.3.1. Speed Up 7
1.3.2. Scale Up 8
1.3.3. Parallel Obstacles 10
1.4. Forms of Parallelism 12
1.4.1. Interquery Parallelism 13
1.4.2. Intraquery Parallelism 14
1.4.3. Intraoperation Parallelism 15
1.4.4. Interoperation Parallelism 15
1.4.5. Mixed Parallelism-A More Practical Solution 18
1.5. Parallel Database Architectures 19
1.5.1. Shared-Memory and Shared-Disk Architectures 20
1.5.2. Shared-Nothing Architecture 22
1.5.3. Shared-Something Architecture 23
1.5.4. Interconnection Networks 24
1.6. Grid Database Architecture 26
1.7. Structure of this Book 29
1.8. Summary 30
1.9. Bibliographical Notes 30
1.10. Exercises 31
2. Analytical Models 33
2.1. Cost Models 33
2.2. Cost Notations 34
2.2.1. Data Parameters 34
2.2.2. Systems Parameters 36
2.2.3. Query Parameters 37
2.2.4. Time Unit Costs 37
2.2.5. Communication Costs 38
2.3. Skew Model 39
2.4. Basic Operations in Parallel Databases 43
2.4.1. Disk Operations 44
2.4.2. Main Memory Operations 45
2.4.3. Data Computation and Data Distribution 45
2.5. Summary 47
2.6. Bibliographical Notes 47
2.7. Exercises 47
Part II Basic Query Parallelism
3. Parallel Search 51
3.1. Search Queries 51
3.1.1. Exact-Match Search 52
3.1.2. Range Search Query 53
3.1.3. Multiattribute Search Query 54
3.2. Data Partitioning 54
3.2.1. Basic Data Partitioning 55
3.2.2. Complex Data Partitioning 60
3.3. Search Algorithms 69
3.3.1. Serial Search Algorithms 69
3.3.2. Parallel Search Algorithms 73
3.4. Summary 74
3.5. Bibliographical Notes 75
3.6. Exercises 75
4. Parallel Sort and GroupBy 77
4.1. Sorting, Duplicate Removal, and Aggregate Queries 78
4.1.1. Sorting and Duplicate Removal 78
4.1.2. Scalar Aggregate 79
4.1.3. GroupBy 80
4.2. Serial External Sorting Method 80
4.3. Algorithms for Parallel External Sort 83
4.3.1. Parallel Merge-All Sort 83
4.3.2. Parallel Binary-Merge Sort 85
4.3.3. Parallel Redistribution Binary-Merge Sort 86
4.3.4. Parallel Redistribution Merge-All Sort 88
4.3.5. Parallel Partitioned Sort 90
4.4. Parallel Algorithms for GroupBy Queries 92
4.4.1. Traditional Methods (Merge-All and Hierarchical Merging) 92
4.4.2. Two-Phase Method 93
4.4.3. Redistribution Method 94
4.5. Cost Models for Parallel Sort 96
4.5.1. Cost Models for Serial External Merge-Sort 96
4.5.2. Cost Models for Parallel Merge-All Sort 98
4.5.3. Cost Models for Parallel Binary-Merge Sort 100
4.5.4. Cost Models for Parallel Redistribution Binary-Merge Sort 101
4.5.5. Cost Models for Parallel Redistribution Merge-All Sort 102
4.5.6. Cost Models for Parallel Partitioned Sort 103
4.6. Cost Models for Parallel GroupBy 104
4.6.1. Cost Models for Parallel Two-Phase Method 104
4.6.2. Cost Models for Parallel Redistribution Method 107
4.7. Summary 109
4.8. Bibliographical Notes 110
4.9. Exercises 110
5. Parallel Join 112
5.1. Join Operations 112
5.2. Serial Join Algorithms 114
5.2.1. Nested-Loop Join Algorithm 114
5.2.2. Sort-Merge Join Algorithm 116
5.2.3. Hash-Based Join Algorithm 117
5.2.4. Comparison 120
5.3. Parallel Join Algorithms 120
5.3.1. Divide and Broadcast-Based Parallel Join Algorithms 121
5.3.2. Disjoint Partitioning-Based Parallel Join Algorithms 124
5.4. Cost Models 128
5.4.1. Cost Models for Divide and Broadcast 128
5.4.2. Cost Models for Disjoint Partitioning 129
5.4.3. Cost Models for Local Join 130
5.5. Parallel Join Optimization 132
5.5.1. Optimizing Main Memory 132
5.5.2. Load Balancing 133
5.6. Summary 134
5.7. Bibliographical Notes 135
5.8. Exercises 136
Part III Advanced Parallel Query Processing
6. Parallel GroupBy-Join 141
6.1. Groupby-Join Queries 141
6.1.1. Groupby Before Join 142
6.1.2. Groupby After Join 142
6.2. Parallel Algorithms for Groupby-Before-Join Query Processing 143
6.2.1. Early Distribution Scheme 143
6.2.2. Early GroupBy with Partitioning Scheme 145
6.2.3. Early GroupBy with Replication Scheme 146
6.3. Parallel Algorithms for Groupby-After-Join Query Processing 148
6.3.1. Join Partitioning Scheme 148
6.3.2. GroupBy Partitioning Scheme 150
6.4. Cost Model Notations 151
6.5. Cost Model for Groupby-Before-Join Query Processing 153
6.5.1. Cost Models for the Early Distribution Scheme 153
6.5.2. Cost Models for the Early GroupBy with Partitioning Scheme 156
6.5.3. Cost Models for the Early GroupBy with Replication Scheme 158
6.6. Cost Model for "Groupby-After-Join" Query Processing 159
6.6.1. Cost Models for the Join Partitioning Scheme 159
6.6.2. Cost Models for the GroupBy Partitioning Scheme 161
6.7. Summary 163
6.8. Bibliographical Notes 164
6.9. Exercises 164
7. Parallel Indexing 167
7.1. Parallel Indexing-an Internal Perspective on Parallel Indexing
Structures 168
7.2. Parallel Indexing Structures 169
7.2.1. Nonreplicated Indexing (NRI) Structures 169
7.2.2. Partially Replicated Indexing (PRI) Structures 171
7.2.3. Fully Replicated Indexing (FRI) Structures 178
7.3. Index Maintenance 180
7.3.1. Maintaining a Parallel Nonreplicated Index 182
7.3.2. Maintaining a Parallel Partially Replicated Index 182
7.3.3. Maintaining a Parallel Fully Replicated Index 188
7.3.4. Complexity Degree of Index Maintenance 188
7.4. Index Storage Analysis 188
7.4.1. Storage Cost Models for Uniprocessors 189
7.4.2. Storage Cost Models for Parallel Processors 191
7.5. Parallel Processing of Search Queries using Index 192
7.5.1. Parallel One-Index Search Query Processing 192
7.5.2. Parallel Multi-Index Search Query Processing 195
7.6. Parallel Index Join Algorithms 200
7.6.1. Parallel One-Index Join 200
7.6.2. Parallel Two-Index Join 203
7.7. Comparative Analysis 207
7.7.1. Comparative Analysis of Parallel Search Index 207
7.7.2. Comparative Analysis of Parallel Index Join 213
7.8. Summary 216
7.9. Bibliographical Notes 217
7.10. Exercises 217
8. Parallel Universal Qualification-Collection Join Queries 219
8.1. Universal Quantification and Collection Join 220
8.2. Collection Types and Collection Join Queries 222
8.2.1. Collection-Equi Join Queries 222
8.2.2. Collection-Intersect Join Queries 223
8.2.3. Subcollection Join Queries 224
8.3. Parallel Algorithms for Collection Join Queries 225
8.4. Parallel Collection-Equi Join Algorithms 225
8.4.1. Disjoint Data Partitioning 226
8.4.2. Parallel Double Sort-Merge Collection-Equi Join Algorithm 227
8.4.3. Parallel Sort-Hash Collection-Equi Join Algorithm 228
8.4.4. Parallel Hash Collection-Equi Join Algorithm 232
8.5. Parallel Collection-Intersect Join Algorithms 233
8.5.1. Non-Disjoint Data Partitioning 234
8.5.2. Parallel Sort-Merge Nested-Loop Collection-Intersect Join Algorithm
244
8.5.3. Parallel Sort-Hash Collection-Intersect Join Algorithm 245
8.5.4. Parallel Hash Collection-Intersect Join Algorithm 246
8.6. Parallel Subcollection Join Algorithms 246
8.6.1. Data Partitioning 247
8.6.2. Parallel Sort-Merge Nested-Loop Subcollection Join Algorithm 248
8.6.3. Parallel Sort-Hash Subcollection Join Algorithm 249
8.6.4. Parallel Hash Subcollection Join Algorithm 251
8.7. Summary 252
8.8. Bibliographical Notes 252
8.9. Exercises 254
9. Parallel Query Scheduling and Optimization 256
9.1. Query Execution Plan 257
9.2. Subqueries Execution Scheduling Strategies 259
9.2.1. Serial Execution Among Subqueries 259
9.2.2. Parallel Execution Among Subqueries 261
9.3. Serial vs. Parallel Execution Scheduling 264
9.3.1. Nonskewed Subqueries 264
9.3.2. Skewed Subqueries 265
9.3.3. Skewed and Nonskewed Subqueries 267
9.4. Scheduling Rules 269
9.5. Cluster Query Processing Model 270
9.5.1. Overview of Dynamic Query Processing 271
9.5.2. A Cluster Query Processing Architecture 272
9.5.3. Load Information Exchange 273
9.6. Dynamic Cluster Query Optimization 275
9.6.1. Correction 276
9.6.2. Migration 280
9.6.3. Partition 281
9.7. Other Approaches to Dynamic Query Optimization 284
9.8. Summary 285
9.9. Bibliographical Notes 286
9.10. Exercises 286
Part IV Grid Databases
10. Transactions in Distributed and Grid Databases 291
10.1. Grid Database Challenges 292
10.2. Distributed Database Systems and Multidatabase Systems 293
10.2.1. Distributed Database Systems 293
10.2.2. Multidatabase Systems 297
10.3. Basic Definitions on Transaction Management 299
10.4. Acid Properties of Transactions 301
10.5. Transaction Management in Various Database Systems 303
10.5.1. Transaction Management in Centralized and Homogeneous Distributed
Database Systems 303
10.5.2. Transaction Management in Heterogeneous Distributed Database
Systems 305
10.6. Requirements in Grid Database Systems 307
10.7. Concurrency Control Protocols 309
10.8. Atomic Commit Protocols 310
10.8.1. Homogeneous Distributed Database Systems 310
10.8.2. Heterogeneous Distributed Database Systems 313
10.9. Replica Synchronization Protocols 314
10.9.1. Network Partitioning 315
10.9.2. Replica Synchronization Protocols 316
10.10. Summary 318
10.11. Bibliographical Notes 318
10.12. Exercises 319
11. Grid Concurrency Control 321
11.1. A Grid Database Environment 321
11.2. An Example 322
11.3. Grid Concurrency Control 324
11.3.1. Basic Functions Required by GCC 324
11.3.2. Grid Serializability Theorem 325
11.3.3. Grid Concurrency Control Protocol 329
11.3.4. Revisiting the Earlier Example 333
11.3.5. Comparison with Traditional Concurrency Control Protocols 334
11.4. Correctness of GCC Protocol 336
11.5. Features of GCC Protocol 338
11.6. Summary 339
11.7. Bibliographical Notes 339
11.8. Exercises 339
12. Grid Transaction Atomicity and Durability 341
12.1. Motivation 342
12.2. Grid Atomic Commit Protocol (Grid-ACP) 343
12.2.1. State Diagram of Grid-ACP 343
12.2.2. Grid-ACP Algorithm 344
12.2.3. Early-Abort Grid-ACP 346
12.2.4. Discussion 348
12.2.5. Message and Time Complexity Comparison Analysis 349
12.2.6. Correctness of Grid-ACP 350
12.3. Handling Failure of Sites with Grid-ACP 351
12.3.1. Model for Storing Log Files at the Originator and Participating
Sites 351
12.3.2. Logs Required at the Originator Site 352
12.3.3. Logs Required at the Participant Site 353
12.3.4. Failure Recovery Algorithm for Grid-ACP 353
12.3.5. Comparison of Recovery Protocols 359
12.3.6. Correctness of Recovery Algorithm 361
12.4. Summary 365
12.5. Bibliographical Notes 366
12.6. Exercises 366
13. Replica Management in Grids 367
13.1. Motivation 367
13.2. Replica Architecture 368
13.2.1. High-Level Replica Management Architecture 368
13.2.2. Some Problems 369
13.3. Grid Replica Access Protocol (GRAP) 371
13.3.1. Read Transaction Operation for GRAP 371
13.3.2. Write Transaction Operation for GRAP 372
13.3.3. Revisiting the Example Problem 375
13.3.4. Correctness of GRAP 377
13.4. Handling Multiple Partitioning 378
13.4.1. Contingency GRAP 378
13.4.2. Comparison of Replica Management Protocols 381
13.4.3. Correctness of Contingency GRAP 383
13.5. Summary 384
13.6. Bibliographical Notes 385
13.7. Exercises 385
14. Grid Atomic Commitment in Replicated Data 387
14.1. Motivation 388
14.1.1. Architectural Reasons 388
14.1.2. Motivating Example 388
14.2. Modified Grid Atomic Commitment Protocol 390
14.2.1. Modified Grid-ACP 390
14.2.2. Correctness of Modified Grid-ACP 393
14.3. Transaction Properties in Replicated Environment 395
14.4. Summary 397
14.5. Bibliographical Notes 397
14.6. Exercises 398
Part V Other Data-Intensive Applications
15. Parallel Online Analytic Processing (OLAP) and Business Intelligence
401
15.1. Parallel Multidimensional Analysis 402
15.2. Parallelization of ROLLUP Queries 405
15.2.1. Analysis of Basic Single ROLLUP Queries 405
15.2.2. Analysis of Multiple ROLLUP Queries 409
15.2.3. Analysis of Partial ROLLUP Queries 411
15.2.4. Parallelization Without Using ROLLUP 412
15.3. Parallelization of CUBE Queries 412
15.3.1. Analysis of Basic CUBE Queries 413
15.3.2. Analysis of Partial CUBE Queries 416
15.3.3. Parallelization Without Using CUBE 417
15.4. Parallelization of Top-N and Ranking Queries 418
15.5. Parallelization of Cume_Dist Queries 419
15.6. Parallelization of NTILE and Histogram Queries 420
15.7. Parallelization of Moving Average and Windowing Queries 422
15.8. Summary 424
15.9. Bibliographical Notes 424
15.10. Exercises 425
16. Parallel Data Mining-Association Rules and Sequential Patterns 427
16.1. From Databases To Data Warehousing To Data Mining: A Journey 428
16.2. Data Mining: A Brief Overview 431
16.2.1. Data Mining Tasks 431
16.2.2. Querying vs. Mining 433
16.2.3. Parallelism in Data Mining 436
16.3. Parallel Association Rules 440
16.3.1. Association Rules: Concepts 441
16.3.2. Association Rules: Processes 444
16.3.3. Association Rules: Parallel Processing 448
16.4. Parallel Sequential Patterns 450
16.4.1. Sequential Patterns: Concepts 452
16.4.2. Sequential Patterns: Processes 456
16.4.3. Sequential Patterns: Parallel Processing 459
16.5. Summary 461
16.6. Bibliographical Notes 461
16.7. Exercises 462
17. Parallel Clustering and Classification 464
17.1. Clustering and Classification 464
17.1.1. Clustering 464
17.1.2. Classification 465
17.2. Parallel Clustering 467
17.2.1. Clustering: Concepts 467
17.2.2. k-Means Algorithm 468
17.2.3. Parallel k-Means Clustering 471
17.3. Parallel Classification 477
17.3.1. Decision Tree Classification: Structures 477
17.3.2. Decision Tree Classification: Processes 480
17.3.3. Decision Tree Classification: Parallel Processing 488
17.4. Summary 495
17.5. Bibliographical Notes 498
17.6. Exercises 498
Permissions 501
List of Conferences and Journals 507
Bibliography 511
Index 541