Ravishankar K. Iyer, Zbigniew T. Kalbarczyk, Nithin M. Nakka
Dependable Computing
Design and Assessment
139,99 €
inkl. MwSt.
Versandkostenfrei*
Versandfertig in über 4 Wochen
Melden Sie sich
hier
hier
für den Produktalarm an, um über die Verfügbarkeit des Produkts informiert zu werden.
Ravishankar K. Iyer, Zbigniew T. Kalbarczyk, Nithin M. Nakka
Dependable Computing
Design and Assessment
- Gebundenes Buch
The only recent book on dependability/fault-tolerance that covers both software and hardware aspects of dependability, Dependable Computing Design and Assessment addresses the new reality of dependability. After a discussion of reliability, availability, and hardware and software fault models, the authors explore hardware redundancy, coding techniques, processor-level error detection and recovery, checkpoint and recovery, software fault tolerance techniques, and network-specific issues. Ideal for both students and practitioners, the capabilities and applicability of all techniques are illustrated with examples of actual applications and systems.…mehr
Andere Kunden interessierten sich auch für
- Wolfgang Emmerich / Stefan Tai (eds.)Engineering Distributed Objects40,99 €
- Neeraj Kumar GoyalInterconnection Network Reliability Evaluation195,99 €
- Samee U. KhanScalable Computing and Communications186,99 €
- High Performance Computing139,99 €
- Amol B. BakshiArchitecture-Independent Programming for Wireless Sensor Networks135,99 €
- Dan C. MarinescuInternet Workflow Management186,99 €
- Thomas C. JepsenDistributed Storage Networks135,99 €
-
-
-
The only recent book on dependability/fault-tolerance that covers both software and hardware aspects of dependability, Dependable Computing Design and Assessment addresses the new reality of dependability. After a discussion of reliability, availability, and hardware and software fault models, the authors explore hardware redundancy, coding techniques, processor-level error detection and recovery, checkpoint and recovery, software fault tolerance techniques, and network-specific issues. Ideal for both students and practitioners, the capabilities and applicability of all techniques are illustrated with examples of actual applications and systems.
Produktdetails
- Produktdetails
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 848
- Erscheinungstermin: 30. Mai 2024
- Englisch
- Abmessung: 235mm x 161mm x 52mm
- Gewicht: 1382g
- ISBN-13: 9781118709443
- ISBN-10: 1118709446
- Artikelnr.: 40782073
- Verlag: Wiley & Sons
- 1. Auflage
- Seitenzahl: 848
- Erscheinungstermin: 30. Mai 2024
- Englisch
- Abmessung: 235mm x 161mm x 52mm
- Gewicht: 1382g
- ISBN-13: 9781118709443
- ISBN-10: 1118709446
- Artikelnr.: 40782073
Ravishankar K. Iyer is George and Ann Fisher Distinguished Professor of Engineering at the University of Illinois Urbana-Champaign, USA. He holds joint appointments in the Departments of Electrical & Computer Engineering and Computer Science as well as the Coordinated Science Laboratory (CSL), the National Center for Supercomputing Applications (NCSA), and the Carl R. Woese Institute for Genomic Biology. The winner of numerous awards and honors, he was the founding chief scientist of the Information Trust Institute at UIUC--a campus-wide research center addressing security, reliability, and safety issues in critical infrastructures. Zbigniew T. Kalbarczyk is a Research Professor in the Department of Electrical & Computer Engineering and the Coordinated Science Laboratory of the University of Illinois Urbana-Champaign, USA. He is a member of the IEEE, the IEEE Computer Society, and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance. Dr. Kalbarczyk's research interests are in the design and validation of reliable and secure computing systems. His current work explores emerging computing technologies, machine learning-based methods for early detection of security attacks, analysis of data on failures and security attacks in large computing systems, and more. Nithin M. Nakka received his B. Tech (hons.) degree from the Indian Institute of Technology, Kharagpur, India, and his M.S. and Ph.D. degrees from the University of Illinois Urbana-Champaign, USA. He is a Technical Leader at Cisco Systems and has worked on most layers of the networking stack, from network data-plane hardware, including layer-2 and layer-3 (control plane), network controllers, and network fabric monitoring. His areas of research interest include systems reliability, network telemetry, and hardware-implemented fault tolerance.
About the Authors xxiii Preface xxv Acknowledgments xxvii About the
Companion Website xxix 1 Dependability Concepts and Taxonomy 1 1.1
Introduction 1 1.2 Placing Classical Dependability Techniques in
Perspective 2 1.3 Taxonomy of Dependable Computing 4 1.3.1 Faults, Errors,
and Failures 5 1.4 Fault Classes 6 1.5 The Fault Cycle and Dependability
Measures 6 1.6 Fault and Error Classification 7 1.7 Mean Time Between
Failures 11 1.8 User- perceived System Dependability 13 1.9 Technology
Trends and Failure Behavior 14 1.10 Issues at the Hardware Level 15 1.11
Issues at the Platform Level 17 1.12 What is Unique About this Book? 18
1.13 Overview of the Book 19 References 20 2 Classical Dependability
Techniques and Modern Computing Systems: Where and How Do They Meet? 25 2.1
Illustrative Case Studies of Design for Dependability 25 2.2 Cloud
Computing: A Rapidly Expanding Computing Paradigm 31 2.3 New Application
Domains 37 2.4 Insights 52 References 52 3 Hardware Error Detection and
Recovery Through Hardware- Implemented Techniques 57 3.1 Introduction 57
3.2 Redundancy Techniques 58 3.3 Watchdog Timers 67 3.4 Information
Redundancy 69 3.5 Capability and Consistency Checking 93 3.6 Insights 93
References 96 4 Processor Level Error Detection and Recovery 101 4.1
Introduction 101 4.2 Logic- level Techniques 104 4.3 Error Protection in
the Processors 115 4.4 Academic Research on Hardware- level Error
Protection 122 4.5 Insights 134 References 137 5 Hardware Error Detection
Through Software- Implemented Techniques 141 5.1 Introduction 141 5.2
Duplication- based Software Detection Techniques 142 5.3 Control- Flow
Checking 146 5.4 Heartbeats 166 5.5 Assertions 173 5.6 Insights 174
References 175 6 Software Error Detection and Recovery Through Software
Analysis 179 6.1 Introduction 179 6.2 Diverse Programming 183 6.3 Static
Analysis Techniques 194 6.4 Error Detection Based on Dynamic Program
Analysis 217 6.5 Processor- Level Selective Replication 233 6.6 Runtime
Checking for Residual Software Bugs 239 6.7 Data Audit 242 6.8 Application
of Data Audit Techniques 246 6.9 Insights 252 References 253 7 Measurement-
based Analysis of System Software: Operating System Failure Behavior 261
7.1 Introduction 261 7.2 MVS (Multiple Virtual Storage) 262 7.3
Experimental Analysis of OS Dependability 273 7.4 Behavior of the Linux
Operating System in the Presence of Errors 275 7.5 Evaluation of Process
Pairs in Tandem GUARDIAN 295 7.6 Benchmarking Multiple Operating Systems: A
Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER 308
7.7 Dependability Overview of the Cisco Nexus Operating System 326 7.8
Evaluating Operating Systems: Related Studies 330 7.9 Insights 331
References 332 8 Reliable Networked and Distributed Systems 337 8.1
Introduction 337 8.2 System Model 339 8.3 Failure Models 340 8.4 Agreement
Protocols 342 8.5 Reliable Broadcast 346 8.6 Reliable Group Communication
351 8.7 Replication 358 8.8 Replication of Multithreaded Applications 370
8.9 Atomic Commit 396 8.10 Opportunities and Challenges in Resource-
Disaggregated Cloud Data Centers 400 References 405 9 Checkpointing and
Rollback Error Recovery 413 9.1 Introduction 413 9.2 Hardware- Implemented
Cache- Based Schemes Checkpointing 415 9.3 Memory- Based Schemes 421 9.4
Operating- System- Level Checkpointing 424 9.5 Compiler- Assisted
Checkpointing 432 9.6 Error Detection and Recovery in Distributed Systems
438 9.7 Checkpointing Latency Modeling 451 9.8 Checkpointing in Main Memory
Database Systems (MMDB) 455 9.9 Checkpointing in Distributed Database
Systems 463 9.10 Multithreaded Checkpointing 468 References 470 10
Checkpointing Large- Scale Systems 475 10.1 Introduction 475 10.2
Checkpointing Techniques 476 10.3 Checkpointing in Selected Existing
Systems 484 10.4 Modeling- Coordinated Checkpointing for Large- Scale
Supercomputers 492 10.5 Checkpointing in Large- Scale Systems: A Simulation
Study 502 10.6 Cooperative Checkpointing 506 References 508 11 Internals of
Fault Injection Techniques 511 11.1 Introduction 511 11.2 Historical View
of Software Fault Injection 513 11.3 Fault Model Attributes 517 11.4
Compile- Time Fault Injection 517 11.5 Runtime Fault Injection 521 11.6
Simulation- Based Fault Injection 529 11.7 Dependability Benchmark
Attributes 530 11.8 Architecture of a Fault Injection Environment: NFTAPE
Fault/Error Injection Framework Configured to Evaluate Linux OS 531 11.9
ML- Based Fault Injection: Evaluating Modern Autonomous Vehicles 547 11.10
Insights and Concluding Remarks 574 References 574 12 Measurement- Based
Analysis of Large- Scale Clusters: Methodology 585 12.1 Introduction 585
12.2 Related Research 587 12.3 Steps in Field Failure Data Analysis 594
12.4 Failure Event Monitoring and Logging 597 12.5 Data Processing 608 12.6
Data Analysis 622 12.7 Estimation of Empirical Distributions 634 12.8
Dependency Analysis 641 References 651 13 Measurement- Based Analysis of
Large Systems: Case Studies 667 13.1 Introduction 667 13.2 Case Study I:
Failure Characterization of a Production Software- as- a- Service Cloud
Platform 667 13.3 Case Study II: Analysis of Blue Waters System Failures
686 13.4 Case Study III: Autonomous Vehicles: Analysis of Human- Generated
Data 710 References 737 14 The Future: Dependable and Trustworthy AI
Systems 745 14.1 Introduction 745 14.2 Building Trustworthy AI Systems 748
14.3 Offline Identification of Deficiencies 753 14.4 Online Detection and
Mitigation 769 14.5 Trust Model Formulation 772 14.6 Modeling the
Trustworthiness of Critical Applications 775 14.7 Conclusion: How Can We
Make AI Systems Trustworthy? 786 References 788 Index 797
Companion Website xxix 1 Dependability Concepts and Taxonomy 1 1.1
Introduction 1 1.2 Placing Classical Dependability Techniques in
Perspective 2 1.3 Taxonomy of Dependable Computing 4 1.3.1 Faults, Errors,
and Failures 5 1.4 Fault Classes 6 1.5 The Fault Cycle and Dependability
Measures 6 1.6 Fault and Error Classification 7 1.7 Mean Time Between
Failures 11 1.8 User- perceived System Dependability 13 1.9 Technology
Trends and Failure Behavior 14 1.10 Issues at the Hardware Level 15 1.11
Issues at the Platform Level 17 1.12 What is Unique About this Book? 18
1.13 Overview of the Book 19 References 20 2 Classical Dependability
Techniques and Modern Computing Systems: Where and How Do They Meet? 25 2.1
Illustrative Case Studies of Design for Dependability 25 2.2 Cloud
Computing: A Rapidly Expanding Computing Paradigm 31 2.3 New Application
Domains 37 2.4 Insights 52 References 52 3 Hardware Error Detection and
Recovery Through Hardware- Implemented Techniques 57 3.1 Introduction 57
3.2 Redundancy Techniques 58 3.3 Watchdog Timers 67 3.4 Information
Redundancy 69 3.5 Capability and Consistency Checking 93 3.6 Insights 93
References 96 4 Processor Level Error Detection and Recovery 101 4.1
Introduction 101 4.2 Logic- level Techniques 104 4.3 Error Protection in
the Processors 115 4.4 Academic Research on Hardware- level Error
Protection 122 4.5 Insights 134 References 137 5 Hardware Error Detection
Through Software- Implemented Techniques 141 5.1 Introduction 141 5.2
Duplication- based Software Detection Techniques 142 5.3 Control- Flow
Checking 146 5.4 Heartbeats 166 5.5 Assertions 173 5.6 Insights 174
References 175 6 Software Error Detection and Recovery Through Software
Analysis 179 6.1 Introduction 179 6.2 Diverse Programming 183 6.3 Static
Analysis Techniques 194 6.4 Error Detection Based on Dynamic Program
Analysis 217 6.5 Processor- Level Selective Replication 233 6.6 Runtime
Checking for Residual Software Bugs 239 6.7 Data Audit 242 6.8 Application
of Data Audit Techniques 246 6.9 Insights 252 References 253 7 Measurement-
based Analysis of System Software: Operating System Failure Behavior 261
7.1 Introduction 261 7.2 MVS (Multiple Virtual Storage) 262 7.3
Experimental Analysis of OS Dependability 273 7.4 Behavior of the Linux
Operating System in the Presence of Errors 275 7.5 Evaluation of Process
Pairs in Tandem GUARDIAN 295 7.6 Benchmarking Multiple Operating Systems: A
Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER 308
7.7 Dependability Overview of the Cisco Nexus Operating System 326 7.8
Evaluating Operating Systems: Related Studies 330 7.9 Insights 331
References 332 8 Reliable Networked and Distributed Systems 337 8.1
Introduction 337 8.2 System Model 339 8.3 Failure Models 340 8.4 Agreement
Protocols 342 8.5 Reliable Broadcast 346 8.6 Reliable Group Communication
351 8.7 Replication 358 8.8 Replication of Multithreaded Applications 370
8.9 Atomic Commit 396 8.10 Opportunities and Challenges in Resource-
Disaggregated Cloud Data Centers 400 References 405 9 Checkpointing and
Rollback Error Recovery 413 9.1 Introduction 413 9.2 Hardware- Implemented
Cache- Based Schemes Checkpointing 415 9.3 Memory- Based Schemes 421 9.4
Operating- System- Level Checkpointing 424 9.5 Compiler- Assisted
Checkpointing 432 9.6 Error Detection and Recovery in Distributed Systems
438 9.7 Checkpointing Latency Modeling 451 9.8 Checkpointing in Main Memory
Database Systems (MMDB) 455 9.9 Checkpointing in Distributed Database
Systems 463 9.10 Multithreaded Checkpointing 468 References 470 10
Checkpointing Large- Scale Systems 475 10.1 Introduction 475 10.2
Checkpointing Techniques 476 10.3 Checkpointing in Selected Existing
Systems 484 10.4 Modeling- Coordinated Checkpointing for Large- Scale
Supercomputers 492 10.5 Checkpointing in Large- Scale Systems: A Simulation
Study 502 10.6 Cooperative Checkpointing 506 References 508 11 Internals of
Fault Injection Techniques 511 11.1 Introduction 511 11.2 Historical View
of Software Fault Injection 513 11.3 Fault Model Attributes 517 11.4
Compile- Time Fault Injection 517 11.5 Runtime Fault Injection 521 11.6
Simulation- Based Fault Injection 529 11.7 Dependability Benchmark
Attributes 530 11.8 Architecture of a Fault Injection Environment: NFTAPE
Fault/Error Injection Framework Configured to Evaluate Linux OS 531 11.9
ML- Based Fault Injection: Evaluating Modern Autonomous Vehicles 547 11.10
Insights and Concluding Remarks 574 References 574 12 Measurement- Based
Analysis of Large- Scale Clusters: Methodology 585 12.1 Introduction 585
12.2 Related Research 587 12.3 Steps in Field Failure Data Analysis 594
12.4 Failure Event Monitoring and Logging 597 12.5 Data Processing 608 12.6
Data Analysis 622 12.7 Estimation of Empirical Distributions 634 12.8
Dependency Analysis 641 References 651 13 Measurement- Based Analysis of
Large Systems: Case Studies 667 13.1 Introduction 667 13.2 Case Study I:
Failure Characterization of a Production Software- as- a- Service Cloud
Platform 667 13.3 Case Study II: Analysis of Blue Waters System Failures
686 13.4 Case Study III: Autonomous Vehicles: Analysis of Human- Generated
Data 710 References 737 14 The Future: Dependable and Trustworthy AI
Systems 745 14.1 Introduction 745 14.2 Building Trustworthy AI Systems 748
14.3 Offline Identification of Deficiencies 753 14.4 Online Detection and
Mitigation 769 14.5 Trust Model Formulation 772 14.6 Modeling the
Trustworthiness of Critical Applications 775 14.7 Conclusion: How Can We
Make AI Systems Trustworthy? 786 References 788 Index 797
About the Authors xxiii Preface xxv Acknowledgments xxvii About the
Companion Website xxix 1 Dependability Concepts and Taxonomy 1 1.1
Introduction 1 1.2 Placing Classical Dependability Techniques in
Perspective 2 1.3 Taxonomy of Dependable Computing 4 1.3.1 Faults, Errors,
and Failures 5 1.4 Fault Classes 6 1.5 The Fault Cycle and Dependability
Measures 6 1.6 Fault and Error Classification 7 1.7 Mean Time Between
Failures 11 1.8 User- perceived System Dependability 13 1.9 Technology
Trends and Failure Behavior 14 1.10 Issues at the Hardware Level 15 1.11
Issues at the Platform Level 17 1.12 What is Unique About this Book? 18
1.13 Overview of the Book 19 References 20 2 Classical Dependability
Techniques and Modern Computing Systems: Where and How Do They Meet? 25 2.1
Illustrative Case Studies of Design for Dependability 25 2.2 Cloud
Computing: A Rapidly Expanding Computing Paradigm 31 2.3 New Application
Domains 37 2.4 Insights 52 References 52 3 Hardware Error Detection and
Recovery Through Hardware- Implemented Techniques 57 3.1 Introduction 57
3.2 Redundancy Techniques 58 3.3 Watchdog Timers 67 3.4 Information
Redundancy 69 3.5 Capability and Consistency Checking 93 3.6 Insights 93
References 96 4 Processor Level Error Detection and Recovery 101 4.1
Introduction 101 4.2 Logic- level Techniques 104 4.3 Error Protection in
the Processors 115 4.4 Academic Research on Hardware- level Error
Protection 122 4.5 Insights 134 References 137 5 Hardware Error Detection
Through Software- Implemented Techniques 141 5.1 Introduction 141 5.2
Duplication- based Software Detection Techniques 142 5.3 Control- Flow
Checking 146 5.4 Heartbeats 166 5.5 Assertions 173 5.6 Insights 174
References 175 6 Software Error Detection and Recovery Through Software
Analysis 179 6.1 Introduction 179 6.2 Diverse Programming 183 6.3 Static
Analysis Techniques 194 6.4 Error Detection Based on Dynamic Program
Analysis 217 6.5 Processor- Level Selective Replication 233 6.6 Runtime
Checking for Residual Software Bugs 239 6.7 Data Audit 242 6.8 Application
of Data Audit Techniques 246 6.9 Insights 252 References 253 7 Measurement-
based Analysis of System Software: Operating System Failure Behavior 261
7.1 Introduction 261 7.2 MVS (Multiple Virtual Storage) 262 7.3
Experimental Analysis of OS Dependability 273 7.4 Behavior of the Linux
Operating System in the Presence of Errors 275 7.5 Evaluation of Process
Pairs in Tandem GUARDIAN 295 7.6 Benchmarking Multiple Operating Systems: A
Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER 308
7.7 Dependability Overview of the Cisco Nexus Operating System 326 7.8
Evaluating Operating Systems: Related Studies 330 7.9 Insights 331
References 332 8 Reliable Networked and Distributed Systems 337 8.1
Introduction 337 8.2 System Model 339 8.3 Failure Models 340 8.4 Agreement
Protocols 342 8.5 Reliable Broadcast 346 8.6 Reliable Group Communication
351 8.7 Replication 358 8.8 Replication of Multithreaded Applications 370
8.9 Atomic Commit 396 8.10 Opportunities and Challenges in Resource-
Disaggregated Cloud Data Centers 400 References 405 9 Checkpointing and
Rollback Error Recovery 413 9.1 Introduction 413 9.2 Hardware- Implemented
Cache- Based Schemes Checkpointing 415 9.3 Memory- Based Schemes 421 9.4
Operating- System- Level Checkpointing 424 9.5 Compiler- Assisted
Checkpointing 432 9.6 Error Detection and Recovery in Distributed Systems
438 9.7 Checkpointing Latency Modeling 451 9.8 Checkpointing in Main Memory
Database Systems (MMDB) 455 9.9 Checkpointing in Distributed Database
Systems 463 9.10 Multithreaded Checkpointing 468 References 470 10
Checkpointing Large- Scale Systems 475 10.1 Introduction 475 10.2
Checkpointing Techniques 476 10.3 Checkpointing in Selected Existing
Systems 484 10.4 Modeling- Coordinated Checkpointing for Large- Scale
Supercomputers 492 10.5 Checkpointing in Large- Scale Systems: A Simulation
Study 502 10.6 Cooperative Checkpointing 506 References 508 11 Internals of
Fault Injection Techniques 511 11.1 Introduction 511 11.2 Historical View
of Software Fault Injection 513 11.3 Fault Model Attributes 517 11.4
Compile- Time Fault Injection 517 11.5 Runtime Fault Injection 521 11.6
Simulation- Based Fault Injection 529 11.7 Dependability Benchmark
Attributes 530 11.8 Architecture of a Fault Injection Environment: NFTAPE
Fault/Error Injection Framework Configured to Evaluate Linux OS 531 11.9
ML- Based Fault Injection: Evaluating Modern Autonomous Vehicles 547 11.10
Insights and Concluding Remarks 574 References 574 12 Measurement- Based
Analysis of Large- Scale Clusters: Methodology 585 12.1 Introduction 585
12.2 Related Research 587 12.3 Steps in Field Failure Data Analysis 594
12.4 Failure Event Monitoring and Logging 597 12.5 Data Processing 608 12.6
Data Analysis 622 12.7 Estimation of Empirical Distributions 634 12.8
Dependency Analysis 641 References 651 13 Measurement- Based Analysis of
Large Systems: Case Studies 667 13.1 Introduction 667 13.2 Case Study I:
Failure Characterization of a Production Software- as- a- Service Cloud
Platform 667 13.3 Case Study II: Analysis of Blue Waters System Failures
686 13.4 Case Study III: Autonomous Vehicles: Analysis of Human- Generated
Data 710 References 737 14 The Future: Dependable and Trustworthy AI
Systems 745 14.1 Introduction 745 14.2 Building Trustworthy AI Systems 748
14.3 Offline Identification of Deficiencies 753 14.4 Online Detection and
Mitigation 769 14.5 Trust Model Formulation 772 14.6 Modeling the
Trustworthiness of Critical Applications 775 14.7 Conclusion: How Can We
Make AI Systems Trustworthy? 786 References 788 Index 797
Companion Website xxix 1 Dependability Concepts and Taxonomy 1 1.1
Introduction 1 1.2 Placing Classical Dependability Techniques in
Perspective 2 1.3 Taxonomy of Dependable Computing 4 1.3.1 Faults, Errors,
and Failures 5 1.4 Fault Classes 6 1.5 The Fault Cycle and Dependability
Measures 6 1.6 Fault and Error Classification 7 1.7 Mean Time Between
Failures 11 1.8 User- perceived System Dependability 13 1.9 Technology
Trends and Failure Behavior 14 1.10 Issues at the Hardware Level 15 1.11
Issues at the Platform Level 17 1.12 What is Unique About this Book? 18
1.13 Overview of the Book 19 References 20 2 Classical Dependability
Techniques and Modern Computing Systems: Where and How Do They Meet? 25 2.1
Illustrative Case Studies of Design for Dependability 25 2.2 Cloud
Computing: A Rapidly Expanding Computing Paradigm 31 2.3 New Application
Domains 37 2.4 Insights 52 References 52 3 Hardware Error Detection and
Recovery Through Hardware- Implemented Techniques 57 3.1 Introduction 57
3.2 Redundancy Techniques 58 3.3 Watchdog Timers 67 3.4 Information
Redundancy 69 3.5 Capability and Consistency Checking 93 3.6 Insights 93
References 96 4 Processor Level Error Detection and Recovery 101 4.1
Introduction 101 4.2 Logic- level Techniques 104 4.3 Error Protection in
the Processors 115 4.4 Academic Research on Hardware- level Error
Protection 122 4.5 Insights 134 References 137 5 Hardware Error Detection
Through Software- Implemented Techniques 141 5.1 Introduction 141 5.2
Duplication- based Software Detection Techniques 142 5.3 Control- Flow
Checking 146 5.4 Heartbeats 166 5.5 Assertions 173 5.6 Insights 174
References 175 6 Software Error Detection and Recovery Through Software
Analysis 179 6.1 Introduction 179 6.2 Diverse Programming 183 6.3 Static
Analysis Techniques 194 6.4 Error Detection Based on Dynamic Program
Analysis 217 6.5 Processor- Level Selective Replication 233 6.6 Runtime
Checking for Residual Software Bugs 239 6.7 Data Audit 242 6.8 Application
of Data Audit Techniques 246 6.9 Insights 252 References 253 7 Measurement-
based Analysis of System Software: Operating System Failure Behavior 261
7.1 Introduction 261 7.2 MVS (Multiple Virtual Storage) 262 7.3
Experimental Analysis of OS Dependability 273 7.4 Behavior of the Linux
Operating System in the Presence of Errors 275 7.5 Evaluation of Process
Pairs in Tandem GUARDIAN 295 7.6 Benchmarking Multiple Operating Systems: A
Case Study Using Linux on Pentium, Solaris on SPARC, and AIX on POWER 308
7.7 Dependability Overview of the Cisco Nexus Operating System 326 7.8
Evaluating Operating Systems: Related Studies 330 7.9 Insights 331
References 332 8 Reliable Networked and Distributed Systems 337 8.1
Introduction 337 8.2 System Model 339 8.3 Failure Models 340 8.4 Agreement
Protocols 342 8.5 Reliable Broadcast 346 8.6 Reliable Group Communication
351 8.7 Replication 358 8.8 Replication of Multithreaded Applications 370
8.9 Atomic Commit 396 8.10 Opportunities and Challenges in Resource-
Disaggregated Cloud Data Centers 400 References 405 9 Checkpointing and
Rollback Error Recovery 413 9.1 Introduction 413 9.2 Hardware- Implemented
Cache- Based Schemes Checkpointing 415 9.3 Memory- Based Schemes 421 9.4
Operating- System- Level Checkpointing 424 9.5 Compiler- Assisted
Checkpointing 432 9.6 Error Detection and Recovery in Distributed Systems
438 9.7 Checkpointing Latency Modeling 451 9.8 Checkpointing in Main Memory
Database Systems (MMDB) 455 9.9 Checkpointing in Distributed Database
Systems 463 9.10 Multithreaded Checkpointing 468 References 470 10
Checkpointing Large- Scale Systems 475 10.1 Introduction 475 10.2
Checkpointing Techniques 476 10.3 Checkpointing in Selected Existing
Systems 484 10.4 Modeling- Coordinated Checkpointing for Large- Scale
Supercomputers 492 10.5 Checkpointing in Large- Scale Systems: A Simulation
Study 502 10.6 Cooperative Checkpointing 506 References 508 11 Internals of
Fault Injection Techniques 511 11.1 Introduction 511 11.2 Historical View
of Software Fault Injection 513 11.3 Fault Model Attributes 517 11.4
Compile- Time Fault Injection 517 11.5 Runtime Fault Injection 521 11.6
Simulation- Based Fault Injection 529 11.7 Dependability Benchmark
Attributes 530 11.8 Architecture of a Fault Injection Environment: NFTAPE
Fault/Error Injection Framework Configured to Evaluate Linux OS 531 11.9
ML- Based Fault Injection: Evaluating Modern Autonomous Vehicles 547 11.10
Insights and Concluding Remarks 574 References 574 12 Measurement- Based
Analysis of Large- Scale Clusters: Methodology 585 12.1 Introduction 585
12.2 Related Research 587 12.3 Steps in Field Failure Data Analysis 594
12.4 Failure Event Monitoring and Logging 597 12.5 Data Processing 608 12.6
Data Analysis 622 12.7 Estimation of Empirical Distributions 634 12.8
Dependency Analysis 641 References 651 13 Measurement- Based Analysis of
Large Systems: Case Studies 667 13.1 Introduction 667 13.2 Case Study I:
Failure Characterization of a Production Software- as- a- Service Cloud
Platform 667 13.3 Case Study II: Analysis of Blue Waters System Failures
686 13.4 Case Study III: Autonomous Vehicles: Analysis of Human- Generated
Data 710 References 737 14 The Future: Dependable and Trustworthy AI
Systems 745 14.1 Introduction 745 14.2 Building Trustworthy AI Systems 748
14.3 Offline Identification of Deficiencies 753 14.4 Online Detection and
Mitigation 769 14.5 Trust Model Formulation 772 14.6 Modeling the
Trustworthiness of Critical Applications 775 14.7 Conclusion: How Can We
Make AI Systems Trustworthy? 786 References 788 Index 797