- Broschiertes Buch
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Alapati demystifies complex Hadoop environments, helping readers understand exactly what…mehr
Andere Kunden interessierten sich auch für
- Danielle C TarrafThe Department of Defense Posture for Artificial Intelligence: Assessment and Recommendations29,99 €
- Mohammad Shahid HusainBig Data Concepts, Technologies, and Applications65,99 €
- Dursun DelenPrescriptive Analytics53,99 €
- Joanne RodriguesProduct Analytics49,99 €
- Benjamin BengfortData Analytics with Hadoop30,99 €
- Thomas ErlBig Data Fundamentals39,99 €
- Marco RussoTabular Modeling in Microsoft SQL Server Analysis Services50,99 €
-
-
-
In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples. Alapati demystifies complex Hadoop environments, helping readers understand exactly what happens behind the scenes when they administer their cluster. Students will gain unprecedented insight as they walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Produktdetails
- Produktdetails
- Addison-Wesley Data & Analytics Series
- Verlag: Pearson Education (US)
- Seitenzahl: 848
- Erscheinungstermin: 6. Dezember 2016
- Englisch
- Abmessung: 231mm x 177mm x 45mm
- Gewicht: 1262g
- ISBN-13: 9780134597195
- ISBN-10: 0134597192
- Artikelnr.: 45102766
- Addison-Wesley Data & Analytics Series
- Verlag: Pearson Education (US)
- Seitenzahl: 848
- Erscheinungstermin: 6. Dezember 2016
- Englisch
- Abmessung: 231mm x 177mm x 45mm
- Gewicht: 1262g
- ISBN-13: 9780134597195
- ISBN-10: 0134597192
- Artikelnr.: 45102766
Sam R. Alapati has been working with various aspects of the Hadoop environment for the past six years. He is currently the principal Hadoop administrator at Sabre Corporation in Westlake, Texas, and works on a daily basis with multiple large Hadoop 2 clusters. In addition to being the point person for all Hadoop administration at Sabre, Sam manages multiple critical data-science- and data-analysis-related Hadoop job flows and is also an expert Oracle Database Administrator. His vast knowledge of relational databases and SQL contributes to his work with Hadoop related projects. Sam’s recognition in the database and middleware area includes having published 18 well-received books over the past 14 years, mostly on Oracle Database Administration and Oracle Weblogic Server. His experience dealing with numerous configuration, architectural, and performance-related Hadoop issues over the years led him to the realization that many working Hadoop administrators and developers would appreciate having a handy reference such as this book to turn to when creating, managing, securing and optimizing their Hadoop infrastructure.
Foreword xxvii
Preface xxix
Acknowledgments xxxv
About the Author xxxvii
Part I: Introduction to Hadoop—Architecture and Hadoop Clusters 1
Chapter 1: Introduction to Hadoop and Its Environment 3
Hadoop—An Introduction 4
Cluster Computing and Hadoop Clusters 12
Hadoop Components and the Hadoop Ecosphere 15
What Do Hadoop Administrators Do? 18
Key Differences between Hadoop 1 and Hadoop 2 21
Distributed Data Processing: MapReduce and Spark, Hive and Pig 24
Data Integration: Apache Sqoop, Apache Flume and
Apache Kafka 27
Key Areas of Hadoop Administration 28
Summary 31
Chapter 2: An Introduction to the Architecture of Hadoop 33
Distributed Computing and Hadoop 33
Hadoop Architecture 34
Data Storage—The Hadoop Distributed File System 37
Data Processing with YARN, the Hadoop Operating System 48
Summary 57
Chapter 3: Creating and Configuring a Simple Hadoop Cluster 59
Hadoop Distributions and Installation Types 60
Setting Up a Pseudo-Distributed Hadoop Cluster 62
Performing the Initial Hadoop Configuration 71
Operating the New Hadoop Cluster 86
Summary 90
Chapter 4: Planning for and Creating a Fully Distributed Cluster 91
Planning Your Hadoop Cluster 92
Going from a Single Rack to Multiple Racks 95
Creating a Multinode Cluster 102
Modifying the Hadoop Configuration 106
Starting Up the Cluster 114
Configuring Hadoop Services, Web Interfaces and Ports 119
Summary 126
Part II: Hadoop Application Frameworks 127
Chapter 5: Running Applications in a Cluster—The MapReduce Framework (and
Hive and Pig) 129
The MapReduce Framework 129
Apache Hive 141
Apache Pig 144
Summary 145
Chapter 6: Running Applications in a Cluster—The Spark Framework 147
What Is Spark? 148
Why Spark? 149
The Spark Stack 153
Installing Spark 155
Spark Run Modes 158
Understanding the Cluster Managers 159
Spark and Data Access 164
Summary 167
Chapter 7: Running Spark Applications 169
The Spark Programming Model 169
Spark Applications 173
Architecture of a Spark Application 179
Running Spark Applications Interactively 181
Creating and Submitting Spark Applications 185
Configuring Spark Applications 192
Monitoring Spark Applications 194
Handling Streaming Data with Spark Streaming 194
Using Spark SQL for Handling Structured Data 198
Summary 201
Part III: Managing and Protecting Hadoop Data and High Availability 203
Chapter 8: The Role of the NameNode and How HDFS Works 205
HDFS—The Interaction between the NameNode and the DataNodes 205
Rack Awareness and Topology 209
HDFS Data Replication 212
How Clients Read and Write HDFS Data 218
Understanding HDFS Recovery Processes 224
Centralized Cache Management in HDFS 227
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) 232
Summary 241
Chapter 9: HDFS Commands, HDFS Permissions and HDFS Storage 243
Managing HDFS through the HDFS Shell Commands 243
Using the dfsadmin Utility to Perform HDFS Operations 251
Managing HDFS Permissions and Users 255
Managing HDFS Storage 260
Rebalancing HDFS Data 267
Reclaiming HDFS Space 274
Summary 276
Chapter 10: Data Protection, File Formats and Accessing HDFS 277
Safeguarding Data 278
Data Compression 289
Hadoop File Formats 295
Using Hadoop WebHDFS and HttpFS 308
Summary 315
Chapter 11: NameNode Operations, High Availability and Federation 317
Understanding NameNode Operations 318
The Checkpointing Process 323
NameNode Safe Mode Operations 329
Configuring HDFS High Availability 334
HDFS Federation 349
Summary 351
Part IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
353
Chapter 12: Moving Data Into and Out of Hadoop 355
Introduction to Hadoop Data Transfer Tools 355
Loading Data into HDFS from the Command Line 356
Copying HDFS Data between Clusters with DistCp 361
Ingesting Data from Relational Databases with Sqoop 365
Ingesting Data from External Sources with Flume 388
Ingesting Data with Kafka 398
Summary 406
Chapter 13: Resource Allocation in a Hadoop Cluster 407
Resource Allocation in Hadoop 407
The FIFO Scheduler 410
The Capacity Scheduler 411
The Fair Scheduler 426
Comparing the Capacity Scheduler and the Fair Scheduler 435
Summary 436
Chapter 14: Working with Oozie to Manage Job Workflows 437
Using Apache Oozie to Schedule Jobs 437
Oozie Architecture 439
Deploying Oozie in Your Cluster 441
Understanding Oozie Workflows 446
How Oozie Runs an Action 449
Creating an Oozie Workflow 454
Running an Oozie Workflow Job 461
Oozie Coordinators 464
Managing and Administering Oozie 470
Summary 475
Chapter 15: Securing Hadoop 477
Hadoop Security—An Overview 478
Hadoop Authentication with Kerberos 481
Hadoop Authorization 505
Auditing Hadoop 518
Securing Hadoop Data 520
Other Hadoop-Related Security Initiatives 524
Summary 525
Part V: Monitoring, Optimization and Troubleshooting 527
Chapter 16: Managing Jobs, Using Hue and Performing Routine Tasks 529
Using the YARN Commands to Manage Hadoop Jobs 530
Decommissioning and Recommissioning Nodes 535
ResourceManager High Availability 541
Performing Common Management Tasks 545
Managing the MySQL Database 548
Backing Up Important Cluster Data 551
Using Hue to Administer Your Cluster 553
Implementing Specialized HDFS Features 562
Summary 567
Chapter 17: Monitoring, Metrics and Hadoop Logging 569
Monitoring Linux Servers 570
Hadoop Metrics 576
Using Ganglia for Monitoring 579
Understanding Hadoop Logging 582
Using Hadoop’s Web UIs for Monitoring 599
Monitoring Other Hadoop Components 609
Summary 610
Chapter 18: Tuning the Cluster Resources, Optimizing MapReduce Jobs and
Benchmarking 611
How to Allocate YARN Memory and CPU 612
Configuring Efficient Performance 621
Tuning Map and Reduce Tasks—What the Administrator Can Do 625
Optimizing Pig and Hive Jobs 635
Benchmarking Your Cluster 638
Hadoop Counters 647
Optimizing MapReduce 652
Summary 658
Chapter 19: Configuring and Tuning Apache Spark on YARN 659
Configuring Resource Allocation for Spark on YARN 659
Dynamic Resource Allocation when Running Spark on YARN 676
Storage Formats and Compressing Data 678
Monitoring Spark Applications 681
Tuning Garbage Collection 686
Tuning Spark Streaming Applications 688
Summary 689
Chapter 20: Optimizing Spark Applications 691
Revisiting the Spark Execution Model 692
Shuffle Operations and How to Minimize Them 694
Partitioning and Parallelism (Number of Tasks) 703
Optimizing Data Serialization and Compression 710
Understanding Spark’s SQL Query Optimizer 712
Caching Data 717
Summary 723
Chapter 21: Troubleshooting Hadoop—A Sampler 725
Space-Related Issues 725
Handling YARN Jobs That Are Stuck 731
JVM Memory-Allocation and Garbage-Collection Strategies 732
Handling Different Types of Failures 737
Troubleshooting Spark Jobs 739
Debugging Spark Applications 740
Summary 742
Chapter 22: Installing VirtualBox and Linux and Cloning the Virtual
Machines 743
Installing Oracle VirtualBox 744
Installing Oracle Enterprise Linux 745
Cloning the Linux Server 745
Index 747
Preface xxix
Acknowledgments xxxv
About the Author xxxvii
Part I: Introduction to Hadoop—Architecture and Hadoop Clusters 1
Chapter 1: Introduction to Hadoop and Its Environment 3
Hadoop—An Introduction 4
Cluster Computing and Hadoop Clusters 12
Hadoop Components and the Hadoop Ecosphere 15
What Do Hadoop Administrators Do? 18
Key Differences between Hadoop 1 and Hadoop 2 21
Distributed Data Processing: MapReduce and Spark, Hive and Pig 24
Data Integration: Apache Sqoop, Apache Flume and
Apache Kafka 27
Key Areas of Hadoop Administration 28
Summary 31
Chapter 2: An Introduction to the Architecture of Hadoop 33
Distributed Computing and Hadoop 33
Hadoop Architecture 34
Data Storage—The Hadoop Distributed File System 37
Data Processing with YARN, the Hadoop Operating System 48
Summary 57
Chapter 3: Creating and Configuring a Simple Hadoop Cluster 59
Hadoop Distributions and Installation Types 60
Setting Up a Pseudo-Distributed Hadoop Cluster 62
Performing the Initial Hadoop Configuration 71
Operating the New Hadoop Cluster 86
Summary 90
Chapter 4: Planning for and Creating a Fully Distributed Cluster 91
Planning Your Hadoop Cluster 92
Going from a Single Rack to Multiple Racks 95
Creating a Multinode Cluster 102
Modifying the Hadoop Configuration 106
Starting Up the Cluster 114
Configuring Hadoop Services, Web Interfaces and Ports 119
Summary 126
Part II: Hadoop Application Frameworks 127
Chapter 5: Running Applications in a Cluster—The MapReduce Framework (and
Hive and Pig) 129
The MapReduce Framework 129
Apache Hive 141
Apache Pig 144
Summary 145
Chapter 6: Running Applications in a Cluster—The Spark Framework 147
What Is Spark? 148
Why Spark? 149
The Spark Stack 153
Installing Spark 155
Spark Run Modes 158
Understanding the Cluster Managers 159
Spark and Data Access 164
Summary 167
Chapter 7: Running Spark Applications 169
The Spark Programming Model 169
Spark Applications 173
Architecture of a Spark Application 179
Running Spark Applications Interactively 181
Creating and Submitting Spark Applications 185
Configuring Spark Applications 192
Monitoring Spark Applications 194
Handling Streaming Data with Spark Streaming 194
Using Spark SQL for Handling Structured Data 198
Summary 201
Part III: Managing and Protecting Hadoop Data and High Availability 203
Chapter 8: The Role of the NameNode and How HDFS Works 205
HDFS—The Interaction between the NameNode and the DataNodes 205
Rack Awareness and Topology 209
HDFS Data Replication 212
How Clients Read and Write HDFS Data 218
Understanding HDFS Recovery Processes 224
Centralized Cache Management in HDFS 227
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) 232
Summary 241
Chapter 9: HDFS Commands, HDFS Permissions and HDFS Storage 243
Managing HDFS through the HDFS Shell Commands 243
Using the dfsadmin Utility to Perform HDFS Operations 251
Managing HDFS Permissions and Users 255
Managing HDFS Storage 260
Rebalancing HDFS Data 267
Reclaiming HDFS Space 274
Summary 276
Chapter 10: Data Protection, File Formats and Accessing HDFS 277
Safeguarding Data 278
Data Compression 289
Hadoop File Formats 295
Using Hadoop WebHDFS and HttpFS 308
Summary 315
Chapter 11: NameNode Operations, High Availability and Federation 317
Understanding NameNode Operations 318
The Checkpointing Process 323
NameNode Safe Mode Operations 329
Configuring HDFS High Availability 334
HDFS Federation 349
Summary 351
Part IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
353
Chapter 12: Moving Data Into and Out of Hadoop 355
Introduction to Hadoop Data Transfer Tools 355
Loading Data into HDFS from the Command Line 356
Copying HDFS Data between Clusters with DistCp 361
Ingesting Data from Relational Databases with Sqoop 365
Ingesting Data from External Sources with Flume 388
Ingesting Data with Kafka 398
Summary 406
Chapter 13: Resource Allocation in a Hadoop Cluster 407
Resource Allocation in Hadoop 407
The FIFO Scheduler 410
The Capacity Scheduler 411
The Fair Scheduler 426
Comparing the Capacity Scheduler and the Fair Scheduler 435
Summary 436
Chapter 14: Working with Oozie to Manage Job Workflows 437
Using Apache Oozie to Schedule Jobs 437
Oozie Architecture 439
Deploying Oozie in Your Cluster 441
Understanding Oozie Workflows 446
How Oozie Runs an Action 449
Creating an Oozie Workflow 454
Running an Oozie Workflow Job 461
Oozie Coordinators 464
Managing and Administering Oozie 470
Summary 475
Chapter 15: Securing Hadoop 477
Hadoop Security—An Overview 478
Hadoop Authentication with Kerberos 481
Hadoop Authorization 505
Auditing Hadoop 518
Securing Hadoop Data 520
Other Hadoop-Related Security Initiatives 524
Summary 525
Part V: Monitoring, Optimization and Troubleshooting 527
Chapter 16: Managing Jobs, Using Hue and Performing Routine Tasks 529
Using the YARN Commands to Manage Hadoop Jobs 530
Decommissioning and Recommissioning Nodes 535
ResourceManager High Availability 541
Performing Common Management Tasks 545
Managing the MySQL Database 548
Backing Up Important Cluster Data 551
Using Hue to Administer Your Cluster 553
Implementing Specialized HDFS Features 562
Summary 567
Chapter 17: Monitoring, Metrics and Hadoop Logging 569
Monitoring Linux Servers 570
Hadoop Metrics 576
Using Ganglia for Monitoring 579
Understanding Hadoop Logging 582
Using Hadoop’s Web UIs for Monitoring 599
Monitoring Other Hadoop Components 609
Summary 610
Chapter 18: Tuning the Cluster Resources, Optimizing MapReduce Jobs and
Benchmarking 611
How to Allocate YARN Memory and CPU 612
Configuring Efficient Performance 621
Tuning Map and Reduce Tasks—What the Administrator Can Do 625
Optimizing Pig and Hive Jobs 635
Benchmarking Your Cluster 638
Hadoop Counters 647
Optimizing MapReduce 652
Summary 658
Chapter 19: Configuring and Tuning Apache Spark on YARN 659
Configuring Resource Allocation for Spark on YARN 659
Dynamic Resource Allocation when Running Spark on YARN 676
Storage Formats and Compressing Data 678
Monitoring Spark Applications 681
Tuning Garbage Collection 686
Tuning Spark Streaming Applications 688
Summary 689
Chapter 20: Optimizing Spark Applications 691
Revisiting the Spark Execution Model 692
Shuffle Operations and How to Minimize Them 694
Partitioning and Parallelism (Number of Tasks) 703
Optimizing Data Serialization and Compression 710
Understanding Spark’s SQL Query Optimizer 712
Caching Data 717
Summary 723
Chapter 21: Troubleshooting Hadoop—A Sampler 725
Space-Related Issues 725
Handling YARN Jobs That Are Stuck 731
JVM Memory-Allocation and Garbage-Collection Strategies 732
Handling Different Types of Failures 737
Troubleshooting Spark Jobs 739
Debugging Spark Applications 740
Summary 742
Chapter 22: Installing VirtualBox and Linux and Cloning the Virtual
Machines 743
Installing Oracle VirtualBox 744
Installing Oracle Enterprise Linux 745
Cloning the Linux Server 745
Index 747
Foreword xxvii
Preface xxix
Acknowledgments xxxv
About the Author xxxvii
Part I: Introduction to Hadoop—Architecture and Hadoop Clusters 1
Chapter 1: Introduction to Hadoop and Its Environment 3
Hadoop—An Introduction 4
Cluster Computing and Hadoop Clusters 12
Hadoop Components and the Hadoop Ecosphere 15
What Do Hadoop Administrators Do? 18
Key Differences between Hadoop 1 and Hadoop 2 21
Distributed Data Processing: MapReduce and Spark, Hive and Pig 24
Data Integration: Apache Sqoop, Apache Flume and
Apache Kafka 27
Key Areas of Hadoop Administration 28
Summary 31
Chapter 2: An Introduction to the Architecture of Hadoop 33
Distributed Computing and Hadoop 33
Hadoop Architecture 34
Data Storage—The Hadoop Distributed File System 37
Data Processing with YARN, the Hadoop Operating System 48
Summary 57
Chapter 3: Creating and Configuring a Simple Hadoop Cluster 59
Hadoop Distributions and Installation Types 60
Setting Up a Pseudo-Distributed Hadoop Cluster 62
Performing the Initial Hadoop Configuration 71
Operating the New Hadoop Cluster 86
Summary 90
Chapter 4: Planning for and Creating a Fully Distributed Cluster 91
Planning Your Hadoop Cluster 92
Going from a Single Rack to Multiple Racks 95
Creating a Multinode Cluster 102
Modifying the Hadoop Configuration 106
Starting Up the Cluster 114
Configuring Hadoop Services, Web Interfaces and Ports 119
Summary 126
Part II: Hadoop Application Frameworks 127
Chapter 5: Running Applications in a Cluster—The MapReduce Framework (and
Hive and Pig) 129
The MapReduce Framework 129
Apache Hive 141
Apache Pig 144
Summary 145
Chapter 6: Running Applications in a Cluster—The Spark Framework 147
What Is Spark? 148
Why Spark? 149
The Spark Stack 153
Installing Spark 155
Spark Run Modes 158
Understanding the Cluster Managers 159
Spark and Data Access 164
Summary 167
Chapter 7: Running Spark Applications 169
The Spark Programming Model 169
Spark Applications 173
Architecture of a Spark Application 179
Running Spark Applications Interactively 181
Creating and Submitting Spark Applications 185
Configuring Spark Applications 192
Monitoring Spark Applications 194
Handling Streaming Data with Spark Streaming 194
Using Spark SQL for Handling Structured Data 198
Summary 201
Part III: Managing and Protecting Hadoop Data and High Availability 203
Chapter 8: The Role of the NameNode and How HDFS Works 205
HDFS—The Interaction between the NameNode and the DataNodes 205
Rack Awareness and Topology 209
HDFS Data Replication 212
How Clients Read and Write HDFS Data 218
Understanding HDFS Recovery Processes 224
Centralized Cache Management in HDFS 227
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) 232
Summary 241
Chapter 9: HDFS Commands, HDFS Permissions and HDFS Storage 243
Managing HDFS through the HDFS Shell Commands 243
Using the dfsadmin Utility to Perform HDFS Operations 251
Managing HDFS Permissions and Users 255
Managing HDFS Storage 260
Rebalancing HDFS Data 267
Reclaiming HDFS Space 274
Summary 276
Chapter 10: Data Protection, File Formats and Accessing HDFS 277
Safeguarding Data 278
Data Compression 289
Hadoop File Formats 295
Using Hadoop WebHDFS and HttpFS 308
Summary 315
Chapter 11: NameNode Operations, High Availability and Federation 317
Understanding NameNode Operations 318
The Checkpointing Process 323
NameNode Safe Mode Operations 329
Configuring HDFS High Availability 334
HDFS Federation 349
Summary 351
Part IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
353
Chapter 12: Moving Data Into and Out of Hadoop 355
Introduction to Hadoop Data Transfer Tools 355
Loading Data into HDFS from the Command Line 356
Copying HDFS Data between Clusters with DistCp 361
Ingesting Data from Relational Databases with Sqoop 365
Ingesting Data from External Sources with Flume 388
Ingesting Data with Kafka 398
Summary 406
Chapter 13: Resource Allocation in a Hadoop Cluster 407
Resource Allocation in Hadoop 407
The FIFO Scheduler 410
The Capacity Scheduler 411
The Fair Scheduler 426
Comparing the Capacity Scheduler and the Fair Scheduler 435
Summary 436
Chapter 14: Working with Oozie to Manage Job Workflows 437
Using Apache Oozie to Schedule Jobs 437
Oozie Architecture 439
Deploying Oozie in Your Cluster 441
Understanding Oozie Workflows 446
How Oozie Runs an Action 449
Creating an Oozie Workflow 454
Running an Oozie Workflow Job 461
Oozie Coordinators 464
Managing and Administering Oozie 470
Summary 475
Chapter 15: Securing Hadoop 477
Hadoop Security—An Overview 478
Hadoop Authentication with Kerberos 481
Hadoop Authorization 505
Auditing Hadoop 518
Securing Hadoop Data 520
Other Hadoop-Related Security Initiatives 524
Summary 525
Part V: Monitoring, Optimization and Troubleshooting 527
Chapter 16: Managing Jobs, Using Hue and Performing Routine Tasks 529
Using the YARN Commands to Manage Hadoop Jobs 530
Decommissioning and Recommissioning Nodes 535
ResourceManager High Availability 541
Performing Common Management Tasks 545
Managing the MySQL Database 548
Backing Up Important Cluster Data 551
Using Hue to Administer Your Cluster 553
Implementing Specialized HDFS Features 562
Summary 567
Chapter 17: Monitoring, Metrics and Hadoop Logging 569
Monitoring Linux Servers 570
Hadoop Metrics 576
Using Ganglia for Monitoring 579
Understanding Hadoop Logging 582
Using Hadoop’s Web UIs for Monitoring 599
Monitoring Other Hadoop Components 609
Summary 610
Chapter 18: Tuning the Cluster Resources, Optimizing MapReduce Jobs and
Benchmarking 611
How to Allocate YARN Memory and CPU 612
Configuring Efficient Performance 621
Tuning Map and Reduce Tasks—What the Administrator Can Do 625
Optimizing Pig and Hive Jobs 635
Benchmarking Your Cluster 638
Hadoop Counters 647
Optimizing MapReduce 652
Summary 658
Chapter 19: Configuring and Tuning Apache Spark on YARN 659
Configuring Resource Allocation for Spark on YARN 659
Dynamic Resource Allocation when Running Spark on YARN 676
Storage Formats and Compressing Data 678
Monitoring Spark Applications 681
Tuning Garbage Collection 686
Tuning Spark Streaming Applications 688
Summary 689
Chapter 20: Optimizing Spark Applications 691
Revisiting the Spark Execution Model 692
Shuffle Operations and How to Minimize Them 694
Partitioning and Parallelism (Number of Tasks) 703
Optimizing Data Serialization and Compression 710
Understanding Spark’s SQL Query Optimizer 712
Caching Data 717
Summary 723
Chapter 21: Troubleshooting Hadoop—A Sampler 725
Space-Related Issues 725
Handling YARN Jobs That Are Stuck 731
JVM Memory-Allocation and Garbage-Collection Strategies 732
Handling Different Types of Failures 737
Troubleshooting Spark Jobs 739
Debugging Spark Applications 740
Summary 742
Chapter 22: Installing VirtualBox and Linux and Cloning the Virtual
Machines 743
Installing Oracle VirtualBox 744
Installing Oracle Enterprise Linux 745
Cloning the Linux Server 745
Index 747
Preface xxix
Acknowledgments xxxv
About the Author xxxvii
Part I: Introduction to Hadoop—Architecture and Hadoop Clusters 1
Chapter 1: Introduction to Hadoop and Its Environment 3
Hadoop—An Introduction 4
Cluster Computing and Hadoop Clusters 12
Hadoop Components and the Hadoop Ecosphere 15
What Do Hadoop Administrators Do? 18
Key Differences between Hadoop 1 and Hadoop 2 21
Distributed Data Processing: MapReduce and Spark, Hive and Pig 24
Data Integration: Apache Sqoop, Apache Flume and
Apache Kafka 27
Key Areas of Hadoop Administration 28
Summary 31
Chapter 2: An Introduction to the Architecture of Hadoop 33
Distributed Computing and Hadoop 33
Hadoop Architecture 34
Data Storage—The Hadoop Distributed File System 37
Data Processing with YARN, the Hadoop Operating System 48
Summary 57
Chapter 3: Creating and Configuring a Simple Hadoop Cluster 59
Hadoop Distributions and Installation Types 60
Setting Up a Pseudo-Distributed Hadoop Cluster 62
Performing the Initial Hadoop Configuration 71
Operating the New Hadoop Cluster 86
Summary 90
Chapter 4: Planning for and Creating a Fully Distributed Cluster 91
Planning Your Hadoop Cluster 92
Going from a Single Rack to Multiple Racks 95
Creating a Multinode Cluster 102
Modifying the Hadoop Configuration 106
Starting Up the Cluster 114
Configuring Hadoop Services, Web Interfaces and Ports 119
Summary 126
Part II: Hadoop Application Frameworks 127
Chapter 5: Running Applications in a Cluster—The MapReduce Framework (and
Hive and Pig) 129
The MapReduce Framework 129
Apache Hive 141
Apache Pig 144
Summary 145
Chapter 6: Running Applications in a Cluster—The Spark Framework 147
What Is Spark? 148
Why Spark? 149
The Spark Stack 153
Installing Spark 155
Spark Run Modes 158
Understanding the Cluster Managers 159
Spark and Data Access 164
Summary 167
Chapter 7: Running Spark Applications 169
The Spark Programming Model 169
Spark Applications 173
Architecture of a Spark Application 179
Running Spark Applications Interactively 181
Creating and Submitting Spark Applications 185
Configuring Spark Applications 192
Monitoring Spark Applications 194
Handling Streaming Data with Spark Streaming 194
Using Spark SQL for Handling Structured Data 198
Summary 201
Part III: Managing and Protecting Hadoop Data and High Availability 203
Chapter 8: The Role of the NameNode and How HDFS Works 205
HDFS—The Interaction between the NameNode and the DataNodes 205
Rack Awareness and Topology 209
HDFS Data Replication 212
How Clients Read and Write HDFS Data 218
Understanding HDFS Recovery Processes 224
Centralized Cache Management in HDFS 227
Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) 232
Summary 241
Chapter 9: HDFS Commands, HDFS Permissions and HDFS Storage 243
Managing HDFS through the HDFS Shell Commands 243
Using the dfsadmin Utility to Perform HDFS Operations 251
Managing HDFS Permissions and Users 255
Managing HDFS Storage 260
Rebalancing HDFS Data 267
Reclaiming HDFS Space 274
Summary 276
Chapter 10: Data Protection, File Formats and Accessing HDFS 277
Safeguarding Data 278
Data Compression 289
Hadoop File Formats 295
Using Hadoop WebHDFS and HttpFS 308
Summary 315
Chapter 11: NameNode Operations, High Availability and Federation 317
Understanding NameNode Operations 318
The Checkpointing Process 323
NameNode Safe Mode Operations 329
Configuring HDFS High Availability 334
HDFS Federation 349
Summary 351
Part IV: Moving Data, Allocating Resources, Scheduling Jobs and Security
353
Chapter 12: Moving Data Into and Out of Hadoop 355
Introduction to Hadoop Data Transfer Tools 355
Loading Data into HDFS from the Command Line 356
Copying HDFS Data between Clusters with DistCp 361
Ingesting Data from Relational Databases with Sqoop 365
Ingesting Data from External Sources with Flume 388
Ingesting Data with Kafka 398
Summary 406
Chapter 13: Resource Allocation in a Hadoop Cluster 407
Resource Allocation in Hadoop 407
The FIFO Scheduler 410
The Capacity Scheduler 411
The Fair Scheduler 426
Comparing the Capacity Scheduler and the Fair Scheduler 435
Summary 436
Chapter 14: Working with Oozie to Manage Job Workflows 437
Using Apache Oozie to Schedule Jobs 437
Oozie Architecture 439
Deploying Oozie in Your Cluster 441
Understanding Oozie Workflows 446
How Oozie Runs an Action 449
Creating an Oozie Workflow 454
Running an Oozie Workflow Job 461
Oozie Coordinators 464
Managing and Administering Oozie 470
Summary 475
Chapter 15: Securing Hadoop 477
Hadoop Security—An Overview 478
Hadoop Authentication with Kerberos 481
Hadoop Authorization 505
Auditing Hadoop 518
Securing Hadoop Data 520
Other Hadoop-Related Security Initiatives 524
Summary 525
Part V: Monitoring, Optimization and Troubleshooting 527
Chapter 16: Managing Jobs, Using Hue and Performing Routine Tasks 529
Using the YARN Commands to Manage Hadoop Jobs 530
Decommissioning and Recommissioning Nodes 535
ResourceManager High Availability 541
Performing Common Management Tasks 545
Managing the MySQL Database 548
Backing Up Important Cluster Data 551
Using Hue to Administer Your Cluster 553
Implementing Specialized HDFS Features 562
Summary 567
Chapter 17: Monitoring, Metrics and Hadoop Logging 569
Monitoring Linux Servers 570
Hadoop Metrics 576
Using Ganglia for Monitoring 579
Understanding Hadoop Logging 582
Using Hadoop’s Web UIs for Monitoring 599
Monitoring Other Hadoop Components 609
Summary 610
Chapter 18: Tuning the Cluster Resources, Optimizing MapReduce Jobs and
Benchmarking 611
How to Allocate YARN Memory and CPU 612
Configuring Efficient Performance 621
Tuning Map and Reduce Tasks—What the Administrator Can Do 625
Optimizing Pig and Hive Jobs 635
Benchmarking Your Cluster 638
Hadoop Counters 647
Optimizing MapReduce 652
Summary 658
Chapter 19: Configuring and Tuning Apache Spark on YARN 659
Configuring Resource Allocation for Spark on YARN 659
Dynamic Resource Allocation when Running Spark on YARN 676
Storage Formats and Compressing Data 678
Monitoring Spark Applications 681
Tuning Garbage Collection 686
Tuning Spark Streaming Applications 688
Summary 689
Chapter 20: Optimizing Spark Applications 691
Revisiting the Spark Execution Model 692
Shuffle Operations and How to Minimize Them 694
Partitioning and Parallelism (Number of Tasks) 703
Optimizing Data Serialization and Compression 710
Understanding Spark’s SQL Query Optimizer 712
Caching Data 717
Summary 723
Chapter 21: Troubleshooting Hadoop—A Sampler 725
Space-Related Issues 725
Handling YARN Jobs That Are Stuck 731
JVM Memory-Allocation and Garbage-Collection Strategies 732
Handling Different Types of Failures 737
Troubleshooting Spark Jobs 739
Debugging Spark Applications 740
Summary 742
Chapter 22: Installing VirtualBox and Linux and Cloning the Virtual
Machines 743
Installing Oracle VirtualBox 744
Installing Oracle Enterprise Linux 745
Cloning the Linux Server 745
Index 747