- Broschiertes Buch
- Merkliste
- Auf die Merkliste
- Bewerten Bewerten
- Teilen
- Produkt teilen
- Produkterinnerung
- Produkterinnerung
The go-to guidebook for deploying Big Data solutions with Hadoop Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in…mehr
Andere Kunden interessierten sich auch für
- Paulraj PonniahData Warehousing Fundamentals for It Professionals172,99 €
- Richard BrathGraph Analysis and Visualization56,99 €
- Gabor SzaboSocial Media Data Mining and Analytics40,99 €
- Brian KnightProfessional Microsoft SQL Server 2014 Integration Services48,99 €
- Tom CarpenterMicrosoft SQL Server 2012 Administration61,99 €
- Data Science and Big Data Analytics67,99 €
- Wayne L. WinstonMarketing Analytics46,99 €
-
-
-
The go-to guidebook for deploying Big Data solutions with Hadoop Today's enterprise architects need to understand how the Hadoop frameworks and APIs fit together, and how they can be integrated to deliver real-world solutions. This book is a practical, detailed guide to building and implementing those solutions, with code-level instruction in the popular Wrox tradition. It covers storing data with HDFS and Hbase, processing data with MapReduce, and automating data processing with Oozie. Hadoop security, running Hadoop with Amazon Web Services, best practices, and automating Hadoop processes in real time are also covered in depth. With in-depth code examples in Java and XML and the latest on recent additions to the Hadoop ecosystem, this complete resource also covers the use of APIs, exposing their inner workings and allowing architects and developers to better leverage and customize them. * The ultimate guide for developers, designers, and architects who need to build and deploy Hadoop applications * Covers storing and processing data with various technologies, automating data processing, Hadoop security, and delivering real-time solutions * Includes detailed, real-world examples and code-level guidelines * Explains when, why, and how to use these tools effectively * Written by a team of Hadoop experts in the programmer-to-programmer Wrox style Professional Hadoop Solutions is the reference enterprise architects and developers need to maximize the power of Hadoop.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Produktdetails
- Produktdetails
- Verlag: Wiley & Sons
- Artikelnr. des Verlages: 1W118611930
- Seitenzahl: 504
- Erscheinungstermin: 23. September 2013
- Englisch
- Abmessung: 234mm x 189mm x 28mm
- Gewicht: 857g
- ISBN-13: 9781118611937
- ISBN-10: 1118611934
- Artikelnr.: 37604343
- Herstellerkennzeichnung
- Produktsicherheitsverantwortliche/r
- Europaallee 1
- 36244 Bad Hersfeld
- gpsr@libri.de
- Verlag: Wiley & Sons
- Artikelnr. des Verlages: 1W118611930
- Seitenzahl: 504
- Erscheinungstermin: 23. September 2013
- Englisch
- Abmessung: 234mm x 189mm x 28mm
- Gewicht: 857g
- ISBN-13: 9781118611937
- ISBN-10: 1118611934
- Artikelnr.: 37604343
- Herstellerkennzeichnung
- Produktsicherheitsverantwortliche/r
- Europaallee 1
- 36244 Bad Hersfeld
- gpsr@libri.de
Boris Lublinsky is principal architect at Nokia and an author of more than 70 publications, including Applied SOA: Service-Oriented Architecture and Design Strategies. Kevin T. Smith is Director of Technology Solutions for the AMS division of Novetta Solutions, where he builds highly secure, data-oriented solutions for customers. Alexey Yakubovich is a system architect at Hortonworks and a member of the Object Management Group SIG on SOA governance and model-driven architecture.
Introduction xvii Chapter 1: Big Data and the Hadoop Ecosystem 1 Big Data
Meets Hadoop 2 Hadoop: Meeting the Big Data Challenge 3 Data Science in the
Business World 5 The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop
Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary
16 Chapter 2: Storing Data in Hadoop 19 HDFS 19 HDFS Architecture 20 Using
HDFS Files 24 Hadoop-Specific File Types 26 HDFS Federation and High
Availability 32 HBase 34 HBase Architecture 34 HBase Schema Design 40
Programming for HBase 42 New HBase Features 50 Combining HDFS and HBase for
Effective Data Storage 53 Using Apache Avro 53 Managing Metadata with
HCatalog 58 Choosing an Appropriate Hadoop Data Organization for Your
Applications 60 Summary 62 Chapter 3: Processing Your Data with MapReduce
63 Getting to Know MapReduce 63 MapReduce Execution Pipeline 65 Runtime
Coordination and Task Management in MapReduce 68 Your First MapReduce
Application 70 Building and Executing MapReduce Programs 74 Designing
MapReduce Implementations 78 Using MapReduce as a Framework for Parallel
Processing 79 Simple Data Processing with MapReduce 81 Building Joins with
MapReduce 82 Building Iterative MapReduce Applications 88 To MapReduce or
Not to MapReduce? 94 Common MapReduce Design Gotchas 95 Summary 96 Chapter
4: Customizing MapReduce Execution 97 Controlling MapReduce Execution with
InputFormat 98 Implementing InputFormat for Compute-Intensive Applications
100 Implementing InputFormat to Control the Number of Maps 106 Implementing
InputFormat for Multiple HBase Tables 112 Reading Data Your Way with Custom
RecordReaders 116 Implementing a Queue-Based RecordReader 116 Implementing
RecordReader for XML Data 119 Organizing Output Data with Custom Output
Formats 123 Implementing OutputFormat for Splitting MapReduce Job's Output
into Multiple Directories 124 Writing Data Your Way with Custom
RecordWriters 133 Implementing a RecordWriter to Produce Outputtar Files
133 Optimizing Your MapReduce Execution with a Combiner 135 Controlling
Reducer Execution with Partitioners 139 Implementing a Custom Partitioner
for One-to-Many Joins 140 Using Non-Java Code with Hadoop 143 Pipes 143
Hadoop Streaming 143 Using JNI 144 Summary 146 Chapter 5: Building Reliable
MapReduce Apps 147 Unit Testing MapReduce Applications 147 Testing Mappers
150 Testing Reducers 151 Integration Testing 152 Local Application Testing
with Eclipse 154 Using Logging for Hadoop Testing 156 Processing
Applications Logs 160 Reporting Metrics with Job Counters 162 Defensive
Programming in MapReduce 165 Summary 166 Chapter 6: Automating Data
Processing with Oozie 167 Getting to Know Oozie 168 Oozie Workflow 170
Executing Asynchronous Activities in Oozie Workflow 173 Oozie Recovery
Capabilities 179 Oozie Workflow Job Life Cycle 180 Oozie Coordinator 181
Oozie Bundle 187 Oozie Parameterization with Expression Language 191
Workflow Functions 192 Coordinator Functions 192 Bundle Functions 193 Other
EL Functions 193 Oozie Job Execution Model 193 Accessing Oozie 197 Oozie
SLA 199 Summary 203 Chapter 7: Using Oozie 205 Validating Information about
Places Using Probes 206 Designing Place Validation Based on Probes 207
Designing Oozie Workflows 208 Implementing Oozie Workflow Applications 211
Implementing the Data Preparation Workflow 212 Implementing Attendance
Index and Cluster Strands Workflows 220 Implementing Workflow Activities
222 Populating the Execution Context from a java Action 223 Using MapReduce
Jobs in Oozie Workflows 223 Implementing Oozie Coordinator Applications 226
Implementing Oozie Bundle Applications 231 Deploying, Testing, and
Executing Oozie Applications 232 Deploying Oozie Applications 232 Using the
Oozie CLI for Execution of an Oozie Application 234 Passing Arguments to
Oozie Jobs 237 Using the Oozie Console to Get Information about Oozie
Applications 240 Getting to Know the Oozie Console Screens 240 Getting
Information about a Coordinator Job 245 Summary 247 Chapter 8: Advanced
Oozie FEATURES 249 Building Custom Oozie Workflow Actions 250 Implementing
a Custom Oozie Workflow Action 251 Deploying Oozie Custom Workflow Actions
255 Adding Dynamic Execution to Oozie Workflows 257 Overall Implementation
Approach 257 A Machine Learning Model, Parameters, and Algorithm 261
Defining a Workflow for an Iterative Process 262 Dynamic Workflow
Generation 265 Using the Oozie Java API 268 Using Uber Jars with Oozie
Applications 272 Data Ingestion Conveyer 276 Summary 283 Chapter 9:
Real-Time Hadoop 285 Real-Time Applications in the Real World 286 Using
HBase for Implementing Real-Time Applications 287 Using HBase as a Picture
Management System 289 Using HBase as a Lucene Back End 296 Using
Specialized Real-Time Hadoop Query Systems 317 Apache Drill 319 Impala 320
Comparing Real-Time Queries to MapReduce 323 Using Hadoop-Based
Event-Processing Systems 323 HFlame 324 Storm 326 Comparing Event
Processing to MapReduce 329 Summary 330 Chapter 10: Hadoop Security 331 A
Brief History: Understanding Hadoop Security Challenges 333 Authentication
334 Kerberos Authentication 334 Delegated Security Credentials 344
Authorization 350 HDFS File Permissions 350 Service-Level Authorization 354
Job Authorization 356 Oozie Authentication and Authorization 356 Network
Encryption 358 Security Enhancements with Project Rhino 360 HDFS Disk-Level
Encryption 361 Token-Based Authentication and Unified Authorization
Framework 361 HBase Cell-Level Security 362 Putting it All Together -- Best
Practices for Securing Hadoop 362 Authentication 363 Authorization 364
Network Encryption 364 Stay Tuned for Hadoop Enhancements 365 Summary 365
Chapter 11: Running Hadoop Applications on AWS 367 Getting to Know AWS 368
Options for Running Hadoop on AWS 369 Custom Installation using EC2
Instances 369 Elastic MapReduce 370 Additional Considerations before Making
Your Choice 370 Understanding the EMR-Hadoop Relationship 370 EMR
Architecture 372 Using S3 Storage 373 Maximizing Your Use of EMR 374
Utilizing CloudWatch and Other AWS Components 376 Accessing and Using EMR
377 Using AWS S3 383 Understanding the Use of Buckets 383 Content Browsing
with the Console 386 Programmatically Accessing Files in S3 387 Using
MapReduce to Upload Multiple Files to S3 397 Automating EMR Job Flow
Creation and Job Execution 399 Orchestrating Job Execution in EMR 404 Using
Oozie on an EMR Cluster 404 AWS Simple Workflow 407 AWS Data Pipeline 408
Summary 409 Chapter 12: Building Enterprise Security Solutions for Hadoop
Implementations 411 Security Concerns for Enterprise Applications 412
Authentication 414 Authorization 414 Confidentiality 415 Integrity 415
Auditing 416 What Hadoop Security Doesn't Natively Provide for Enterprise
Applications 416 Data-Oriented Access Control 416 Differential Privacy 417
Encrypted Data at Rest 419 Enterprise Security Integration 419 Approaches
for Securing Enterprise Applications Using Hadoop 419 Access Control
Protection with Accumulo 420 Encryption at Rest 430 Network Isolation and
Separation Approaches 430 Summary 434 Chapter 13: Hadoop's Future 435
Simplifying MapReduce Programming with DSLs 436 What Are DSLs? 436 DSLs for
Hadoop 437 Faster, More Scalable Processing 449 Apache YARN 449 Tez 452
Security Enhancements 452 Emerging Trends 453 Summary 454 APPENDIX : Useful
Reading 455 Index 463
Meets Hadoop 2 Hadoop: Meeting the Big Data Challenge 3 Data Science in the
Business World 5 The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop
Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary
16 Chapter 2: Storing Data in Hadoop 19 HDFS 19 HDFS Architecture 20 Using
HDFS Files 24 Hadoop-Specific File Types 26 HDFS Federation and High
Availability 32 HBase 34 HBase Architecture 34 HBase Schema Design 40
Programming for HBase 42 New HBase Features 50 Combining HDFS and HBase for
Effective Data Storage 53 Using Apache Avro 53 Managing Metadata with
HCatalog 58 Choosing an Appropriate Hadoop Data Organization for Your
Applications 60 Summary 62 Chapter 3: Processing Your Data with MapReduce
63 Getting to Know MapReduce 63 MapReduce Execution Pipeline 65 Runtime
Coordination and Task Management in MapReduce 68 Your First MapReduce
Application 70 Building and Executing MapReduce Programs 74 Designing
MapReduce Implementations 78 Using MapReduce as a Framework for Parallel
Processing 79 Simple Data Processing with MapReduce 81 Building Joins with
MapReduce 82 Building Iterative MapReduce Applications 88 To MapReduce or
Not to MapReduce? 94 Common MapReduce Design Gotchas 95 Summary 96 Chapter
4: Customizing MapReduce Execution 97 Controlling MapReduce Execution with
InputFormat 98 Implementing InputFormat for Compute-Intensive Applications
100 Implementing InputFormat to Control the Number of Maps 106 Implementing
InputFormat for Multiple HBase Tables 112 Reading Data Your Way with Custom
RecordReaders 116 Implementing a Queue-Based RecordReader 116 Implementing
RecordReader for XML Data 119 Organizing Output Data with Custom Output
Formats 123 Implementing OutputFormat for Splitting MapReduce Job's Output
into Multiple Directories 124 Writing Data Your Way with Custom
RecordWriters 133 Implementing a RecordWriter to Produce Outputtar Files
133 Optimizing Your MapReduce Execution with a Combiner 135 Controlling
Reducer Execution with Partitioners 139 Implementing a Custom Partitioner
for One-to-Many Joins 140 Using Non-Java Code with Hadoop 143 Pipes 143
Hadoop Streaming 143 Using JNI 144 Summary 146 Chapter 5: Building Reliable
MapReduce Apps 147 Unit Testing MapReduce Applications 147 Testing Mappers
150 Testing Reducers 151 Integration Testing 152 Local Application Testing
with Eclipse 154 Using Logging for Hadoop Testing 156 Processing
Applications Logs 160 Reporting Metrics with Job Counters 162 Defensive
Programming in MapReduce 165 Summary 166 Chapter 6: Automating Data
Processing with Oozie 167 Getting to Know Oozie 168 Oozie Workflow 170
Executing Asynchronous Activities in Oozie Workflow 173 Oozie Recovery
Capabilities 179 Oozie Workflow Job Life Cycle 180 Oozie Coordinator 181
Oozie Bundle 187 Oozie Parameterization with Expression Language 191
Workflow Functions 192 Coordinator Functions 192 Bundle Functions 193 Other
EL Functions 193 Oozie Job Execution Model 193 Accessing Oozie 197 Oozie
SLA 199 Summary 203 Chapter 7: Using Oozie 205 Validating Information about
Places Using Probes 206 Designing Place Validation Based on Probes 207
Designing Oozie Workflows 208 Implementing Oozie Workflow Applications 211
Implementing the Data Preparation Workflow 212 Implementing Attendance
Index and Cluster Strands Workflows 220 Implementing Workflow Activities
222 Populating the Execution Context from a java Action 223 Using MapReduce
Jobs in Oozie Workflows 223 Implementing Oozie Coordinator Applications 226
Implementing Oozie Bundle Applications 231 Deploying, Testing, and
Executing Oozie Applications 232 Deploying Oozie Applications 232 Using the
Oozie CLI for Execution of an Oozie Application 234 Passing Arguments to
Oozie Jobs 237 Using the Oozie Console to Get Information about Oozie
Applications 240 Getting to Know the Oozie Console Screens 240 Getting
Information about a Coordinator Job 245 Summary 247 Chapter 8: Advanced
Oozie FEATURES 249 Building Custom Oozie Workflow Actions 250 Implementing
a Custom Oozie Workflow Action 251 Deploying Oozie Custom Workflow Actions
255 Adding Dynamic Execution to Oozie Workflows 257 Overall Implementation
Approach 257 A Machine Learning Model, Parameters, and Algorithm 261
Defining a Workflow for an Iterative Process 262 Dynamic Workflow
Generation 265 Using the Oozie Java API 268 Using Uber Jars with Oozie
Applications 272 Data Ingestion Conveyer 276 Summary 283 Chapter 9:
Real-Time Hadoop 285 Real-Time Applications in the Real World 286 Using
HBase for Implementing Real-Time Applications 287 Using HBase as a Picture
Management System 289 Using HBase as a Lucene Back End 296 Using
Specialized Real-Time Hadoop Query Systems 317 Apache Drill 319 Impala 320
Comparing Real-Time Queries to MapReduce 323 Using Hadoop-Based
Event-Processing Systems 323 HFlame 324 Storm 326 Comparing Event
Processing to MapReduce 329 Summary 330 Chapter 10: Hadoop Security 331 A
Brief History: Understanding Hadoop Security Challenges 333 Authentication
334 Kerberos Authentication 334 Delegated Security Credentials 344
Authorization 350 HDFS File Permissions 350 Service-Level Authorization 354
Job Authorization 356 Oozie Authentication and Authorization 356 Network
Encryption 358 Security Enhancements with Project Rhino 360 HDFS Disk-Level
Encryption 361 Token-Based Authentication and Unified Authorization
Framework 361 HBase Cell-Level Security 362 Putting it All Together -- Best
Practices for Securing Hadoop 362 Authentication 363 Authorization 364
Network Encryption 364 Stay Tuned for Hadoop Enhancements 365 Summary 365
Chapter 11: Running Hadoop Applications on AWS 367 Getting to Know AWS 368
Options for Running Hadoop on AWS 369 Custom Installation using EC2
Instances 369 Elastic MapReduce 370 Additional Considerations before Making
Your Choice 370 Understanding the EMR-Hadoop Relationship 370 EMR
Architecture 372 Using S3 Storage 373 Maximizing Your Use of EMR 374
Utilizing CloudWatch and Other AWS Components 376 Accessing and Using EMR
377 Using AWS S3 383 Understanding the Use of Buckets 383 Content Browsing
with the Console 386 Programmatically Accessing Files in S3 387 Using
MapReduce to Upload Multiple Files to S3 397 Automating EMR Job Flow
Creation and Job Execution 399 Orchestrating Job Execution in EMR 404 Using
Oozie on an EMR Cluster 404 AWS Simple Workflow 407 AWS Data Pipeline 408
Summary 409 Chapter 12: Building Enterprise Security Solutions for Hadoop
Implementations 411 Security Concerns for Enterprise Applications 412
Authentication 414 Authorization 414 Confidentiality 415 Integrity 415
Auditing 416 What Hadoop Security Doesn't Natively Provide for Enterprise
Applications 416 Data-Oriented Access Control 416 Differential Privacy 417
Encrypted Data at Rest 419 Enterprise Security Integration 419 Approaches
for Securing Enterprise Applications Using Hadoop 419 Access Control
Protection with Accumulo 420 Encryption at Rest 430 Network Isolation and
Separation Approaches 430 Summary 434 Chapter 13: Hadoop's Future 435
Simplifying MapReduce Programming with DSLs 436 What Are DSLs? 436 DSLs for
Hadoop 437 Faster, More Scalable Processing 449 Apache YARN 449 Tez 452
Security Enhancements 452 Emerging Trends 453 Summary 454 APPENDIX : Useful
Reading 455 Index 463
Introduction xvii Chapter 1: Big Data and the Hadoop Ecosystem 1 Big Data
Meets Hadoop 2 Hadoop: Meeting the Big Data Challenge 3 Data Science in the
Business World 5 The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop
Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary
16 Chapter 2: Storing Data in Hadoop 19 HDFS 19 HDFS Architecture 20 Using
HDFS Files 24 Hadoop-Specific File Types 26 HDFS Federation and High
Availability 32 HBase 34 HBase Architecture 34 HBase Schema Design 40
Programming for HBase 42 New HBase Features 50 Combining HDFS and HBase for
Effective Data Storage 53 Using Apache Avro 53 Managing Metadata with
HCatalog 58 Choosing an Appropriate Hadoop Data Organization for Your
Applications 60 Summary 62 Chapter 3: Processing Your Data with MapReduce
63 Getting to Know MapReduce 63 MapReduce Execution Pipeline 65 Runtime
Coordination and Task Management in MapReduce 68 Your First MapReduce
Application 70 Building and Executing MapReduce Programs 74 Designing
MapReduce Implementations 78 Using MapReduce as a Framework for Parallel
Processing 79 Simple Data Processing with MapReduce 81 Building Joins with
MapReduce 82 Building Iterative MapReduce Applications 88 To MapReduce or
Not to MapReduce? 94 Common MapReduce Design Gotchas 95 Summary 96 Chapter
4: Customizing MapReduce Execution 97 Controlling MapReduce Execution with
InputFormat 98 Implementing InputFormat for Compute-Intensive Applications
100 Implementing InputFormat to Control the Number of Maps 106 Implementing
InputFormat for Multiple HBase Tables 112 Reading Data Your Way with Custom
RecordReaders 116 Implementing a Queue-Based RecordReader 116 Implementing
RecordReader for XML Data 119 Organizing Output Data with Custom Output
Formats 123 Implementing OutputFormat for Splitting MapReduce Job's Output
into Multiple Directories 124 Writing Data Your Way with Custom
RecordWriters 133 Implementing a RecordWriter to Produce Outputtar Files
133 Optimizing Your MapReduce Execution with a Combiner 135 Controlling
Reducer Execution with Partitioners 139 Implementing a Custom Partitioner
for One-to-Many Joins 140 Using Non-Java Code with Hadoop 143 Pipes 143
Hadoop Streaming 143 Using JNI 144 Summary 146 Chapter 5: Building Reliable
MapReduce Apps 147 Unit Testing MapReduce Applications 147 Testing Mappers
150 Testing Reducers 151 Integration Testing 152 Local Application Testing
with Eclipse 154 Using Logging for Hadoop Testing 156 Processing
Applications Logs 160 Reporting Metrics with Job Counters 162 Defensive
Programming in MapReduce 165 Summary 166 Chapter 6: Automating Data
Processing with Oozie 167 Getting to Know Oozie 168 Oozie Workflow 170
Executing Asynchronous Activities in Oozie Workflow 173 Oozie Recovery
Capabilities 179 Oozie Workflow Job Life Cycle 180 Oozie Coordinator 181
Oozie Bundle 187 Oozie Parameterization with Expression Language 191
Workflow Functions 192 Coordinator Functions 192 Bundle Functions 193 Other
EL Functions 193 Oozie Job Execution Model 193 Accessing Oozie 197 Oozie
SLA 199 Summary 203 Chapter 7: Using Oozie 205 Validating Information about
Places Using Probes 206 Designing Place Validation Based on Probes 207
Designing Oozie Workflows 208 Implementing Oozie Workflow Applications 211
Implementing the Data Preparation Workflow 212 Implementing Attendance
Index and Cluster Strands Workflows 220 Implementing Workflow Activities
222 Populating the Execution Context from a java Action 223 Using MapReduce
Jobs in Oozie Workflows 223 Implementing Oozie Coordinator Applications 226
Implementing Oozie Bundle Applications 231 Deploying, Testing, and
Executing Oozie Applications 232 Deploying Oozie Applications 232 Using the
Oozie CLI for Execution of an Oozie Application 234 Passing Arguments to
Oozie Jobs 237 Using the Oozie Console to Get Information about Oozie
Applications 240 Getting to Know the Oozie Console Screens 240 Getting
Information about a Coordinator Job 245 Summary 247 Chapter 8: Advanced
Oozie FEATURES 249 Building Custom Oozie Workflow Actions 250 Implementing
a Custom Oozie Workflow Action 251 Deploying Oozie Custom Workflow Actions
255 Adding Dynamic Execution to Oozie Workflows 257 Overall Implementation
Approach 257 A Machine Learning Model, Parameters, and Algorithm 261
Defining a Workflow for an Iterative Process 262 Dynamic Workflow
Generation 265 Using the Oozie Java API 268 Using Uber Jars with Oozie
Applications 272 Data Ingestion Conveyer 276 Summary 283 Chapter 9:
Real-Time Hadoop 285 Real-Time Applications in the Real World 286 Using
HBase for Implementing Real-Time Applications 287 Using HBase as a Picture
Management System 289 Using HBase as a Lucene Back End 296 Using
Specialized Real-Time Hadoop Query Systems 317 Apache Drill 319 Impala 320
Comparing Real-Time Queries to MapReduce 323 Using Hadoop-Based
Event-Processing Systems 323 HFlame 324 Storm 326 Comparing Event
Processing to MapReduce 329 Summary 330 Chapter 10: Hadoop Security 331 A
Brief History: Understanding Hadoop Security Challenges 333 Authentication
334 Kerberos Authentication 334 Delegated Security Credentials 344
Authorization 350 HDFS File Permissions 350 Service-Level Authorization 354
Job Authorization 356 Oozie Authentication and Authorization 356 Network
Encryption 358 Security Enhancements with Project Rhino 360 HDFS Disk-Level
Encryption 361 Token-Based Authentication and Unified Authorization
Framework 361 HBase Cell-Level Security 362 Putting it All Together -- Best
Practices for Securing Hadoop 362 Authentication 363 Authorization 364
Network Encryption 364 Stay Tuned for Hadoop Enhancements 365 Summary 365
Chapter 11: Running Hadoop Applications on AWS 367 Getting to Know AWS 368
Options for Running Hadoop on AWS 369 Custom Installation using EC2
Instances 369 Elastic MapReduce 370 Additional Considerations before Making
Your Choice 370 Understanding the EMR-Hadoop Relationship 370 EMR
Architecture 372 Using S3 Storage 373 Maximizing Your Use of EMR 374
Utilizing CloudWatch and Other AWS Components 376 Accessing and Using EMR
377 Using AWS S3 383 Understanding the Use of Buckets 383 Content Browsing
with the Console 386 Programmatically Accessing Files in S3 387 Using
MapReduce to Upload Multiple Files to S3 397 Automating EMR Job Flow
Creation and Job Execution 399 Orchestrating Job Execution in EMR 404 Using
Oozie on an EMR Cluster 404 AWS Simple Workflow 407 AWS Data Pipeline 408
Summary 409 Chapter 12: Building Enterprise Security Solutions for Hadoop
Implementations 411 Security Concerns for Enterprise Applications 412
Authentication 414 Authorization 414 Confidentiality 415 Integrity 415
Auditing 416 What Hadoop Security Doesn't Natively Provide for Enterprise
Applications 416 Data-Oriented Access Control 416 Differential Privacy 417
Encrypted Data at Rest 419 Enterprise Security Integration 419 Approaches
for Securing Enterprise Applications Using Hadoop 419 Access Control
Protection with Accumulo 420 Encryption at Rest 430 Network Isolation and
Separation Approaches 430 Summary 434 Chapter 13: Hadoop's Future 435
Simplifying MapReduce Programming with DSLs 436 What Are DSLs? 436 DSLs for
Hadoop 437 Faster, More Scalable Processing 449 Apache YARN 449 Tez 452
Security Enhancements 452 Emerging Trends 453 Summary 454 APPENDIX : Useful
Reading 455 Index 463
Meets Hadoop 2 Hadoop: Meeting the Big Data Challenge 3 Data Science in the
Business World 5 The Hadoop Ecosystem 7 Hadoop Core Components 7 Hadoop
Distributions 10 Developing Enterprise Applications with Hadoop 12 Summary
16 Chapter 2: Storing Data in Hadoop 19 HDFS 19 HDFS Architecture 20 Using
HDFS Files 24 Hadoop-Specific File Types 26 HDFS Federation and High
Availability 32 HBase 34 HBase Architecture 34 HBase Schema Design 40
Programming for HBase 42 New HBase Features 50 Combining HDFS and HBase for
Effective Data Storage 53 Using Apache Avro 53 Managing Metadata with
HCatalog 58 Choosing an Appropriate Hadoop Data Organization for Your
Applications 60 Summary 62 Chapter 3: Processing Your Data with MapReduce
63 Getting to Know MapReduce 63 MapReduce Execution Pipeline 65 Runtime
Coordination and Task Management in MapReduce 68 Your First MapReduce
Application 70 Building and Executing MapReduce Programs 74 Designing
MapReduce Implementations 78 Using MapReduce as a Framework for Parallel
Processing 79 Simple Data Processing with MapReduce 81 Building Joins with
MapReduce 82 Building Iterative MapReduce Applications 88 To MapReduce or
Not to MapReduce? 94 Common MapReduce Design Gotchas 95 Summary 96 Chapter
4: Customizing MapReduce Execution 97 Controlling MapReduce Execution with
InputFormat 98 Implementing InputFormat for Compute-Intensive Applications
100 Implementing InputFormat to Control the Number of Maps 106 Implementing
InputFormat for Multiple HBase Tables 112 Reading Data Your Way with Custom
RecordReaders 116 Implementing a Queue-Based RecordReader 116 Implementing
RecordReader for XML Data 119 Organizing Output Data with Custom Output
Formats 123 Implementing OutputFormat for Splitting MapReduce Job's Output
into Multiple Directories 124 Writing Data Your Way with Custom
RecordWriters 133 Implementing a RecordWriter to Produce Outputtar Files
133 Optimizing Your MapReduce Execution with a Combiner 135 Controlling
Reducer Execution with Partitioners 139 Implementing a Custom Partitioner
for One-to-Many Joins 140 Using Non-Java Code with Hadoop 143 Pipes 143
Hadoop Streaming 143 Using JNI 144 Summary 146 Chapter 5: Building Reliable
MapReduce Apps 147 Unit Testing MapReduce Applications 147 Testing Mappers
150 Testing Reducers 151 Integration Testing 152 Local Application Testing
with Eclipse 154 Using Logging for Hadoop Testing 156 Processing
Applications Logs 160 Reporting Metrics with Job Counters 162 Defensive
Programming in MapReduce 165 Summary 166 Chapter 6: Automating Data
Processing with Oozie 167 Getting to Know Oozie 168 Oozie Workflow 170
Executing Asynchronous Activities in Oozie Workflow 173 Oozie Recovery
Capabilities 179 Oozie Workflow Job Life Cycle 180 Oozie Coordinator 181
Oozie Bundle 187 Oozie Parameterization with Expression Language 191
Workflow Functions 192 Coordinator Functions 192 Bundle Functions 193 Other
EL Functions 193 Oozie Job Execution Model 193 Accessing Oozie 197 Oozie
SLA 199 Summary 203 Chapter 7: Using Oozie 205 Validating Information about
Places Using Probes 206 Designing Place Validation Based on Probes 207
Designing Oozie Workflows 208 Implementing Oozie Workflow Applications 211
Implementing the Data Preparation Workflow 212 Implementing Attendance
Index and Cluster Strands Workflows 220 Implementing Workflow Activities
222 Populating the Execution Context from a java Action 223 Using MapReduce
Jobs in Oozie Workflows 223 Implementing Oozie Coordinator Applications 226
Implementing Oozie Bundle Applications 231 Deploying, Testing, and
Executing Oozie Applications 232 Deploying Oozie Applications 232 Using the
Oozie CLI for Execution of an Oozie Application 234 Passing Arguments to
Oozie Jobs 237 Using the Oozie Console to Get Information about Oozie
Applications 240 Getting to Know the Oozie Console Screens 240 Getting
Information about a Coordinator Job 245 Summary 247 Chapter 8: Advanced
Oozie FEATURES 249 Building Custom Oozie Workflow Actions 250 Implementing
a Custom Oozie Workflow Action 251 Deploying Oozie Custom Workflow Actions
255 Adding Dynamic Execution to Oozie Workflows 257 Overall Implementation
Approach 257 A Machine Learning Model, Parameters, and Algorithm 261
Defining a Workflow for an Iterative Process 262 Dynamic Workflow
Generation 265 Using the Oozie Java API 268 Using Uber Jars with Oozie
Applications 272 Data Ingestion Conveyer 276 Summary 283 Chapter 9:
Real-Time Hadoop 285 Real-Time Applications in the Real World 286 Using
HBase for Implementing Real-Time Applications 287 Using HBase as a Picture
Management System 289 Using HBase as a Lucene Back End 296 Using
Specialized Real-Time Hadoop Query Systems 317 Apache Drill 319 Impala 320
Comparing Real-Time Queries to MapReduce 323 Using Hadoop-Based
Event-Processing Systems 323 HFlame 324 Storm 326 Comparing Event
Processing to MapReduce 329 Summary 330 Chapter 10: Hadoop Security 331 A
Brief History: Understanding Hadoop Security Challenges 333 Authentication
334 Kerberos Authentication 334 Delegated Security Credentials 344
Authorization 350 HDFS File Permissions 350 Service-Level Authorization 354
Job Authorization 356 Oozie Authentication and Authorization 356 Network
Encryption 358 Security Enhancements with Project Rhino 360 HDFS Disk-Level
Encryption 361 Token-Based Authentication and Unified Authorization
Framework 361 HBase Cell-Level Security 362 Putting it All Together -- Best
Practices for Securing Hadoop 362 Authentication 363 Authorization 364
Network Encryption 364 Stay Tuned for Hadoop Enhancements 365 Summary 365
Chapter 11: Running Hadoop Applications on AWS 367 Getting to Know AWS 368
Options for Running Hadoop on AWS 369 Custom Installation using EC2
Instances 369 Elastic MapReduce 370 Additional Considerations before Making
Your Choice 370 Understanding the EMR-Hadoop Relationship 370 EMR
Architecture 372 Using S3 Storage 373 Maximizing Your Use of EMR 374
Utilizing CloudWatch and Other AWS Components 376 Accessing and Using EMR
377 Using AWS S3 383 Understanding the Use of Buckets 383 Content Browsing
with the Console 386 Programmatically Accessing Files in S3 387 Using
MapReduce to Upload Multiple Files to S3 397 Automating EMR Job Flow
Creation and Job Execution 399 Orchestrating Job Execution in EMR 404 Using
Oozie on an EMR Cluster 404 AWS Simple Workflow 407 AWS Data Pipeline 408
Summary 409 Chapter 12: Building Enterprise Security Solutions for Hadoop
Implementations 411 Security Concerns for Enterprise Applications 412
Authentication 414 Authorization 414 Confidentiality 415 Integrity 415
Auditing 416 What Hadoop Security Doesn't Natively Provide for Enterprise
Applications 416 Data-Oriented Access Control 416 Differential Privacy 417
Encrypted Data at Rest 419 Enterprise Security Integration 419 Approaches
for Securing Enterprise Applications Using Hadoop 419 Access Control
Protection with Accumulo 420 Encryption at Rest 430 Network Isolation and
Separation Approaches 430 Summary 434 Chapter 13: Hadoop's Future 435
Simplifying MapReduce Programming with DSLs 436 What Are DSLs? 436 DSLs for
Hadoop 437 Faster, More Scalable Processing 449 Apache YARN 449 Tez 452
Security Enhancements 452 Emerging Trends 453 Summary 454 APPENDIX : Useful
Reading 455 Index 463