24,99 €
inkl. MwSt.

Versandfertig in über 4 Wochen
  • Broschiertes Buch

The book describes data-driven approach to optimal monitoring and alerting in distributed computer systems. It interprets monitoring as a continuous process aimed at extraction of meaning from system's data. The resulting wisdom drives effective maintenance and fast recovery - the bread and butter of web operations.
The content of the book gives a scalable perspective on the following topics:
- anatomy of monitoring and alerting
- conclusive interpretation of time series
- data-driven approach to setting up monitors
- addressing system failures by their impact
- applications
…mehr

Produktbeschreibung
The book describes data-driven approach to optimal monitoring and alerting in distributed computer systems. It interprets monitoring as a continuous process aimed at extraction of meaning from system's data. The resulting wisdom drives effective maintenance and fast recovery - the bread and butter of web operations.

The content of the book gives a scalable perspective on the following topics:

- anatomy of monitoring and alerting

- conclusive interpretation of time series

- data-driven approach to setting up monitors

- addressing system failures by their impact

- applications of monitoring in automation

- reporting on quality with quantitative means

- and more!
With this practical book, you'll discover how to catch complications in your distributed system before they develop into costly problems. Based on his extensive experience in systems ops at large technology companies, author Slawek Ligus describes an effective data-driven approach for monitoring and alerting that enables you to maintain high availability and deliver a high quality of service. Learn methods for measuring state changes and data flow in your system, and set up alerts to help you recover quickly from problems when they do arise. If you're a system operator waging the daily battle to provide the best performance at the lowest cost, this book is for you. Monitor every component of your application stack, from the network to user experience Learn how to draw the right conclusions from the metrics you obtain Develop a robust alerting system that can identify problematic anomalies - without raising false alarms Address system failures by their impact on resource utilization and user experience Plan an alerting configuration that scales with your expanding network Learn how to choose appropriate maintenance times automatically Develop a work environment that fosters flexibility and adaptability
Autorenporträt
Slawek is a systems and software engineer with a background in web operations and service-oriented architectures. He specializes in implementing solutions to tough problems in large-scale information systems. Slawek has been involved in automation of infrastructures and product development, working with leading Internet giants.