Sources of errors can be categorized into 4
different types: 1. Software bugs, 2. Human mistake
during configuration and deployment of applications,
and during maintenance of machines, 3. System
hardware failures, and 4. Network problems.
Software bugs have direct impact to system resource
availability, for instance, memory leaks. Human
mistakes usually result in decreacing of application
availability. As thousands of computers connected
together to form an application and to serve
network traffic, hardware failures become common,
such as RAID failures, file system issues, disk
failed, etc. An example of Network problem is
network switch failed. The ugly thing is that the
switch usually is partially failed. Before this
problem switch is identified, many other application
timeouts, intermittent application availibility are
already making people doing trouble shooting crazy.
Embeded network issues sometimes are hard to
identify.
Error event filtering (in both temporal and spacial)
can help to identify problems most of the time, and
is helpful in trouble shooting. Once the errors are
identified, modeling and failure prediction will
come into play.
different types: 1. Software bugs, 2. Human mistake
during configuration and deployment of applications,
and during maintenance of machines, 3. System
hardware failures, and 4. Network problems.
Software bugs have direct impact to system resource
availability, for instance, memory leaks. Human
mistakes usually result in decreacing of application
availability. As thousands of computers connected
together to form an application and to serve
network traffic, hardware failures become common,
such as RAID failures, file system issues, disk
failed, etc. An example of Network problem is
network switch failed. The ugly thing is that the
switch usually is partially failed. Before this
problem switch is identified, many other application
timeouts, intermittent application availibility are
already making people doing trouble shooting crazy.
Embeded network issues sometimes are hard to
identify.
Error event filtering (in both temporal and spacial)
can help to identify problems most of the time, and
is helpful in trouble shooting. Once the errors are
identified, modeling and failure prediction will
come into play.