Previous Contents Next

Chapter 3   Alarm Correlation

This section explains the challenge of correlating alarms. Section 3.1 gives the background on how the difficulties that need correlation arise. Section 3.2 gives some more detail on what an alarm is. Section 3.3 explains the task of correlating alarms then section 3.4 discusses briefly how the task has been accomplished by others.

3.1   Alarms and Faults

The first point is that alarms and faults are not the same thing.

An alarm is the notification of an event. In the area of fault management this event is some fault. Alarms may also be used in other areas, for example, traffic management traffic management, where the alarm gives information about the rise in traffic in one area[Meira, 1997].

Alarms have a life span. Some alarms have a pre-determined lifetime, others are started by a SET alarm then ended by a CLEAR alarm. Fault alarms are usually of the latter sort.

In fault management the alarm is a symptom of the fault. There may be many alarms generated for a single fault.




Figure 3.1: The connection between X and Y has broken. The arrows denote alarms being sent to the manager.


For example, see Figure 3.1. Here we have a break between elements X and Y. The alarms are generated when A and B try to communicate. A knows that something is wrong since it can't communicate with B, so it generates an alarm. A doesn't know what has caused the alarm. It may be something in A itself, it may be in B. In this hypothetical case the elements are too primitive to do testing. So it sends the alarm to let the manager know that something is wrong. B also can't get a message to A so it set an alarm too. X and Y can't pass on the messages they received from A and B respectively, so they also generate alarms.

It may happen that a device generates more than one alarm for one fault. For example, X may send two alarms, perhaps one for the connection failure, and one for a dropped packet. Perhaps X will send an alarm every time A tries to send a message through it. This will generate intermittent alarms.

Alarm correlation is necessary to sort out the event from the symptoms. In other words, to find the fault from the generated alarms.

3.2   An Alarm Object

According to the information architecture in TMN an alarm can be thought of as an object. The attributes of the alarm try to describe the event that triggered it.

alarm attributes

This is a possible set of attributes:
Object ID
which identifies the object sending the alarm
Time Stamp
which gives the time the alarm was issued
Severity
which gives a state ranging from critical to indeterminate
Event Type
which gives some indication of what happened
Probable Cause
which gives some indication of why it happened
Specific Problems
which clarifies what happened
Proposed Repair Actions
which gives some suggestion about what to do

There can be extra fields added and modified by the correlator, such as a field to keep a count of how many times an alarm is SET and CLEARed within a certain time frame.

3.3   Correlation

Alarm correlation aims to pinpoint the triggering events from the incoming alarms. Generally this is done by reducing the number of alarm notifications. This helps add meaning to the alarms remaining.

All the alarms are logged into an alarm surveillance database. This helps when trying to trace what has happened in the past.

3.3.1   Functional Requirements

The basic requirements are to filter, compress, count, classify and generalise about the incoming alarms. These help the operator to locate the fault.

Modify Alarms

The operator may wish to clarify an alarm by overwriting one or more of the fields. For example, an alarm was to say that there was something in the letter box. If this alarm was first set in the presence of another alarm that registered the sound of a motorcycle then the first alarm could be modified to say that there are letters in the letter box.

Suppress Alarm

Some alarms may be known to be superfluous to solving a problem. For example, if there was a major collapse in the network and a large number of alarms were coming in from the one area then the operator might like to suppress most of the alarms and keep those that convey the boundary of the problem area.

Clear Active Alarms

Once a fault has been identified and its cause is clear then it may be useful to clear all the active alarms in one go, rather than having to do this manually. For example, suppose a test was carried out that set off many alarms. Once the test was complete, the elements had no signal to send the CLEAR signal back to the manager.

Generate New Alarms

Even though alarm correlation mainly aims to reduce the number of alarms, there may be times when one needs to generate a new alarm. For example, if an alarm saying that the dog is barking came in 20 minutes ago and still hasn't been cleared, then a new alarm would be useful, to alert the operator to this problem and increase the severity.

Delay Alarms

This is a trickier operation to implement. If there is an unimportant alarm then the operator may only want to know about it when the alarm has been set for an extended time. For example, with the dog barking again, this alarm could be delayed for 20 minutes. Then, when one does get the dog barking alarm then it can be acted on immediately.

Suppress Alarm then Generate a Consequential Alarm

This is a combination of delaying and generating alarms. Take the dog example again. The alarm could be suppressed, as before. Then after the waiting period, if the alarm is still active, a new alarm is generated to distinguish it from the low severity dog barking alarm.

Count Alarms

This operation takes an estimate of the frequency of an alarm as its condition for firing. In figure 3.2 we see an alarm SETting and CLEARing in quick succession over 10 minutes. Counting the number of SETs that take place over 10 minutes we get a count of 5. This value could be used as a condition of performing another function, such as generating a new alarm once the count exceeds a certain threshold.




Figure 3.2: Frequent alarms. The upward arrows represent the SET alarms, the downward arrows represent the CLEAR alarms.


3.4   Implementations

The function of correlating alarms has been tackled in a number of ways. In this thesis rule based correlation is the focus. So far we've described the functions that describe the operation of an alarm correlator. The most straight forward way to implement these is to define conditions and operations then have a big database of the connections between the two. This is a rule based system known as a production system see chapter 5. Rules based systems are the most common.

The other main categories are:

Model-based
This uses a model of the system to analyse monitored information. The model may be in the form of a tree, rules, state machine or networked functional nodes. These methods can become complicated and need special algorithms to adapt from the model to the action required. They have great potential to be adaptable to changes in the model.
Case-based
This uses cases to define the system. New cases are integrated into the system through a process of modifying the case according to tests devised beforehand, until the case fits the knowledge base. Devising the tests needs to be done in the light of previous knowledge about what the system requires. It is bit unpredictable as the control over the modifications made to the cases as they are added is not explicit.
Neural Network
This models itself on the pattern of the development of the brain. This process has been likened to annealing metal to set the magnetic pole of the metal by heating then cooling. When the block is heated the particles are excited and unpredictable, then in the cooling stage, all the particles fall into place. Likewise in neural nets the program goes through a stage of randomness before getting things right consistently. The rules are implicit rather than explicit. This makes it hard to locate mistakes in the program and there is no way to test the reasoning behind a certain decision[Gardner and Harle, 1996].

3.5   Summary

Alarm correlation is a useful tool for network management. In the application of monitoring faults in a telecommunication it is more often used as an alarm filter rather than a tool to provide a complete diagnosis owing to the complexity of the system. Different approaches have been used to tackle this problem. Rules-based systems come out ahead due to the fact that they are the most commonly used in existence and they are safer against machines making unpredictable assumptions.


Previous Contents Next