[Horwitz02] Chapter 6. Monitoring Services

来源:百度文库 编辑:神马文学网 时间:2024/04/29 05:10:11

Chapter 6. Monitoring Services

You will learn about the following in this chapter:

  • The difference between proactive and reactive monitoring

  • Effective techniques for host monitoring

  • Important elements of network monitoring

  • How to monitor an Internet service

  • Maintaining log files

  • How to monitor your logs

  • Combining internal and external system monitoring techniques

  • Using enterprise monitoring software

Doctorstoday say that every adult should go for a physical exam once a year,to monitor overall health and catch any problems that may bedeveloping. Unix systems need to bemonitored, too, and system administrators should conduct regular system“checkups” to ensure their systems' overall functionality and ongoingreliability. Service monitoring tools can help administrators conductregular system checks, and administrators can choose from among a widevariety of these tools.

Thischapter introduces the basic concepts of service monitoring, the typesof services that can be monitored, and the tools used to monitor them(collectively called monitors).Logs also are an essential part of the monitoring process; in thischapter, you learn about keeping and monitoring logs, and some helpfultechniques for analyzing and acting on a log's contents. Finally, thechapter discusses some popular enterprise-wide monitoring tools thatcan oversee your entire organization's technology infrastructure.

What Is Monitoring?

You monitoryour Unix systems to make sure they are performing as they should.System monitoring assesses the performance of individual machines, todetermine the status of, for example, remaining available disk space,CPU load, and network availability. Monitoring also assesses theperformance of servers, to determine (for example) that mail serversare routing mail properly and Web servers are serving Web pages withinacceptable standards.

Monitoringencompasses a variety of operations, including service checking, loganalysis, and notification, but all of these functions are designed toidentify two types of problems: those that the system is in danger ofdeveloping and those that the system already has developed. These twotypes of monitoring are referred to as proactive and reactivemonitoring. As a system administrator, you need to understand the usesof and the differences between both types of system monitoring.

Proactive Monitoring

The Merriam-Webster Online Dictionary defines proactiveas “acting in anticipation of future problems, needs, or changes.” Thisis exactly what proactive monitoring does; it monitors various aspectsof a system, and using a set of known or learned parameters, warnssystem support staff of impending problems. Note that proactivemonitoring does not require some sort of grand artificial intelligence,nor does it try to predict the future. Proactive monitoring does try todiscover problems before anyone else does.

Asimple but effective example of proactive monitoring is periodicallychecking disk space on a server's file systems to make sure nothing isapproaching 100% capacity. If one file system is becoming full, you canexamine its contents more closely to see what is causing the problem before something disastrous happens.

Reactive Monitoring

Reactivemonitoring, unlike proactive monitoring, uncovers and reports systemproblems that have already occurred, rather than trying to cut offproblems at the pass. Reactive monitoring is often frowned upon infavor of its proactive counterpart because it depends on a problemexisting beforehand, but it has its place with services that are eitheron or off, much like a light bulb. There are many situations that youcannot possibly predict ahead of time, such as

  • Power and HVAC outages

  • Security breaches

  • Kernel panic

  • Human error

For these situations, reactive monitoring is the only choice.

Reactivemonitoring is not appropriate in all situations, though. Using the diskspace example from the previous section, if you had instead monitoredsystem logs for a file system fullmessage, you would not know about any disk space problems until a filesystem reaches 100% capacity. Such reactive monitoring only preparesyou to deal with a problem once it has occurred, instead of watchingfor the telltale signs ahead of time and preventing the problem fromoccurring in the first place. In this situation, proactive monitoringwould have been a better choice.

Host Monitoring

System administratorsare responsible for servers, and they use host monitoring to track andtroubleshoot server performance. Host monitoring looks at variousperformance and capacity parameters of a server and compares themagainst standard thresholds determined by the system administrator.Some of these parameters include the following:

  • CPU load

  • Available memory

  • Paging activity

  • Disk utilization

  • Number of users (for public login servers)

  • ICMP echo response time (ping)

Hostmonitoring examines the individual parts of the Unix systeminfrastructure, to uncover existing problems and individual performanceissues that can develop into larger problems for the Unix system. Hostmonitoring might uncover, for example, a problem on one mail serverthat isn't necessarily a problem for the mail service as a whole, butmay be a harbinger of a larger problem yet to come. Because mosthost-monitoring tools provide detailed, almost real-time statistics, you can use them for proactive monitoring.

Considerthis example of using host-monitoring tools for proactive monitoring:Say you have four mail servers hosting 10,000 mailboxes, and you noticethat as the hours of prime-time usage approach, the CPU load on eachserver increases at a rate that might process mail very slowly or evenrender the systems unusable (see Chapter 11, “Performance Tuning and Capacity Planning,”for more information about CPU load). As a result of this proactivemonitoring, you might add another server to the pool of mail servers,or launch a more thorough investigation into why the load has increasedso suddenly (an investigation that might result in a code change).

Ineither case, you have become aware of a problem before it occurred andhave an opportunity to resolve the problem before it becomes apparentto your customers. Many host-monitoring processes measure performance;you learn more about them and their use in Chapter 11.

Network Monitoring

At some point intheir careers, system administrators must put on their network engineerhats. Some people are quite skilled at network design and routerconfiguration, while some are network neophytes, skilled at systemadministration but clueless when confronted with a failing T1 circuit(a circuitis just another name for a network connection). At the very least, asystem administrator should understand the networks that connectmachines within the Unix system for which they are responsible andmonitor the availability of those networks.

In Chapter 4, “Testing Your Systems,” you learned about system testing using the example of Nearest Integer Bank; here, the same example illustrates the system monitoring process. Figure 6.1 shows a logical network diagram of a small part of Nearest Integer's network.

Figure 6.1. Logical diagram of Nearest Integer Bank's Web server and customer service networks.


The network portion depicted in the diagram includes four separate networks:

  • The online services network (192.168.1.0/24)

  • The customer service network (192.168.2.0/24)

  • The New York database center (192.168.3.0/24)

  • The San Francisco database center (192.168.4.0/24)

Thisportion of the network also includes two T1 circuits (1.544Mbps) andone DS3 circuit (45Mbps), each of which must be monitored. Whenmonitoring a network circuit, two useful metrics to look at are thelink status and bandwidth. You can monitor these metrics either at arouter or firewall's console, or remotely via SNMP. Network errors arealso important to watch, using the tools in the following sections.Network errors are discussed in more detail inChapter 11.

Simple Network Management Protocol

SNMP stands for Simple Network Management Protocol. SNMP was designed to provide a framework for querying and managing network devices. Although version 1 of the protocol (SNMPv1) is shunned by many people for being insecure and many consider SNMPv2 too complex, SNMP is still in wide use today, especially for querying routers and firewalls. Router and firewall queries aren't the only use for SNMP, however; it can also be used to query servers for statistics like disk usage and CPU load. Check your vendor's documentation for details.


Monitoring Link Status

The link statusof a network circuit like a T1 or T3 simply indicates whether a router(or another network device) considers the circuit up or down. If acircuit is down, no network traffic can flow across it and manyservices are likely to be affected. The easiest way to determine thelink status of a circuit is to ask the router itself. For example, youcan log in to a Cisco router and use thesh int (show interface) command to view the status of a T1 on one of the router's interfaces, as follows:

router>sh int Serial0
Serial0 is up, line protocol is up

In this output,Serial0 is up means that the actual circuit attached to the interface is up and running;line protocol is up indicates that the protocol responsible for sending data across that circuit is running normally.

Use SNMP to Remotely Query Link Status

It is more convenient to query link status remotely than on the router directly. You can query via SNMP with the snmpget command, which would look something like this:

bash$ snmpget router public interfaces.ifTable.ifEntry.ifOperStatus.2
interfaces.ifTable.ifEntry.ifOperStatus.2 = up(1)

As you can see, SNMP can be quite confusing, but a good network monitoring program will take care of these details for you. Some of these tools will be discussed later in this chapter.


Another way to test linkstatus is to ping the router on one side of a link from the router onthe other side. This is also a great way to test the availability ofany device on your network. The following script pings the serverapple from another server and sends an email if it fails (this usage ofping is specific to Solaris):

Code View:Scroll/Show All
#!/bin/sh
if ! ping apple; then
echo "WARNING: apple ping test failed" | mailx –s apple root@example.com
fi



Beware of Low-Priority Pings

Using the ping command to test link status can produce misleading results. Many sites choose to reduce the priority of ICMP packets (the type of network traffic used by ping) to allow more important application-related traffic like HTTP to pass through first. During periods of heavy traffic, a router may drop ICMP packets in favor of the higher priority application traffic. As a result, a user might believe the network is experiencing problems when it is just the router doing its job.


Monitoring Network Bandwidth

Bandwidth refers to themaximum speed of a communications line, or how much data can travelfrom one side of a network link to the other in a specified amount oftime. Higher bandwidth means that more bits can travel across a link,and more people can use it simultaneously without negatively impactingthe performance of the connection.

Monitoringthe amount of traffic that moves across network links is critical; itis very easy for one particular circuit to become congested due to thesingle transfer of a large amount of data or from a large number ofconnections running across it. Again, you can ask a router aboutbandwidth utilization for each interface, as follows:

router>sh int Serial1
Serial1 is up, line protocol is up
[some output removed]
MTU 1500 bytes, BW 1544 Kbit, DLY 20000 usec, rely 255/255, load 1/255
[some output removed]
5 minute input rate 71000 bits/sec, 11 packets/sec
5 minute output rate 4000 bits/sec, 8 packets/sec

Thebandwidth for this link is 1,544Kbps (1.544 Mbps), which is the speedof a full T1. Because this is a full-duplex connection, it can handle1.544Mbps in both directions (input and output) simultaneously. Theinput rate over the last 5 minutes is 71Kbps, or 4.6% of the availablebandwidth. The output rate is even lower, so this link is doing justfine. Now, consider a link that is a bit more congested, as follows:

router>sh int Serial0
Serial0 is up, line protocol is up
[some output removed]
MTU 1500 bytes, BW 1544 Kbit, DLY 20000 usec, rely 255/255, load 3/255
[some output removed]
5 minute input rate 1386000 bits/sec, 133 packets/sec
5 minute output rate 24000 bits/sec, 67 packets/sec

In thiscase, 90% of the bandwidth is utilized, leaving only 10%, or about155Kbps, of bandwidth for other applications (about three 56Kbpsmodem's worth). This situation might seem like a cause for panic, butremember—network applications can use a tremendous amount of networkresources. A 5-minute long file transfer over that line could accountfor the entire bandwidth use shown in this test; in that case,bandwidth use would return to normal levels when the transfer finished.

Suchnetwork bandwidth usage “spikes” are normal, and as a systemadministrator, you should expect to see them periodically. However, ifthis kind of bandwidth use continued over a period of time, itdefinitely would be a cause for concern, and may indicate one or moreof the following problems:

  • Misbehaving application Configuration mistakes might cause an application to send a flurry of network traffic to other servers.

  • Denial-of-service (DoS) attack A person outside of your organization may try to disable your network services by flooding them with an excess of traffic. Chapter 13, “Implementing System Security,” discusses denial-of-service attacks in more detail.

  • Inappropriate use of network resources An employee could be downloading large files like MP3s or listening to streaming audio, which can take up a significant portion of an organization's bandwidth.

  • Inadequate bandwidth for normal operations You may not have enough bandwidth for all of the network services you run. You can determine bandwidth needs with tools like tcpdump and ntop that can help you profile your traffic. These tools are described in the “Monitoring Traffic Contents” section of this chapter.

A Fractional T1 Is Scalable

If your bandwidth requirements are unknown, start small and order a fractional T1, at 512Kbps or 768Kbps, for example. If you need more bandwidth in the future, you won't have to order a new T1 and incur the costs and waiting associated with the activation process—your provider can usually increase your capacity while keeping the same circuit.


One of the best and most widely used tools for recording bandwidth utilization is MRTG (Multi Router Traffic Grapher), which can be downloaded from http://www.mrtg.org. You can run MRTG periodically fromcron on a Unix server to query routers, switches, and other network devices for statistics via SNMP (see Chapter 12, “Process Automation,” for more information on automating tasks withcron). MRTG stores these statistics and creates daily, weekly, monthly, and yearly graphs of the data, which is extremely useful to show trends over time.

You Can Use MRTG to Monitor Other Network Statistics

In this section, we use MRTG to graph network bandwidth over time, but it is not limited to measuring bandwidth. Any statistic available through an SNMP query can be graphed with MRTG, such as CPU usage, disk capacity, and network errors.


Figure 6.2shows an MRTG-generated daily graph for a company's primary Internetconnection over a T1. Note the trend: a slow buildup of traffic in themorning as employees arrive, and a slow decrease in traffic in the lateafternoon as people head home for the night. You would expect thispattern to repeat itself day after day, save for weekends when nobodyis at the office. Daily graphs tend to be “spiked,” in that shortsurges in network usage can show up as spikes on a short-term graphsuch as this daily example. In this particular case, samples were takenevery five minutes, so even short surges in bandwidth utilizationappear as spikes on the graph.

Figure6.2. A daily MRTG graph of a company's T1 circuit to the Internet. Notethe usage trend as the workday progresses from right to left.


In the MRTG-generated yearly graph shown in Figure 6.3, each data point represents the average bandwidth usage during a 24-hour period. Such yearly graphs demonstrate the following three major usage metrics:

  • Average peak bandwidth over time This company averages about 15KBps (or 120Kbps), which is well under the capacity of the link.

  • Short-term periods of particularly high usage The example graph shows that the company's servers experienced abnormally high usage in May. This particular company's Web site went live during that month and experienced a surge in usage that is reflected on the graph.

  • Seasonal usage The example graph illustrates abnormally high usage in December, followed by a dramatic drop and continuing low usage for several days at a time (look closely and you can see wider-than-normal gaps in the plot).

Figure6.3. A yearly MRTG graph of a company's T1 circuit to the Internet.This type of graph makes it easy to discern average usage trends overtime. Note the abnormally high usage in May.


Eachgraph is useful for establishing bandwidth utilization trends. After afew weeks of watching this data, you should have a good idea about howyour network is normally used, which makes it a cinch to spot serious abnormalities.

Monitoring Traffic Contents

After you have determinedthere is a problem on your network, you may need to look at the actualtraffic flowing through it to determine what that problem is and whatis causing it. This is called sniffing the network, and the most popular tool for doing this istcpdump (http://www.tcpdump.org).tcpdump listens to a network interface and outputs the traffic that you specify with filters on the command line.

The typical usage oftcpdump is as follows:

tcpdump [-i interface] [-s snaplen] [-a] [filter] ...

The interface parameter is the name of the network interface you want to sniff (if you want to limit the sniffing to a single interface). Normally,tcpdumpdumps every packet that comes across a network interface, providing awindow into what is flowing across your network. Filters, however, telltcpdump to display only the traffic you are interested in. The following filter expressions are the most useful:

host hostname hostname or IP address port number port number tcp TCP (Transmission Control Protocol) udp UDP (User Datagram Protocol) ICMP ICMP Protocol (ping) src apply next filter to the source address dest apply next filter to the destination address

Filters Look in Both Directions

Filters such as host and port apply to traffic flowing in both directions— tcpdump does not care whether the filter matches the source or destination address. Use the src and dest expressions immediately before a host or port filter to make them apply to the source and destination addresses respectively.


To display network traffic involving the hostapple.example.com, type the following:

Code View:Scroll/Show All
# tcpdump host apple.example.com
Kernel filter, protocol ALL, datagram packet socket
tcpdump: listening on all devices
12:14:41.423382 eth0 > apple.example.com.ssh > 192.168.0.100.3662: P 412688691
:412688799(108) ack 2432159415 win 5840 (DF) [tos 0x10]
12:14:41.424291 eth0 > apple.example.com.32770 > orange.example.com.domain
: 39513+ PTR? 100.0.168.192.in-addr.arpa. (44) (DF)
12:14:41.443721 eth0 < orange.example.com.domain > apple.example.com.32770
: 39513 NXDomain 0/1/0 (121) (DF)



You can useand andor to combine filters. In the preceding example, you could limit the output to traffic flowing to or fromapple.example.com on port 22 (SSH) as follows:

Code View:Scroll/Show All
# tcpdump host apple.example.com and port 22
Kernel filter, protocol ALL, datagram packet socket
tcpdump: listening on all devices
12:21:03.343382 eth0 > apple.example.com.ssh > 192.168.0.100.3662: P 412693775
:412693883(108) ack 2432160235 win 5840 (DF) [tos 0x10]
12:21:03.363366 eth0 > apple.example.com.ssh > 192.168.0.100.3662: P 108:264(1
56) ack 1 win 5840 (DF) [tos 0x10]
12:21:03.363606 eth0 < 192.168.0.100.3662 > apple.example.com.ssh: . 1:1(0) ac
k 264 win 63388 (DF)



tcpdump on Switched Networks

If you have a switched network (using a switch instead of a hub), you see only network traffic flowing to and from the server on which tcpdump is run. A shared network configuration using a hub causes tcpdump to show traffic from all devices on the hub.


If you are interested in the actual contents of the packets, you can use the–a flag to print out each packet in ASCII format. Note that the–aflag does not just print the data inside of the packet—it also printsthe data that actually makes up the network transport protocols.Nevertheless, you should be able to make out any data that is notencrypted. You also must specify the snaplen, which is the maximum number of bytes thattcpdumpis to take from each packet. The default is 68, which is not enough toshow you any relevant data beyond the protocols themselves. In mostcases, you can safely set snaplen anywhere from 1024 to 8192.

Service Monitoring

During theInternet boom in the late 1990s, an increasing number of companiesbegan offering clients services such as email, dial-up services, andnews feeds. In this environment, no system administrator could becontent with determining that the company's servers were up and running; the more important question to ask was if the company's services were up and running. As a system administrator today, your system monitoring routine must include service monitoring.

Themost significant benefit of service monitoring is its proactive nature;it probes for potential failures, rather than waiting for them tohappen. Naturally, some services fail without warning, so some servicemonitoring is reactive. But in optimum circumstances, servicemonitoring alerts you to problems before they become noticeable to yourcustomers.

Servicemonitoring can reveal even the smallest problems in a large Unixsystem; a glitch in a minor dependency, for example, can cause theservice monitoring test to fail. Service monitoring processes can bemade even more powerful by implementing timeouts, which enable you totest how effectively your service is performing. The following sectionsdiscuss service monitoring techniques, some common services that can bemonitored, what to look out for in the results, and how to tune eachtype of monitor. These tips should be applicable to any servicemonitoring software you are using.

Monitoring Port Connections

A very basicmethod for service monitoring is to make sure the monitoring tool canconnect to the network port the service is listening on. If themonitoring tool can establish a connection, then the service must berunning. A list of common services and their ports are listed in Table 6.1.

Table 6.1. Common Network Services and Their Port Numbers
Service Transport Port FTP TCP 21 SSH TCP 22 Telnet TCP 23 SMTP TCP 25 DNS UDP, TCP 53 Finger TCP 79 HTTP TCP 80 POP TCP 110 NNTP (news) TCP 119 HTTPS TCP 443

A Comprehensive List of Services

You can find the list of network services your server knows about in /etc/services, which maps service names to ports and protocols, much like Table 6.1. The official list of Internet services is available from RFC 1340, Assigned Numbers, at http://www.faqs.org/rfcs/rfc1340.html.


An ability to establisha connection does indicate that something is listening on that port,but it doesn't guarantee that the service you've connected to isrunning correctly. The best service monitoring tools test services bysubjecting them to use that simulates that of a real user. These typesof monitoring tools attempt to send and retrieve mail from a mailserver, dial out via modems and authenticate to your network, and postarticles to newsgroups. In each case, if the service can't perform theattempted action, the service fails the test. This kind of testingestablishes a service's usability, as opposed to its availability.

Theinterval at which you choose to monitor your services is just asimportant as the monitoring itself. This interval should be based onyour organization's tolerance for downtime. For example, if you monitora service once every 30 minutes, it will take an average of 15 minutesand potentially as long as 30 minutes for you to notice any failures.If 30 minutes of downtime is acceptable to your organization, thisinterval would make sense, but if it requires a response within 5minutes of any outage, you need to monitor every 5 minutes. You canread more about service uptime and choosing monitoring intervals in Chapter 8, “Service Outages.”

Monitoring the POP (Post Office Protocol) Internet Service

System administratorscommonly monitor the POP, or Post Office Protocol, Internet service.POP is used to access remote mailboxes; you probably use it every daywithout thinking about it. The POP monitoring program has to interactwith the service just like a real mail client, so it needs to speakPOP. Here's what a typical POP session looks like (connect, log in, andlist number and sizes of new messages):

SERVER   +OK QPOP (version 3.0.1b1) at mail.
CLIENT user test
SERVER +OK Password required for test.
CLIENT pass foobar
SERVER +OK test has 1 visible message (0 hidden) in 703 octets.
CLIENT list
SERVER +OK 1 visible messages (703 octets)
1 703
.
CLIENT quit

Youcan try this yourself by using Telnet to connect to port 110 of anyPOP-capable mail server. POP is a fairly simple protocol, as you cansee from this example, so it's actually quite easy to write a servicemonitor for it. In its simplest form, the monitor just has to look fora response starting with+OK aftereach step, in which case the test succeeds; the service is functioningnormally. Otherwise, some part of the service has failed, and anadministrator should be notified. Here's an example of a failed login;perhaps the authentication mechanism is broken:

SERVER   +OK QPOP (version 3.0.1b1) at mail.
CLIENT user test
SERVER +OK Password required for test.
CLIENT pass foobar
SERVER -ERR [AUTH] Password supplied for "jhorwitz" is incorrect.
SERVER +OK Pop server at mail.laserconnect.net signing off.

In this example, you can see that–ERRindicates that the last request resulted in an error; when the errorhappens during the login process, the client is automaticallydisconnected. The diagnostics provided by the POP protocol make it acinch for anyone with C or shell programming skills to write a monitor for it.

Monitoring the Domain Name Service

The Domain Name Service (DNS) is anothercommon network service, without which the Internet wouldn't function asit does today. Unlike POP, DNS provides many functions, including nameresolution, address caching, request forwarding, change notifications,and zone transfers. DNS conducts most communication using UDP, and inan unreadable binary format; therefore, DNS doesn't give us a simple,readable protocol to work with.

Inaddition to being a more complex service than POP, DNS also providesnumerous features to monitor. Perhaps the most useful of these featuresis also one of DNS's most widely used: name resolution. To test nameresolution, you must determine whether the name server can successfullyresolve a foreign name into an IP address (or vice versa); if it can,then the name server is working correctly.

Allof the monitoring software discussed later in this chapter will monitorname servers, but you can also try this test on your own beforeconfiguring the software. To perform the test, you must first decidewhich foreign address to query. Don't choose one that is going todisappear anytime soon; choose a well-established, highly available site.

Choose an Established Foreign Address for Reliable Results

When monitoring DNS name resolution performance, choose a well-established foreign address—you are depending on a third party for your monitoring, and if that third party's DNS servers are unreliable, your monitoring results are just as unreliable.


One example of a well-established, reliable foreign address is that of Yahoo!. To run the test, query for www.yahoo.com. When you run the query manually, you see the following result (as of the time of this writing):

Code View:Scroll/Show All
bash$ nslookup www.yahoo.com
Server: stiffness.it.targetrx.com
Address: 192.168.1.23
Non-authoritative answer:
Name: www.yahoo.akadns.net
Addresses: 64.58.76.229, 64.58.76.176, 64.58.76.177, 64.58.76.178
64.58.76.179, 64.58.76.222, 64.58.76.223, 64.58.76.224, 64.58.76.225
64.58.76.226, 64.58.76.227, 64.58.76.228
Aliases: www.yahoo.com



Thislists a lot of different addresses, but you are only concerned that youget a valid response from your name server—if it responds with anyresults, you assume that the name server is running correctly. However,if you actually compare the results to a known list of addresses, yourun the risk of Yahoo! changing addresses behind your back at any time, triggering false alarms.

Let Third Parties Know That You Are Monitoring Them

If you decide to use third-party data in your monitoring processes, as in the Yahoo! DNS example in this section, let the third party know that you are using their data. They may be able to alert you to changes ahead of time. This is most effective with organizations that you already have relationships with, such as your ISP or a parent company.


Whenindirectly monitoring third-party services as in the previous example,you also run the risk of experiencing false alarms as a result of theservices becoming unavailable. This is an acceptable risk; if ithappens more than a few times, have your monitor query a differentaddress or give it a series of addresses in a loop to avoid a singlepoint of failure. You can read more about these third-party outages in Chapter 8.

Timeouts and Repeats

Themonitoring processes you've learned about to this point in the chapterhave tested services and passed or failed those services based on theirresponses to the tests. What if the services take 2 hours to respond,or what if they don't respond at all? Your monitoring tools should havethe ability to timeout after a designated interval, at which point thetest considers the service to be down.

Thereare no established rules for timeout values; the length of the timeoutinterval is highly dependent on the nature of each service. DNS, whichdrives the name-to-address translation that makes the Internet moreuser friendly, needs to be very responsive, so a timeout of 15 secondsis reasonable. SMTP, on the other hand, isn't as time-sensitive as DNS,and is also more prone to sluggishness due to server load. Anadditional 20 seconds over the usual time required to send an email isunimportant, so perhaps a timeout of 30 seconds is more appropriate foran SMTP service check.

Occasionally,a timeout failure is a fluke; nearly every Internet user has clicked ona link in a browser only to wait two minutes for a DNS query tocomplete. These timeouts can be caused by network congestion or a quickburst of server activity—pinning down the cause can be difficult. Inany case, most services shouldn't be marked down because of a singletimeout; a second service test shortly after the first may yield asuccessful result. If a service fails the second test, however, theservice should probably be marked down. Not all monitoring systemsoffer repeat testing, but if yoursdoes, you should definitely take advantage of it.

Logging

System logs area system administrator's best friend. They tell the story of eachsystem, service, and network in your infrastructure. There are manytypes of logs, and each operating system and application has adifferent way of logging information. The most common manifestation oflogs in Unix is the log file. This is a simple text file to whichapplication or system log messages are appended. Most well-constructedlogs contain the date and time each message was generated, theapplication that generated it, and a detailed description of the eventthat occurred.

Manylogging techniques and log files are common to all Unix systems. Thefollowing sections discuss the use of these techniques and log files in monitoring your systems.

syslog

syslog is thede factoUnix logging mechanism. It first appeared in BSD 4.2 in the early1980s, and has been a part of every subsequent Unix release. Like manythings in Unix,syslog is a client-server architecture; a client submits logs to asyslog server (the logs can be located on the same machine), which can record the logs or notify administrators in a variety of ways.

Configuration

All logs are not equal insyslog; each log message is assigned a facility and a level, collectively called the log's priority.The facility identifies the type of event the log is referencing, suchas authentication information, email diagnostics, messages from thekernel, or a custom application message. Unfortunately, systemadministrators can't just make up a facility out of the blue; Unix hasalready allocated several facilities, the most common of which arelisted in Table 6.2. Examples of programs that use the facility for logging are provided where appropriate.

Table 6.2. Common Predefined syslog Facility Names and the Subsystems That Use Them
Facility Used By user user processes (the default facility) kern the Unix kernel mail mail applications (sendmail) daemon system daemons (named, ftpd) auth or authpriv authentication system (login, su) cron cron and at subsystems local0-7 reserved for local use * wildcard for any facility

Facilities arebroken down even further into levels. A log message's level isdetermined by its urgency, so you can use levels to differentiateurgent messages from informational ones. A message regarding a Telnetconnection, for example, might be considered informational, but amessage reporting a disk failure would be critical. Table 6.3 lists the predefinedsysloglevels.

Table 6.3. syslog Levels Listed by Decreasing Severity and Their Descriptions
Level Description emerg panic situations alert actions that need immediate attention crit critical conditions err any other errors warning warning messages notice non-error conditions info informational messages debug debugging messages none do not log this facility

In order to receive logs, a machine must be running asyslog server, usually calledsyslogd. In most standard implementations,syslogd is configured by editing the file/etc/syslog.conf, which consists of pairs of selectors and actions. A selector is simply the text representation of a priority insyslog.conf; it is formed by joining the facility and level with a period between them. A log with a facility ofauth and a level ofinfo would have the selectorauth.info. The selector is used to “select” which action to take for any particular message.

Examine Your Ownsyslog Configuration

syslog is a standard part of every Unix system; one of the first things you should do after loading a system is examine syslog.conf to find out where your logs are going.


The Unix kernel and most applications that log viasyslog use a special logging device (for example,/dev/log) or the standard C functionsyslog(). Shell scripts cannot easily take advantage of these logging mechanisms, so you should use thelogger command.logger allows you to specify asyslog priority with the–p flag; it accepts a message from the command line or standard input that is then routed throughsyslog to the appropriate destination. The basic usage of thelogger command is as follows:

logger [-p priority] [-f file] [-t tag] message ...

You can log from a file using the–f flag, and you can also specify a tag to identify the program sending the log using the–t flag. The following example logs the message, “Shutting down web server,” from the machineserver1 with a priority ofdaemon.info:

bash$ logger –p daemon.info "Shutting down web server"

This command produces the following log message:

May  5 15:57:14 server1 jeff: Shutting down web server

A server can do several things when it receives a log, based on the selector specified insyslog.conf. Thesyslog actions are listed in Table 6.4.

Table 6.4. syslog Actions and Their Configuration Formats
Action Format append log to file filename broadcast to some users username1, username2 broadcast to all users * send to remote syslog server @server

Multipleselectors can be specified on the same line with a single action byseparating the selectors with semicolons (System V) or commas (BSD).The defaultsyslog configuration for Solaris 8 looks like this:

# Selector                                      Action
*.err;kern.notice;auth.notice /dev/sysmsg
*.err;kern.debug;daemon.notice;mail.crit /var/adm/messages
*.alert;kern.err;daemon.err operator
*.alert root
*.emerg *

Implementing Multiple Actions

You can apply multiple actions to a single selector by adding multiple lines to syslog.conf with the same selector and its additional actions, as follows:

auth.info                                       /var/log/auth
auth.info @logserver.example.com


Output

syslog outputs log messages in a standard format, though it is often augmented with operating specific fields. The format is as follows:

timestamp hostname[PID]: message

Thetimestamp is the date and time the message was logged. It is set to the time of the machine that generated the log. Thehostname is the name of the machine that generated the log,PID is the process ID of the process that generated the log, and themessage is the actual text of the log. An example of a realsyslog message follows:

Jan  9 20:37:03 servo su: 'su root' failed for jhorwitz on /dev/pts/1

Thelogs are appended to each log file in the order that they werereceived. The format of these logs makes them very easy to searchthrough using standard Unix tools such asgrep andtail.More advanced log-monitoring software and even shell scripts can alsotake advantage of their simple layout, making log files the mostuniversal mechanism for keeping track of activities on a Unix system.

Tailing a Log File

Tailing a log is a method by which you can view the end of a particular log file and every new message that is appended to it from that point on. tail –f is used to do this; for example, tail –f messages will tail the messages log file.


Remote Logging

Thesyslog server is not required to run on the same machine as thesyslogclients. Processes that run on one machine can submit logs that arerouted to a separate log server. This arrangement, known as remotelogging, is useful for several reasons. First, because a log server isa central repository for all system logs, remote logging enables you toexamine the results from several servers in a single log on a singleserver.

Improvedlog integrity is another benefit of remote logging. When hackers breakinto systems, one of the first things they do to cover their tracks isremove all logs that could provide evidence of intrusion. When logslive on the same server, it's easy for a hacker to alter them as he orshe sees fit. When logs live on a remote server, records of a hacker'sactivity are sent to that server, far away from the compromised server.At that point, even thoughsyslogcan be disabled by the hacker, the damage is already done: the logshave been sent off to the remote log server, and you can see exactlywhat happened. This alone should be reason enough to implement a central log server.

Isolate Your Log Server

If you do not isolate your log server to prevent third parties from logging in, it could suffer the same fate as other systems that are broken into, and hackers could potentially remove valuable logs. You can use a firewall or TCP Wrappers, as explained in Chapter 13, to limit the addresses that are allowed to connect to the log server.


The stocksyslogon Unix systems will happily accept logs from remote hosts without anychanges. All you need to do is tell the remote hosts where to sendtheir logs, which is done with the@server action. The followingsyslog configuration sends allauth.info logs to the serverlumberjack:

auth.info                                       @lumberjack

syslog - ng

syslog was a wonderfuladdition to Unix in the early days, but it has its limitations. Themost glaring limitation is that the only thing that differentiates onesysloglog from another is the priority. If two logs have the same priority,there is no way to perform different actions on each log;syslog considers them identical from a configuration point of view.

Thislimitation can be a real problem during log analysis; Unix systems canbe quite verbose in their logging, and the signal-to-noise ratio isoften lower than you'd like it to be. In addition, many programshard-code theirsyslog facility; administrators can't change thesyslogfacility to suit their needs. If an FTP server is loggingauthentications with a facility of daemon, but you want it logged witha facility of auth, you may be out of luck.

Entersyslog-ng in the late 1990s.syslog-ng was designed to overcome the limitations insyslog, and it's quickly gaining popularity in the Unix community.syslog-ng provides a much more granular filtering mechanism than the common everydaysyslog,making it a perfect fit for a central log server. For example, you cansend all logs coming from a host whose name contains the word“firewall” to a file calledfirewall. You can send all logs containing the word “login” to a file calledauth.syslog-ngeven supports a limited set of regular expressions for patternmatching. Of course, you can still filter logs based on theirpriorities.

The configuration ofsyslog-ngis quite interesting. In its single configuration file, there are fivemajor types of configuration directives: options, sources, filters,destinations, and log paths. These directives tellsyslog-ng where to find log messages and how to route them to their final destinations. The following sections discuss these directives in detail.

Options

Options aresyslog-ng's global configuration options. You can use options to set the default permissions, owner and group of log files, havesyslog-ngautomatically create directories for log files, and set the behavior oftimestamps (use remote timestamp or the time the log was received).Several other options are also available; you can learn about them byvisiting thesyslog-ng home page at http://www.balabit.hu/en/downloads/syslog-ng.

A typical options directive looks like this:

options {
chain_hostnames(no);
use_time_recvd(yes);
create_dirs(yes);
owner(root);
group(systems);
dir_perm(0750);
};

Sources

Sources define all of the mechanisms through whichsyslog-ng can receive a log. If your system sends logging information through the device/dev/logand through UDP network connections on port 514, you would define thosemechanisms as sources. A source directive for a central log serverrunning Solaris 8 might look like this:

source src { sun-streams("/dev/log" door("/etc/.syslog_door")); internal();
udp(port(514)); };

Filters

Filters are where most of the dirty work insyslog-ng is done. Filters specify a named group of logs that share common patterns. For example, you might create a filter calledf_firewall (thef_prefix is for clarity, and is not required) that represents all logsoriginating from machines whose hostnames contain the word “firewall.”Boolean logic makes this system even more powerful. You could alsocreate one calledf_mail for logs that have a facility of “mail” or come from the programsendmail. Add to this wildcards and a limited set of regular expressions, and you have an infinitely powerful log filtering system.

Here are some examples ofsyslog-ng filters:

# filter based on facility
filter f_auth { facility(auth); };
filter f_kern { facility(kern); };
filter f_daemon { facility(daemon); };
filter f_cron { facility(cron); };
# filter based on program name and facility
filter f_mail { ( program("sendmail") or facility(mail) ); };
# filter based on hostname
filter f_firewall { host("pix*"); };

Destinations

Destinations specify the actions that can be performed on log messages.syslog-ngcomes with several predefined destination “drivers” that providevarious functionality. These destination drivers are listed in Table 6.5 (adapted fromsyslog-ng documentation).

Table 6.5. Destination Drivers in syslog-ng Provide Many Options When Deciding the Fate of a Log Message
Driver Description file output to a file fifo, pipe output to a named pipe unix-stream output to a Unix socket unix-dgram output to a Unix socket udp output to a UDP port on another machine tcp output to a TCP port on another machine usertty output to a user's terminal program pipe log to standard input of a program

Drivers alsosupport a limited set of variables. There are date variables that canbe used to name log files based on the date the log was received, thereare hostname variables that contain the name of the originatingmachine, and so on.

One of the most useful implementation of these variables is to havesyslog-ngput log files into automatically created directories named after thecurrent year and month. Using these directories is like having anautomatic log archiving system; here's how you do it:

Code View:Scroll/Show All
destination mail { file("/var/log/syslog-ng/$YEAR$MONTH/mail" group(wheel)
perm(0640) ); };
destination cron { file("/var/log/syslog-ng/$YEAR$MONTH/cron" group(wheel)
perm(0640) ); };
destination auth { file("/var/log/syslog-ng/$YEAR$MONTH/auth" group(wheel)
perm(0640) ); };
destination kern { file("/var/log/syslog-ng/$YEAR$MONTH/kern" group(wheel)
perm(0640) ); };
destination daemon { file("/var/log/syslog-ng/$YEAR$MONTH/daemon" group(wheel)
perm(0640) ); };
destination firewall { file("/var/log/syslog-ng/$YEAR$MONTH/firewall" group(wheel)
perm(0640) ); };



Log Paths

Log paths bringthe entire configuration together, linking sources, filters, anddestinations into one cohesive logging system. Each log path specifiessources from whichsyslog-ng shouldobtain log messages, filters to categorize the messages, anddestinations to specify what action to take for each filtered message.Writing these directives is relatively simple, as follows:

log { source(src); filter(f_mail); destination(mail); };
log { source(src); filter(f_cron); destination(cron); };
log { source(src); filter(f_firewall); destination(firewall); };
log { source(src); filter(f_auth); destination(auth); };
log { source(src); filter(f_kern); destination(kern); };
log { source(src); filter(f_daemon); destination(daemon); };

Appropriate Use ofsyslog-ng

Installingsyslog-ngis easy, but keep in mind that although it can be installed on all ofyour machines, it's most useful on a central log server. The stocksyslog provided with your operating systems should be able to redirect all log messages to your newsyslog-ng log server (see previous section onsyslog). The configuration to do this on Solaris follows:

Code View:Scroll/Show All
# send all syslogs to syslog server
*.emerg;*.alert;*.crit;*.err;*.warning;*.notice;*.info;*.debug @syslog-server



Application Logs

Not all utilities and applications usesyslog as a logging mechanism. In fact, the client-server aspect ofsyslogadds a lot of overhead to a system that produces massive amounts oflogs, such as a Web server taking 500 hits per minute. In such asituation, it makes more sense to store the logs on the local filesystem. Application logs can grow out of control if not monitoredproperly. On a system with many logs, placing the logs under one commonsubdirectory, such as/var/log, letsyou know exactly where all of the log files are located. Using a commonsubdirectory also simplifies log archiving. You learn more about logmanagement in the next section of this chapter.

Log Management

Log managementis a mundane daily task for all system administrators. Logs tell thestory of your systems; you should consider them sacred and watch overthem at all times. Properly managed logs do not take up too much spaceon a file system, are of a usable size, are archived for futurereference, and provide an adequate view into the past activity of yoursystems. Many Unix systems provide tools that handle some of thesetasks automatically, but a few tweaks here and there can make the jobeven easier.

Logmanagement involves four major functions: location, file size,rotation, and archiving. This section describes each of these functionsand how to implement and tune them on your own systems.

Location

The location oflog files in a Unix file system is critical to the performance andscalability of the overall system. The most important aspect of a logfile is that it grows, and sometimes that growth can become out ofcontrol. A system administrator's great fear is that the log filegrowth will quickly fill up the file system where the file is located.This fear is justified; most (if not all) administrators have seen asingle log file fill a file system to capacity.

The best way to mitigate this situation is to place your logs in their own dedicated file system. Putting logs in the/ or/usr file systems is probably a bad idea, as those systems usually contain critical system files. Filling up/ can have a disastrous effect on a running system, often requiring maintenance in single-user mode.

Luckily, the/var file system comes to the rescue. Named for the variable length files contained within,/var is a perfect place to put log files. In fact, most Unix operating systems put system logs there by default. Linux uses/var/log, while Solaris makes use of a variety of subdirectories, including/var/log and/var/adm. When logs fill the/var partition, the log files (and anything else you might store in/var) are the only files affected.

Although you may lose logs when the/varpartition fills up, logs are the only thing you'll lose. The operatingsystem will keep running, and most applications that are alreadyrunning will go about their normal business. However,/var is often a part of the/ file system by default; you may have to force its creation during installation.

Create a Separate/var/log File System

If you are running an application on a server that makes heavy use of the /var file system, you may want to create a separate /var/log file system (separate from /var itself). This prevents overactive log files from interfering with your applications and vice versa. Sendmail is an example of such an application; its default message queue is located in /var/spool/mqueue.


Real-World Example: Logs Gone Wild

A company running a popular Java application server had just released a new version of their Web application. Two days later, the system administrators were paged about a file system being full on one of the servers. One of the administrators checked the Web site to make sure everything was still running, and it was. However, on one of the servers, the /var file system was filled to capacity with error logs from the application server. In fact, the log file had been growing at a rate of almost 1,000 logs per second! The administrator moved the offending log to another file system and restarted the server, which eliminated the problem temporarily. Through this whole ordeal, though, both the server and application were still running! Because no critical application data resided in the /var file system, the overflowing log files had no serious adverse effect on the application, and there was no outage.


File Size

Log file size affectsthe overall file system, but file size also has a direct impact on alog's usability. Would you rather search for information in a 2MBsystem log or a 500MB system log? Obviously, it's much easier to findinformation in a smaller log, but it's also useful to have as muchinformation at your disposal as possible; searching through backuptapes for last week's logs may not be acceptable to you. These factors,including the growth rate of the logs in question, should all be takeninto account when deciding what your tolerance for log file size should be.

Choosing Log File Size Limits

Choosing an appropriate size limit for log files can only be accomplished through experience. The size should be large enough to keep the amount of historical data you require, but small enough so you can search through and transfer the files without too much delay. You also should have enough room in your file system to keep multiple online copies of older log files (rotations). Watch your log files over time to see how each of them grows, and then you can make a decision as to the appropriate limit to place on each file.


Rotation

Log rotation helpskeep active log information available while managing file size. Youneed older logs to be readily available so you can research recentproblems and audit who is accessing your systems. Depending on thenature of your organization, you may even be legally required to keeplogs for a certain period of time. You may eventually write these logsto long-term storage like CD-ROM or tape, but having them online makesthem readily available for use.

Logscannot live forever, though; some eventually become so large that theybecome unusable, and files can become so old that they're no longer ofany use. Log rotation moves active logs to a separate location so theydon't grow out of control or become so stale that their information isno longer relevant.

Mostoperating systems have built-in facilities for log rotation, but it'seasy enough to write your own if the need arises. Red Hat Linux rotatesmost logs in/var/log on a weekly basis, renaming each file every week:

bash$ ls –l messages*
-rw------- 1 root root 95505 Jan 12 15:14 messages
-rw------- 1 root root 110162 Jan 6 04:02 messages.1
-rw------- 1 root root 91290 Dec 30 04:02 messages.2
-rw------- 1 root root 91736 Dec 23 04:02 messages.3
-rw------- 1 root root 121928 Dec 16 04:02 messages.4

This rotation scheme is customizable. Red Hat Linux uses a program calledlogrotate to automate the rotation of log files. You can find a detailed description oflogrotate in Chapter 12.

OpenBSDemploys a similar rotation scheme, but it also takes the liberty ofcompressing the files after rotating them. In addition to rotatingfiles, you should also think about your log retention policy, or howlong you want to keep each log on disk. As you can see from thepreceding example, Red Hat Linux has a four-week log retention policy;after four weeks, rotated logs are permanently removed from the system.

Retentionis all about available disk space; how much disk space are you willingto dedicate to old logs? Will the size of the old logs seriouslyinterfere with the space available for new logs? Experiment and see what works for your systems.

Log Rotation Procedures

syslog opens log files once and keeps them open for the life of the process. If you move or rename the log files, they will still remain open and will be written to. The proper way to rotate these logs is to rename the logs and then restart the syslog daemon. Alternatively, you could stop the daemon, move the logs, and then restart it, but you may lose some logs during the process.


Archival Storage

When the oldestlog in a rotation is deleted, it is gone forever. Logs are critical forproblem diagnosis, performance monitoring, and may even have legalramifications, so it is important to archive them to some kind oflong-term storage. This storage is usually some sort of tape backup,but archiving to CD-ROM is becoming popular because of its longer shelflife.

Keep Archives Readily Available

No matter what medium you use to store the logs, it should be readily available and convenient to access, as the logs may be needed at a moment's notice. A CD-ROM or tape library can make older archives available at the touch of a button.


You should schedule your logs to be written to archival media at some point beforethe rotation process takes place so you have archived copies of filesthat are about to be removed from online storage. You archive filesmanually with tools such astar andufsdump, schedule archiving withcron (see Chapter 12), or have your backup software manage it for you. Chapter 9, “Preparing for Disaster Recovery,” offers more information on backup software.

As a simple example, the followingcrontab entry writes all of themessage logs to the Solaris tape device/dev/rmt/0n at 11:00 p.m. on Sunday nights (check your system documentation for the name of your system's tape device):

0 11 * * 0 ( cd /var/adm ; tar cf /dev/rmt/0n messages* )

Log Monitoring

Most people arecontent to passively monitor system logs; they might take a look at logfiles once a day just to make sure everything looks kosher. Ifsomething goes wrong, these administrators fix it after the fact. Thisapproach just won't do for critical system errors such as a file systemfilling up or an operating system panic; you need to know about thesethings as soon as they happen.

Fortunately,there are many third-party log-monitoring packages available on theInternet. The following sections discuss one of those packages,logsurfer, which boasts many of the features you should be looking forin log-monitoring software. You also learn how these packages notifysystem administrators of errors that occur both during and after normal working hours.

Log-Monitoring Software

The firstwell-known log-monitoring software was swatch. Swatch worked very wellfor a basic monitoring program, but it had its limitations: it requiredPerl (a more significant problem in the early 1990s), it could workonly with single-line log messages, and it had limited support fordynamic rules.

Enter logsurfer in 1996. Logsurfer overcame all of the limitations of swatch and added a few features of its own, as follows:

  • Context-sensitive analysis enables you to monitor collections of related logs as one unit.

  • Dynamic rules let you create new monitoring rules based on the content of previous messages.

  • Multiple actions can be taken on one log message.

  • Individual log messages must pass through two regular expressions to evoke an action—one that each message must match, and (optionally) one that the message must not match.

  • The entire logsurfer program is written in C, with no third-party software requirements.

Logsurfer'sconfiguration file is a set of rules. Rules determine the actions thatare taken for logs that match the patterns you specify. For example, ifthe program/usr/local/bin/logmail were written to email its standard input to system administrators, the following rule would send allsu failures to the administrators:

"'su root' failed" - - - 0
pipe "/usr/local/bin/logmail"

This rule tells logsurfer to look for the pattern'su root' failed and pipe the log to thelogmail program.

Themost significant feature that logsurfer offers is the addition ofdynamic rules. Dynamic rules are created on the fly to handle certainsituations as they happen. For example, in swatch, if the/varfile system filled up on a server, your email, pager, and every othermechanism you had set up for notification would be full of thousands ofmessages like this:

Nov 26 18:43:03 server1 ufs: NOTICE: alloc: /var: file system full

You'dreceive thousands of these messages because they would be generated bythe kernel for each unsuccessful write to the already full file system.Swatch would receive the messages and act on each of themindependently; if you were paged forfile system full errors, your pager battery would soon need to be replaced.

Logsurfer can handle this situation much more gracefully. You could create a dynamic rule that ignoresfile systemfull errors for a particular server after the first message had been received, as follows:

"file system full" - - - 0 continue
pipe "/usr/local/bin/logmail"
"file system full" - - - 0
rule top "file system full" - - - 900 ignore

The first rule pipes the first message tologmail. The second rule creates a new rule that ignores thefile system full pattern for 900 seconds (15 minutes). It does this by inserting the rule on top of all other rules (thetop inrule top). Instead of getting thousands of notifications that the file system was full, you'd get one every 15 minutes, a more subtle reminder that your server needs you.

Notification

All the monitoring inthe world won't do you any good if you're never notified about outages.There are various notification methods to choose from, and picking theright one depends on your needs and capabilities. The most importantfactor to consider when making this choice is reliability; emergenciescan happen at any time, and your notification system needs to be ableto get in touch with you at any time. Systems use any of a variety ofnotification systems, including pagers, cell phones, email, and instantmessaging. The following sections describe using these optionsindividually or in combination to provide good emergency notificationfor your system.

Pagers

Pagers are the quintessentialsystem administrator accessory. Nearly every system administratorcarries at least one pager. Pagers come in two types: numeric and text.Numeric pagers were designed to communicate phone numbers and,therefore, display only a limited series of numbers. Although codessuch as “911” for emergencies and “215” for running late have madenumeric pager communications more tolerable, numeric pagers usuallyaren't the right solution for a notification system.

Textpagers can receive text messages of several kilobytes and have largerdisplays that can display several lines of each message at a time.Messages can be sent to pagers either by an operator at a dispatchcenter, through a modem, or through email. Many modern text pagers aretwo-way and can both receive and send messages. Text pagers are clearlythe notification tool of choice for any system administrator.

Emailand modem are the two most popular methods of sending pages. Althoughemail is highly convenient and integrates well into most monitoringsystems, it is also an in-band solution; your email infrastructure mustbe operational in order for you to send email messages. If your emailservers are down, you can't receive a notification of the problem viaemail.

Out-of-bandcommunication solves this problem. Using infrastructure that is outsideyour production network guarantees that pages will be sent out even ifyour system has lost all connectivity to the Internet. Modems fall intothis out-of-band category, and most pager services today can provide amodem gateway to your pager. These services usually provide a publicphone number through which anybody can send pages using TAP (TelocatorAlphanumeric Protocol, also known as the IXO protocol). Using a modemattached to your monitoring server to send pages completely separatesyour production infrastructure from your notification mechanisms;you'll never be in the dark when something goes down.

Monitor Your Monitoring Server

It is important to keep watch over your monitoring server so it is always available to notify you in case of failures. Even a simple ping test from a separate server can alert you when your monitoring server crashes (this is covered in the “Network Monitoring” section of this chapter).


Several paging software packages are available on Unix, but by far the most versatile and cost-effective (free) package is Qpage (http://www.qpage.org).Although it's been almost three years since the last release, Qpagestill enjoys widespread use, and it's still a good example of what tolook for in paging software:

  • Support for TAP/IXO.

  • Client-server architecture; paging clients can send pages through multiple servers in a failover fashion, providing redundancy.

  • Pager groups facilitate sending pages to multiple users using one identifier.

  • Message queues store messages for periodic transmissions and during retries.

Cell Phones

Cell phones arebecoming an increasingly popular notification mechanism, as prices forboth service and phones decrease and wireless features increase.Conventionally, cell phones have been used as a secondary notificationmechanism when pagers didn't work. Today, in addition to voice, cellphones can also support text messaging and even email and Web browsing,making them even more versatile than two-way text pagers.

Email

Some eventsare important enough for you to be notified, but you don't necessarilywant to hear about them at three o'clock in the morning on your pager.Although emergencies like failed services, dead network circuits, andpower outages warrant immediate notification, a failedsu or a connection denied by a firewall doesn't fit into the emergency category.

Evenif an event doesn't warrant emergency notification, you still need toknow about it. Email is a convenient middle-ground for notification; itlets you know what's going on without intruding into your life. You cancheck email on your schedule.

Instant Messaging

The Internet'sfirst “killer app” was email. The second was the Web. The latest killerapp is instant messaging (IM). Services like AOL Instant Messenger andICQ enable two or more people on the Internet to converse with eachother in real time, exchange files, and play games in real time.Instant messaging is also a great tool for system problemnotifications, because it doesn't even require that a person be oneither end of the “instant” communication.

Variouscommand-line utilities and API's exist that can be incorporated intonotification scripts, providing real-time notification of problems. Thedisadvantage to this arrangement, of course, is that it relies on anoutside party to provide the IM service; this limitation may not beacceptable from the standpoints of reliability and security. Hostingyour own instant messaging service alleviates these problems. Jabber isthe most mature of these services (http://www.jabber.org).

As an example, the following fragment of a Perl script uses theNet::AOLIMmodule to log into the AOL Instant Messenger service and send aninstant message to an administrator who is also logged into the service:

#!/usr/bin/perl
use Net::AOLIM;
$conn = Net::AOLIM->new(
'username' => $user,
'password' => $password);
$aim->signon;
$aim->toc_send_im($admin_user, "The server is down!");

Combination Solutions

Sometimes more ofa good thing is better. Many companies utilize two or more notificationmechanisms to make sure that staff receive the messages they're meantto see. A common combination notification solution provides paging andemail. When a log that is deemed an emergency comes through the system,both a page and an email are sent out. If the page doesn't make it (if,for example, the pager is out of range or the provider gives a busysignal), administrators still receive the email notification of theproblem. The likelihood of both services being down at the same time isslim, so this effectively eliminates any single point of failure inyour notification infrastructure. The only noticeable effect is thepotential delay in receiving the message.

Internal Versus External Monitoring

Most monitoringis done from the inside; an organization's production services aremonitored from within their own network infrastructure. This setup hassome inherent problems. Because the system is operating from withinyour own network, for example, the internal monitoring system won'ttell you when a third-party network problem is causing an outage onyour network.

If youmust monitor from within, you may want to consider provisioning aseparate circuit through a different network provider (not the oneproviding your primary Internet connection) through which you canmonitor your systems. This arrangement gives you the same view of yoursystems that an end user would have, and a more accurate picture of thereal-world performance of your systems.

Anotherproblem with internal monitoring is false positives. If any part of thenetwork between your monitoring server and the production systemsfails, the internal monitoring system generates alarms, just as if theservices had gone down. This problem isn't monumental, since it's goodto be notified about the other problem anyway. But when you'rereporting uptime to clients with service level agreements (as you learnin Chapter 8), false positives can cause significant hassles.

Third-partymonitoring companies that watch your service from outside of yournetwork help to curb these problems. Sometimes your ISP can providethird-party monitoring for you, but there are other companies thatprovide this service as the primary focus of their business.Third-party monitoring companies will still have issues with falsepositives, but at least you know those problems lie outside of yournetwork. If you choose to go this route, look for a monitoring servicewith multiple connections to the Internet, or even a dedicated T1 toyour ISP to help mitigate any connectivity problems.

Sounding Multiple Alarms

By using both internal and external monitoring services, you can add an extra level of urgency to notifications that you receive from both monitors. In this case, you can be fairly certain that the notification is not a false alarm because two separate servers taking two separate routes to your service saw the same error.

Monitoring Applications

This chapterhas discussed at great length the various methodologies and tools usedto monitor your services, but how do you put them all together into onesystem? Some people like to build their own systems—especially inacademia. In the corporate world, however, there isn't always the timeor management support to build your own service monitoring system.Fortunately, there are many quality products on the market (bothcommercial and free) that wrap everything together into one powerfulpackage. The following sections look at a large-scale commercialmonitoring package, and two open-source products that take differentapproaches to monitoring.

Writing Your Own Monitoring Tools

Smaller organizations with very little infrastructure may consider writing their own monitoring tools. With some moderate C, Shell, or Perl programming skills, you can easily write simple scripts that ping servers or test network services. However, maintaining and debugging these scripts becomes your responsibility, and this can become a real burden as your organization grows and the number of servers and networks you administer increases by leaps and bounds. Research all of the available third-party monitoring tools in your price range to see if they meet your needs; if they don't, writing your own monitoring tools may be the right answer for you.


Micromuse Netcool®

Netcool® (http://www.micromuse.com) is a large-scalefault management system geared toward large complex networks. Netcoolis currently deployed at many national ISPs, cable companies, andtelephone companies across the world. Netcool brings together a wholesuite of network monitoring tools into one massive all-in-one package:service monitoring, log analysis, historical reporting, problemassignment, integration with help desk systems, and notification, justto name a few.

Netcool'sarchitecture is based on the object server, an in-memory databaseoptimized for event collection from a variety of sources. Anapplication generates asyslog message;Netcool stores the message in the event list of the object server. If aservice monitor failed for your POP3 service, for example, that eventis stored in the object server. When the POP3 service is working again,Netcool also stores that event message in the object server.

Everyevent logged by Netcool is stored in the object server's event list.Filters and views can be built on top of the event list to customizewhich events each user is interested in seeing. Figure 6.4 shows two such event lists.

Figure 6.4. Netcool event lists.


Compiling all of your network events in one place is an extremely handy feature, but Netcool performs an additional step—called normalization—tohelp you use and manage the system monitoring activities. Log datadoesn't look like service monitor results, so how can you possibly listthis data and these results together in a coherent fashion?Normalization takes all of the fields associated with logs, monitorresults, probes, and whatever else you monitor, and translates theminto a bunch of common fields. These fields include machine name,severity, location, time of first occurrence, time of last occurrence,acknowledgment status, and so on.

Netcool's De-Duplication Feature

The Netcool report fields also include an event count. Netcool “de-duplicates” events so you never see the same event on more than one line. Remember the multiple “file system full” errors that you configured Logsurfer to ignore? Netcool's de-duplication mechanism handles that automatically.


Netcool also sports a robusttrigger and action mechanism that can react to events based on complexBoolean logic. A simple example of this logic could be, “if the POP3service fails on server3 between 5:00 p.m. and 9:00 a.m., page theon-call person for the first instance of the failure.” These triggersand actions work off of the object server's in-memory database andsupport a SQL-like query language that should be familiar to anyone whohas used a relational database.

Unfortunately,Netcool can be very costly and could be out of the reach of smallercompanies with limited budgets. Netcool is a godsend for large complexnetworks like those at ISPs, cable companies, and phone companies, butit is just not worth the time or effort for small networks. Thelearning curve is large; it can take several weeks to several months tofully understand Netcool's capabilities and how to use them effectively.

Netcoolalso requires a large amount of system resources, including memory,CPU, and disk space, indicating the purchase of an expensive dedicatedserver for monitoring, something which a small company may not bewilling to pay for. Using Netcool effectively also requires that youuse many different components, including license servers, Apache Webservers, and Oracle; each of these requires specific knowledge andexperience to run correctly. Many organizations simply do not have theresources to dedicate to these tasks, even for a few months to getthings up and running. Organizations can hire consultants to handlethis job for them, but that makes this already expensive product evenmore costly to use. For these organizations, some of the otheralternatives in this section may be a better fit.

Thatsaid, Netcool is a great example of an enterprise network monitoringsystem that does it all. If you have the time and resources to dedicateto administering Netcool, it can be well worth the investment.

NetSaint

NetSaint (http://www.netsaint.org) is a free,open-source network monitor. Building it from source code is hasslefree, and once configured, it is mostly maintenance free—an excellentquality in a monitoring system. NetSaint is not a full-featuredenterprise system such as Netcool. Instead, NetSaint does onething—service monitoring—and it does it well.

NetSaintplugins allow for a variety of standard and third-party servicemonitors. Old favorites like HTTP, FTP, POP3, and NNTP come standard,as well as an Oracle plugin that makes sure applications can connect toyour Oracle databases. These service monitors work much like Netcool's;they run from the monitoring server and access services as a userwould. If you have an in-house network application, you can write yourown plugin to monitor it—just one of the benefits of open source.

NetSaint'suser interface is provided by a set of CGI applications and requires aWeb server like Apache. Most monitoring functions can be accessed viathe Web, including the status of each host and service, summary views(see Figure 6.5), and the history of alerts for all of your systems.

Figure6.5. NetSaint's Web interface provides a summary view of the status ofyour systems. In this example, most hosts and services are available,some are down, and some are still awaiting a service test.


WhereNetSaint tends to fall short is in its configuration. The primaryconfiguration is stored in a very obfuscated and busy-looking file; itcontains every host and service that NetSaint monitors, in addition tocontact information, notification methods, and host dependencies. Otherfiles store contact information and monitoring commands. Thedistribution itself provides no configuration application; you have tocreate the configuration files on your own using a set of sample files.

Avariety of third-party NetSaint configuration applications areavailable, and they make administration of the server much easier. Themost popular of these applications is NEAT (NetSaint EasyAdministration Tool),another CGI program that provides a Web-based administration tool forediting NetSaint configuration files. If you use NetSaint for a largenetwork, use of an administration aid like NEAT is highly recommended. You can find it on the NetSaint downloads page at http://www.netsaint.org/download.

Thatsaid, once NetSaint is configured, it does the job well, and quietly atthat. Months might pass without notifications from NetSaint; it is notchatty and won't produce many (if any) false alarms. If you don't havethe budget to spend money on something like NetCool, NetSaint is a nice (and free) alternative.

Big Brother

Big Brother (http://www.bb4.com) is amonitoring system that takes a slightly different approach to systemmonitoring. In addition to monitoring services like POP and HTTP from acentral server, you can also deploy Big Brother clients on varioushosts which will actively monitor metrics like disk space, CPU usage,and process existence, and report back to the Big Brother server. Thiscapability enables you to monitor virtually anything on remote servers.Big Brother also supports a plugin mechanism, so you can write your owncustom monitors. As you might expect, Big Brother supports servicemonitoring, with features like timeouts and thresholds (good for diskmonitoring).

Youconfigure Big Brother through various confusing configuration files,which can be frustrating to work with. But as a Unix administrator, youshould be able to manage this task.

TheBig Brother user interface is Web based and can display the status ofall of your hosts in a table on one Web page. This all-in-one view is areal time saver, as you can see the status of your entire network witha quick glance. Tiny icons indicate the status of each monitoredservice, which is one of the following: OK, attention, trouble, noreport, unavailable, or offline.

Manypeople shy away from Big Brother because it is more difficult toconfigure and doesn't look as pretty as other systems. While it doeslack flair, Big Brother is very effective and has a large following onthe Internet, where you can always find help with configuration, bugs,and plugins. In addition to the help you'll receive from those sources,Big Brother support is available for an annual fee, as well.

BigBrother is free for non-commercial use, but if you're a business,you'll have to spend some money for a license. Fortunately, as of thetime of this writing, the licenses are not expensive by any means when compared to other commercial solutions.

Summary

Afterhours, days, or even weeks of installing, configuring, and testing yoursystems, constant monitoring can reassure you that the systems continueto work properly. In addition to existing systems, it's important tomonitor any new systems you place into your infrastructure, and thatmonitoring should begin as soon as you deploy the new systems.Continually updating your monitoring system to reflect your currentinfrastructure allows you to be proactive about your monitoring,actively looking for problems rather than reacting to problems thathave already occurred. Nothing is more embarrassing or careerthreatening than having someone else report a major outage on your ownsystems to you. There are many ways to monitor systems, services, andlogs, and many applications that can do the job well. You should choosethe applications that best meet your monitoring needs and fit into yourbudget.