[Cockcroft98] Chapter 9. Networks

来源:百度文库 编辑:神马文学网 时间:2024/04/29 09:17:13

Chapter 9. Networks

The subject of network configuration and performance has been extensively covered by other writers[1].For that reason, this chapter concentrates on Sun-specific networkingissues, such as the performance characteristics of the many networkadapters and operating system releases.

[1] In particular, see Managing NFS and NIS by Hal Stern.

New NFS Metrics

Localdisk usage and NFS usage are functionally interchangeable, so Solaris2.6 was changed to instrument NFS client mount points as if they weredisks! NFS mounts are always shown byiostat andsar.Automounted directories coming and going more often than disks comingonline may be an issue for performance tools that don’t expect thenumber ofiostat orsar records to change often.

The full instrumentation includes the wait queue for commands in the client (biodwait)that have not yet been sent to the server; the active queue forcommands currently in the server; and utilization (%busy) for theserver mount point activity level. Note that unlike the case withdisks, 100% busy does notindicate that the server itself is saturated, it just indicates thatthe client always has outstanding requests to that server. An NFSserver is much more complex than a disk drive and can handle a lot moresimultaneous requests than a single disk drive can.

Figure 9-1 shows the new-xnP option, although NFS mounts appear in all formats. Note that theP option suppresses disks and shows only disk partitions. Thexn option breaks down the response time,svc_t,into wait and active times and puts the expanded device name at the endof the line so that long names don’t mess up the columns. Thevold entry is used to mount floppy and CD-ROM devices.

Figure 9-1. Exampleiostat Output Showing NFS Mount Points
crun% iostat -xnP 
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 crun:vold(pid363)
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 servdist:/usr/dist
0.0 0.5 0.0 7.9 0.0 0.0 0.0 20.7 0 1
servhome:/export/home/adrianc
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 servhome:/var/mail
0.0 1.3 0.0 10.4 0.0 0.2 0.0 128.0 0 2 c0t2d0s0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c0t2d0s2

New Network Metrics

The standard SNMP network management MIB for a network interface is supposed to containIfInOctets andIfOutOctetscounters that report the number of bytes input and output on theinterface. These were not measured by network devices for Solaris 2, sothe MIB always reported zero. Brian Wong and I filed bugs against allthe different interfaces a few years ago, and bugs were filed morerecently against the SNMP implementation. The result is that thesecounters have been added to the “le” and “hme” interfaces in Solaris 2.6, and the fix has been backported in patches for Solaris 2.5.1, as 103903-03 (le) and 104212-04 (hme).

The new counters added were:

  • rbytes, obytes — read and output byte counts

  • multircv, multixmt — multicast receive and transmit byte counts

  • brdcstrcv, brdcstxmt — broadcast byte counts

  • norcvbuf, noxmtbuf — buffer allocation failure counts

The full set of data collected for each interface can be obtained as described in “The Solaris 2 “kstat” Interface” on page 387. An SE script, calleddumpkstats.se, prints out all of the available data, and an undocumented option,netstat -k, prints out the data. In Solaris 2.6,netstat -k takes an optional kstat name, as shown in Figure 9-2, so you don’t have to search through the reams of data to find what you want.

Figure 9-2. Solaris 2.6 Example ofnetstat -k to See Network Interface Data in Detail
% netstat -k le0 
le0:
ipackets 0 ierrors 0 opackets 0 oerrors 5 collisions 0
defer 0 framing 0 crc 0 oflo 0 uflo 0 missed 0 late_collisions 0
retry_error 0 nocarrier 2 inits 11 notmds 0 notbufs 0 norbufs 0
nocanput 0 allocbfail 0 rbytes 0 obytes 0 multircv 0 multixmt 0
brdcstrcv 0 brdcstxmt 5 norcvbuf 0 noxmtbuf 0

Virtual IP Addresses

You can configure more than one IP address on each interface, as shown in Figure 9-3.This is one way that a large machine can pretend to be many smallermachines consolidated together. It is also used in high-availabilityfailover situations. In earlier releases, up to 256 addresses could beconfigured on each interface. Some large virtual web sites found thislimiting, and now a newndd tunablein Solaris 2.6 can be used to increase that limit. Up to about 8,000addresses on a single interface have been tested. Some work was alsodone to speed upifconfig of large numbers of interfaces. You configure a virtual IP address by usingifconfigon the interface, with the number separated by a colon. Solaris 2.6also allows groups of interfaces to feed several ports on a networkswitch on the same network to get higher bandwidth.

Figure 9-3. Configuring More Than 256 IP Addresses Per Interface
# ndd /dev/ip -set ip_addrs_per_if 300 
# ifconfig hme0:283...

Network Interface Types

There are many interface types in use on Sun systems. In this section, I discuss some of their distinguishing features.

10-Mbit SBus Interfaces — “le” and “qe

The “le”interface is used on many SPARC desktop machines. The built-in Ethernetinterface shares its direct memory access (DMA) connection to the SBuswith the SCSI interface but has higher priority, so heavy Ethernetactivity can reduce disk throughput. This can be a problem with theoriginal DMA controller used in the SPARCstation 1, 1+, SLC, and IPC,but subsequent machines have enough DMA bandwidth to support both.

Theadd-on SBus Ethernet card uses exactly the same interface as thebuilt-in Ethernet but has an SBus DMA controller to itself. The morerecent buffered Ethernet interfaces used in the SPARCserver 600, theSBE/S, the FSBE/S, and the DSBE/S have a 256-Kbyte buffer to provide alow-latency source and sink for the Ethernet. This buffer cuts down ondropped packets, especially when many Ethernets are configured in asystem that also has multiple CPUs consuming the memory bandwidth. Thedisadvantage is increased CPU utilization as data is copied between thebuffer and main memory. The most recent and efficient “qe”Ethernet interface uses a buffer but has a DMA mechanism to transferdata between the buffer and memory. This interface is found in theSQEC/Sqe quadruple 10-Mbit Ethernet SBus card and the 100-Mbit “be” Ethernet interface SBus card.

100-Mbit Interfaces — “be” and “hme

The100baseT standard takes the approach of requiring shorter andhigher-quality, shielded, twisted pair cables, then running the normalEthernet standard at ten times the speed. Performance is similar toFDDI, but with the Ethernet characteristic of collisions under heavyload. It is most useful to connect a server to a hub, which convertsthe 100baseT signal into many conventional 10baseT signals for theclient workstations.

FDDI Interfaces

TwoFDDI interfaces have been produced by Sun, and several third-partyPCIbus and SBus options are available as well. FDDI runs at 100 Mbits/sand so has ten times the bandwidth of standard Ethernet. The SBusFDDI/S 2.0 “bf” interface is theoriginal Sun SBus FDDI board and driver. It is a single-width SBus cardthat provides single-attach only. The SBus FDDI/S 3.0, 4.0, 5.0 “nf”software supports a range of SBus FDDI cards, including both single-and dual-attach types. These are OEM products from Network PeripheralsInc. Thenf_stat command provided in/opt/SUNWconn/SUNWnf may be useful for monitoring the interface.

SBus ATM 155-Mbit Asynchronous Transfer Mode Cards

Thereare two versions of the SBus ATM 155-Mbit Asynchronous Transfer Modecard: one version uses a fiber interface, the other uses twisted paircables like the 100baseT card. The ATM standard allows isochronousconnections to be set up (so audio and video data can be piped at aconstant rate), but the AAL5 standard used to carry IP protocol datamakes it behave like a slightly faster FDDI or 100baseT interface forgeneral-purpose use. You can connect systems back-to-back with just apair of ATM cards and no switch if you only need a high-speed linkbetween two systems. ATM configures a 9-Kbyte segment size for TCP,which is much more efficient than Ethernet’s 1.5-Kbyte segment.

622-Mbit ATM Interface

The622-Mbit ATM interface is one of the few cards that comes close tosaturating an SBus. Over 500 Mbits/s of TCP traffic have been measuredon a dual CPU Ultra 2/2200. The PCIbus version has a few refinementsand a higher bandwidth bus interface, so runs a little moreefficiently. It was used for the SPECweb96 benchmark results when theEnterprise 450 server was announced. The four-CPU E450 needed two622-Mbit ATM interfaces to deliver maximum web server throughput. See “SPECweb96 Performance Results” on page 83.

Gigabit Ethernet Interfaces—vge

GigabitEthernet is the latest development. With the initial release, a singleinterface cannot completely fill the network, but this will be improvedover time. If a server is feeding multiple 100-Mbit switches, then agigabit interface may be useful because all the packets are the same1.5-Kbyte size. Overall, Gigabit Ethernet is less efficient than ATMand slower than ATM622 because of its small packet sizes and relativeimmaturity as a technology. If the ATM interface was going to befeeding many Ethernet networks, ATM’s large segment size would not beused, so Gigabit Ethernet may be a better choice for integrating intoexisting Ethernet networks.

Using NFS Effectively

TheNFS protocol itself limits throughput to about 3 Mbytes/s per activeclient-side process because it has limited prefetch and small blocksizes. The NFS version 3 protocol allows larger block sizes and otherchanges that improve performance on high-speed networks. This limitdoesn’t apply to the aggregate throughput if you have many activeclient processes on a machine.

First, some references:

  • Managing NFS and NIS by Hal Stern (O’Reilly)—essential reading!

  • SMCC NFS Server Performance and Tuning Guide

    The SMCC NFS Server Performance and Tuning Guide is part of the SMCC hardware-specific manual set. It contains a good overview of how to size an NFS server configuration. It is updated with each Solaris release, and I think you will find it very useful.

How Many NFS Server Threads?

In SunOS 4, the NFS daemonnfsd services requests from the network, and a number ofnfsd daemons are started so that a number of outstanding requests can be processed in parallel. Eachnfsdtakes one request off the network and passes it to the I/O subsystem.To cope with bursts of NFS traffic, you should configure a large numberofnfsds, even on low-end machines. All thenfsdsrun in the kernel and do not context switch in the same way asuser-level processes do, so the number of hardware contexts is not alimiting factor (despite folklore to the contrary!). If you want to“throttle back” the NFS load on a server so that it can do otherthings, you can reduce the number. If you configure too manynfsds,some may not be used, but it is unlikely that there will be any adverseside effects as long as you don’t run out of process table entries.Take the highest number you get by applying the following three rules:

  • Two NFS threads per active client process

  • Sixty-four NFS threads per SuperSPARC processor, 200 per UltraSPARC

  • Sixteen NFS threads per Ethernet, 160 per 100-Mbit network

What Is a Typical NFS Operation Mix?

Thereare characteristic NFS operation mixes for each environment. TheSPECsfs mix is based on the load generated by slow disklessworkstations with a small amount of memory that are doing intensivesoftware development. It has a large proportion of writes compared tothe typical load mix from a modern workstation. If workstations areusing the cachefs option, then many reads will be avoided, so the totalload is less, but the percentage of writes is more like the SPECsfsmix. Table 9-1 summarizes the information.

Table 9-1. The LADDIS NFS Operation Mix
NFS Operation Mix Comment (Possible Client Command) getattr 13% Get file attributes ( ls -l) setattr 1% Set file attributes ( chmod) lookup 34% Search directory for a file and return handle ( open) readlink 8% Follow a symbolic link on the server ( ls) read 22% Read an 8-KB block of data write 15% Write an 8-KB block of data create 2% Create a file remove 1% Remove a file ( rm) readdir 3% Read a directory entry ( ls) fsstat 1% Get filesystem information ( df)

Thenfsstat Command

Thenfsstat -s command shows operation counts for the components of the NFS mix. This section is based upon the Solaris 2.4 SMCC NFS Server Performance and Tuning Guide.Figure 9-4 illustrates the results of annfsstat -s command.

Figure 9-4. NFS Server Operation Counts
% nfsstat -s 

Server rpc:
calls badcalls nullrecv badlen xdrcall
2104792 0 0 0 0

Server nfs:
calls badcalls
2104792 5
null getattr setattr root lookup readlink read
10779 1% 966412 46% 13165 1% 0 0% 207574 10% 572 0% 686477 33%
wrcache write create remove rename link symlink
0 0% 179582 9% 5348 0% 9562 0% 557 0% 579 0% 32 0%
mkdir rmdir readdir statfs
120 0% 386 0% 12650 1% 10997 1%

The meaning and interpretation of the measurements are as follows:

  • calls — The total number of remote procedure (RPC) calls received. NFS is just one RPC application.

  • badcalls — The total number of RPC calls rejected, the sum of badlen and xdrcall. If this value is non-zero, then RPC requests are being rejected. Reasons include having a user in too many groups, attempts to access an unexported file system, or an improper secure RPC configuration.

  • nullrecv — The number of times an RPC call was not there when one was thought to be received.

  • badlen — The number of calls with length shorter than the RPC minimum.

  • xdrcall — The number of RPC calls whose header could not be decoded by the external data representation (XDR) translation.

  • readlink — If this value is more than 10 percent of the mix, then client machines are making excessive use of symbolic links on NFS-exported file systems. Replace the link with a directory, perhaps using a loopback mount on both server and clients.

  • getattr — If this value is more than 60 percent of the mix, then check that the attribute cache value on the NFS clients is set correctly. It may have been reduced or set to zero. See the mount_nfs command and the actimo option.

  • null — If this value is more than 1 percent, then the automounter time-out values are set too short. Null calls are made by the automounter to locate a server for the file system.

  • writes — If this value is more than 5 percent, then configure a Prestoserve option, NVRAM in the disk subsystem, or a logging file system on the server.

NFS Clients

On each client machine, usenfsstat -c to see the mix, as shown in Figure 9-5; for Solaris 2.6 or later clients, useiostat -xnP to see the response times.

Figure 9-5. NFS Client Operation Counts (Solaris 2.4 Version)
Code View: Scroll / Show All
% nfsstat -c 

Client rpc:
calls badcalls retrans badxids timeouts waits newcreds
1121626 61 464 15 518 0 0
badverfs timers toobig nomem cantsend bufulocks
0 442 0 0 0 0

Client nfs:
calls badcalls clgets cltoomany
1109675 6 1109607 0
Version 2: (1109678 calls)
null getattr setattr root lookup readlink read
0 0% 345948 31% 4097 0% 0 0% 375991 33% 214 0% 227031 20%
wrcache write create remove rename link symlink
0 0% 112821 10% 3525 0% 3120 0% 290 0% 54 0% 0 0%
mkdir rmdir readdir statfs
370 0% 45 0% 10112 0% 26060 2%


  • calls — The total number of calls sent.

  • badcalls — The total number of calls rejected by RPC.

  • retrans — The total number of retransmissions.

  • badxid — The number of times that a duplicate acknowledgment was received for a single NFS request. If it is approximately equal to timeout and above 5 percent, then look for a server bottleneck.

  • timeout — The number of calls that timed out waiting for a reply from the server. If the value is more than 5 percent, then RPC requests are timing out. A badxid value of less than 5 percent indicates that the network is dropping parts of the requests or replies. Check that intervening networks and routers are working properly; consider reducing the NFS buffer size parameters (see mount_nfs rsize and wsize), but reducing the parameters will reduce peak throughput.

  • wait — The number of times a call had to wait because no client handle was available.

  • newcred — The number of times authentication information had to be refreshed.

  • null — If this value is above zero by a nontrivial amount, then increase the automount timeout parameter timeo.

You can also view each UDP-based mount point by using thenfsstat -m command on a client, as shown in Figure 9-6. TCP-based NFS mounts do not use these timers.

Figure 9-6. NFS Operation Response Times Measured by Client
% nfsstat -m 
/home/username from server:/export/home3/username
Flags: vers=2,hard,intr,down,dynamic,rsize=8192,wsize=8192,retrans=5
Lookups: srtt=7 (17ms), dev=4 (20ms), cur=2 (40ms)
Reads: srtt=16 (40ms), dev=8 (40ms), cur=6 (120ms)
Writes: srtt=15 (37ms), dev=3 (15ms), cur=3 (60ms)
All: srtt=15 (37ms), dev=8 (40ms), cur=5 (100ms)
/var/mail from server:/var/mail
Flags: vers=2,hard,intr,dynamic,rsize=8192,wsize=8192,retrans=5
Lookups: srtt=8 (20ms), dev=3 (15ms), cur=2 (40ms)
Reads: srtt=18 (45ms), dev=6 (30ms), cur=5 (100ms)
Writes: srtt=9 (22ms), dev=5 (25ms), cur=3 (60ms)
All: srtt=8 (20ms), dev=3 (15ms), cur=2 (40ms)

Thisoutput shows the smoothed round-trip times (srtt), the deviation orvariability of this measure (dev), and the current time-out level forretransmission (cur). Values are converted into milliseconds and arequoted separately for read, write, lookup, and all types of calls.

The system will seem slow if any of the round trip times exceeds 50 ms. If you find a problem, watch theiostat -x measures on the server for the disks that export the slow file system, as described in “How iostat Uses the Underlying Disk Measurements”on page 194. If the write operations are much slower than the otheroperations, you may need a Prestoserve, assuming that writes are animportant part of your mix.

NFS Server Not Responding

Ifyou see the “not responding” message on clients and the server has beenrunning without any coincident downtime, then you have a seriousproblem. Either the network connections or the network routing ishaving problems, or the NFS server is completely overloaded.

Thenetstat Command

Several options to thenetstat command show various parts of the TCP/IP protocol parameters and counters. The most useful options are the basicnetstat command, which monitors a single interface, and thenetstat -i command, which summarizes all the interfaces. Figure 9-7 shows an output from theiostat -i command.

Figure 9-7.netstat -i Output Showing Multiple Network Interfaces
Code View: Scroll / Show All
% netstat -i 
Name Mtu Net/Dest Address Ipkts Ierrs Opkts Oerrs Collis Queue
lo0 8232 loopback localhost 1247105 0 1247105 0 0 0
bf0 4352 labnet-fddi testsys-fddi 5605601 0 1266263 0 0 0
le1 1500 labnet-71 testsys-71 738403 0 442941 0 11485 0
le2 1500 labnet testsys-lab 4001566 1 3141510 0 47094 0
le3 1500 labnet-tpt testsys 4140495 2 6121934 0 70169 0


Froma single measurement, you can calculate the collision rate since boottime; from noting the difference in the packet and collision countsover time, you can calculate the ongoing collision rates as Collis *100 / Opkts for each device. In this case,lo0 is the internal loopback device,bf0 is an FDDI so has no collisions,le1 has a 2.6 percent collision rate,le2 has 1.5 percent, andle3 has 1.2 percent.

For more useful network performance summaries, see the network commands of the SE toolkit, as described starting with “net.se” on page 486.