[Cockcroft98] Chapter 6. Source Code Optimization

来源:百度文库 编辑:神马文学网 时间:2024/04/29 10:47:12

Chapter 6. Source Code Optimization

Thischapter is aimed at software developers. It covers instrumenting andtuning Java programs; 64 bit development issues; linking withlibraries; and optimizing compiler options.

Java Tuning

There is a good paper on Java that includes a lot of performance tuning information at http://www.sun.com/solaris/java/wp-java/, and I have found a very useful web site with up-to-date performance and optimization information at http://www.cs.cmu.edu/~jch/java/optimization.html.A Java percollator graphing program that I'm tuning includes a usefulclass to help measure Java performance. I’ve tried out the profilerfeature of Java WorkShop 2.0 on it.

Almosttwo years ago, I figured out Java far enough to put together a verysimple performance graphing applet, and during summer 1997, I helped astudent intern rewrite and extend it. The result is a much more usefultool that runs as both an applet and as an application and browses thefiles generated by my percollator script for monitoring web serverperformance. When graphing large amounts of data, the program can slowdown, so I added code to time critical operations and report a summary.Since that is a useful code fragment in itself, that’s where I'll start.

The Metrognome Class

Iwanted a simple object that measured a time interval multiple times,then generated a summary of the results. The name is a joke, as it’s akind of small, helpful personalized metronome. The main design idea wasto create a separate object that had a simple collection interface butwhich could be upgraded and extended without changing the interface.That way, a program using theMetrognomeclass could be instrumented in more detail by just replacing the class.No recompilation or interface changes should be necessary.

You construct a Metrognome by giving it a text label that identifies it.

Metrognome redrawTimer = new Metrognome("Display redraw");

Atsome point in the code, let’s say during the paint method that redrawsthe display, we call the start method then call the stop method.

paint() {
redrawTimer.start();
// lots of display update code
redrawTimer.stop();
}

TheMetrognome class will collect the duration of all these displayupdates. Somewhere else in the code we need to have a way to show theresults. ThegetSummary methodreturns a String that can be displayed in a GUI, written on exit to theJava console, or whatever else you like. Some individual performancedata can also be read from the Metrognome and displayed, but it is easyto update the contents of the String if a more sophisticated Metrognomeis used.

The summary string shows the label, count, and min/mean/max times in seconds.

"Display redraw: count= 5 Latency(s) min= 0.017 mean= 0.029 max= 0.041"

Figure 6-1. Code for theMetrognome Class
Code View: Scroll / Show All
// Metrognome.java - the 'g' was Adrian's idea, not mine 

import java.lang.*;
import java.util.*;
import java.text.*;

public class Metrognome extends Object {
private String label;
private long total_time;
private long begin_time;
private long count;
private long min;
private long max;

public Metrognome(String lb) {
label = new String(lb);
begin_time = 0;
count = 0;
total_time = 0;
min = 0;
max = 0;
}

public void start() {
if (begin_time == 0) {
begin_time = new Date().getTime();
}
}

public void stop() {
long diff;
if (begin_time != 0) {
diff = (new Date().getTime() - begin_time);
if (count == 0 || min > diff) min = diff;
if (max < diff) max = diff;
total_time += diff;
begin_time = 0;
count++;
}
}

public double getMeanLatency() {
if (count == 0) {
return 0.0;
} else {
return ((total_time / 1000.0) / count);
}
}

public long getCount() {
return count;
}

public long getTotalTime() {
return total_time;
}

public String getSummary() {
DecimalFormat df = new DecimalFormat("0.0##");
return label + ": count= " + count + " latency(s) min= " +
df.format(min/1000.0) + " mean= " + df.format(getMeanLatency()) +
" max= " + df.format(max/1000.0);
}
}


As usual, I have more ideas than time to implement them. Some possibilities for extending theMetrognome class without changing the interface to the rest of the program might include these actions:

  • Accumulate time buckets to produce a histogram of the time distribution.

  • Log a timestamped record to a buffer on each start.

  • Update the record on each stop with the duration.

  • Log the records somewhere.

  • Make all the Metrognomes share one big buffer—you would need to apply access locking.

  • Wait until a few times have been counted, then compare against the mean and deviation to report on abnormally large delays.

  • Open an RMI connection to report back to a central data collection server.

Java Tuning and Java Virtual Machines

Oneof the nice things about Java is that investment levels are very high.That is, everyone is competing to provide the fastest Java VirtualMachine, and speeds are increasing all the time. This can trap theunwary. Things that are slow in one release can suddenly become fast inthe next one. You must be very careful not to write unnatural code justfor the sake of short-term performance gains. The introduction offirst-generation Just In Time (JIT) compilers has made loops go muchfaster, but JIT compilers do not speed up method invocations very much,and since a whole class is compiled before it is run, startup time isincreased and gets worse as optimization levels rise. As more JIT-basedoptimizations are perfected, more language features will beaccelerated. Sun is working on an incremental compiler technology,known as HotSpot™, which uses a run-time profile of the code to decidewhich classes and methods to compile. It has quite differentperformance characteristics than those of a JIT, with a faster startuptime and better directed optimization.

OnSolaris, there are several versions of Java. Java is upward compatible,but if you use recent features, you cannot go back and run the code onearlier releases. The Java 1.0 specification and its support classeswere found to need improvements in some areas. Java 1.1 was based on alot of user feedback and takes code from several vendors, not just Sun,to provide a much more robust and usable standard that should remainstable for much longer than 1.0.

GPercollator Performance and Size Issues

TheGPercollator GUI is shown in Figure 16-22on page 493. The tool can append many days of data from separate URLsand can display several colored traces of different metrics. The twooperations that we timed using Metrognomes inGPercollatorare the graphical display update and the data load from URL operation.At present, they seem to run reasonably quickly, but as the amount ofdata on the display increases, the update slows down. One displayoption, triggered by an Info button, shows credits, the version, andthe Metrognome summaries.

Themost CPU-intensive operation is loading a new day’s data. Each filecontains about 280 rows and 20 to 30 columns of data, mostly numeric.Processing the file is tricky and inefficient, taking up to a second ofCPU time. I tried to use theStreamTokenizer, but it doesn't behave the way I want it to when processing numbers. I store the result in a class called aDataFramethat contains a Vector of Vectors, which can cope with the variation intype from one column to another. Some data starts with a numeral but isreally a string, like the local timezone timestamp: “08:35:20”. TheStreamTokenizerbreaks this numeral into two tokens if you let it interpret the number.We force it to pick off space-delimited strings and to check that thewhole token is a valid number before converting the data, usingDouble(String). This conversion operation is inefficient; a comment in the code for theDouble class points this out. TheStreamTokenizercontains its own code for parsing numbers, which seems like wastefulduplication but it is much faster. I'm hoping that the performance ofDouble(String) will be fixed for me one day.

Theother area of concern is the size of the program. It can load manydays’ worth of data and then show up to 16 differently colored plotlines on the graph. The process size on Solaris whenGPercollatoris run as an application can reach well over 10 Mbytes. The code sizeis under 50 Kbytes, but it does require the additional Java WorkShop™GUI class libraryvisualrt.zip,which is about 600 Kbytes. To save downloading this library dynamicallyeach time you start an applet, you can grab a copy and put it on yourlocalCLASSPATH. The main memoryhog is the array of DataFrames, each containing a Vector of Vectors ofDoubles or Strings. I'd like to make it a Vector of arrays of doublewhere possible but haven’t yet figured out how. I do trim the Vectorsdown to size once I have finished reading the file. I tried to find atool that could pinpoint the memory usage of each object but only foundpeople that agreed with me that it would be a good idea to build one.

The Java WorkShop Performance Analyzer

My summer intern, Graham Hazel, wrote the new Graphical percollator browser (GPercollator) with a lot of feedback and a few code fixes from me. You can run it from http://www.sun.com/sun-on-net/www.sun.com/gpercol, and you can also get a tar file of the code from that location. We builtGPercollatorusing Java WorkShop 2.0 as the development tool and GUI builder. Onefeature of Java WorkShop is that it provides a simple menu option thatstarts the program or applet along with a performance profiler. Afterthe program exits, the profile is loaded and you can see which methodstook longest to run. You can also see and traverse the call hierarchy.

Whenwe first tried this, our top routine was an iso8859 characterconversion method. Initially, we didn't see it because the profilershows only your own code. When we looked at the system library code aswell, we could see the problem. When we tracked it down, we realizedthat we were processing the input data without buffering it first. Thisis a common mistake, and when we wrapped a buffer around the inputstream, the processing went a lot faster and that routine dropped waydown the list. We also compiled the application with debug turned on tostart with, and when we changed to invoke the optimizer, the individualclass sizes dropped and we got a reasonable speedup on our classes.Overall performance is dominated by the provided system classes, so thebiggest gains come from using the libraries more effectively.

Java WorkShop Profiler Display Examples

Icompiled the code with debug and used the old-style input streammethods. This technique is deprecated in Java1.1 but is based on oldcode I wrote using Java1.0. I started it up as an applet from JavaWorkShop, using the profiler button. The tool automatically loaded adata file, and I reloaded it another four times so that the load timewould dominate the tool startup time. The initial profile shown in Figure 6-2 does not include system routines.

Figure 6-2. Debug Profile Without System Routines (Times in ms)


When the system routines are shown, as in Figure 6-3, the top one is the idle time routineObject.wait. Next comes the stream tokenizer, using about 15 seconds of CPU. The first routine of my own code,DataFrame.fetch, is about 1.5 seconds. Input goes via aBufferedInputStream.

Figure 6-3. Debug Profile with System Routines (Times in ms)


The code is now brought up to date by adding anInputStreamReader between the input stream and theStreamTokenizer rather than aBufferedInputStream.

InputStreamReader is = new InputStreamReader(url.openStream()); 
StreamTokenizer st = new StreamTokenizer(is);

Thisis part of the improved internationalization of Java1.1. Spot thedeliberate mistake—there are now about 200 seconds of overhead with 104seconds inByteToChar8859_1.convert on its own. It needs a buffer!

Figure 6-4. Debug Profile Times with Unbuffered Input


Bufferingincreases the size of the chunk of data being processed by each methodinvocation, thus reducing the overall overhead. The new code wraps aBufferedReader around the input.

Code View:Scroll/Show All
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream())); 
StreamTokenizer st = new StreamTokenizer(br);


This buffering reduces the overhead to about the same level as the original code, as you can see in Figure 6-5.

Figure 6-5. Debug Profile Times with Java1.1 BufferedReader


Thenext step is to turn off the debug compilation flag and turn on theoptimizer. The total size of the compiled classes is 54 Kbytes whencompiled with debug and 46 Kbytes when compiled with-O.

Whenthe code is run without the profiler with debug code, the average timetaken to do a load operation, as measured by Metrognome, is 0.775seconds. This is a lot faster than the profiled time of 5.47 seconds,so there is quite a large profiler overhead. This result justifiesusing techniques like theMetrognomeclass to instrument your own code. When the code was optimized, overallperformance did not increase much, but since most of the time is spentin system routines, this performance is not really surprising. If welook at the profile of the optimized code in Figure 6-6, excluding system time, theDataFrame.fetchroutine is a lot faster, but that only amounts to a reduction from 1.5seconds to 1.0 seconds, as the total CPU for five fetch operations. Totune the code further, I need to work on making more efficient use ofthe system functions. Here is the optimized profile for my own code.

Figure 6-6. Optimized Profile Times


Sofar, I have managed to build a program that is fast enough to be usefulbut bigger than I would like. It’s a useful test bed for me as I learnmore about tuning Java code to make it smaller and faster. The subjectis not covered much in the many books on Java, but SunSoft Press isplanning a book on Java Performance, by Achut Reddy of the JavaWorkShop development team.

When Does “64 Bits” Mean More Performance?

Therehas recently been a lot of talk about 64-bit systems. There are alsoclaims that 64-bits are faster than 32-bits, and some of the fastestCPUs have 64-bit capability. Can you take advantage of 64-bitoperations to make applications run faster? There have been severaloccasions over the last few years where systems have been marketed as“64 bit,” so what is the difference this time?

Thereis a lot of confusion surrounding this issue. Part of the problem isthat it is not always clear what the term “64 bit” is referring to.I’ll start with a generic statement of the difference between 32-bitand 64-bit operations. We can then look at the many ways in which “64bitness” has been applied over the years.

Note

64-bit operations provide more capacitythan do 32-bit operations. If you are performing operations that usemore than 32 bits, then these operations can be performed with several32-bit operations or a reduced number of 64-bit operations. If you areperforming operations that fit within 32 bits, then they will run on a64-bit system at the same speed—or sometimes more slowly.


Thereare two ways to increase computer performance. You can increase theclock rate, and you can increase the amount of work done in each clockcycle. Over the years, microprocessors have steadily increased theirclock rates. To do more work per cycle, they have also increased thewidth of their internal and external buses, arithmetic operations, andmemory addresses. Over the last few years of microprocessor evolution,arithmetic operations moved to 64 bits first. Internal and externalbuses went to 64 bits at the same time or soon afterward. We are now inthe final phase, as all the major microprocessor architectures areadding 64-bit addressing capability.

Itis important to consider the software implications of the move to 64bits. In the case of the width of an internal or external bus, nosoftware changes are required. In the case of arithmetic operations,most computer languages already support a full range of 32-bit and64-bit arithmetic options. If some of the implicit defaults change(such as the size of anint orlongin C), then some work is required to port code. In the case of changesin addressing, the implication is a major rewrite of the operatingsystem code, and any applications that use the increased addressingcapability will need some work.

SPARC and Solaris Versions

OlderSPARC CPUs implement the 32-bit version 8 definition of thearchitecture. This implementation includes microSPARC™ I, microSPARCII, HyperSPARC™, SuperSPARC, and SuperSPARC II. The very oldest SPARCCPUs rely on the operating system to emulate any missing features sothat they can still run SPARC V8 code. Sun’s UltraSPARC, and HalComputer Systems’ SPARC64 are the first two 64-bit SPARC V9implementations. Since SPARC V9 is a full superset of V8 for user modeapplications, V8 code runs perfectly well on these new CPUs. In thiscase, the terms 32 bit and 64 bit refer to the size of the linearaddress space that can be manipulated directly by the CPU. To takeadvantage of any SPARC V9 features, the application has to be speciallyrecompiled and will no longer run on SPARC V8 systems. An intermediatespecification is known as V8plus. It is a minor 32 bit upgrade to V8that includes some of the extended features that are part of V9.Solaris 2.5 through 2.6 support V8plus mode for UltraSPARC-specificcode. Additional support for SPARC V9 is due in the next major Solarisrelease.

64-bit Floating-Point Arithmetic

Floating-pointarithmetic is defined by the IEEE 754 standard for almost all systems.The standard includes 32-bit single precision and 64-bit doubleprecision data types. For many years, single precision operations werefaster than double precision. This changed when first supercomputers,then microprocessors implemented 64-bit floating point in a singlecycle. Since both 32-bit and 64-bit operations take the same time, thiswas the first opportunity for marketing departments to talk about 64bitness. It is common for Fortran programmers to assume that a 64-bitsystem is one where double precision floating point runs at full speed.

TheSuperSPARC, HyperSPARC, and UltraSPARC processors all containfloating-point units that can perform double precision multiply and addoperations at the same rate as single precision. They can all becounted as full-speed, 64-bit arithmetic processors. The older SPARCCPUs and microSPARC I and II all take longer to perform doubleprecision operations. One of the first microprocessors to be marketed as 64 bit, on the basis of arithmetic operations, was the Intel i860 in the early 1990’s.

SPARCdefines 32 floating-point registers. SPARC V8 defines these as singleprecision registers. For 64-bit operations, the registers pair up, sothere are only 16 double precision floating-point registers. SPARCV8plus and V9 define all the registers to be capable of operating insingle, double, or a SPARC V8-compatible paired mode. You get 32 doubleprecision registers in SPARC V9 mode.

Insummary: Full-speed, 64-bit, floating-point arithmetic has beenavailable since SuperSPARC shipped in 1992. A full set of 64-bitfloating-point registers is available in V8plus mode on UltraSPARC.

64-bit Integer Arithmetic

The C language is usually defined to haveinteger andlong as 32-bit values on 32-bit systems. To provide for 64-bit arithmetic, an extension to ANSI C creates the 64-bitlonglongtype. Some CPU architectures have no support for 64-bit integeroperations and use several 32-bit instructions to implement each 64-bitoperation. The SPARC architecture has always provided support for64-bit integers, using pairs of 32-bit registers. Older SPARC chipstake many cycles to execute the instructions, but SuperSPARC,HyperSPARC, and UltraSPARC can all efficiently implement 64-bit loadstore, add, subtract and shift operations.

SPARCV8 defines the integer registers to be 32 bit. In SPARC V9, all theregisters support 32-bit, 64-bit and a V8-compatible paired mode.UltraSPARC supports some 64-bit integer operations in V8plus mode andcan accelerate 64-bit integer multiplies and divides. A problem is thatthe data is passed in V8-compatible, paired 32-bit registers, it has tobe moved to 64-bit V9 registers to do the multiply, then moved back, sothere will be some benefit to integer arithmetic from V9 mode.

Insummary: Fast 64-bit integer arithmetic has been available sinceSuperSPARC shipped in 1992. Integer multiplication and divide isaccelerated by V8plus and will be accelerated further by V9.

The Size ofint andlong

Inan unprecedented display of common sense, the major players in theindustry have made some decisions about a 64 bit version of the Unixprogramming interface. The most basic decision revolves around thequestion of the size ofint andlong when pointers are 64 bits. The agreement is to leaveint as 32 bits but to makelong64 bits. This decision is in line with systems that have been shippingfor some time from SGI and Digital. You can prepare your code inadvance by always usinglong types to perform pointer arithmetic and avoidingint. This approach is known as the LP64 option:long andpointer types are 64 bit, andintremains 32 bit. Correctly written code is portable between ILP32 andLP64 modes without the need to use any conditional compilation options.

Internal Buses or Datapaths

Theinternal buses that route data inside the CPU are also known asdatapaths. If a datapath is only 32-bits wide, then all 64-bit-widedata will take two cycles to transfer. As you might guess, the sameCPUs that can perform 64-bit floating-point and integer operations inone clock cycle also have 64-bit internal datapaths, namely,SuperSPARC, HyperSPARC, and UltraSPARC. The wide datapath connects theregisters, the arithmetic units, and the load/store unit. Since theseCPUs are also superscalar, they can execute several instructions in asingle clock cycle.

Thepermitted combinations of instructions are complex to explain, so I’lljust provide one example. In SuperSPARC, the integer register file canaccept one 64-bit result or two 32-bit results in each clock cycle. Forexample, two 32-bit integer adds could be processed—or a single 64-bitinteger load. Even though a single 64-bit operation takes the samelength of time as a single 32-bit operation, the CPU can sustain twiceas many 32-bit operations in the same time period.

External Buses and Caches

Datahas to get in and out of the CPU somehow. On the way, it gets stored incaches that speed up repeated access to the same data and accesses toadjacent data items in the same cache block. For some applications, theability to move a lot of data very quickly is the most importantperformance measure. The width of the cache memory and the buses thatconnect to caches and to main memory are usually 64 bits and as much as128 or 256 bits in some CPUs. Does this make these CPUs into somethingthat can be called a 64-, 128-, or 256-bit CPU? Apparently, Solbournethought so in 1991 when it launched its own SPARC chip design,marketing it as the first 64-bit SPARC chip! In fact, it was the firstSPARC chip that had a 64-bit-wide cache and memory interface and wassimilar in performance to a 33 MHz microSPARC.

Allof the microSPARC, SuperSPARC, HyperSPARC, and UltraSPARC designs havecaches and main memory interfaces that are at least 64 bits wide. TheSuperSPARC on-chip instruction cache uses a 128-bit-wide datapath toload four instructions into the CPU in each cycle. The SuperSPARC andHyperSPARC access memory over the MBus, a 64-bit-wide systeminterconnect. The MBus-based systems use 144-bit-wide memory SIMMS toprovide 128 bits of data plus error correction. It takes two datatransfer cycles to pass the data from the SIMM over the MBus to theCPU. For UltraSPARC system designs, the 128-bit interface from theUltraSPARC chip connects to its external cache and continues to a mainmemory system that provides 256 or 512 bits of data per cycle.

In summary: External interfaces and caches are already at least 64 bits wide. The fastest designs are wider still.

64-bit Addressing

Thelatest development is 64-bit addressing. Pointers and addresses go from32-bit quantities to 64-bit quantities. Naturally, the marketingdepartments are in full swing, and everyone is touting hishigh-performance, 64-bit capabilities as the solution to the world’sproblems. In fact, the performance improvements in UltraSPARC and other64 bit processors come despite the 64-bit address support, not because of it!

Ifwe go back to my generic statement at the beginning of this section,you get a performance improvement only when you go from 32 bit to 64bit if you are exceeding the capacity of 32 bits. In the case ofaddresses, this implies that you can expect a performance improvementif you are trying to address more than 4 Gbytes of RAM. There is also adownside from the increase in size of every pointer and address.Applications embed many addresses and pointers in code and data. Whenthey all double in size, the application grows, reducing cache hitrates and increasing memory demands. It seems that you can expect asmall decrease in performance for everyday applications.

Onebenchmark that benefits from 64-bit addressing is the TPC-C OLTPdatabase test. High-end results are already using almost 4 Gbytes ofRAM for the shared memory database buffer pool. You can expect that asvendors move to 64-bit addressing and can configure more than 4 Gbytesof shared memory, even higher TCP-C results will be posted. Thisexpectation has little relevance to most production systems and is notan issue for the TPC-D Data Warehouse database test, which can use manyprivate data areas and a much smaller shared memory space.

A Hundred Times Faster? Nothing to Scream About!

Howdoes this discussion square up with those Digital advertisements thatclaimed their computer was 100 times faster because of the use of 64bits? On page 89 of the July/August 1995 issue of Oracle Magazine (http://www.oramag.com)is a description of the tests that Digital performed. I’ll try tobriefly summarize what the article says about the “100 times faster”figure.

The comparisonwas performed by running two tests on the same uniprocessor DigitalAlpha 8400 system. That is, both the slow and the 100-times-fasterresults were measured on a 64-bit machine running a 64-bit Uniximplementation. The performance differences were obtained with twoconfigurations of an Oracle “data warehouse style” database runningscans, index builds, and queries on each configuration. Of severaloperations that were evaluated, the slowest was 3 times faster, and thefastest was a five-way join that was 107 times faster.

Thespecial features of the configuration were that the database containedabout 6 Gbytes of data, and the system was configured with twodifferent database buffer memory and database block sizes. The slowconfiguration had 128 Mbytes of shared buffer memory and 2 Kbytesdatabase blocks. The fast configuration had 8.5 Gbytes of shared buffermemory and 32 Kbytes database blocks. For the fast tests, the entiredatabase was loaded into RAM, and for the slow tests, lots of disk I/Oin small blocks was performed.

So,what does this have to do with 64 bits versus 32 bits? The performancedifference is what you would expect to find, comparing RAM speeds withdisk speeds. They greatly magnified the performance difference by usinga database small enough to fit in memory on the fast configuration, andonly 128 Mbytes of shared buffer memory and smaller blocks on the slowconfiguration.

Toredress the balance, a more realistic test should have a database thatis much larger than 6 Gbytes. A few hundred gigabytes is more like thekind of data warehouse that would justify buying a big system. SixGbytes is no more than a “data shoebox”; it would fit on a single diskdrive. The comparison should be performed with the same 32-Kbytedatabase block size for both tests, and a 3.5-Gbyte shared buffer, nota 128-Mbyte shared buffer, for the “32 bit” configuration. In thisupdated comparison, I would not be surprised if both configurationsbecame CPU and disk bound, in which case they would both run atessentially the same speed.

AtSun we ran some similar tests, using an E6000 with 16 Gbytes of RAM. Wefound that we could use UFS filesystem caching to hold a 10-Gbytedatabase in RAM, for a speedup using OLTP transactions of 350 timesfaster than a small memory configuration. This did not require any64-bit address support.

How Does a 32-bit System Handle More Than 4 Gbytes of RAM?

Itmay have occurred to some of you that Sun currently ships SPARC/Solarissystems than can be configured with more than 4 Gbytes of RAM. How canthis be? The answer is that the SuperSPARC memory management unit mapsa 32-bit virtual address to a 36-bit physical address, and theUltraSPARC memory management unit maps a 32-bit virtual address to a44-bit physical address. While any one process can only access 4Gbytes, the rest of memory is available to other processes and alsoacts as a filesystem cache. UltraSPARC systems also have a separate4-Gbyte address space just for the kernel.

TheSun SPARCcenter 2000 supports 5 Gbytes of RAM, and the Cray CS6400supports 16 Gbytes of RAM. The Sun E6000 supports 30 Gbytes of RAM, andthe E10000 supports 64 Gbytes of RAM. It would be possible to makedatabase tables in file system, and to cache up to 64 Gbytes of filesin memory on an E10000. There is some overhead mapping data from thefilesystem cache to the shared database buffer area, but nowhere nearthe overhead of disk access.

The Bottom Line

So,where does this leave us on the issue of performance through having 64bits? In the cases that affect a large number of applications, there isplenty of support for high-performance 64-bit operations in SuperSPARCand HyperSPARC. The performance improvements that UltraSPARC brings aredue to its higher clock rate, wider memory interface, and otherchanges. The full, 64 bit SPARC V9 architecture has an advantage overthe intermediate 32 bit V8plus specification because the integerregisters can handle 64-bit integer arithmetic more efficiently.

Thereare a small number of applications that can use 64-bit addresses toaccess more than 4 Gbytes of RAM to improve performance. There shouldbe good, application source code portability across the 64-bitaddressed implementations of several Unix variants.

Mostvendors have released details of their 64 bit CPU architecture and haveannounced or shipped 64 bit implementations. The marketing of 64bitness is now moving to software considerations like operating systemsupport and application programming interfaces.

The next major release of Solaris will have the option of running with a 64-bit address enabled kernel on UltraSPARC systems. In this mode:

  • Both 32 bit and 64 bit environments are supported simultaneously for user programs.

  • All 32 bit user-level applications will run unchanged. There is complete backward compatibility with applications that run under Solaris 2.6. The majority of the system-supplied commands and utilities will still be running in the 32 bit environment.

  • All device drivers will need to be remade to handle 64-bit addresses. Existing device drivers will only work in a 32-bit kernel.

  • New 64-bit applications will link to additional 64-bit versions of the supplied system libraries. A 64-bit application cannot link with a 32-bit library.

  • The file command can be used to identify binaries or libraries.

Code View:Scroll/Show All
% file /bin/ls 
/bin/ls: ELF 32-bit MSB executable SPARC Version 1, dynamically linked, stripped


Ifyou are a developer of device drivers and libraries, you should talk toSun about early access and ISV support programs for the upcomingrelease, as other users will be dependent upon device and librarysupport.

The white paper, 64-bit Computing in Solaris, is available at http://www.sun.com/solaris/wp-64bit.

Linker Options and Tuning

Dynamiclinking is the default for Solaris 2. It has several advantages and, inmany cases, better performance than static linking.

I’llstart by explaining how static and dynamic linking operate so that youunderstand the difference. I’ll explain why dynamic linking ispreferred, then I’ll look at using dynamic linking to improveapplication performance. The difference between interfaces andimplementations is a crucial distinction to make.

  • Interfaces — Interfaces are designed to stay the same over many releases of the product. That way, users or programmers have time to figure out how to use the interface.

  • Implementations — The implementation hides behind the interface and does the actual work. Bug fixes, performance enhancements, and underlying hardware differences are handled by changes in the implementation. There are often changes from one release of the product to the next, or even from one system to another running the same software release.

Static Linking

Staticlinking is the traditional method used to combine an applicationprogram with the parts of various library routines that it uses. Thelinker is given your compiled code, containing many unresolvedreferences to library routines. It also gets archive libraries (for example,/usr/lib/libm.a)containing each library routine as a separate module. The linker keepsworking until there are no more unresolved references and references,then writes out a single file that combines your code and a jumbled-upmixture of modules containing parts of several libraries. The libraryroutines make system calls directly, so a statically linked applicationis built to work with the kernel’s system call interface.

Archive libraries are built with thear command, and, in older versions of Unix, the libraries had to be processed byranlib to create an index of the contents for random access to the modules. In Solaris 2,ranlib is not needed;ar does the job properly in the first place. Sun had so many people ask “Where’sranlib?”that in Solaris 2.5, it was put back as a script that does nothing! Itacts as a placebo for portable makefiles that expect to find it onevery system.

Themain problem with static linking is that the kernel system callinterface is in itself a dynamic binding, but it is too low level. Onceupon a time, the kernel interface defined the boundary betweenapplications and the system. The architecture of the system is nowbased on abstractions more sophisticated than the kernel system callinterface. For example, name service lookups use a different dynamiclibrary for each type of server (files, NIS, NIS+, DNS), and thislibrary is linked to the application at run time.

Performanceproblems with static linking arise in three areas. The RAM that iswasted by duplicating the same library code in every statically linkedprocess can be very significant. For example, if all the window systemtools were statically linked, several tens of megabytes of RAM would bewasted for a typical user and the user would be slowed down a lot bypaging. The second problem is that the statically linked programcontains a subset of the library routines and they are jumbled up. Thelibrary cannot be tuned as a whole to put routines that call each otheronto the same memory page. The whole application could be tuned thisway, but very few developers take the trouble. The third performanceproblem is that subsequent versions of the operating system containbetter tuned and debugged library routines or routines that enable newfunctionality. Static linking locks in the old slow or buggy routinesand prevents access to the new functionality.

Thereare a few ways that static linking may be faster. Calls into thelibrary routines have a little less overhead if they are linkedtogether directly, and startup time is reduced because there is no needto locate and load the dynamic libraries. The address space of theprocess is simpler, sofork(2) canduplicate it more quickly. The static layout of the code also makes runtimes for small benchmarks more deterministic, so that when the samebenchmark is run several times, there will be less variation in the runtimes. These speedups tend to be larger on small utilities or toybenchmarks, and much less significant for large complex applications.

Dynamic Linking

Whenthe linker builds a dynamically linked application, it resolves all thereferences to library routines but does not copy the code into theexecutable. When you consider the number of commands provided withSolaris, it is clear that the reduced size of each executable file issaving a lot of disk space. The linker adds startup code to load therequired libraries at run time, and each library call goes through ajump table. The first time a routine is actually called, the jump tableis patched to point at the library routine. For subsequent calls, theonly overhead is the indirect reference. You can see the libraries thata command depends upon by using theldd command. Shared object libraries have a ‘.so’ suffix and a version number.

% ldd /bin/grep 
libintl.so.1 => /usr/lib/libintl.so.1
libc.so.1 => /usr/lib/libc.so.1
libw.so.1 => /usr/lib/libw.so.1
libdl.so.1 => /usr/lib/libdl.so.1

These libraries include the main system interface library (libc.so), the dynamic linking library (libdl.so), wide-character support (libw.so), and internationalization support (libintl.so).This support raises another good reason to use dynamic linking:statically linked programs may not be able to take advantage ofextended internationalization and localization features.

Manyof the libraries supplied with Solaris 2 have been carefully laid outso that their internal intercalling patterns tend to reference theminimum possible number of pages. This reference procedure reduces theworking-set size for a library and contributes to a significant speedupon small memory systems. A lot of effort has been put into theOpenWindows and CDE window system libraries to make them smaller andfaster.

Solaris 1 Compatibility

ManySolaris 1/SunOS 4 applications can be run on Solaris 2 in a binarycompatibility mode. A very similar dynamic linking mechanism is alsothe default in Solaris 1. Dynamically linked Solaris 1 applicationslink through specially modified libraries on Solaris 2 that provide thebest compatibility and the widest access to the new features of thesystem. Statically linked Solaris 1 applications can run on Solaris 2.3and later releases by dynamically creating a new kernel system calllayer for the process. This solution is a bit slower; moreover, theapplications cannot access some of the features of Solaris 2.

Thereare more problems with access to files that have changed formats, andthe applications can only make name lookups via the old name services.Solaris 2.5 adds the capability of running some mixed-mode Solaris 1applications that are partly dynamic and partly statically linked.

Mixed-Mode Linking

Mixed-modelinking can also be used with Solaris 2 applications. I don’t mean thecase where you are building an application out of your own archivelibraries. Where there is a choice of both archive and shared librariesto link to, the linker will default to use the shared one. You canforce some libraries to be statically linked, but you should alwaysdynamically link to the basic system interface libraries and nameservice lookup library.

Interposition and Profiling

Itis possible to interpose an extra layer of library between theapplication and its regular dynamically linked library. This layer canbe used to instrument applications at the library interface. You builda new shared library containing only the routines you wish to interposeupon, then set theLD_PRELOADenvironment variable to indicate the library and run your application.Interposition is disabled for setuid programs to prevent securityproblems.

Internallyat Sun, two applications have made heavy use of interposition. Oneapplication instruments and tunes the use of graphics libraries by realapplications. The other helps automate testing by making sure that useof standard APIs by applications does actually conform to thosestandards. The API tester is available as a free, unsupported tool fromthe http://opcom.sun.ca web site.

TheLD_PROFILE option allows agprofformat summary of the usage of a library to be generated andaccumulated over multiple command invocations. That way, the combineduse of a shared library can be measured directly. See theld(1) manual page for more details.LD_PROFILE is available in Solaris 2.5 and later releases. A new extension in Solaris 2.6 is thesotruss(1) command, which traces calls to shared libraries.

Dynamic Performance Improvements in Solaris 2.5

Severalnew features of Solaris 2.5 libraries provide a significant performanceboost over earlier releases. Dynamically linked applicationstransparently get these speedups. The standard recommendation is tobuild applications on the oldest Solaris release that you wish to runon. If you statically link, you miss out on these improvements.

Thelibraries dynamically figure out whether the SPARC CPU in use hasinteger multiply and divide support. This support is present in allrecent machines but is not available in the old SPARCstation IPX andSPARCstation 2. The new library routines use the instructions if theyare there and calculate the result the old way if not. You no longerhave to choose whether to run fast or to run optimally on every oldSPARC system.

Parts of the standard I/O library (stdio) were tuned. This tuning also helps some I/O-intensive Fortran programs.

ForUltraSPARC-based systems, a special “platform specific” version of somelibrary routines is interposed over the standard routines. Theseprovide a hardware speedup that triples the performance ofbcopy,bzero,memcopy, andmemset operations. The speedup is transparent as long as the application is dynamically linked.

Dynamiclinking is the only way to go. Sun was an early adopter of thistechnology, but every other vendor now offers shared libraries in someform. Banish static linking from your makefiles and figure out how totake best advantage of the technology.

SunSoft Official Position on Linking Options

by Rob Gingell, SunSoft Chief Scientist.

Using the default linker options provides several important properties:

  • Portability — Options that aren’t universally available are used.

  • Insulation — By using the default configurations of libraries that are dynamically linked, you are insulated from bugs or limitations in the implementation which would otherwise become part of your program.

  • Evolution — Related to insulation, the evolution of the shared library implementation may provide to your application performance other benefits which become available simply by running the extant application against the new libraries.

  • Quality — The dynamic implementation of a library, with its single configuration of modules and capability for being manipulated independent of any application, yields better testing and ultimately better quality. Although any bug fixes found in a dynamic version are also applied to the static implementation, such fixes are not available to application programs without reconstruction of the application. And, the essentially arbitrary combinations of modules possible with archive libraries are not exhaustively tested by us or any other vendor—the combinatorics are simply too vast.

  • Stability — A property of dynamic linking is that it creates an expression of dependencies between software modules. These dependencies are expressed in terms of the interfaces through which they interact rather than an (often very temporal) relationship based on the status and behavior of the implementation. These relationships can, through emerging tools, be validated in both their static and dynamic behaviors. Thus, a level of application portability higher than one currently enjoys is assured without the need for constant retesting as the implementation behavior changes.

UltraSPARC Compiler Tuning

Compilershave a huge variety of options. Some should be used as a matter ofcourse, some give a big speedup if used correctly, and others can getyou into trouble. What are the implications of using theUltraSPARC-specific options, and which options make the most differenceto performance?

Theanswer depends upon your situation. If you are a software vendor, yourmain concern is likely to be portability and testing costs. With acareful choice of options, you can support a large proportion of theSolaris installed base with very good performance. If you are an enduser with source code and some CPU intensive applications that take along time to run, you may be more interested in getting the very bestpossible performance from the particular system you have.

Applicationsare what sells computers. In recognition of this fact, Sun designs itssystems to be compatible with preexisting applications. Sun is alsoconcerned about the costs a software vendor incurs to support anapplication on Solaris. The key is to provide the opportunity for thelargest possible volume of sales for a single version of an application.

Theinstalled base of Solaris on SPARC is now between one and two millionunits. This installed base is not uniform however, as it consists ofmany different SPARC implementations and operating system releases. Itis possible to build applications that work on all these systems, butit is easy to inadvertently build in a dependency on a particularimplementation or release.

Thissection tells you what you can depend upon for maximum coverage of theinstalled base. It indicates several ways that you can optimize for aparticular implementation without becoming incompatible with all theothers. It also describes opportunities to optimize further, whereperformance or functionality may justify production of animplementation-specific version of the product.

Solaris2.5 contains support for hardware implementations based on theUltraSPARC processor. UltraSPARC is based on an extended SPARC Version9 instruction set and is completely upward compatible with theinstalled base of SPARC Version 8 applications. The new Ultra SystemsArchitecture requires its own specific “sun4u” kernel, as do theprevious “sun4m” MBus-based desktop systems and “sun4d” XDBus-basedserver systems. Correctly written device drivers will work with allthese kernels.

Althoughend users may be concerned about the implications of the new featuresprovided by UltraSPARC, they will find that it operates as just anotherSPARC chip. If their applications work on microSPARC-, SuperSPARC-, andHyperSPARC-based systems, then they will also work on UltraSPARC. Endusers may find that over time some application developers will produceversions that are specifically optimized for UltraSPARC.

Somevendors may wish to balance the work required to optimize forUltraSPARC and the likely benefits. A trade-off between severalcompatibility, performance, and functionality issues can be made. I’mgoing to clarify the issues and recommend a course of action to followthat allows incremental benefits to be investigated over time.

An Incremental Plan for UltraSPARC Optimization

Thereare several steps to take; each step should usually be completed beforethe next step is attempted. After each step is complete, you have theoption to productize the application, taking advantage of whateverbenefits have been obtained so far. I’ll briefly describe the steps,then cover each step in detail.

1. Test and support the existing application on UltraSPARC hardware.

All user mode applications will work. Correctly written device drivers will also work. You should see a substantial performance boost over previous generation SPARC systems.

2. Design for portability.

Review the interfaces that your application depends upon, and review the assumptions you have made about the implementation. Move to portable standard interfaces where possible, and isolate implementation-specific interfaces and code into a replaceable module if possible.

3. Make sure the application uses dynamic linking.

Solaris 2.5 and subsequent releases contain platform-specific, tuned versions of some shared library routines. They are automatically used by dynamically linked programs to transparently optimize for an implementation but will not be used by statically linked programs.

4. Migrate your application to at least SPARCompiler™ 4.0 Release.

You can do this step independently of the OS and hardware testing, but it is a necessary precursor to optimization for UltraSPARC. With no special, platform-specific options, a “generic” binary is produced that will run reasonably well on any SPARC system.

5. Optimize code scheduling for UltraSPARC by using SPARCompilers 4.0.

The optimal sequences of instructions for UltraSPARC can be generated using the same SPARC instructions that current systems use. This is the option that Sun recommends to application developers to try out. Compare the performance of this binary with the one produced at the end of Step4, both on UltraSPARC machines and any older machines that comprise a significant segment of your user base.

Note

The following optimizations are not backward compatible, and software vendors are strongly discouraged from using them. They are intended primarily for end users in the imaging and high-performance computing markets.

6. Build an UltraSPARC-only application.

Using the all-out UltraSPARC compile options, you get access to the 32-bit subset of the SPARC V9 instruction set and you double the number of double precision, floating-point registers available. Many programs will see no improvement. A few Fortran programs speed up a great deal. Determine if the benefit of any extra performance outweighs the cost of maintaining two binaries (one for UltraSPARC and one for older machines).

7. Build a VIS instruction set-enabled, device-specific module.

Applications that already implement a device-specific driver module mechanism for access to graphics and imaging accelerators can build a module by using the VIS instruction set extensions. Determine if the benefit of any extra performance outweighs the cost of maintaining an UltraSPARC-specific module or using the standard VIS-optimized XGL and XIL libraries.

Running the Existing Application on UltraSPARC Hardware

All user-mode applications that work on olderSPARC systems running Solaris 2.5 or later will also work on anUltraSPARC system. Applications that depend upon the kernelarchitecture may need minor changes or a recompile. If in the past youneeded to take into account the difference in kernel architecturebetween a SPARCstation 2 (sun4c), SPARCstation 20 (sun4m), and aSPARCserver 2000 (sun4d), you may need to be aware of the new featuresof the UltraSPARC kernel (sun4u).

Comparing Performance

Youshould collect performance data to compare against older hardware.There is a wide range of speedups for existing unmodified code runningon UltraSPARC. A rough guideline for integer applications is that anaverage speedup is the ratio of clock rate of the SuperSPARC andUltraSPARC CPUs tested (use% /usr/sbin/psrinfo -vto check the clock rates). For floating-point applications, the speedupis a bit larger. This ratio does not apply for microSPARC andHyperSPARC. If you get less than you would expect, you may be disk,RAM, or network limited. There are also a few programs that fitentirely in the 1- to 2-Mbyte SuperSPARC cache and don’t fit in a512-Kbyte UltraSPARC cache. If you compare systems with the same sizedcaches or increase the cache size on UltraSPARC, results will be morepredictable. If you get more speedup than you would expect, then youmay have been memory bandwidth limited on the older MBus-based machines.

Processor -Specific Dynamic Libraries

Solaris2.5 and later releases contain platform-specific versions of somelibrary routines. They are automatically used by dynamically linkedprograms.

TheUltraSPARC versions take advantage of VIS instructions for high-speedblock move and graphics operations. If you static link tolibc,you will not take advantage of the platform-specific versions. Whenused with Creator graphics, the X server, XGL, and XIL libraries haveall been accelerated by means of VIS instruction set extensions and theCreator frame buffer device driver.

Solaris2.5 libraries automatically use the integer multiply and divideinstructions on any platform that has them. This use allows genericbinaries to be built for the oldest SPARC Version 7 CPUs (e.g.,SPARCstation 2) but to take advantage of the instructions implementedin SPARC Version 8 and subsequent CPUs (e.g., SuperSPARC andUltraSPARC). In Solaris 2.5.1, some additional optimizations useUltraSPARC-specific instructions, for example, to multiply the 64-bitlong long type.

TheUltraSPARC-specific VIS block move instruction performs a 64-bytetransfer that is both cache coherent and nonpolluting. This feature isused by the platform-specificlibcbcopy,bzero,memcpy,memmove,memset,memcmpoperations. The term “nonpolluting” refers to the fact that data thatis moved is not cached. After copying a 1-Mbyte block of data, the CPUcache still holds its original contents, unlike the case with otherdesigns that would have overwritten the cache with the data beingmoved. On a 168 MHz Ultra 1/170E, large memory-to-memory data movesoccur at over 170 Mbytes/s, limited by the 350-Mbyte/s throughput ofthe single Ultra 1/170 memory bank. Memory-to-Creator frame buffermoves occur at 300 Mbyte/s, limited by the processor interfacethroughput of 600 Mbyte/s. Faster UltraSPARC systems have even higherdata rates. A move involves a read and a write of the data, which iswhy the data is moved at half the throughput. These operations areabout five times faster than a typical SuperSPARC-based system. TheUltra Enterprise Server systems have more memory banks and sustainaggregate rates of 2.5 Gbytes/s on the E6000, and 10.4 Gbytes/s on theE10000.

Migrating Your Application to at Least SPARCompilers 4.2

Youcan migrate your application independently of the OS and hardwaretesting, but it is a necessary precursor to optimization forUltraSPARC. SPARCompilers 4.2 improves performance on all platforms byperhaps 10%–30% for CPU-intensive applications. There may be issueswith old code written in C++ because the language is evolving; itchanges from one release of the compiler to the next, as the compilertracks the language standard. If you are already using SPARCompilers3.0 you should have few, if any, problems.

Tosupport the maximum proportion of the installed base, Sun recommendsthat applications be compiled on the oldest practical release ofSolaris. SPARCompilers 4.2 is fully supported on Solaris 2.3 and 2.4,and all code generation options, including UltraSPARC-specific ones,can be used on older releases.

Applicationvendors who want to ship one binary product for all SPARC Solaris 2systems and who want the best performance possible on older systemsshould use the generic compiler options. The options are statedexplicitly in Figure 6-7 for clarity. The level of optimization set byxO3 generates small, efficient code for general-purpose use. Thexdependoption tells the compiler to perform full dependency analysis; itincreases compile time (which is why it is not done by default) but cangive up to a 40% performance boost in some cases. Try with and withoutxdepend to quantify the difference on your application. The compiler defaults toxchip=genericxarch=generic;these options tell the compiler that you want to run reasonably well onall SPARC processors. Adding the options to your makefile, even thoughthey are defaults, makes it clear that this is what you are trying todo.

Figure 6-7. Recommended Compiler Options to Optimize for All SPARC Systems
cc -xO3 -xdepend -xchip=generic -xarch=generic *.c 
f77 -xO3 -xdepend -xchip=generic -xarch=generic *.f

For the C compiler, the commonly used-O option defaults to-xO2. The extra optimization invoked by-xO3 is only problematic in device driver code that does not declare memory mapped device registers asvolatile. The Fortran compiler already maps-O to-xO3.

Optimizing Code Scheduling for UltraSPARC

The optimal sequences of instructions for UltraSPARC can be generated by means of the same SPARC instructions that current systems use.

Tounderstand what I mean by this statement, let’s take the analogy of anEnglishman talking to an American. If the Englishman speaks normally,the American will be able to understand what is being said, probablywith a little extra effort (and the comment “I dolove your accent...”). If the Englishman tries harder and says exactlythe same words with an American accent, they may be more easilydigested, but other English people listening in will still understandthem. The equivalent of full optimization would be to talk in anAmerican accent, with American vocabulary, phrasing and colloquialisms(let’s touch base before we go the whole nine yards, y’all). The wordsmostly look familiar but only make sense to other Englishmen familiarwith American culture.

Thesequencing level of optimization avoids using anything that cannot beunderstood by older SPARC chips, but instructions are put in the mostoptimal sequence for fast execution on UltraSPARC.

Comparethe performance of this binary with the one produced in the previousstage, both on UltraSPARC machines and any older machines thatconstitute a significant segment of your customer base. Performance onUltraSPARC platforms can show a marked improvement. Thexchip=ultra option puts instructions in the most efficient order for execution by UltraSPARC. Thexarch=genericoption is the default, but it is good to explicitly state yourintentions. The generic option tells the compiler to use onlyinstructions that are implemented in all SPARC processors.

Figure 6-8. Recommended Compiler Options to Optimize for UltraSPARC
cc -xO3 -xdepend -xchip=ultra -xarch=generic *.c 
f77 -xO3 -xdepend -xchip=ultra -xarch=generic *.f

Theseoptions are intended to be realistic and safe settings for use on largeapplications. Higher performance can be obtained from more aggressiveoptimizations if assumptions can be made about numerical stability andthe code “lints” cleanly.

The Implications of Nonportable Optimizations

Upto this point, the generated code will run on any SPARC system. Theimplications and trade-off implicit in following the aboverecommendations are that a single copy of your application will beportable across all SPARC-based environments. The performance andcapability of the CPU will be maximized by the run-time environment,but some performance benefits unique to specific implementations maynot be available.

The subsequent optimizations are not backward compatible, and software vendors are strongly discouraged from using them unless they are only interested in running on UltraSPARC-based systems.

Theremay be cases in which tuning an application to a specific processor orspecific system platform is worth more to you than losing generalityand portability. If you use UltraSPARC-specific compiler options, youshould be aware that you will either need to create a different binaryor continue to support an existing one, to run on older systems.

Building an UltraSPARC-Only Application

Theprimary interest in UltraSPARC-specific code comes from Fortran endusers in the high-performance computing (HPC) marketplace. It is commonfor HPC end users to have access to the source code of theirapplications and to be interested in reducing the very long run timesassociated with large simulation and modeling computations by any meansavailable. There is also relatively less interest in running the codeon older, slower SPARC systems. The commonly used SPECfp95 benchmarkscontain several examples of this kind of application.

Thereare also situations where the UltraSPARC system is embedded in aproduct manufactured by an OEM. Since there is complete control overthe hardware and the software combination, it is possible to optimizethe two very closely together without concern for backwardcompatibility.

Usingthe all-out UltraSPARC compile options, you get access to the 32 bitsubset of the SPARC V9 instruction set, and you increase the number ofdouble precision, floating-point registers from 16 to 32. Thiscombination of V8 and V9 is known as the V8plus specification; it isenabled with the-xarch=v8pluscompiler option. No source code changes will be required, but the codewill no longer run on older systems. You can identify the binaries byusing thefile command.

% f77 -o prog -fast -xO4 -depend -xchip=ultra -xarch=v8plus *.f 
% file prog
prog: ELF 32-bit MSB executable SPARC32PLUS Version 1, V8+ Required,
dynamically linked, not stripped

Compare the performance of this binary with one using-xarch=v8 instead of-xarch=v8plus. Determine if the benefit of any extra performance outweighs the cost of maintaining two binaries.

You can expect a speedup from-xarch=v8plus if your code is double precision, vectorizable, and the compiler can unrolldoloops. A large number of temporary variables need to be stored inregisters in the unrolled loop to hide load-use latencies. A version ofthe Linpack DP1000 benchmark went 70% faster with this option, which isthe most you can expect. Single precision code shows no speedup becausethere are already 32 single precision registers. The performanceimprovement obtained by the above options with-xarch=v8 and-xarch=v8pluson each component of SPECfp92 varied from 0% in several cases to a bestcase of 29%. The geometric mean increased by 11%. It is rare for oneloop to dominate an application, so a mixture of accelerated andunaccelerated loops gives rise to a varying overall speedup. Thepotential for speedup increases with the highest optimization levelsand should increase over time as the compiler improves its optimizationstrategies.

I have not seen a significant speedup on typical C code. In general, don’t waste time tryingxarch=v8plus with the C compiler.

Thecompiler’s code generator has many more options. The ones I havedescribed are the ones that usually make a significant difference. In afew cases, I have found that profile feedback can be useful as well.The highest level of optimization is now-xO5,and it should only ever be used in conjunction with a collected profileso the code generator knows which loops to optimize aggressively. Yousimply compile withxO4 andxprofile=collect, run the program, then recompile withxO5 andxprofile=use. This is easy to set up on small benchmarks but is more tricky with a large application.

Numerical Computation Issues

Floating-pointarithmetic suffers from rounding errors and denormalization (when verysmall differences occur), and results can vary if you change the orderof evaluation in an expression. The IEEE 754 standard ensures that allsystems which comply fully with the standard get exactly the sameresult when they use the same evaluation order. It does add overhead,and it is not used on older architectures such as the IBM S390mainframe, the Digital VAX, and the Cray supercomputer. If you performthe same calculation on all of these systems, you will get results thatare different. Algorithms that have high numerical stability producevery similar results; poor algorithms can produce wildly variableresults. By default, the Sun compiler is very conservative and sticksto IEEE 754. You can configure IEEE 754 rounding modes and otheroptions.

If you are working with code that is known to be numerically stable and produces good results on a Cray, VAX, mainframe, and IEEE 754-based system, then you can run faster by turning off some of the expensive IEEE 754 features, using thefsimple andfnonstd options. Thefsimple option tells the compiler that you want it to assume that simple mathematical transformations are valid, andfnonstd turns off rounding and underflow checks. The Sun compiler manual set comes with a Numerical Computation Guide, which explains these issues in much more detail.

Building a VIS Instruction Set-Enabled, Device-Specific Module

Goingback to my analogy, if the Englishman and the American start to talk indense industry jargon, full of acronyms, no one else will have a cluewhat they are discussing, but the communication can be very efficient.The UltraSPARC processor implements an extension to the SPARC V9instruction set that is dedicated to imaging and graphical operationsthat is a bit like talking in jargon—it is highly nonportable.

Someapplications already have device-specific modules that provide accessto accelerators for imaging and graphics operations. These modules cancode directly to the VIS instruction set. For pixel-based operations,the VIS instructions operate on four or more pixels at a time. Thisapproach translates into a four times speedup for large filtering,convolution, and table lookup operations. Several developers arereporting this kind of gain over the base-level UltraSPARC performancefor applications like photographic image manipulation and medical imageprocessing. For the first time, MPEG2 video and audio stream decodingcan be performed at full speed and resolution with no add-on hardware.The best way to access VIS is via the standard graphics (XGL) andimaging (XIL) libraries, which are optimized to automatically takeadvantage of the available CPU and framebuffer hardware.

SunMicroElectronics has created a VIS developers kit and several librariesof routines that implement common algorithms and is promoting the useof VIS for specialized new-media applications.The libraries areavailable from http://www.sun.com/sparc.

Summary

Theimplication of following Sun’s general recommendations is that a singlecopy of your application will be portable across all SPARC-basedenvironments. The inherent performance and capability of the processorwill be maximized by the run-time environment, but some performancebenefits unique to specific implementations may not be available toyour application.

A white paper on compiler optimization techniques, called UltraComputing: How to Achieve Peak Performance from Solaris Applications, is available at http://www.sun.com/solaris/wp-ultracomputing/ultracomputing.pdf.