Windows Data Alignment on IPF, x86, and x86-64

来源:百度文库 编辑:神马文学网 时间:2024/04/24 17:13:40
Visual Studio Technical Articles
Windows Data Alignment on IPF, x86, and x64
Kang Su Gatlin
Microsoft Corporation
March 2006
Applies to:
Microsoft Visual C++
Microsoft Windows XP application development
Microsoft Windows Server 2003 application development
Summary:Gives developers the information they need in order to confront dataalignment problems critical to the performance of 64-bit and 32-bitapplications developed for the Microsoft Windows XP and MicrosoftWindows Server 2003 platforms. (17 printed pages)
Contents
Introduction
What Is Data Alignment?
Why Is Alignment a Concern?
Data Alignment Exceptions and Fix-Ups
Compiler Support for Alignment
Some Quick Tips on How to Avoid Alignment Issues
What About Instruction Alignment?
Conclusion
Introduction
Inteland AMD have introduced a new family of processors, the Intel ItaniumProcessor Family (IPF) Architecture and the x64 Architecture. Theseprocessors join the IA-32 Intel Architecture family in the MicrosoftWindows desktop/server world. With Microsoft Visual C++ and MicrosoftWindows on these platforms, you can get incredible performance, butthis good performance is contingent upon certain programming practices.One of these programming practices is proper data alignment. Properdata alignment allows you to get the most out of your 64-bit and 32-bitapplications—and on the Itanium, it is often not only a matter ofperformance, but it can also be a matter of correctness.
In thisdocument we explain why you should care about data alignment, the costsif you do not, how to get your data aligned, and what to do when youcannot. You will never look at your data access the same way again.
What Is Data Alignment?
Allvariables have two components associated with them: their value, andtheir storage location. In this article our concern is the storagelocation. The storage location of a variable is also called its address, and it is the integer (the mathematical term integer,not the data type) offset in memory where the data begins. Thealignment of a given variable is the largest power-of-2 value, L, wherethe address of the variable, A, modulo this power-of-two value is0—that is, A mod L = 0. We will call this variable L-byte aligned. Notethat when X > Y, and both X and Y are power-of-two values, avariable that is X-byte aligned is also Y-byte aligned.
InListing 1, we give a code example to illustrate where variables getstored/aligned. Don't worry if you do not understand why things arealigned where they are. You will understand all of this by the end ofthe paper. We do encourage you to have fun and play with the example(reorder the local variables and class member variables, and then seewhat happens to the addresses).
Listing 1. Data alignment example
Copy Code
#include int main(){char a;char b;class S1{public:char m_1; // 1-byte element// 3-bytes of padding are placed hereint m_2; // 4-byte elementdouble m_3, m_4; // 8-byte elements};S1 x;long y;S1 z[5];printf("b = %p\n", &b);printf("x = %p\n", &x);printf("x.m_2 = %p\n", &x.m_2);printf("x.m_3 = %p\n", &x.m_3);printf("y = %p\n", &y);printf("z[0] = %p\n", z);printf("z[1] = %p\n", &z[1]);return 0;}
In Listing 2, we show the output of what Listing1 might print. Remember that this is just what it prints on mycomputer. Your computer will almost certainly print different numbers.That is to be expected.
Listing 2. Output from example in Listing 1
Copy Code
b = 000006FBFFB8FEB1x = 000006FBFFB8FE98x.m_2 = 000006FBFFB8FE9Cx.m_3 = 000006FBFFB8FEA0y = 000006FBFFB8FE90z[0] = 000006FBFFB8FEB8z[1] = 000006FBFFB8FED0
So, from the example in Listings 1 and 2, you can now see how each of the variables is aligned. The char, b, is aligned on a 1-byte boundary (0xB1 % 2 = 1). The class, x, is aligned on an 8-byte boundary (0x98 % 8 = 0). The member, x.m_2, is aligned on a 4-byte boundary (0x9C % 8 = 4). x.m_3 is on an 8-byte boundary, as is y. z[0] and z[1] are also 8-byte aligned (we omit the modulo math for those last sets of variables, because it is straightforward).
If we look at the class S1,we see that the whole class has become 8-byte aligned. The packingwithin the class is not optimal, because there exists a gap of 4 bytesbetween elements x.m_1 and x.m_2, although x.m_1 is merely a 1-byte element.
TheItanium and x64 compilers provide for data items of natural lengths of1, 2, 4, 8, 10, and 16 bytes. All types are aligned on their naturallengths, except items that are greater than 8 bytes in length: thoseare aligned on the next power-of-two boundary. For example, 10-bytedata items are aligned on 16-byte boundaries. The x86 compiler supportsaligning on boundaries of the natural lengths of 1, 2, 4, and 8 bytes.
Next we give a relatively simple way to determine the alignment of a given type. To do this, use the __alignof(type) operator. (The macro equivalent is TYPE_ALIGNMENT(type)). This operator returns the alignment requirement of the variable/type passed to it.
Stack Alignment
Onboth of the 64-bit platforms, the top of each stackframe is 16-bytealigned. Although this uses more space than is needed, it guaranteesthat the compiler can place all data on the stack in a way that allelements are aligned.
The x86 compiler uses a different methodfor aligning the stack. By default, the stack is 4-byte aligned.Although this is space efficient, you can see that there are some datatypes that need to be 8-byte aligned, and that, in order to get goodperformance, 16-byte alignment is sometimes needed. The compiler candetermine, on some occasions, that dynamic 8-byte stack alignment wouldbe beneficial—notably when there are double values on the stack.
Thecompiler does this in two ways. First, the compiler can use link-timecode generation (LTCG), when specified by the user at compile and linktime, to generate the call-tree for the complete program. With this, itcan determine regions of the call-tree where 8-byte stack alignmentwould be beneficial, and it determines call-sites where the dynamicstack alignment gets the best payoff. The second way is used when thefunction has doubles on the stack, but, for whatever reason, has notyet been 8-byte aligned. The compiler applies a heuristic (whichimproves with each iteration of the compiler) to determine whether thefunction should be dynamically 8-byte aligned.
Note   A downside to dynamic 8-byte stack alignment, with respect to performance, is that frame pointer omission (/Oy)effectively gets turned off. Register EBP must be used to reference thestack with dynamic 8-byte stack, and therefore it cannot be used as ageneral register in the function.Structure and Union Layout
Thelayout with respect to alignment in structures and unions is dependenton a few simple rules. We can break structure and union alignment intotwo components: inter-structure/union alignment, and intra-structurealignment. (There is no intra-union alignment.)
Inter-structure/unionalignment is the simpler case. The rule here is that the compileraligns the structure with the largest alignment requirement of any ofthe members of the structure. Unions follow the same rule.
Intra-structurealignment works by the principle that the members are aligned by thecompiler at their natural boundaries, and it does this through padding,inserting as much padding as necessary up to the padding limit. Thepadding limit is set by the compilation switch /Zpn. The default for this switch is /Zp8.
The programmer can use the #pragma pack atthe point of declaration of the structure, to also set the paddinglimit from that point in the translation unit onward. That is, it doesnot affect structures declared prior to the #pragma pack.Access to structure members that are packed may result in access todata that is unaligned. The compiler inserts the fix-up code for thesemembers, which means that the access will not result in an exception,but it will result in slower and more bloated code. (The fix-up codeand exception may not make sense yet, but you will understand them bythe end of this article.)
The padding limits (#pragma pack and /Zpn)should be used with care. Unless most of your work consists of simplymoving data, without reading or writing particular elements, or you arespace constrained, the trade-offs involved with using padding limitsthat violate the alignment rules usually do not work in theprogrammer's favor.
Why Is Alignment a Concern?
Okay,so now you know what it means for a variable to be aligned. Why do wecare about alignment? Well, as you may have guessed, the reason isperformance. On the Itanium platform, the reason is correctness aswell, due to the way misalignment is handled. Now the question is, Why?What is the underlying reason that we care about alignment? Certainly,no computer architect arbitrarily decided to make our lives difficult.No, but these alignment issues are, in fact, a remnant of architecturaltrade-offs made by computer architects.
On most modernRISC-based designs, data can be accessed only at the boundary definedby the natural length of the data being requested. This fills thedestination register with the data of that length. The implication ofthis is that the computer gets data in natural-length chunks fromaddresses that are a product of the natural length. What this furtherimplies is that reading data from addresses that are not a product ofthe natural length will be problematic (it may slow down or crash theapplication).
For example, a 32-bit computer with a wordboundary starting at 0 can load data from bytes at location 0 to 3 inone load, or 4 to 7 in one load, or 40 to 43 in one load, but NOT 2 to5 in one load (because bytes 2 to 5 span two words). What this means isthat if the computer actually needed to retrieve the 32-bit value fromlocation 2 to 5, it would have to retrieve the data from 0 to 3, andalso retrieve the value from location 4 to 7, and then perform someoperations to properly extract and shift the bytes that it needs.Depending on the computer system, either the operating system orcompiler does this for you. If they do not, then the hardware can raisean exception (and you do not want that to happen; as a worst case, itcould crash). When the software bails you out, this not only requiressome extra logic, but it also takes extra memory accesses. In fact, formany applications on modern computers, the memory system is theperformance bottleneck, thus making extra memory requests very costly.In the particular example of this paragraph, it will take two memoryaccesses to get the 32-bit value from 2 to 5, rather than the onememory access it would take to get the 32-bit value from an alignedaddress. See Figure 1, because a visual representation might help tomake more sense of this potentially tricky topic.

Figure 1. Loading bytes at addresses 2 to 5
Figure1 shows: a) loading the first word (bytes 0 to 3); b) extracting bytes2 to 3 from the loaded word; c) loading the second word; and d)extracting the first two bytes from the second loaded word andappending it to the previously extracted bytes.
This notion ofdata alignment goes beyond the word-size of the given computerarchitecture, extending up the memory hierarchy, through the multiplelevels of cache, translation lookaside buffer, and pages. Each ofthese, like the 32-bit words, has an associated unit chunk size. Cacheshave cache lines that are on the order of 32 to 128 bytes. Pages gofrom 1024 bytes to megabytes in size. This is all done to make ourprograms perform more efficiently. We just need to know how to dealwith it when it bites us.
Data Alignment Exceptions and Fix-Ups
Theobvious way to deal with alignment issues is to avoid them; however, inthe real world, that is not always possible. To help generate correctprograms, Microsoft Visual C++ and Microsoft Windows have somemechanisms to help the programmer. These do not come without someperformance impact, but they do assist in rapid development and/orporting of applications.
The first question that comes to mindmight be, "What if I violate the alignment restrictions?" That is, whathappens if I generate an alignment fault? Well, a few things canhappen, and none of them are good.
In Windows, an applicationprogram that generates an alignment fault will raise an exception,EXCEPTION_DATATYPE_MISALIGNMENT. On the Itanium, by default, theoperating system (OS) will make this exception visible to theapplication, and a termination handler might be useful in these cases.If you do not set up a handler, then your program will hang or crash.In Listing 3, we provide an example that shows how to catch theEXCEPTION_DATATYPE_MISALIGNMENT exception.
Listing 3. Code to catch alignment exception on Itanium
Copy Code
#include #include int mswindows_handle_hardware_exceptions (DWORD code){printf("Handling exception\n");if (code == STATUS_DATATYPE_MISALIGNMENT){printf("misalignment fault!\n");return EXCEPTION_EXECUTE_HANDLER;}elsereturn EXCEPTION_CONTINUE_SEARCH;}int main(){__try {char temp[10];memset(temp, 0, 10);double *val;val = (double *)(&temp[3]);printf("%lf\n", *val);}__except(mswindows_handle_hardware_exceptions (GetExceptionCode ())) {}}
The application can change the behavior of thealignment fault from the default, to one where the alignment fault isfixed up. This is done with the Win API call SetErrorMode, with the argument field SEM_NOALIGNMENTFAULTEXCEPT set. This allows the OS to handle the alignment fault, but at considerableperformance cost. There are two things to note: 1) this is on aper-process basis, so each process should set this before the firstalignment fault, and 2) SEM_NOALIGNMENTFAULTEXCEPT is sticky—that is, if this bit is ever set in an application through SetErrorMode, then it can never be reset for the duration of the application (inadvertently or otherwise).
Onthe x86 architecture, the operating system does not make the alignmentfault visible to the application. On these two platforms, you will alsosuffer performance degradation on the alignment fault, but it will besignificantly less severe than on the Itanium, because the hardwarewill make the multiple accesses of memory to retrieve the unaligneddata.
On the x64 architecture, the alignmentexceptions are disabled by default, and the fix-ups are done by thehardware. The application can enable alignment exceptions by setting acouple of register bits, in which case the exceptions will be raisedunless the user has the operating system mask the exceptions with SEM_NOALIGNMENTFAULTEXCEPT. (For details, see the AMD Architecture Programmer's Manual Volume 2: System Programming.)
Withthat said, there are situations on the x86 and x64 platform whereunaligned access will generate a general-protection exception. (Notethat these are general-protection exceptions, not alignment-checkexceptions.) This is when the misalignment occurs on a 128-bittype—specifically, SSE/SSE2-based types.
In some experimentalruns, with the code in Listing 4 (we used 9,000,000 iterations, with 0and 3 offset representing aligned and unaligned, respectively), we sawthat on a slower Pentium III (731MHz, running Microsoft Windows XPProfessional), the program with the unaligned access runs about 3.25times slower than the program with the aligned access. On a fasterPentium IV (2.53GHz, running Windows XP Professional), the program withan unaligned access runs about 2 times slower than the program with thealigned access.
This is definitely not the type of performancehit you want to take. Unfortunately, it gets even worse on the ItaniumProcessor Family. With the same test, running on an Itanium2 at 900MHzwith Microsoft Windows Server 2003 (but only for 90,000 iterations, dueto how long the test takes to run), the unaligned program runs 459times slower! As you can see, unaligned access in an inner-loop candevastate the performance of your application.
So, even with the OS fix-up, which prevents your application from crashing, you should avoid unaligned access.
Listing 4. Example code to compare OS fix-up unaligned vs. aligned
Copy Code
#include #include #include #include #include #ifdef _WIN64#define UINT unsigned __int64#define ENDPART QuadPart#else#define UINT unsigned int#define ENDPART LowPart#endifint main(int argc, char* argv[]){SetErrorMode(GetErrorMode() | SEM_NOALIGNMENTFAULTEXCEPT);UINT iters, offset;if(argc < 2)iters = 9000000;elseiters = atoi(argv[1]);if(argc < 3)offset = 0;elseoffset = atoi(argv[2]);printf("iters = %d, offset = %d\n", iters, offset);double *dest, *origsource;double *source;dest = new double[128];origsource = new double[150];source = (double *)((UINT)origsource + offset);printf("dest = %x source = %x\n", dest, source);LARGE_INTEGER startCount, endCount, freq;QueryPerformanceFrequency(&freq);QueryPerformanceCounter(&startCount);for (UINT x = 0; x < iters; x++)for(UINT i = 0; i < 128; ++i)dest[i] = source[i];QueryPerformanceCounter(&endCount);printf("elapsed time = %lf\nTo keep stuff from being optimized %lf\n",(double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART, dest[75]);delete[] origsource;delete[] dest;return 0;}
Compiler Support for Alignment
Sometimes,through explicit syntax, the compiler can help with these alignmentissues. In this section, we give a few extensions that you can use inthe source code to either minimize the cost of unaligned access, or tohelp ensure aligned access.
__unaligned keyword
Aswe stated earlier, by default, the compiler will align data on theirnatural boundaries. Most of the time, this is sufficient, and therewill not be a problem; however, there can be situations where analignment issue will exist, with no clear way to work around it (or itwould take too much effort to do so).
When you, the programmer,can determine statically which variables might be accessed on unalignedboundaries, you can specify these variables as being unaligned, byusing the __unaligned keyword (the macro equivalent is UNALIGNED).This keyword is useful in that the compiler will insert the code toaccess the variable on an unaligned boundary, and it will not fault. Itdoes this by inserting extra code that will finesse its way around theunaligned boundary—but this does not come without a trade-off. Theseextra instructions will slow your code down, plus increase the codesize. Unfortunately, these extra instructions are generated even inplaces where it might be provable that the data is aligned! So use thiskeyword with care.
We can modify the program of Listing 4 by using the __unaligned keyword in a variable declaration. In this example, we change the declaration of source to the following:
Copy Code
__unaligned double *source;
This program will now run correctly on theItaniums, even if you do not enable the operating system to fix up thealignment faults, although it will suffer some performance degradation.This is still better than having your program crash or suffer thesevere performance penalty of the OS fix-up. (Keep in mind that, asnoted earlier, the compiler inserts code to handle misaligned access,even where it is provable that the data is aligned. The OS goes intoits fix-up code only when an exception occurs, and these occur onlywhen the misaligned access actually happens.)
In Figure 2, wehave a chart that gives the running time on an Itanium 2 for theexample program of Listing 4 when using various data access methods.The program executes fastest when the data is aligned, and the __unaligned keyword is not used. It runs next fastest when the data is aligned, but the __unaligned keyword is used. (Recall that if you use the __unaligned keyword, you pay a performance penalty, even if your data is aligned.) You run slightly slower if you use the __unaligned keyword on unaligned data. Lastly, you run much slower if you access unaligned data, but you have set SetErrorMode with SEM_NOALIGNMENTFAULTEXCEPT.
Note   In this chart, the y-axis is on a log10 scale.
Figure 2. Comparative runtimes of test program to illustrate effect of different types of accesses
__declspec(align(#))
So,we have dealt with the problem of a variable that you know is going tohave unaligned access, but what about when you have a variable, and youwould like it to be allocated on a boundary that is different from itsnatural boundary? For example, when using SSE2 instructions, you maywant to align your operands on a 16-byte boundary, or you may want toalign certain variables on cache-line boundaries. __declspec(align(#)) (where # is a power of two) is made for such purposes. In Listing 5, we give an example of its use.
Listing 5. Code to demonstrate how __declspec(align(#)) works
Copy Code
#include class ClassA {public:char d1;__declspec(align(256)) char d2;double d3;};int main(){__declspec(align(32)) double a;double b;__declspec(align(512)) char c;ClassA d;printf("sizeof(a) = %d, address(a) = %0x\n", sizeof(a), &a);printf("sizeof(b) = %d, address(b) = %0x\n", sizeof(b), &b);printf("sizeof(c) = %d, address(c) = %0x\n", sizeof(c), &c);printf("sizeof(d) = %d, address(d.d2) = %0x\n", sizeof(d), &d.d2);return 0;}
The output might look something like the following (taken from my computer):
Copy Code
sizeof(a) = 8, address(a) = 12fde0sizeof(b) = 8, address(b) = 12fdd8sizeof(c) = 1, address(c) = 12fa00sizeof(d) = 512, address(d.d2) = 12f900
Note the sizeof of the class. The sizeofvalue for any structure/class is the offset of the final member, plusthat member's size, rounded up to the nearest multiple of the largestmember alignment value, or the whole structure/class alignment value,whichever is greater. (This definition is taken from MSDN's entry onalign.)
The CRT and Intrinsics
__declspec(align) isa useful tool, but it cannot align dynamic data off of the heap. Forthis, the C runtime library (CRT) gives a set of aligned memoryallocation routines. These are listed below (and come with ):
void *_aligned_malloc(size_t size, size_t alignment)
void *_aligned_offset_malloc(size_t size, size_t alignment, size_t offset)
void _aligned_free(void *aligned_block)
void *_aligned_realloc(void *aligned_block, size_t size, size_t alignment)
void *_aligned_offset_realloc(void *aligned_block, size_t size, size_t alignment, size_t offset)
See Data Alignment on MSDN for more information on these routines.
Oneof the best ways to get performance is to use code that programmershave spent a lot of time tuning. The supplied CRT memory routines (strncpy, memcpy, memset, memmove,and so on) are a great example of this. The CRT routines are hand-coderoutines (often assembly) that are tuned to the particulararchitecture, which will align the source and destination so that, forlarge moves, the costs of the unaligned accesses are minimized.
Alternatively, the user can use the /Oi flag or the #pragma intrinsic(functions) pragma, which enables generation of intrinsics. (Note that the /Oi flag is implied by the /O2flag.) Intrinsics are inlined routines emitted by the compiler, thatare generally not as well tuned as the assembly language CRT routines.They do avoid the overhead of the function call, but at the additionalcost of code bloat. It is also worth noting that using /Oi or #pragma intrinsic isa suggestion to the compiler, and the compiler is free to emitintrinsics or the CRT routines. Looking at the assembly code is a goodway to determine which was generated.
The IPF compiler will alsouse type information to assist in expanding the inline intrinsics. Thecompiler will examine the types of pointers to the source anddestination addresses, and from this, it will infer the alignment ofthese addresses. If the pointer types are not correct, you might takean alignment exception, or the program will run slower (with thedreaded OS fix-ups).
In Listing 6, we give code to show theeffects of aligned versus unaligned accesses on code that uses thecompiler intrinsics for memcpy or the CRT assembly languagehand-tuned routines. To use the CRT assembly language hand-tunedroutines, make sure to insert the #pragma function(function) pragma.
Listing 6. Code to demonstrate the effect of intrinsic and CRT routines on aligned vs. unaligned accesses
Copy Code
#include #include #include #include #include #ifdef _WIN64#define UINT unsigned __int64#define ENDPART QuadPart#else#define UINT unsigned int#define ENDPART LowPart#endif#pragma function(memcpy) // comment out this line for intrinsic generation.int main(int argc, char *argv[]){int iters1 = atoi(argv[1]);int size1 = atoi(argv[2]);int offset = atoi(argv[3]);char *source, *origsource = (char *)_aligned_malloc(size1, 8);char *dest, *origdest = (char *)_aligned_malloc(size1, 8);source = (char *)((UINT)origsource + offset);dest = (char *)((UINT)origdest + offset);LARGE_INTEGER startCount, endCount, freq;QueryPerformanceFrequency(&freq);QueryPerformanceCounter(&startCount);for(int i = 0; i < iters1; ++i)memcpy(dest, source, size1-offset);QueryPerformanceCounter(&endCount);printf("&source = %0x \t &dest = %0x\n", source, dest);printf("elapsed time = %lf\nTo keep stuff from being optimized %lf\n",(double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART, dest[1]);_aligned_free(source);_aligned_free(dest);}
Figures 3 and 4 show the relative performance of each of the four configurations on memcpysof various size—on a Pentium III and Itanium2 computer, respectively.We generated this data with the code from Listing 6, using thefollowing parameters:
Copy Code
exename 1000000 size offset
Where 8 ≤ size ≤ 4096 and 0 ≤ offset ≤ 1.

Figure 3. The time to perform a memcpy using aligned vs. unaligned data and CRT vs. intrinsic routines on a Pentium III
Onthe Pentium III, for aligned copies, it does not matter too muchwhether you use CRT or intrinsic. However, for large unaligned copies,using the CRT version is a big win. On the Itanium2, we compare onlythe CRT versions, because the compiler almost always uses the CRTversions, even when the programmer specifies /Oi or #pragma intrinsic.In Figure 4, we compare unaligned versus aligned CRT calls. You canclearly see that using aligned data results in better performance. Thelesson here is not subtle at all.

Figure 4. The time to perform a memcpy using aligned vs. unaligned data with CRT routines on an Itanium2
Some Quick Tips on How to Avoid Alignment Issues
Ifyou are short on time, and just want a quick section to refer to, youhave found the right place. Here are some quick tips to help deal withdata alignment related issues:
When casting from an aligned pointer P1 to a pointer P2, where the TYPE_ALIGNMENT(P1) < TYPE_ALIGNMENT(P2), you must ensure that all accesses are properly aligned. Using P2 to dereference addresses originally pointed to by P1 may result in an alignment fault. However, if TYPE_ALIGNMENT(P1) > TYPE_ALIGNMENT(P2), then P2 is fine to dereference all elements, element-wise, that it points to.
Do not pack structures unless you are sure that the space savings is a win—for example, if you are simply transporting the structure around, and never accessing individual members.
Understand what boundaries you need to align data on. Not having your alignment high enough can lead to alignment problems, but setting the alignment too high can lead to data bloat.
What About Instruction Alignment?
Well,you are almost at the end of this article, and some of you may bewondering, "You've talked about data alignment, but what aboutinstruction alignment? Aren't instructions also stored in memory?" Theanswer is, instruction alignment is also an issue, but it is notcovered in this article, because most programmers do not have to dealwith it at all. Instruction alignment is mostly an issue for compilerwriters. The one type of general-purpose programmer who mightstill care about instruction alignment would be the assembly-languageprogrammer, especially if he or she is not using an assembler.
Conclusion
Hopefully,you will now feel confident that you know the ins and outs of dataalignment when you sit down to do Windows development. This article hascovered how to avoid many data-alignment faults, what to do when theyare inevitable, and the various costs associated with them. Thisknowledge will be useful for all Windows development, but it will proveespecially useful when porting code from x86 to Itanium, where dataalignment plays a front-and-center role. In the end, the result will befaster, more reliable code.
 
About the author
Kang Su Gatlinis a Program Manager at Microsoft in the Visual C++ group. He receivedhis PhD from UC San Diego. His focus is on high-performance computationand optimization.