原创 Measuring instead of speculating (Part 1)

 2011-3-31 19:28  2048 15 15 分类: 消费电子

Some programmers think modeling memory-mapped devices as C++ objects is too costly. With some architectures, the chances are they're wrong.

I have began a series of columns on representing and manipulating memory-mapped devices in C and C++. In one piece, I considered the alternatives in C and advocated that structures are the best way to represent device registers.¹ In another article, I explained why C++ classes are even better than C structures.²

In Standard C and C++, you can't declare a memory-mapped object at a specified absolute address, but you can initialize a pointer with the value of that address. Then you can dereference that pointer to access the object. That's what I did in those articles last spring.

One column prompted several readers to react alleging that using a pointer to access a C++ class object representing a memory-mapped device incurs a performance penalty by somehow adding unnecessary indirect addressing. Interestingly, no one complained that using a pointer to access a C structure incurs a similar performance penalty. This left me wondering if the allegation is that using a C++ class is more expensive than using a C structure, or if it's that using pointers to access memory-mapped class objects is more costly than using some other means.

My impression is that the authors of the comments were more concerned about the latter—the alleged cost of using pointers. However, I suspect many of you are also interested in knowing whether using C++ classes is more expensive than using comparable C structures. I know I am. Therefore, I decided to evaluate alternative memory-mapped device implementations in both C and C++.

In one column, I described the common standard and non-standard mechanisms for placing objects into memory-mapped locations.3 I have also presented alternative data representations for memory-mapped device registers that eliminate the need to use pointers to access memory-mapped devices.⁴ I also delineated some of those alternatives more explicitly.⁵

All of this brings us back to the question I set out to answer: Does eliminating pointer references from function calls and function bodies actually improve the run-time performance of C or C++ code that manipulates memory-mapped devices? To answer this, I ran some timing tests using a few different C and C++ compilers. I'll describe how I wrote the tests and what conclusions I think we can draw from the results. Some of the results surprised me. I suspect they'll surprise many of you, too.

Test design considerations

Different processors support different combinations of addressing modes. Some are better at, say, absolute addressing than they are at base+offset addressing, and others are just the opposite. For a given processor, some compilers may be better than others at leveraging the addressing modes on the target processor. Thus, the results you get from measurements made with one compiler targeting one processor may not be the same as what you get with a different compiler or different target processor. No surprise there.

I have access to only a modest assortment of compilers and processors. Any conclusions that we can draw from running tests with the tools I have might be broadly applicable, but I have no illusions about discovering universal truths. Running tests on only a small set of compilers or processors can still yield useful information—just not as much as most of us would like. Therefore, I'll explain how and why I designed the test programs as I did so that you can write similar (or perhaps better) tests for other compilers and processors, make your own measurements, and share your observations with the rest of us.

For this first round of measurements, I decided to use the one evaluation board I have that I can program with multiple compilers. The board has a 50MHz ARM processor with 512 KB of memory and a small assortment of memory-mapped devices. I used three different compilers, each from a different vendor and of different vintage. Each compiler supported both C and C++. I compiled for the ARM (rather than THUMB) instruction set with little-endian byte ordering. I set each compiler to optimize for maximum speed. I didn't turn the instruction cache on.

All of the tests are variations on the same theme: the main function in each program repeatedly calls a function that accesses a memory-mapped device, and counts the number of calls it makes in a given span of time. Each program differs in (1) how it represents the registers of the memory-mapped device, (2) how it accesses those registers, and (3) whether the access functions are inline or not.

The purpose of these test programs is to provide information to help evaluate programming techniques. They're not for compiler benchmarking. Therefore, I won't identify the compiler vendors. Rather, I'll refer to each compiler by the year in which it was released: 2000, 2004, and 2010.

Implementation choices

Each program tests either a polystate implementation, a bundled monostate implementation, or an unbundled monostate implementation.

As I explained before, a polystate implementation of a memory-mapped device uses a C structure or C++ class accessed via a pointer or reference. Polystate implementations support multiple instances of the same kind of device. The structures and classes that I presented in previous columns, and that provoked the comments that triggered this investigation, are polystate implementations.

A monostate implementation eliminates the need for pointers or references to access the device. A monostate implementation for a device permits only one instance of that device. As I explained previously, a monostate implementation can be bundled or unbundled. A bundled monostate wraps its data members in an additional structure; an unbundled monostate does not.

For each design, I wrote a C implementation and a C++ implementation, and for each of those implementations, I wrote a version that used inline functions and another that did not. Thus, for example, I wrote a polystate implementation in C with inline functions, a polystate implementation in C with non-inline functions, a polystate implementation in C++ with inline functions, a polystate implementation in C++ with non-inline functions, and so on for the bundled and unbundled monostate implementations.

Not all the C compilers I used support inline functions, so I implemented the inline functions in C using function-like macros.

Placement choices

As I noted earlier, in Standard C and C++, you can't declare a memory-mapped object at a specified absolute address, but you can initialize a pointer with the value of that address. In the following discussion of the test cases, I describe this technique as using pointer-placement.

For example, to access a timer_registers object residing at location 0xFFFF6000, you can declare a pointer called the_timer as a macro:

#define the_timer \

((timer_registers *)0xFFFF6000)

or as a constant pointer:

timer_registers *const the_timer

= (timer_registers *)0xFFFF6000;

My tests that use pointer-placement use the latter form. I randomly replaced the pointer constants with macros in a few test cases and saw no difference in the generated code.

In C++, you can also write pointer casts using the reinterpret_cast operator, as in:

timer_registers *const the_timer

= reinterpret_cast

(0xFFFF6000);

I use this form in my C++ tests.

In C++, but not C, you can use a reference instead of a pointer, as in:

timer_registers &the_timer

= *reinterpret_cast

(0xFFFF6000);

I call this technique reference-placement.

As some readers suggested, you can declare a memory-mapped object using a standard extern declaration such as:

extern timer_registers the_timer;

and then use a linker command to force the_timer into the desired address. I call this technique linker-placement.

Some C and C++ compilers provide language extensions that let you position an object at a specified memory address. For example, to declare a timer_registers object residing at location 0xFFFF6000, you might write:

timer_registers the_timer @ 0xFFFF6000;

with one compiler, or:

timer_registers the_timer _at(0xFFFF6000);

with another, or:

timer_registers the_timer

__attribute__((at(0xFFFF6000)));

with yet another. I describe test cases that use declarations such as these as using at-placement.

Of the compilers at my disposal, all three support pointer-placement—as they should because it's standard—and all three C++ compilers support reference-placement. Only one compiler supports at-placement.

I suspect all the compilers support linker-placement, but to be honest, I could figure out how to do it with only one. However, I realized I could simulate the behavior of linker-placement easily, and increase the portability of the tests at the same time. Here's how.

[Continued at Measuring instead of speculating (Part 2)]