[Continued from Measuring instead of speculating (Part 1)]
A software "device"
In order to time the execution of function calls that manipulate memory-mapped devices, I needed just two devices: a timer and something else. My hardware platform has a small assortment of devices such as lights, switches and serial ports, and I could have picked any one of them. But then again, I wanted these tests to be fairly easy to migrate, and I didn't think I could rely on any one of these devices being available on other platforms.
Rather than use a real hardware device, I invented a software "device" that manipulates memory, and I wrote the tests to address the "device" as if it really were a memory-mapped device. The "device" is a Fibonacci sequence generator; each "get" operation applied to the device returns the next number in the Fibonacci sequence. A C++ implementation appears in Listing 1. The corresponding C implementation appears in Listing 2.
The address range of the memory on my target evaluation board is from 0 to 0x7FFFF. I determined that my test programs weren't using any memory at the higher addresses, so I placed the fibonacci_registers object at 0x7FF00. For example, C test cases that use pointer-placement declare a pointer to the fibonacci_registers using a declaration such as:
fibonacci_registers *const fibonacci
= (fibonacci_registers *)(0x7FF00);
Programs that use at-placement declare the device as:
fibonacci_registers fibonacci @ 0xF7700;
Using the Fibonacci "device" also gave me a way to mimic linker-placement when I couldn't figure out the linker commands to really do it. With linker-placement, the test program declares the "device" using the extern declaration:
extern fibonacci_registers fibonacci;
When I couldn't use genuine linker-placement, I simply wrote the definition for the object in another source file, as in:
/* fibonacci.c */
#include "fibonacci.h"
fibonacci_registers fibonacci;
The linker places the compiled definition for the fibonacci object among the other global objects in the program, not at 0x7FF00. I call this technique default-placement. Using default-placement let me measure the performance of linker-placement without actually using linker-placement. On the one compiler that I actually tested linker-placement, default-placement did indeed produce the same run times as linker-placement.
Computing run times
All of the test programs have a main function that repeatedly calls the Fibonacci device's get function, and counts the number of calls it makes in 15 seconds. The C version of main for a polystate implementation appears in Listing 3. The C++ version (not shown) is nearly identical; it uses the C++ member function notation for calls to the timer and Fibonacci functions.
The test program generates no output. I used a debugger to examine the value of iterations when the program terminated. The final value of iterations indicates the relative speed of the call to the Fibonacci get function: faster implementations of the get function yielded more iterations, while slower implementations yielded fewer iterations.
I thought the results would be easier to read if I converted the number of iterations into the actual execution time of the function call. The time for each iteration is simply the elapsed time for all the iterations (15 seconds) divided by the number of iterations. But that time, which I'll call Te, includes the time for loop overhead (the increments, compares and branches) as well as the function call. I wanted the time for just the function call.
I simply commented out the function call and ran the test again. (I did check that commenting out the function call removed only the instructions for the call.) I used that iteration count to compute To, the execution time for one iteration without the call. The execution time for each call, Tf, is then simply Te – To.
The envelope, please
Tables 1 through 3 show the results I obtained running my tests with each of the three compilers. The results in each table are sorted from fastest to slowest, with bands of shading to highlight tests with the same run times.
The tables have unequal length because some compilers support more testable features than others. For example, Compiler 2004 was the only one that I could figure out how to test linker-placement. Only Compiler 2010 has a syntax for at-placement.
The tables have a column labeled "scope." For monostate implementations, the scope is always global. For polystate implementations, the scope indicates whether the placement declaration is at the global scope (outside main) or local scope (within main). For example, the main function for a C polystate implementation using local pointer-placement would look in part like:
int main()
{
fibonacci_registers *const fibonacci
= (fibonacci_registers *)0x7FF00;
~~~
}
So what have we learned today?
It appears that using inline functions does more than anything to improve the performance of memory-mapped accesses. With each compiler, every inline implementation outperformed every non-inline implementation. In fact, with the exception of one slightly unusual case in Table 2 (the C++ polystate implementation using reference-placement at global scope), inlining erased every other factor from consideration.
Among the non-inline implementations, three things strike me as significant. One is that, in general, the best non-inline polystate implementations outperform the best non-inline monostate implementations. This is the case for all three compilers. I believe this is because the ARM architecture loves base+offset addressing, which the polystate implementations leverage effectively.
Secondly, the non-inline unbundled monostate implementations with linker-placement (or default-placement) have the worst performance of all, and the C implementation is worse than the C++ implementation. Again, this is true for every compiler. By shifting knowledge of the data layout from the compiler to the linker, this approach robs the compiler of useful information it might use to improve code quality.
The third and possibly the most striking observation is that, except for the non-inline unbundled monostate in C++, every non-inline C++ implementation outperformed every non-inline C implementation. Once again, this was true for every compiler, regardless of its age.
I gotta say, this last one surprised me. I expected the C code to be clearly better in the oldest compiler and for the gap between C and C++ to disappear with the newer ones. I didn't expect the non-inline C++ to be better than the non-inline C across the board.
As I cautioned earlier, we shouldn't put too much stock in measurements made using just one architecture, albeit a popular one. No doubt there are things we could do to improve these measurements. But I think I've done my part to lift the discussion above speculation and anecdotes. To those who maintain that C++ polystate implementations are too costly to use: the ball's now in your court.
Acknowledgements:
Thanks to Greg Davis, Bill Gatliff, and Nigel Jones for their valuable feedback.
Endnotes:
1. Saks, Dan. "Alternative models for memory-mapped devices," Embeddeddesignindia.com, March 2010. http://forum.embeddeddesignindia.com/BLOG_ARTICLE_7086.HTM.
2. Saks, Dan. "Memory-mapped devices as C++ classes," Embeddeddesignindia.com, March 2010. http://forum.embeddeddesignindia.co.in/BLOG_ARTICLE_7022.HTM.
3. Saks, Dan. "Compared to what?" Embeddeddesignindia.com, March 2010. http://forum.embeddeddesignindia.co.in/BLOG_ARTICLE_7029.HTM.
4. Saks, Dan. "Accessing memory-mapped classes directly." Embeddeddesignindia.com, March 2010. http://forum.embeddeddesignindia.co.in/BLOG_ARTICLE_7042.HTM.
5. Saks, Dan. "Bundled vs. unbundled monostate classes." Embeddeddesignindia.com, March 2010. http://forum.embeddeddesignindia.co.in/BLOG_ARTICLE_7069.HTM.
文章评论(0条评论)
登录后参与讨论