tag 标签: multicore

相关博文
  • 热度 22
    2015-9-18 18:19
    2173 次阅读|
    0 个评论
    In courses on fiction and literature in college, my teachers emphasized the difference between implicit and explicit meanings and the importance of understanding the difference in order to derive the full meaning from what I was reading. In technical documentation about specifications and standards, the distinction between the two terms is even more important. Mismatches between the engineering reader’s level of understanding and that of the expert writing the standard complicates the problem.   If the writer assumes too much about the reader’s level of expertise, s/he will get into the habit of communicating in technical shorthand, risking a mismatch in expertise level. I have found that is often the case in such highly technical and nuanced areas such as specifications for high reliability and safety standards such as DO-178. What was implicitly assumed in version DO-178B has become explicit in the DO-178C.   Except for only the rarest of use case developers, in developing software for traditional single cores that had to meet the older DO-178B it was safest and fastest to do certification according to what was explicitly required by the standard. In the transition to multicore designs things got much more complicated, and that assumption began to break down. In such an environment the program has to be broken down into well-defined modular functional units with interfaces that are as unambiguous as possible. Implicit in this methodology is the need to test all of the software modules carefully to make sure they all come together and interact properly.   One place where depending only on DO-178B’s explicit requirements has caused a lot of problems is in relation to control and data coupling, which are particularly problematic in multicore designs. Control coupling is where one software module sends data to another module in order to influence or direct its behavior. Data coupling is where one software module simply sends information to another one but does not place any requirement on the receiving module, nor expect some sort of return action.   Many multicore developers started having problems with getting their designs through the certification process. For example, they did not realize that they had to demonstrate that the software modules and their couplings interacted only in the ways specified in their original design – not explicit in DO-178B, but implicit. For the same reason, even though they followed the explicit rules in 178B, they were unable to demonstrate that unplanned anomalous or erroneous actions were not possible.   That confusion has been cleared up in the newer DO-178C. Designed as the standard for software used in civil airborne systems, now explicitly requires that control and data coupling assessments be performed on safety-critical software to ensure design, integration, and test objectives are met. DO-178C requires increased and more rigorous testing at each level of the standard's qualification process.   This requires developers to carefully measure control and data coupling using a detailed combination of control and data flow analysis. This is a difficult enough process in single core designs, but borders on impossible in some mil-aero multicore designs and requires a new approach to doing that analysis.   One of the best tools to do this tough job is the newest Version 9.5 of the LDRA Tool Suite with its new improved "Uniview." This is a sophisticated graphical tool for observing all of the software components and artifacts in a multicore design and providing requirements traceability relating to system interdependencies and behavior.   By means of the LDRA Uniview call graphs, the hierarchy of a system can be observed graphically, allowing direct tracking of the behavior of all the various nodes and their dependencies. (Source: LDRA)   I am a sucker for graphical approaches to solving almost any problem. Even in high school, instead of using the traditional mathematical techniques for solving math and physics problems, I tried to find a way to come up with a graphical representation of the problem. I found that I not only could get the right answers faster, but I came away with a better understanding of the nature of the problem I was dealing with.   I find that is also true with LDRA's Version 9.5. The improved Uniview capabilities include not only traditional code coverage but the tracking of data and control coupling. On the control coupling side, it allows a developer to perform flow analysis on both a program's calling hierarchy as well as individual procedures and see the results instantly, showing which control functions are invoked by others and how.   On the data flow analysis side, I am impressed by the way it follows variables through the source code, performing checks at both the procedure level and at the system-level. A developer gets an ongoing report on any anomalous behavior, an important part of a data coupling assessment. The graphical framework makes it much easier to see data dependencies between modules, considerably speeding up the verification of all data and control coupling paths.   This capability may be a big plus in complex and demanding heterogeneous multicore designs, beyond safety-critical ones such as DO-178. In such applications there are numerous cores, and their software module interactions are increasingly complex. Even in the current generation of mobile devices there is a mix of five or six to a dozen processors of various types: general purpose CPUs, graphics processing units, and digital signal processors. And the number and diversity is continuing to climb.   In such an environment, structured design using software modules with clear and unambiguous data and control coupling will be forced on developers. To get a design that simply works, let alone meet an imposed standard, developers will have to structure their software code into clearly defined functional modules that can operate as a cohesive system across all processors in the system. However, when you go modular, even in non-safety-critical designs it will be necessary to examine closely the way the software blocks come together and interact. And to do that effectively a graphical approach such as the one incorporated into LDRA's tools may be the only way to go.
  • 热度 19
    2011-4-28 18:00
    2420 次阅读|
    0 个评论
    We have seen the challenges of multicore programming in mainstream programming in earlier articles. The main  challenge is refactoring of the code to multiple parallel threads of execution that can run on different cores and in the process, how to address the issues of synchronization, race conditions effectively. But if there are no synchronizations needed between the parallel threads, it becomes easy and even natural to adopt parallel programming on multicore. We have seen some such areas like networking and scientific computing. Server side processing is another such area. If we look at a  modern web server, it runs a software like Apache on Linux or IIS on Microsoft servers. When a request for reading a web page lands in a server, it  spawns a thread which runs to completion. Since most of the web requests to a server are likely to be reading of pages, the web server can spawn multiple threads in parallel to handle requests. Now, if the server has a multicore processor like Intel Xeon 7500, it helps. Each thread can be run on a different core. This speeds up the performance of the web server by many times. For jobs like responding to HTTP requests or generating dynamic page generation, it speeds up the performance. One such study reports that the quad core CPUs provide 55% more performance for generating the dynamic pages than a similarly-priced pair of dual core CPUs. But there is a catch here. If the web requests access the database or perform I/O intensive jobs, it could create bottlenecks.  One should not expect 8x performance gains by just moving to an octal core server. After initial performance boost, more number of cores does not necessarily translate to higher performance linearly. One needs to do lots of  tuning of various parameters like number of threads, number of ethernet interfaces, cache sizes, I/O, databases and so on! To summarize, one can getter better performance by moving to multicore servers and with careful tuning of system parameters one can improve it too!
  • 热度 20
    2010-7-3 22:27
    3804 次阅读|
    0 个评论
    Networking gear companies are bullish about the multicore processors and are seeing tangible benefits from them. Let us look at the reasons for the same.   If we look at a networking equipment like router it has two broad components: the data plane and  control plane The data plane is also referred as the fast path. This is the code that gets executed when a packet (like ethernet packet) comes on an interface like ethernet port. It is a real-time task and throughput of the system depends on this code. It is not a complex logic but it is typically handcrafted for higher throughput. It does not need much OS support. To process a packet, it refers the the “forwarding table” which is set up by control plane.   The second component is the Control Plane, also referred as slow path. It contains code for different protocols, that perform many functions like updating new routes, processing control packets and setting up forwarding tables which is accessed by datapath. If route from A to B is broken, it means a route is changed, and forwarding tables need to be changed to reflect that. This code is typically large, runs on top of an OS, runs infrequently, and is compute intensive.   In the early days, designers used to have same CPU handling both data plane and control plane leading to poor performance. If higher priority is given to the control plane against the data plane, processing of packets gets delayed , which leads to packet queueing, congestion and packet drops. If data plane gets higher priority than the control plane, then the  line events (e.g. link up/down) or the control plane indications (e.g. route change) get analysed lated.   To avoid this, system designers used to have two processors, one for control plane and other for data plane. Control process runs the routing protocols, computes the forwarding tables, and updates the same for data plane processor. But this had consequence of increased cost.   Now, multicore processors allow running of control plane protocols on a single core and allow running of  data plane code on other cores.   Availability of cheap cores has helped data plane immensely. Since the packet processing algorithm is essentially similar for all packets, it allows the partitioning of code to use the multiple cores and get performance boost.   First way of partitioning of data plane works like this: Suppose the dataplane code has 3 logical functions f1(), f2() and f3(). Now we can write the code in such a way that first core runs f1() and passes control to second core to run f2() and third core to run f3(). In this “pipelining”, the number of packets currently under processing is equal to the number of cores in the pipeline. By designing this pipeline efficiently, we can increase throughput multiple times.   The alternative to this partitioning is to have all the cores run the same code that read packets from the same input stream and write to the same output stream. Whichever core is free, picks up the next packet from input stream for processing. The full functionality of the packet processing pipeline is applied on the input packets by a single stage as opposed to sending the packets from one stage to another. It is also possible to have a hybrid model, where both models can be combined.   To summarize, the availability of multicore processors is a big blessing for for networking gear designers. It allows them to partition code efficiently (control and data plane codes) and also get performance boost by using pipelining in data plane code to get higher throughputs.   Semicon companies like Cavium, Tilera, Freescale all have multicore network processors and have helped to develop better networking gear for market.        
  • 热度 21
    2010-5-7 17:42
    3463 次阅读|
    1 个评论
    Recently, scientists from NCSU have discovered a  technique that improves performance of  applications up to 20% on a multicore system. And the technique does not need any change in existing code. To understand its significance, we should first answer the question “why it is difficult  to parallelize a desktop application?”. If we take a word processing application as an example, it will have a  loop waiting for a key to be pressed,  and when it is pressed, it applies the format/styles to the character, displays the character on screen and returns to the wait loop. It is difficult to convert this logic into pieces of code that can execute in parallel, as the whole code is in a sequential flow and each step will wait for the previous step to complete. Such single threaded applications have little scope of parallelization and have not been able to get benefits of multiple cores. That is why the news from the scientists at the North Carolina State university (NCSU) is exciting. They have found a simple, effective technique that allows existing programs to get benefits of speed improvement on a multicore system, that too without any change in the code. What is their magic? Most of the applications need dynamic memory and they  allocate and free chunks of memory as per their needs. Programs don't care for the return value of the memory free()  function. The NCSU scientists simply moved the free()  function to another core from the main core. This allows free() code to run in parallel to main code. Since applications use malloc() and free() extensively in code, just moving free() code to another core gives improvements of  20% in performance! This “simple drop in replacement” is achieved by linking applications with a new memory management library and it needs no change in the application code. It is a low hanging fruit of performance improvement and it is surprising how it has missed notice so long! Simple techniques like this will help desktop user to get visible, tangible improvements in performance for their existing applications on multicore systems!
  • 热度 18
    2010-3-12 16:08
    2634 次阅读|
    0 个评论
    Dear Readers, In the last article, we saw about the so called “embrassingly parallel operations” which can easily take advantage of multicore systems. In this article, let us see one more way of getting performance improvements on multicore systems. OpenMP is one of them.   OpenMP specifications was originally defined by industry vendors like Sun, Intel in 1997. It was popular in Symmetric Multiprocessing (SMP) systems. A typical SMP system is a multiprocessor computer hardware where two or more identical processors are connected to a single shared memory and are all processors run same OS instance.   Surprisingly, today's multicore systems are similar to the SMP architecture. Instead of multiple processors, we have multiple cores. All cores access the common shared memory and run same OS instance. That is why, a solution like OpenMP which is from SMP era is suddenly finding a renewed interest in the multicore systems of today.   The OpenMP specification is defined for C/C++/Fortran languages. It consists of three parts: compiler directives, runtime library and environment variables . The code is instrumented with directives and it gets compiled with the openMP supported compiler. The code is linked with the runtime library for generating the executable. There are some runtime environment variables that control the code execution.   An OpenMP program works like this:   Start as a single process called the master thread . The master thread executes sequentially like any other normal program, until the first parallel region construct is encountered. The master thread then creates a team of parallel threads The statements in the parallel region construct are then executed in parallel among the various threads created When the individual threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread   Since the process of creation, starting and joining of threads is done automatically, programmers are relieved of the complexities. The model also allows variables to be locked and shared between the threads and supports fairly advanced features. Here is an example code from Wikipedia:   int main(int argc, char **argv) {     const int N = 100000;     int i, a ;       #pragma omp parallel for    - Compiler directive     for (i = 0; i N; i++)         a = 2 * i;       return 0; }   As first step, the code is compiled with OpenMP enabled compiler. An environment variable, something like OMP_NUM_THREADS is set to the number of threads and the program is executed. Suppose, OMP_NUM_THREADS is set to 4. Code starts normally, but when it reaches the for loop, it creates 4 threads, and each thread does the matrix multiplication for 100000/4=25000 different entries. This speeds up the processing as four threads work in parallel, on different cores. As we keep increasing the the value of OMP_NUM_THREADS, one could see a decrease in time and improvement in performance, till system bus  bottlenecks start showing up. The advantages of the openMP include: Learning curve is low as it builds on existing languages through #pragma commands It hides thread semantics one could do incremental parallelization across the code and see the effects. This “Change and See” approach gives confidence to programmers It supports of good set platforms (C/C++/Fortran on Linux/Windows) supports both coarse/fine grained parallelism.   Main disadvantage of OpenMP is that it needs specific tool chains (compilers, runtime). Not all compiler tool chains support OpenMP. Popular ones include Sun Studio tool chain and GNU 4.3.1.   OpenMP can get a big performance improvement on a multicore systems as each thread can run on each core separately and hence translates to better performance.   Then why OpenMP is not so well known in mainstream? It is because OpenMP gives big performance gains to mainly mathematical and scientific computing needs like large matrix multiplications. For a desktop application or server application, OpenMP may not be of great help unless the application logic has such code.          
相关资源