tag 标签: multi-core

相关博文
  • 热度 21
    2013-12-5 22:22
    2169 次阅读|
    0 个评论
    After several years of hype, multi-core and multiple-CPU embedded systems are now becoming mainstream. There are numerous articles about multi-core design that address different hardware architectures (homogeneous and heterogeneous multi-core) and software architectures (AMP: asymmetrical multi-processing and SMP: symmetrical multi-processing). In this article the development of an AMP system is outlined, highlighting various challenges that were addressed. What is unusual is that the design was completed in 1981! It often appears that there is nothing truly new in the world. With technology, it is often a question of the time being right. The programming languages that we use nowadays are mostly developments of 30-40 year old technology. The current enthusiasm for multi-core designs is really nothing new either. Browsing the literature turns up titles like "Multi-core is Here Now" that have been appearing for at least 5 years. But multi-core goes back further still. I was working on a multi-core system in 1980 ... How it all started It was my first job out of college and I worked for a company that made materials testing instruments – large machines and systems that stressed and broke things under controlled conditions. The use of computer or microprocessor control was new. Hitherto, the machines had been controlled by racks of analogue electronics, with meters and chart recorders providing results. I worked in the division that provided the computer control. Initially, the approach was simply to link a mini-computer to a traditional console. The next – and, at the time, brave – step was to replace the entire console with a microprocessor where a keypad enabled input of parameters and selection of settings from menus on a screen. Of course, a mouse or touch screen might have been better, but that technology would not appear for some years. The project to which I was assigned was to facilitate the "user programmability" of the new microprocessor-controlled machines – the "User Programmability Option" or "UPO". It was decided that the best way to provide this capability would be to add an additional computer instead of potentially compromising the real-time behaviour of the controlling microprocessor. This is exactly how I might advise a customer today who is designing a multi-core system with real-time and non-real-time components. The processors The advanced console was built around a Texas Instruments 9900 microprocessor, which was one of the first true 16bit, single-chip devices on the market. It had an advanced architecture, with some interesting pros and cons: it could intrinsically support multi-threading in a very simple way, with context saving accommodated in hardware; but its registers were mostly RAM based, which, at the time, was a significant performance limiter. The instruction set and addressing modes bore some similarity to the 68000. I recall that the documentation was confusing, as the bits were numbered backwards, with the most significant bit being #0. This part of the system was programmed in Forth code (see code below). I have no idea why this design decision was made, but I found the language intriguing and my interest persists. The UPO computer was an SBC-11. The "11" came from the internal processor, which was essentially a DEC PDP-11, a mini-computer which was familiar to us at the time. "SBC" was apparently short for "shoe box computer", because that is what it looked like. I have a suspicion that this was a joke and it actually stood for "single board computer", as it does today. We implemented user programmability using a variant of the BASIC language, with some extensions to access capabilities of the testing machine. Interprocessor communications Using multiple CPUs (or cores) presents a variety of challenges. One is the division of labour, which was reasonably straightforward in this case. Another is communication between the processors ... In designing the UPO, we considered a number of means by which the two CPUs might be connected. As they were separate boxes, serial and parallel connections were considered. But we were concerned about any possible compromise of the real-time performance of the console microprocessor. Also, we did not want the user to be faced with the UPO freezing while it waited for attention from the console. So, clearly a buffering mechanism was needed and shared memory seemed to be a good option. A small memory board was designed. I have no idea of the hardware architecture, except that I seem to recall that the TI-9900 had priority over the SB-11, as it could not afford to be delayed by slow memory access. If I remember correctly, the board was 2K (words, probably). Protocol It was down to us to define a protocol for communication, so we aimed to produce something that was simple and reliable. We divided the memory into two halves; one was a buffer for communication from the UPO to the console and the other for the opposite direction. The first word of each buffer was for a command/status code, which was simply a non-zero value. We did not use interrupts. The receiving CPU just polled the first word when appropriate, awaiting a non-zero value. When a command was found, any data could be copied and the command word cleared to zero, indicating that the processing was complete. So, the UPO sending a command to the console might go through a sequence like this: * Write data to the buffer . * Write a command to the first word. * Poll the word, waiting for it to become zero. If it was expecting a response, it would then start monitoring the other buffer. Of course, there were other facilities to handle a situation where one CPU did not respond after a timeout period. Nowadays, multi-core and multi-chip systems have a variety of interconnection technologies, but shared memory is still common. A number of standardised protocols have been developed over the years, including derivatives of TCP/IP. In recent years, the Multicore Association produced the Multicore Communications API (MCAPI), which is rapidly gaining broad acceptance in multi-core embedded system designs. Challenges When we hooked up the shared memory and started to send test messages between the processors, we hit a problem: they seemed to get scrambled. At first we assumed that there was a problem with the memory board, but it was checked by the hardware guys who pronounced it healthy. Then we spotted a pattern: the bytes of each 16bit word were getting swapped. We thought that it was a wiring problem with the board, but studying the schematics and the board layout showed no error. Of course, the reason for the problem was that the two CPUs were different architectures, from different manufacturers, each of whom had a different idea about which byte went where in the word. Now I would describe one as big-endian and the other as little-endian, but I did not have the vocabulary back then. An adjustment to the board design could have made this problem go away. But of course it was too late for that. So we had to resort to the age-old method to rectify a problem found late in the design process: fix it in the software. I simply put a byte swap on one side of the interface and the other side was none the wiser. How would I do it now? In the end, we got the UPO working to our satisfaction and I think we even sold a few of them. It is interesting to consider how I might build such a system now, thirty-odd years later. First off, I would design the console as a multi-core system. There would probably be one core that would do non-real-time work (user interface, data storage, networking, etc.) and maybe two more to do real-time work like data sampling and control of the servo-hydraulics.ÿThe UPO would just be an app on a standard Windows PC with a USB interface to the console. I have no insight into how the company builds testing systems now, but it would be interesting to compare notes with my 2013 counterparts. Colin Walls has over thirty years experience in the electronics industry, largely dedicated to embedded software. A frequent presenter at conferences and seminars and author of numerous technical articles and two books on embedded software, Colin is an embedded software technologist with Mentor Embedded (the Mentor Graphics Embedded Software Division), and is based in the UK.  
  • 热度 17
    2013-8-11 15:21
    1985 次阅读|
    0 个评论
    As multi-core and manycore systems emerge whose complexity is daunting to embedded software developers, they have looked with envy at standardisation efforts such as IEEE's IP-XACT. This standard defines and describes electronic components for EDA hardware design. Created initially by the SPIRIT consortium, IP-XACT defines and describes electronic components and their designs as a way to automate configuration and integration of the various EDA tools in an system on chip design. This was done by creating a standardised XML data format to describe components from multiple vendors and do so in a vendor-neutral way that would allow exchange of component libraries between diverse electronic design automation tools. Developers of embedded multi-core and manycore processor applications – and the suppliers of the tools they use – need no longer wait for such capabilities to come to software development. Drawing to some degree on what IP-XACT has done for hardware components, the Multicore Association has created the Software-Hardware Interface for Multi-Many core (SHIM) working group which will have as its goal the definition of a common interface to abstract the hardware properties that matter to multi-core tools (see figure .) According to Markus Levy, Multicore Association president, the working group aims to deliver, by the end of the year, an initial standard using XML models, then turn attention to ironing out implementation issues for specific use cases. He said that while the IEEE's IP-XACT standard is related to SHIM in some ways, their emphasis is much different. Where IP-XACT gives more details on connections between hardware blocks and is intended for use by hardware designers, he said, SHIM presents more details about latencies, cache types and sizes, and other details of interest for its target audience of software developers. "Multicore and manycore system development often gets sidetracked because development tool vendors and runtime systems for these programs are challenged to support the virtually unlimited number of processor configurations," said Levy. "The primary goal of the SHIM working group is to define an architecture description standard useful for software design." A good case in point, he said, is the way in which it will simplify the management of the processor cores, the inter-core communication channels, the memory system (including hierarchy, topology, coherency, memory size, latency), the network-on-chip (NoC) and routing protocols, and hardware virtualisation features. "These are among the architectural features that SHIM will either directly or indirectly describe," he said, "and the aim is to make it flexible enough to allow vendor-specific, non-standard architectural information for customised tools." He said the standard will also simplify communications between chip vendors and tool suppliers. The former, he said, can use these models to automatically report details of their chips in a standard way to operating systems and developer tools such as performance analysis programs, runtime libraries, and auto-parallelizing compilers. While the XML models will be publicly available, vendor-specific chip details can remain confidential between a processor vendor and its software partners. And while the XML models are intended to be descriptive for configuration purposes and not for use in simulations, said Levy, they can be used to provide rough performance estimates that might inform decisions about how software automatically configures itself. SHIM aims to replace manual software configuration by engineers and existing proprietary formats that handle similar jobs. "I think SHIM will be a useful adjunct for many types of tools," said Levy, "including performance estimation, system configuration, and hardware modelling. "Performance information is critical for most software development tools, including performance analysis tools, auto-parallelizing compilers, and other parallelizing tools." Moreover, operating systems, middleware, and other runtime libraries, he said, require basic architectural information for system configuration. Because the SHIM standard can be used with hardware modelling to support architecture exploration, an important goal of the SHIM effort is to align it with work underway in the Multicore Association's Tools Infrastructure Working Group (TIWG). Initial work on SHIM was started about a year ago as part of a government-funded project in Japan, headed by Masaki Gondo, a general manager at eSOL , who opened contact with MCA leading to the formation of the working group, which Gondo is chairing. Other members of the working group include Cavium, CriticalBlue, eSOL, Freescale Semiconductor, Mentor Graphics, Nagoya University, Nokia Siemens Networks, PolyCore Software, Renesas Electronics, Texas Instruments, TOPS Systems, Vector Fabrics, Wind River, and Xilinx. "Ultimately, I hope SHIM will promote highly-optimised tools that can provide efficient utilisation of very complex SoCs and eliminate the need for users to work their way through 1000-page manuals to program all the device features," said Levy.  
  • 热度 12
    2013-3-22 21:49
    1719 次阅读|
    0 个评论
    In 1974 Robert Dennard conceptualized a scaling theory that drew on Moore's Law to promise ever-faster microprocessors. If from one generation to the next the transistor length shrinks by a factor of about 0.7, the transistor budget doubles, speed goes up by 40%, total chip power remains the same, and a legion of other good things continues to be bestowed on the semiconductor industry. Unfortunately Dennard scaling petered out at 90 nm. Clock rates stagnated and power budgets have grown at each process node. Many traditional tricks just don't work any more. For instance, shrinking transistors meant thinner gate oxide thicknesses, but once those hit 1.2 nm (about the size of five adjacent silicon atoms), tunnelling created unacceptable levels of leakage. Semiconductor engineers replaced the silicon-dioxide insulator (with a dielectric constant of 3.9) with other materials like hafnium dioxide (dielectric constant = 25), to allow for somewhat thicker insulation. Voltages had to go down, but are limited by subthreshold leakages as the transistors' threshold voltage must inevitably decline. More leakage means greater power dissipation. A lot of innovative work is being done, like the use of 3D finFETs, but the Moore's manna of yore has, to a large extent, dried up. Like the cavalry in a bad western, multi-core came riding to the rescue, and it's hard to go a day without seeing some new many-core CPU introduction. Most sport symmetric multi-processing architectures, where two or more cores share some cache plus the main memory. Some problems can really profit from SMP, but many can't. Amdahl's Law tells us that even with an infinite number of cores, an application that is 50% parallelizable will get only a 2x speed-up over a single-core design. But that law is optimistic, and doesn't account for the inevitable bus conflicts that will occur when sharing L2 and main memory. Interprocessor communication, locks, and the like make things even worse. Data from Sandia National Labs shows that even for some very parallel problems multi-core just doesn't scale after a small number of processors are involved. In Power Challenges May End the Multicore Era (Communications of the ACM, February 2013, subscription required), the authors develop rather complex models that show multi-core may (and the operative word is "may") bang into a dead-end due to power constraints. Soon. The key takeaways are that by the 8-nm node (expected around 2018) more than 50% of the transistors on a microprocessor die will have to be dark, or turned off, at any one time just to keep the parts from self-destructing from overheating. The most optimistic scenarios show only a 7.9x speed-up between the 45-nm and 8-nm nodes; a more conservative estimate pegs that at 3.7x. The latter is some 28 times less than one would expect from the gains Moore's Law has led us to expect. I have some problems with the paper: - The authors assume an Intel/AMD-like CPU architecture. That is, huge, honking processors whose entire zeitgeist is performance. We in the embedded space are already power-constrained and generally use simpler CPUs. It's reasonable to assume a mid-level ARM part will run into the same issues, but perhaps not at 8 nm. - They don't discuss memory contention, locks, and interprocessor communication. That's probably logical as their thesis is predicated on power constraints. But these issues will make the results even worse in real-world applications. The equations presented indicate no bus contention for shared L2 (and L2 is always shared on multi-core CPUs) and none for main memory accesses. Given that L1 is tiny (32-64KB) one would expect plenty of L1 misses and thus lots of L2 activity... and therefore plenty of contention. - The models analyse applications in which 75% to 99% of the work can be done in parallel. Plenty of embedded systems won't come near 75%. - It appears the analysis assumes cache wait states are constant: three for L1 and 20 for L2. Historically that has not been the case—the 486 had zero wait state cache. It's hard to predict how caches in the future will behave but assuming past trends continue, the paper's conclusions will be even worse. - The paper figures on a linear relationship between frequency and performance, and the authors acknowledge that memory speeds don't support this assumption. The last point is insanely hard to analyse. Miss rates for L1 and L2 are extremely dependent on the application. SDRAM is very slow for the first access to a block, though succeeding transfers happen very quickly indeed. But any transaction could take just three cycles (if in L1) to hundreds. One wonders how much tolerance a typical hard real-time system would have for such uncertainty. Two conclusions are presented: the pessimistic one is the chicken little scenario where we hit a computational brick wall. Happily, the paper address a number of more optimistic possibilities ranging from microarchitecture improvements to unpredictable disruptive technologies. The latter has driven semiconductor technology for decades, and I for one am optimistic that some cool and unexpected inventions will continue to drive computer performance on its historical upward trajectory.  
  • 热度 19
    2012-12-27 20:29
    1464 次阅读|
    0 个评论
    My editors have asked me to prepare a list of what I feel are the ten best things that have happened in the embedded space in 2012. Rather than do that, I've compiled what I see as the ten most important things this year for embedded systems. Number 10: Sub-$0.50 32bit processors. NXP and others have introduced ARM Cortex-M0 microcontrollers for tens of cents. Put a high-end CPU in your product for a tenth of the cost of a cup of Starbucks. Does this spell the end of 8 and 16 bits? I don't think so, but it does shift the landscape considerably. Number 9: Ada 2012 Ada 2012, a new version of Ada, includes design-by-contract to automatically detect large classes of runtime errors. Though Ada's use is still very small, it does offer incredibly low bug rates. In the past design-by-contract was only available natively in Eiffel, which has a 0% market share in the embedded space. Number 8: Xilinx acquires Petalogix, meanwhile coming out with the Zynq FPGA The Zynq has twin Cortex-A9 cores. Zynq is interesting in that it's less about a massive FPGA and more about cores with some configurable logic. And Petalogix has a great demo showing interrupt latency on each core, one running FreeRTOS and the other Linux. Although Linux is a wonderful OS, it isn't an RTOS replacement:   Number 7: The Coremark benchmark goes mainstream While Coremark has been around for some time, in 2012 a number of microprocessor manufacturers have started using it strategically to differentiate their offerings. Now Coremark is even found in datasheets. ARM leveled the playing field... will Coremark upend it? Number 6: Ivy Bridge released Although Intel's part is not targeted at the embedded space, their successful use of 22nm geometry, enabled by FinFET transistors, is causing the other foundries to scramble. You can be sure we'll see FPGAs at this process node before long, which will mean higher density and lower power consumption (at least on a per-transistor basis). Today both Altera and Xilinx are shipping 28nm parts.   Number 5: Foxconn plans to add 1 million robots. Nope, this isn't happening in 2012, but that oft-reviled company is starting to ramp up their robotics. What will this mean? A ton of layoffs in China, that's for sure. It will also be a shot in the arm for those vendors who make the embedded systems that go into robots. I suspect the economy of scale will drive prices down substantially, creating more opportunities for robots there and here in the West. The impact on employment will be scary. Number 4: ARM's BIG.little heterogeneous cores If there is a theme about embedded in the last year or two, it's that of power management. It's all about the Joules when running from a battery. A smart phone demands a ton of computational capability when active, but does spend most of its time loafing. ARM mixed a Cortex-A15 with an -A7 on one die. The A15 runs when demands are high; otherwise it sleeps and the A7 runs exactly the same code while consuming less power. Other vendors have taken somewhat similar approaches, like NXP in their LPC4350 which mixes a Cortex-M4 and -M0 on a single chip. Number 3: Improved tools to measure power consumption of devices To continue with the theme of power management, a number of vendors have introduced or improved tools to measure power consumption of devices. ARM's DS-5 toolchain now operates with National Instruments' data acquisition devices; Segger has a brand-new debugger that measures power, and IAR's has been improved. All three of these correlate power consumption to the running code (with some caveats). Then there are the low-cost devices like Dave Jones' µCurrent and a new-new and very innovative product I'm not allowed to talk about yet. The bottom line is that designers of low-power systems now have tools that operate in the power/code domains. Number 2: Innovations in gesture UIs, such as Microchip's GestIC parts Also huge in the last few years are new ways to interact with devices. Apple refined the UI with touchscreen swiping. Kinect uses a camera to sense a player's inputs. This year Microchip introduced their GestIC parts that sense hand gestures made within 15 cm of a device. It can detect the hand position in 3D space, flicks, an index finger making clockwise or counterclockwise circles, and various symbols. And, no, as yet it cannot detect that gesture you were just thinking about. Number 1: Searching... searching.... Finally, the biggest development in 2012 is the one that didn't happen. Despite sales of hundreds of millions of multi-core chips this year, no one really knows how to program them. The problem of converting intrinsically-serial code to parallel remains unsolved. Here's my six-core PC's current state as half a dozen busy apps are running:  
  • 热度 15
    2012-7-20 01:11
    4338 次阅读|
    0 个评论
    The tech world is going through unprecedented changes in the last few quarters. Apple and Samsung have taken the mobile phone market leadership to a different level where both of them have surged forward from their nearest rivals, in terms of innovation, technology and the revenue/profits earned in this business. We now see that Software, Operating System players like Microsoft and Google have entered the HW market through their own branded products. The message is loud and clear—the companies are looking at increased business from the consumer and the actions that the consumer carries out in the internet; they are out to influence the consumer side devices as well as the server/network side applications in order to maximise the business. All of these changes are having a huge impact on the traditional eco systems in the mobile, handheld, consumer markets. Only time will tell if the integrated strategies played out by Apple is the way to go for the consumer electronics leaders of the world, though there is an apparent shift in that direction. All of these changes are fuelling tremendous growth in the embedded markets. Some of the trends that we see in the embedded system design markets are as follows:   Increased use of multi-core processor platforms: Traditional embedded systems design principles ensured processor and design simplicity in order to meet the stringent needs of cost, reliability, thermal performance, etc. So the use of multi-core processors was not very common. Of late new process and power conservation technologies are driving the use of multi-core processors in embedded system design without impacting the traditional principles. Enhancements in processor design is looking not only at the increased clock speed, but considers increased efficiency, lower power consumptions and integrated graphic performance.   Connectivity is driving security needs in the devices: The convergence of devices features and technologies are happening faster than anyone's imagination these days and the need for connectivity is driving the device designs. All this is adding a security nightmare to preserve personal and professional information from hostile attacks. The embedded system components (processor, operating system, applications) need to have better security features in them in order to tackle these challenges.   Demand for Video processing: The enhanced processing power in the devices are driving the need to have better video processing for personal and professional data transfers and there is an increasing trend in devices that have video capability being designed. Innovative application use cases are built in to take advantage of the social networking and other converged platforms to share video across devices.   Irrespective of the global economic turbulence, there would be continued investments in providing more innovative and efficient solutions coming up in the embedded domain to cater to these trends. In order to be a winner in the embedded market, the companies and individuals need to constantly develop and innovate on new ideas, approaches that can provide efficient, fast, low power, cost effective solutions to the consumers. The above trends of increased video data, security needs and use of complex processors would demand a new level of expertise in providing these solutions.   - Krishnakumar M.  
相关资源