embedded power for your project embedded power for your project

| Home | Software | Services | Support | Contact |

 Embedded Prozessor
   8-Bit Prozessoren
   16-Bit Prozessoren
   Altera FPGA
   Analog Devices
   ARM
   Atmel
   Infineon
   MIPS
   Freescale CF
   Freescale PPC
   NEC
   Renesas
   ST Micro
   Texas Instruments
   Xilinx FPGA
 Tool Information
   BDM/JTAG tool
   Emulator
   Hardware design
   Logikanalysator
   Eval Boards
   Simulator
   Compiler
   Real Time OS
   Software Tools
   Debugger
 Services
   Consulting
   Support
   Verkauf
   Links
   Literatur

32-bit und mehr - MikroProzessor-Power für jede Anwendung

30 Jahre Mikroprozessoren:

Die Auswahl an leistungsfähigen Prozessoren für viele spezielle Zwecke ist in den letzten Jahren förmlich explodiert. Zum einen ersetzen konfigurierbare Architekturen bei kleinen Stückzahlen herkömmliches Board-Level Design, zum anderen denken sich bekannte Siliziumfabrikanten immer wieder raffiniertere Variationen ihrer Standardarchitekturen aus, so daß viele ehemalige Wünsche wahr werden. Ein Coldfire mit CAN und Ethernet? Freescale machts möglich. Ein Prozessor der 6 Schrittmotoren gleichzeitig in Schach hält und obendrein Bluetooth fähig ist? Warum nicht, gerade Zulieferanten der Automobilindustrie sind sehr empfänglich für solche Ideen, denn Millionen-Stückzahlen winken in der einzigen Industrie, die in Europa noch gute Erträge erwirtschaftet.

Evolution in Architectures

Processor design, powered by faster, denser, and more plentiful silicon, has broadened beyond a simple CPU, caches, and on-chip peripherals on a chip. Today's embedded processor choices include:
Chip-level MP Multiprocessing on-chip is reality. Implementations range from handset SOCs integrating a DSP + uC, to multiple DSPs on a common bus.
True Systems-on-Chip SOCs have gone beyond just a processor, its memory bus and peripherals. SOCs are taking on the aspects of systems, with multiple processors, common memory, and peripherals with sophisticated system buses to tie it all together.
Parallel Processor Element Designs Alternate multiple Processing Element architectures that can deliver massive amounts of processing power from ganged PEs.These processors may function as a co-processor or be integrated with a host CPU in an SOC.
Extensible Processors RISCs that can be extended at the ISA level that rely on system level logic synthesis to integrate the designs.
Add-On Functionality RISC, DSP architectures that enable 3rd parties and vendors to add logic functionality at the ISA level. They rely on logic synthesis to integrate the new functions into the design.
And last, but not least, CPU speeds are up. Embedded processors are moving up the speed curve. Already some embedded processors are cracking the 1 GHz barrier, along with the associated EMC and heat problems.

Silicon-Driven Revolution

Everybody knows it, but it's still true: that silicon technology follows Moore's Law, roughly doubling the number of transistors every 18 months. And we are seeing the benefits of this relentless silicon march up the silicon curve. It is a technical commonplace, a cliché, that silicon technology follows Moore's Law, roughly doubling the number of transistors (or functionality, or clock rates, or capabilities) every 18 months to two years. It is also a truism. And we are now seeing the benefits of this relentless silicon advance up the silicon curve (time vs. capability).

We are now seeing lower core voltages, accompanied by lower power dissipation, higher clock rates, smaller geometries, and much, much higher transistor counts. We can do more, faster, with less power, and even cheaper than before. Core voltages are dropping to 1.2 to 1.8 V for the more advanced designs, on its way to 1.0 V and less.

Such lower core voltages and smaller silicon, coupled with power saving design techniques, have brought down chip level power dissipation, in spite of higher clock rates. Power dissipation for many embedded processors, even for advanced processors, has come down to a matter of a few Watts.

This silicon is not free. But it is available. And with today's higher and higher clock rates, processors need more on-chip memory to minimize off-chip memory access delays. So processors are bulking up on on-chip memory. Most RISCs, for example, can now afford to run with 16 KB or even 32 KB Instruction and Data caches. And many, following Intel Pentium, are moving toward large on-chip L2 caches to localize processing and to minimize off-chip memory access delays.

The cement that binds these technical advantages to more advanced designs is logic synthesis. Without HDL-based design and the logic synthesis to transform HDL designs into silicon, current designs would not be possible. With chips running from 7 to 45 M+ transistors, the time for gate-level hand design is over. Today's complex designs are the products of a new generation of synthesis tools. These tools enable designers to easily add IP or to extend their processor designs.

Chip-Level MP

In 21st century chip-level MultiProcessor became a reality. The year that SOCs moved from being a way to integrate a processor with its peripherals on one piece of silicon, to the point when SOCs started taking on the characteristics of true systems. Multiple processors on an FPGA became a working reality, one that designers could count on for delivering a large amount of processing power within a realistic silicon budget.

SOC MP ranges from paired processors, such as a RISC paired with a microcontroller, to full-scale MP architectures with multiple RISC or DSP processors. In addition, a new class of MP processing has emerged, that of multiple specialized processors or coprocessors arranged in sequential processing order or in processing arrays. This latter class represents the deployment of specialized math, vector, graphic, or media processors, which collectively can deliver a very high level of performance at modest clock rates. Especially DSP algorithms benefit largely from parallel execution as is possible in adaptable logic, rather than traditional sequential DSP processors.

Taking advantage of today's plentiful silicon, vendors are packing multiple processors on a single die to minimize design chip counts and costs. For example, Motorola is now integrating its 32-bit M-CORE RISC with its 4th generation, 16-bit fixed-point StarCore VLIW DSP for one-chip wireless transceivers. Another example of such high-level on-chip integration is Lucent's StarPro, which integrates 3 StarCore DSPs with 768 KB of RAM and on-chip peripherals. A third example is Infinion's Carmel DSP, which supports on-chip MP, with 4 or more DSPs integrated on a single chip.

Clocks vs. Execution Units

There's a new variation on an age-old: clock rates vs. execution units. The idea is that we don't have to go faster if we do more in parallel. Many designers are making an interesting tradeoff: clock rates vs. execution units based on the idea that maybe we don't have to go faster if we can have lots of parallel execution units. We can then run the execution units at slower clock rates and get GHz level performance without straining the silicon. It's a variation of the "wider rather than faster" design theme. If you think about it, that's precisely what superscalar RISC, VLIW and SIMD are all about, essentially deploying more execution units in parallel.

Sounds good, but most superscalar RISCs, VLIWs or SIMDs, can't get that many execution units chugging away in parallel. For example, a 4-way superscalar RISC will run 4 execution units in parallel. At best, a VLIW like TI's C6x with an 8-way VLIW has 8 units executing in parallel. SIMDs do a bit better, especially for 8-bit operations: a 128-bit SIMD like Motorola's PowerPC G4 does 16 executions in parallel. But if you need 16-bit accuracy, it only does 8 operations in parallel.

However, there's another way to get more parallel processing power to deliver massive amounts of execution MIPS at relatively low clock rates. New architecture designers have done this by basically upping the number of parallel execution units that can be deployed in tandem. Today's emerging parallel designs are all over the place architecturally, but basically all get their top-level performance by ganging multiple parallel execution units for massive parallelism.

There are several dynamically reconfigurable MP designs with an ARC RISC on-chip host with a 32-bit reconfigurable processing fabric. It is configurable with FPGA-like programmable local and layer interconnects and datapath cells. Examples of such architectures can be found with Stretch, Altera, Atmel, Xilinx and more companies to come.

Through the looking glass:

Extensible Processors

May be fixed instruction sets and standardized instruction set architectures (ISA) are not the best way to get the ultimate efficiency into embedded applications. May be the better way is to tailor the instruction set and the peripheral units of YOUR OWN processor (system on a chip - SOC). A few good instructions can save milliseconds. Years ago, engineers and programmers settled on standardized processor architectures to maximize software life and to minimize reprogramming. Instruction sets remained fixed, while architectural implementations varied, taking advantage of new technology and design methodologies. While this approach did deliver cost-effective processing, it forced the programmers to tailor their software to the application problem, even if a hardware assist would be much more cost effective.

Today, there is a third way, one between fixed instruction sets and custom ISAs. A new instruction or hardware capability would be automatically included in the assemblers, compilers, libraries, and operating software. Thus you can add a new instruction, one that the operating system will automatically use, or one that provides new hardware functionality available to programmers as one or more new instructions used in functional libraries. This addition of instruction functionality can make a big difference for many embedded applications that handle specialized interfacing and packet processing tasks.

Two fabless processor vendors - ARC and Tensilica - have taken this extensible processor approach for their RISC CPUs. Both companies based their designs on a RISC processor base, and both designs enable developers to pare down instructions for a minimal design, or to add new instruction functionality and coprocessors to the CPU. ARC has made a name for itself by fielding a minimal RISC core with a small footprint and a scalable instruction set. Tensilica, a more recent arrival to the processor business, has a 16-/24-bit instruction word RISC. Both companies rely on synthesis to handle the instruction integration.

The ARC architecture and tool chain enabled designers to tailor the RISC CPU to their design needs, especially letting them eliminate unneccesary instructions and functions. They could also add new instructions, registers and logic resources as needed, relying on logic synthesis to integrate the logic. Later developments opened up this add-on capability to third parties, not just licensees, allowing the third party to add functions (and the instruction to access them) to a library for licensee's use. ARC now has a Plug-In program to encourage 3rd party developers to supply new functions. These ARC Plug-Ins can include instructions, new registers, memory, peripheral IP, custom register flags, A DSP function unit, and bus interfaces. The Plug-Ins are automatically supported by the ARC software tool chain.

Tensilica has taken a language and layout approach to CPU extensibility. You can specify their added functionality in Tensilica's special design language, TIE (Tensilica Instruction Extension Language), a Verilog-like language. With TIE, you can define the hardware resources, such as registers and functional units, or coprocessors and their operations for new instructions. Or you can extend existing instructions with new resources, such as more registers. But the additions have to fit into Tensilica's efficient chip layout scheme. You can also add processor and user states, register files, instructions, coprocessors, and C data types.

Next, Stretch’s software-configurable processors combine the ease of software development associated with general purpose processors and DSPs, with the parallelism and flexibility of FPGAs. Stretch achieves this unique combination by embedding programmable logic entirely inside the processor architecture.

One step further goes Celoxica by allowing transformation from C code (their Handel-C and soon System-C) directly into FPGA programming instructions which rely on Virtex family of Xilinx devicest. An IBM PowerPC 405 may command the FPGA but also it may be synthesized inside the FPGA, aibled at lower speed than the original integrated chip. The possibility to throw C-code into hardware without passing by any HDL opens doors to new design paradigms, the next few years will show where these doors lead the industry.

And more CPU vendors are moving toward adding extensibility to their architectures. Infineon, for example, has added a hardware plug-in capability, PowerPlug, to its Carmel DSP. This feature enables the vendor or a customer to customize the Carmel SOC via logic synthesis. Up to 4 PowerPlug units can be added to the Carmel core. Using these plug-ins, users can add two Infinion-supplied MACs to the DSP core doubling from 2 to 4 the number of MACs performed per clock cycle. User generated PowerPlug units can be attached to new instructions that are added to the software tool chain, including the compilers.

RISC, Superscalar, VLIW, and SIMD

Today's processor design techniques include RISC, Superscalar, VLIW and SIMD. Each of these techniques enable designers to get more out of their silicon by squeezing down cycle logic, executing instructions in parallel, or multiplying the number of operations a single instruction can execute respectively. The trick is to get more done in the same amount of clock time.

RISC In classic RISCs, the trick was to squeeze down the register-to-ALU-to register cycle for higher execution speeds. One way to get it faster was to simplify the logic: to simplify the instruction set, use fixed multi-word addressing, use a Load/Store architecture (operate only on registers), pipelining to sequentially stage execution (enabling the next instruction to start before the current one finished), and use fixed instruction words. These design techniques enabled RISCs to run faster than the older CISC (complex instruction set computer) processors.

Superscalar The next step to up RISC performance was adding superscalar execution. Superscalar designs can issue more than one RISC instruction per cycle, using multiple execution units to execute multiple instructions in parallel. For example, many RISCs can issue and execute an integer and a floating-point instruction in parallel. But superscalar design techniques ran into some natural limits, namely that the more instructions you issue, the more intermediate stuff you have to hold in case something goes wrong, such as having to take a branch, which negates the instructions that follow it in sequence. Superscalar has settled out into implementations that can issue 2,3 or 4 instructions in parallel.

VLIW Some new design techniques have evolved from RISC. These include VLIW and SIMD. VLIW (very long instruction word) implementations are a relatively successful attempt to bypass the problems of superscalar RISC. VLIW is very like RISC superscalar; both techniques issue a number of RISC instructions. The difference is that RISC superscalar does it dynamically in hardware, deciding which instructions to issue and to handle intermediate scheduling problems. VLIW lets the compiler handle the scheduling, with the hardware receiving and issuing a block of RISC instructions.

SIMD It turns out that SIMD (single instruction, multiple data) has been around a long time. It means that a single instruction controls the operation on multiple data elements. For example, an ADD instruction causes n units to do an add. SIMD have proved to be a very powerful mechanism, especially for 8-, 16-bit, and 32-bit DSP and graphics operations done on large register words. SIMD was a natural extension for floating-point units in RISC and the X86 PC processors. Originally pioneered by Sun for its SPARC and picked up by Intel for its Pentium, SIMD enables one instruction to be applied to multiple fields in a floating-point register word. For a 64-bit word, that can be 8 8-bit adds, 4 16-bit adds, or 2 32-bit adds, delivering a 8x, 4x or 2x speedup. SIMD has now been extended to other architectures and designs: Motorola's PowerPC G4 implements a 128-bit vector engine co-processor with a G3 PPC core. The latest SIMD designs are moving to a separate 128-bit vector unit instead of the earlier 64-bit Floating-Point Execution Units.

(by Ray Weiss/techonline2000, revised and updated by Bernhard Kockoth, embeddedexpert 2006)

| Home | Software | Services | Support | Techbuch Online Store | Impressum |

Embedded Expert 2008 - Alle Marken, Warenzeichen und Handelsnamen sind Eigentum der jeweiligen Inhaber.

© BK media systems 2002, 2008.

All trademarks and registered names are property of their respective owners. German law requires Impressum