Computer Systems Architecture
Informatics Institute - UvA
home pagestaffresearchprojectsworkshoplinks
NWO Project Microgrids

Introduction

This page describes the work being undertaken in the NWO funded project Foundations for Massively Parallel on-chip Architectures using Microthreading - Microgrids. This is a four-year project exploring sone novel foundations for multi- and many-core chips. The project started in September 2005 and will finish in August 2009. This project has already been succesful in its outcomes and the model of microthreading has been adopted as the SANE Virtual Processor (the SVP model), where SANE stands for Self-Adaptive Network Entity, in the FP6 European Integrated project AETHER. Work has also started on the FP7 European STREP project Apple-CORE in developing an infrastructure of compilers and tools for the SVP/Microthread model.

This project is addressing a number of fundamental research questions in its attempt to provide a systematic approach to the development of many-core chips. Although very ambitious, the manner in which these questions are posed is incremental and can be seen to have a direct impact on current developments as well as on providing a framework for future developments right up to the end of silicon scaling. The research questions can be summarised as follows:

Is it possible, through the introduction of simple and explicit concurrency controls, to develop a systematic approach to:

  1. incrementally designing new processor architectures (i.e. based on an existing ISA and infrastructure);
  2. dynamically managing and optimising the available resources for a variety of goals such as performance, power and reliability (i.e. resulting in autonomous and self-adaptive microgrids);
  3. formally defining the architectures' execution properties;
  4. incrementally developing the architectures' infrastructure (i.e. simulators, compilers, binary-to-binary translators and even silicon intellectual property);

all within the context of ten to fifteen years of silicon-technology scaling (i.e. over a thousand fold increase in chip density)?

Issues

The issues that are being researched ion the Microgrids project are:

  • Speedup - ILP not following Moore’s law - profligate use of gates on unscalable ILP with most speedup coming from clock speed.
  • Programmability - Industry is concrned about compatibility and of course the can of worms opened up when you introduce non-determinism into the bug pot, see Edward Lee's 2006 paper for a good take on this.
  • Power dissipation - high clock rates mean greater power density and chips are already too hot.
  • Scalability - speedup can also be obtained from concurrency... but how do performance, area and power dissipated scale with concurrency in instruction issue?
  • Concurrency management - there is a belief that concurrency is inherently difficult, it is not! What is difficult is implementing mechanisms for supporting concurrency with appropriate synchronisation and scheduling mechanisms.

Microthreading

Microthreading is an execution model that breaks code down into fragments that can execute simultaneously. It provides data-driven synchronisation close to the processor in a distributed register file, which manages dependencies in pipeline operations. Memory is assumed to be slow and is synchronised in bulk using a barriers on the model's families of threads. Recently we have extended the model to one in which complete programs can be decomposed into a parallel control structure over many threads. This control structure is built dynamically, is constrained by resources and a fragment may be as small as a single instruction.

Microthreading can be implemented in any instruction set by adding support for the following instructions:

  1. create - creates an indentifiable family of threads
  2. sync - blocks until the family identified has completed
  3. break - terminates the execution of all other threads and stops creation of threads in the same family this provides support for infinite concurrency
  4. kill - terminates the execution of all threads and the creation of new threads in an indentified family
  5. squeeze - stops creation of threads but allows created threads to comple this provides for preemption of families to suppoort resource management

In addition to these instructions some form of control stream is required to identify thread end points and context switch points. An example of the the use of create for executing a loop concurrently is shown to the right. Create can also be used to represent task concurrency and instruction-level concurrency.

Microcontexts
In order to share code between threads in a family, each has its own a microcontext. A microcontext is a window of registers that are accessible only to one thread although parts of it are accessible to a dependent thread. Various models of dependency can be considered but our current work focuses on support for threads with locality of communication, i.e. to successor threads in a family only. Addressing a microcontext uses a base address set on thread instantiation, which is a part of a thread's state. All threads can also access some registers (values) from the creating context but these are read only. This management of microcontexts allows a conventional register specifier (e.g. 5 bits) to access a large distributed register file without renaming and to classify register access according to whether it is local, global or shared between threads. All communication in the model we implement can be achieved in a ring network

Microgrids
A
microgrid is a scalable CMP comprising a N independently clocked and asynchronously communicating processors. Microgrids run microthreaded programs on groups of ring-connected processors. Each processor has a Pid {0..N-1} and a group id, Gid{0..P-1}. Hardware determines which group of processors should execute a family of threads based on information in the meta data for each create instruction, configuring them into a ring network for broadcast and register-sharing. A second network addressed by Pid (perhaps TCP/IP off chip) manages the resoource acquisition and delegation. Each processor uses a local scheduler to manage the concurrent code fragments distributed to it, executing instructions only when all their data data is known to be available. As the model is data driven sophisticated power management can be exploited that adopt conservative properties of instruction execution (no work - no power dissipated). The networks required on chip are illustrated in the diagram on the right.

Implementations of microgrids use a tiled floorplan partitioned into clusters of processors forming allocation units. These units are allocated dynamically at any level of create, allowing concurrency to recursively unfold over a chip (or many chips) according to resource utilisation models and the dynamic metadata associated with each level of create instruction. The key features of this model are:

  • it provides for a fully scalable CMP implementations
  • it provides for conservative instruction issue with hooks for controlling power dissipation
  • concurrency is parametric and dynamic but schedule invariant providing code compatibility across generations of implementation
  • allocation of iterations to processors is deterministic which means cache locality and memory partitioning can be managed statically

Scalability

We have performed extensive simulation of a microgridand these results are cycle accurate. The results below are for the FFT . They show speedup for an FFT of length 2^8, 2^12, 2^16, 2^20 for n processors against the performance of a single processor. The same results are plotted on two different scales for clarity. These results are translated into performance in the final figure assuming a 1.5GHz clock.

FFT2
FFT1
FFT3

Publications

Home Page | Staff | Research | Projects | Workshop | Links | Job vacancies

home pagestaffresearchprojectsworkshop