Helper Threads

Extreme-scale computers will contain 1000s of cores.  Applications that can effectively exploit such massive parallelism have the potential to achieve unprecedented performance gains.  Unfortunately, the anticipated performance may not materialize for all applications.  Some may lack sufficient parallelism to keep all the cores busy. Other applications may incur parallelization overheads--e.g., communication and synchronization--that outweigh the benefits of exploiting large-scale parallelism.  And for memory-intensive applications, cache pressure and limited off-chip bandwidth may prevent thread scaling.

Rather than execute application threads on every core, an alternate approach is to employ some cores that would otherwise be used inefficiently to execute ``helper threads.''  Helper threads assist the application's compute threads in some way, typically to improve performance.  The best example of a task for helper threads is to trigger in advance the cache misses or branch mispredictions that would normally stall compute threads, thus improving performance and power efficiency.  These are known as load and branch ``pre-execution'' helper threads.

A key step in deploying helper thread techniques is creation of the pre-execution code that runs in helper threads.  For example, compiler-based pre-execution extracts helper thread code directly from the application source code.  Application code regions (typically loops) containing performance-degrading memory instructions are cloned to form helper threads.  The cloned code is then optimized using program slicing, prefetch conversion, and speculative parallelization to increase its speed relative to application code.

In the past, helper threads have been studied extensively on SMT processors.  However, less is known about their effectiveness for multicore processors.  Certainly, the role of helper threads in a 1000-core chip is an open question.  Our resesarch addresses several key questions:
 

  • Unlike SMT processors, compute and helper threads are physically distributed in large-scale factored multicore processors.  How does this effect generation of effective helper thread code?
     
  • What is the right proportion of helper threads to compute threads to achieve the best performance and power efficiency?  And how should the operating system schedule helper versus compute threads to maximize benefit while minimizing resource contention?
     
  • Due to the asymmetric runtime requirements of helper and compute threads, helper threads can exploit heterogeneous cores.  Can helper threads run on extremely low-power cores (e.g., Angstrom's partner cores) to achieve very high power-efficiency yet still provide effective memory and branch latency tolerance?

     

Researchers: