Extreme-scale computers will contain 1000s of cores. Applications that can effectively exploit such massive parallelism have the potential to achieve unprecedented performance gains. Unfortunately, the anticipated performance may not materialize for all applications. Some may lack sufficient parallelism to keep all the cores busy. Other applications may incur parallelization overheads--e.g., communication and synchronization--that outweigh the benefits of exploiting large-scale parallelism. And for memory-intensive applications, cache pressure and limited off-chip bandwidth may prevent thread scaling.
Rather than execute application threads on every core, an alternate approach is to employ some cores that would otherwise be used inefficiently to execute ``helper threads.'' Helper threads assist the application's compute threads in some way, typically to improve performance. The best example of a task for helper threads is to trigger in advance the cache misses or branch mispredictions that would normally stall compute threads, thus improving performance and power efficiency. These are known as load and branch ``pre-execution'' helper threads.
A key step in deploying helper thread techniques is creation of the pre-execution code that runs in helper threads. For example, compiler-based pre-execution extracts helper thread code directly from the application source code. Application code regions (typically loops) containing performance-degrading memory instructions are cloned to form helper threads. The cloned code is then optimized using program slicing, prefetch conversion, and speculative parallelization to increase its speed relative to application code.
In the past, helper threads have been studied extensively on SMT processors. However, less is known about their effectiveness for multicore processors. Certainly, the role of helper threads in a 1000-core chip is an open question. Our resesarch addresses several key questions: