Third Workshop on

Computer Architecture and Operating System
Co-design

In conjunction with:
the 7th ACM International Conference on
High-Performance Embedded Architectures and Compilers (HiPEAC'12)
Paris, France, January 23-25, 2012

News and Announcements

The workshop proceedings now available online.

Keynote: Blue Gene/Q: Architecture, Co-Design, and the Road to Exascale Talk slides

Multicores are monopolizing the market, from embedded systems to supercomputers. However, extracting high performance from these modern systems has become a complex task. As the number of cores per chip and/or the number of hardware threads per core continue to increase, new research questions arise in scheduling, power, temperature, dependability, scalability, design complexity, efficiency, throughput, heterogeneity, languages, compilers etc. Performance is not the only important metric anymore, and new metrics (such as security, power, total throughput, reliability and Quality of Service) are becoming more important than ever. Therefore, it is evident that neither hardware nor software alone can achieve the desired performance and, at the same time, be compliant with these constraints. One approach to tackle these new challenges comes from hardware-software co-design.

This workshop aims to bring together researchers and engineers from academia and industry to share ideas and research directions in Computer Architecture and Operating System co-design and interaction. Authors are invited to submit innovative manuscripts in all areas of parallel architecture, distributed processing, real-time systems, HPC systems and commercial/server systems.

Topics of interest

Papers are sought on topics including, but not limited to:

Architectural and OS support for scheduling applications on emerging multicore systems
Architectural and OS support for programming languages and compilers
Architectural and OS support for power and thermal management
Architectural and OS support for specialized architectures (e.g., heterogeneous processors, accelerators, GPGPUs etc.)
Architectural and OS support for reliability, dependability, and security
Benchmarking and characterization of OS activity in multicore architectures
Architectural and OS support for virtualization
Architectural and OS support to manage processor resource allocation and heterogeneity for Quality of Service
Simulation tools for full system simulation

The workshop provides a forum to discuss the latest proposals in co-designing the computer architecture, software systems and OS and to bring ideas and research problems to the attention of the audience. Papers reporting on on-going work that address cross-cutting issues and provide thought-provoking insights into the main themes are encouraged. Position and vision papers are also welcomed.

Workshop proceedings will be made available at the workshop.

Selected 1 or 2 best papers will be considered for publication in IEEE Computer Architecture Letters (CAL).

Program

10.00am	Opening
10.05-11.00	Keynote: Blue Gene/Q: Architecture, Co-Design, and the Road to Exascale Dr. Robert Wisniewski (IBM Research) Abstract: In 2004 Blue Gene made a significant impact by introducing an ultra-scalable computer with a focus on low power. After that, Blue Gene/L maintained the number 1 spot on the top500 list for an unprecedented 7 lists. In 2007 Blue Gene/P was announced and a peak 1 PF machine installed at Juelich, and Blue Gene/P garnered the top position on the green 500 list. At Supercomputing 2011 we announced Blue Gene/Q, a 208 TF per rack machine, obtaining over 2 GF/watt of computing, which obtained the number 1 position on the green 500, and a 4 rack machine was ranked number 17 on the top 500 list. Blue Gene/Q also was number 1 on the graph 500 list. The announced LLNL Sequoia machine will be a 96 rack, 20 PF machine, and will be delivered in mid 2012. Blue Gene/Q contains innovative technology including hardware transactional memory and speculative execution, as well as mechanisms such as scalable atomic operations and a wakeup unit to help us better exploit the 17 cores and 68 threads per node. In the talk I will describe the base architecture of Blue Gene/Q include the hardware, packaging, and software with a focus on the codesign process between the applications, system software, and hardware teams that lead to the above capability. I will also describe how Blue Gene/Q is a research vehicle for helping us explore the challenges that face us on the road to exascale. Bio: Dr. Robert Wisniewski is the chief software architect for Blue Gene Research and manager of the Blue Gene and Exascale Research Software Team at the IBM T.J. Watson Research Facility. He is an ACM Distinguished Scientist and IBM Master Inventor. He has published over 60 papers in the area of high performance computing, computer systems, and system performance, and has filed over 50 patents. Prior to working on Blue Gene, he worked on the K42 Scalable Operating System project targeted at scalable next generation servers and the HPCS project on Continuous Program Optimization that utilizes integrated performance data to automatically improve application and system performance. Before joining IBM Research, and after receiving a Ph.D. in Computer Science from the University of Rochester, Robert worked at Silicon Graphics on high-end parallel OS development, parallel real-time systems, and real-time performance monitoring. His research interests lie in experimental scalable systems with the goal of achieving high performance by cooperation between the system and the application. He is interested in how to structure and design systems to perform well on parallel machines, and how those systems can be designed to allow user customization.
11.00-11.30	Coffee break
11.30-12.00	NUMA Implications for Storage I/O Throughput in Modern Servers Shoaib Akram, Manolis Marazs, and Angelos Bilas Foundation for Research and Technology - Hellas (FORTH), Institute of Computer Science (ICS), Greece Abstract: Current server architectures have started to move away from traditional memory buses that do not scale and towards point-to-point interconnects for communication among processors, memories, and I/O devices. As a result, memory modules are not equidistant from all cores leading to significant differences in memory access performance from different cores. Similar to memory modules, I/O devices are connected today to processor sockets in a NUMA manner. This results in NUMA effects for transfers between I/O devices and memory banks, as well as processor I/O (PIO) accesses to I/O devices. This trend towards NUMA architectures increases complexity for buffer placement, device data transfers, and code execution, creating a complex affinity space. In this paper, we discuss problems that arise when performing I/O and present a preliminary evaluation of the impact of different types of affinity. We use a server-type system with two Intel Xeon processors, four storage controllers, and 24 solid-state-disks (SSDs). Our experiments with various machine configurations show that compared to local transfers between devices and memory, remote transfers have the potential to reduce maximum achievable throughput from 8% up to 40%. Further, for I/O-intensive applications, remote transfers can potentially increase I/O completion time up to 130%.
12.00-12.30	Judicious Thread Migration When Accessing Distributed Shared Caches Keun Sup Shim, Mieszko Lis, Omer Khan, and Srinivas Devadas Massachusetts Institute of Technology (MIT), USA Abstract: Chip-multiprocessors (CMPs) have become the mainstream chip design in recent years; for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Architecture (NUCA) design, where on-chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware-level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously; we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 18% on average and at best by 2.3X over the standard NUCA design that only uses remote accesses.
12.30-13.00	Programming and Scheduling Model for Supporting Heterogeneous Accelerators in Linux Tobias Beisel, Tobias Wiersema, Christian Plessl, and Andre Brinkmann University of Paderborn, Germany Abstract: Computer systems increasingly integrate heterogeneous computing elements like graphic processing units and specialized co-processors. The systematic programming and exploitation of such heterogeneous systems is still a subject of research. While many efforts address the programming of accelerators, scheduling heterogeneous systems, i.e., mapping parts of an application to accelerators at runtime, is still performed from within the applications. Moving the scheduling decisions into an external component would not only simplify application development, but also allow the operating system to make scheduling decisions using a global view. In this paper we present a generic scheduling model that can be used for systems using heterogeneous accelerators. To accomplish this generic scheduling, we introduce a scheduling component that provides queues for available accelerators, offers the possibility to take application specific meta information into account and allows for using different scheduling policies to map tasks to the queues of both accelerators and CPUs. Our additional programming model allows the user to integrate checkpoints into applications, which permits the preemption and especially also subsequent migration of applications between accelerators. We have implemented this model as an extension to the current Linux scheduler and show that cooperative multitasking with time-sharing enabled by our approach is beneficial for heterogeneous systems.
13.00-14.30	Closing and Lunch

Important dates

Abstracts submission deadline:	Oct. 21st 2011 12PM EST (final)
Papers submission deadline:	Oct. 28th 2011 12PM EST (final)
Notification to authors:	Nov. 21st 2011
Camera-ready submission deadline:	Dec. 28th 2011
HiPEAC Conference:	Jan 23-25th, 2012
Workshop:	Jan 25th, 2012

Paper submission

Papers submitted to CAOS 2012 must use the two-column, 10-pt font, IEEE conference proceedings format. A template for Microsoft Word and LaTex can be downloaded from here. Submission should be a maximum of eight (8) pages, excluding references and appendices.

Submission site is now open. Use easychair to submit both abstracts and papers.

Workshop Co-Chairs

Roberto Gioiosa	Barcelona Supercomputing Center	Spain	roberto.gioiosa[_at_]bsc.es
Omer Khan	MIT, CSAIL	USA	okhan[_at_]csail.mit.edu

Program committee

Buyuktosunoglu, Alper	(IBM T.J. Watson, USA)
Cesati, Marco	(University of Rome Tor Vergata, Italy)
Davis, Kei	(LANL, USA)
Etsion, Yoav	(Barcelona Supercomputing Center, Spain)
Falcon, Ayose	(Intel Barcelona Research Center, Spain)
Hempstead, Mark	(Drexel University, USA)
Holt, Jim	(Freescale, USA)
Koushanfar, Farinaz	(Rice University, USA)
Kursun, Eren	(IBM Research, USA)
Lang, Michael	(LANL, USA)
Miller, Jason	(MIT, USA)
Nikolopoulos, Dimitrios	(University of Crete, Greece)
Schirner, Gunar	(Northeastern University, USA)
Tumeo, Antonino	(PNNL, USA)
Wisniewski, Robert	(IBM Research, USA)

Webmaster: okhan[_at_]csail.mit.edu