09:00-10:00 Invited Talk

What Your Mother Never Taught You About Multicore Programming
by Richard L. Sites, Google, Inc.

10:00-11:00 Low-Power and Embedded Systems

Power Efficiency Evaluation of Commodity Processors on Earth Science Workloads
by Benjamin Mayer, National Center for Atmospheric Research

As high-performance computing systems often exceed tens of thousands of processors and require hundreds of kilowatts of power to operate, selecting power efficient processor architectures for target application workloads is becoming increasingly important. This paper evaluates the power and price efficiency of several current-generation power-aware and low-power microprocessor architectures for use with computational workloads typical at the U.S. National Center for Atmospheric Research.

The results show that the Intel Core i7 (Nehalem) provides the best power efficiency and shortest run time for the tested applications. With respect to the AMD Opteron's power management features, reducing the clock speed from 2.7 GHz to 2.0 GHz increases its power efficiency. Finally, our interest in constructing a power-efficient throughput supercomputer as a cluster of low-power embedded processors has been tempered by the power requirements of evaluation boards. While the processor itself has sufficiently low power to be attractive, its performance on our computationally-intensive codes results in low efficiency.

Design Issues in Hybrid Embedded Systems
by Irvin R. Jones Jr., US Air Force Academy

The scope of the embedded system design platform has expanded from the traditional system of a programmed microprocessor or microcontroller to the use of embedded cores and processors, and reconfigurable devices as components within the system. This advance in technology provides greater design flexibility, in that the system architect/designer can choose the mix of hardware and software components of a system, and have more control over the computing architecture and organization of the system. The resulting "hybrid embedded systems" are inherently concurrent and have additional complexity in the design of the timing, synchronization, and control mechanisms. We will discuss the hardware/software design tradeoffs; the design issues of timing, synchronization, and control; and how they are addressed in the design of hybrid embedded systems using FPGAs.

A Low-Power Dynamically Reconfigurable Nanophotonic-electric Network-on-Chip Architecture
by Shirish Bahirat, Colorado State University

Networks-on-chip (NoC) are beginning to be used as scalable, reliable, and high bandwidth on-chip communication fabrics in emerging chip multiprocessor (CMP) designs. However their high power dissipation continues to be a bottleneck that must be addressed to sustain the evolution of multi-core computing. Several NoC prototypes have shown NoCs taking a substantial portion of system power, e.g., 40% in the MIT RAW chip and 30% in the Intel 80-core teraflop chip. Recent studies have also suggested that electrical NoC power dissipation is much higher (by a factor of 10) than what is needed to meet tera- and petaflop performance levels. To overcome the power drawback, we propose a novel hybrid nanophotonic-electric NoC communication architecture called PHOTON. Photonic interconnect technology is of interest because it can be much more energy efficient compared to conventional copper (Cu) interconnects especially at high speeds and long distances. In addition, the ability of photonic waveguides to carry many information channels simultaneously increases interconnect bandwidth density significantly, eliminating the need for a large number of wires to achieve adequate bandwidth. PHOTON couples one or more photonic ring paths with traditional electrical mesh-based NoC fabrics to significantly reduce on-chip communication power dissipation. A key feature of PHOTON is the support for dynamic reconfiguration to adapt to changing runtime traffic requirements, and uncover further opportunities for reduction in power dissipation. There are three ways in which PHOTON supports runtime reconfiguration: (i) DVS/DFS (ii) Dynamic wave division multiplexing, and (iii) a dynamic region of photonic influence (PRI). Experimental results on several CMP benchmarks indicate that PHOTON can save orders of magnitude power dissipation and improve energy-delay product compared to traditional electrical NoC architectures.

11:00-11:20 Break

11:20-12:00 Static Analysis

Mixing Type Checking and Symbolic Evaluation
by Bor-Yuh Evan Chang, University of Colorado at Boulder

Program analysis design is an exercise in trade-offs---between precision and efficiency. That is, for any program analyzer, there are programs where it cannot prove a property that actually holds (i.e., its abstraction is not precise enough) or where it does more work than necessary (i.e., its abstraction is more precise than needed). In this talk, I present a mixing of two very different kinds of program analyses: type checking and symbolic evaluation. The novel aspect of our proposal is that these analyses are applied independently on disjoint parts of program in an off-the-shelf manner. Yet, the result of the mix is a sound analysis on the whole program that is more precise than type checking alone and more efficient than symbolic evaluation exclusively.

Joint work with Khoo Yit Phang and Jeffrey S. Foster (University of Maryland, College Park)

Generating and Analyzing Symbolic Traces of Simulink/Stateflow(tm) Models of Control Systems
by Sriram Sankaranarayanan, University of Colorado at Boulder

Simulink/Stateflow (tm) diagrams form a de-facto standard for the model-based development of control systems. We present a methodology and a toolkit for improving simulation coverage of Simulink/Stateflow models of hybrid systems using symbolic analysis of simulation traces. We propose a novel instrumentation scheme that allows the simulation engine of Simulink/Stateflow to output, along with the concrete simulation trace, the symbolic transformers needed for our analysis. Given a simulation trace, along with the symbolic transformers, our analysis computes a set of initial states that would lead to traces with the same sequence of discrete components at each step of the simulation.

Such an analysis relies critically on the use of convex polyhedra to represent sets of states. However, the exponential complexity of the polyhedral operations implies that the performance of the analysis would degrade rapidly with the increasing size of the model and the simulation traces. We propose a new representation, called the bounded vertex representation, which allows us to perform under-approximate computations while fixing the complexity of the representation a priori. Using this representation we achieve a trade-off between the complexity of the symbolic computation and the quality of the under-approximation.

Joint work with Rajeev Alur (University of Pennsylvania), Aditya Kanade (Indian Inst. of Science), K.C Sashidhar, S. Ramesh (General Motors Research, India), Franjo Ivancic and Aarti Gupta (NEC Laboratories, Princeton).

12:00-01:00 Lunch

01:00-02:00 Madness

We encourage all participants to bring their wild and crazy ideas to present in the madness session. Each speaker will have approximately 5 minutes.

02:00-02:20 Break

02:20-03:20 High-Performance Computing

Modeling Ion Channel Kinetics with High-Performance Computation
by Allison Gehrke, University of Colorado at Denver

Performance improvements for computational sciences such as biology, physics, and chemistry are critically dependent on advances in multicore and GPU hardware. However, these emerging systems require substantial investment in software development time to migrate, optimize, and validate existing science models. Very often, scientists fear a new implementation of their model will not compute the same expected results and believe the benefits of the systems will not achieve the promised level of performance advertised by the new high performance system. The focus of our study is to examine the step-by-step process of adapting new and existing computational biology models to multicores and GPUs. Our target application, Kingen, was developed to simulate AMPAR ion channel activity and to optimize kinetic model rate constants to biological data. Kingen uses a genetic algorithm to stochastically search parameter space to find global optima. As each individual in the population describes a rate constant parameter set in the kinetic model and the model is evaluated for each individual, there is significant computational complexity and parallelism in even a simple model run.

We begin by demonstrating characteristics of Kingen through workload characterization and application profiling. Next, we examine a coarse-grained parallelism implementation of the model, developed using Intel's Threaded Building Blocks (TBB). On a multicore architecture, the TBB implementation achieves a ~15x reduction in run time (2-4 days with more demanding workloads as compared with >30 days sequentially with a simpler model). Preliminary analysis suggests there is even more parallelism to exploit that may be ideally suited for GPU acceleration. Finally, we explore issues in model validation and in our ability to reliably and efficiently port applications that scale with massively parallel architectures. Overall, the improvement of the time-intensive optimization of kinetic models will accelerate discovery in neuroscience. Furthermore, the methodology to do so will be applicable to a broad range of scientific applications accelerating discovery in computer science.

GPU Parallelization of RNA Folding
by Guillaume Rizk, INRIA/Irisa

The prediction of RNA secondary structures is an important problem for biologists, but the underlying algorithm is computationally intensive, with O(n^3) complexity, which can become prohibitive when dealing with very large sequences or with large data sets of sequences. We investigate a parallelization on GPU, which are claimed to provide better performance/price and performance/energy ratios than CPU.

The main computation is a dynamic programming algorithm, whose analysis reveals some easy way to extract parallelism. By exploiting parallelism also at the coarse grain level among several sequences, we are able to provide enough independent tasks to a GPU for most execution configurations. Our implementation faces two major GPU-specific issues. The first, "SIMD divergence" can lead to inefficient use of computational resources. The second, complex memory access patterns could lead to low memory bandwidth. We tackle those issues by off-loading part of the divergence to the CPU, and through the careful use of GPU memory spaces: shared, constant and texture memory.

Next, we developed a tiled approach with a greater reuse of data. It allows the storage of blocks of input data on the on-chip shared memory leading to significant additional improvement. The implementation achieves up to x40 speedup over the original sequential algorithm.

We expect that what works well on a GPU also works well on a CPU. In order to ensure that we compare best efforts on both platforms, we tiled the sequential program, and found that it indeed improved performance by increasing memory cache reuse and by allowing an efficient SIMD optimization.

We provide an exhaustive performance comparison between GPU, a fully optimized CPU code, the reference unoptimized code used by biologists, and a related openMP parallelization of this same application developed at GaTech.

Introducing the Sparse Polyhedral Framework (SPF)
by Michelle Mills Strout, Colorado State University

Loops often dominate the execution time of applications. Various transformation frameworks have been developed to enable the automatic compile-time transformation of loops. Many of the existing models fit within the polyhedral framework and although they are quite powerful, they are restricted to compile-time transformations and loop bounds and memory accesses that are affine or can be approximated as affine. In this talk, I will present the Sparse Polyhedral Framework (SPF). The SPF builds on the polyhedral programming model, but is also capable of expressing and supporting the code generation for run-time reordering transformations implemented with inspector/executor strategies. I will then discuss the idea of abstractions for exposing transformation frameworks in performance programmer models.

03:20-03:30 Break

03:30-04:10 Programming Languages Medley

A Calculus for Reflective Metaprogramming
by Weiyu Miao, University of Colorado at Boulder

C++ templates support reflective metaprogramming: they enable compile-time computation, customized code generation, computing over program metadata (such as types), and the checking of library-specific safety properties. Despite the power and expressiveness of C++, it is a far cry from the ideal metaprogramming language: compile-time computation is encoded in class templates and errors residing in templates are reported at instantiation time, not definition time. Garcia introduces a foundational and core calculus for reflective metaprogramming that addresses some of the shortcomings of C++ templates. His calculus can express reflective metaprogramming capabilities directly, including compile-time computation and type reflection. In the calculus, types are data to compile-time computations so type expressions in a residual program may be the result of static computation. In this talk, we extend Garcia's calculus with several more features needed to provide the full power of C++! templates, including reflection over classes and user-defined safety checks. In addition, we present a modular type system that catches errors at definition time.

The Semantic Gap in Java Programs
by Devin Coughlin, University of Colorado at Boulder

The Java language specification imposes strict requirements on compilers and virtual machines: they must throw exceptions when array accesses are out of bounds, they must evaluate arguments to functions from left to right, they must preserve iteration order in for-each loops, etc. This talk will present work investigating to what extent real programs rely on these requirements and discuss the gap between the language's specification and the programmer's intent.

04:10-04:30 Break

04:30-05:30 System Simulation and Analysis

Large Program Trace Analysis and Compression with ZDDs
by Graham Price, University of Colorado at Boulder

Prior work has shown that reduced, ordered, binary decision diagrams (BDDs) can be a powerful tool for program trace analysis and visualization. Unfortunately, it can take hours or days to encode large traces as BDDs. Further, techniques used to improve BDD performance are inapplicable to large dynamic program traces. In this presention I will explore the use of ZDDs for compressing dynamic trace data. Prior work has show that ZDDs can represent sparse data sets with less memory compared to BDDs. This paper demonstrates that (1) ZDDs do indeed provide greater compression for sets of dynamic traces (25% smaller than BDDs on average), (2) with proper tuning, ZDDs encode sets of dynamic trace data over 9 times faster than BDDs, and (3) ZDDs can be used for all prior applications of BDDs for trace analysis and visualization.

Topological Entropy as a Metric for Computer System Performance
by Zach Alexander, University of Colorado at Boulder

Topological entropy is a number that measures the complexity of a dynamical system. In particular, positive topological entropy is indicative of chaos and a system with higher topological entropy can be thought of as 'more chaotic'. We use this metric to distinguish differences in performance across multiple software and hardware configurations. We run two simple programs on two different processors and then again using SimpleScalar. In each case, we measure the average number of cache misses per cycle. We find that the topological entropy correctly identifies differences in each combination of software and hardware, most notably when SimpleScalar is configured to simulate one of the two processors. This indicates that useful information can be obtained by viewing computer system performance as a dynamical system, and suggests the development of further topological methods to characterize the dynamics of computer systems.

Scalable Simulation of Complex Network Routing Policies
by Andy Stone, Colorado State University

Network simulation is useful for examining how policy and topology affect routing. Modern routing protocols implement complex policies that take more into account than just path-length. Current simulators are limited to either working with simple hard-coded policies or working on small networks (1000 nodes or less). The routing algebra meta-language (RAML) is a formal framework reminiscent of data-flow analysis frameworks for specifying network protocols with semi-ring algebras. We are developing a C++ library implementation of RAML to define new routing protocols. A direct implementation of a routing simulation has scaling issues due to the O(n^2) memory requirements for routing tables. However, due to properties of routing algebras specified in RAML we are able to segment the simulation problem into multiple runs propagating route information for subsets of the network on each run. This strategy enables us to perform a simulation that does not exceed system memory. This segmentation also exposes tasks that can be run in parallel. In this presentation I will discuss our policy implementation framework, its similarities to data-flow analysis, describe how simulation with policies described in our framework works, and outline how segmentation makes it possible to overcome memory hurdles.