Network of Excellence on High Performance and Embedded Architecture and Compilation

Welcome to the Autumn Computing Systems Week, 21-23 September 2015, Milano, Italy

The 11th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC 2016), 18-20 January 2016, Prague, Czech Republic

Follow us on LinkedIn! hipeac.net/linkedin
I hope you have enjoyed relaxing summer holidays and that you are ready to tackle the challenges of the coming year. This summer, the international news was overshadowed by the Greek financial crisis and the refugee crisis in the Mediterranean. Europe seems to have difficulties finding adequate answers to these problems. The two problems have many commonalities: they impact a large part of Europe, they are about the future of real people, they won’t go away in the short term, and at the more philosophical level, they raise the question on how we want to organize Europe in the future. I don’t know the answers, but it is clear that these problems need answers. Not acting is not an option.

In July, many of us enjoyed the yearly ACACES summer school, one of the flagship activities of the network. Again, the appreciation scores for this year’s edition were very high, actually the second highest ever. I would like to thank the 188 participants for attending and contributing to the summer school and the instructors and keynote speakers for their excellent service to the HIPEAC community.

The next event we are organizing is the Autumn Computing Systems Week in Milano. The event is a great networking opportunity for the community, and I am happy to see that an increasing number of European projects are co-locating their meetings with the computing systems weeks. HIPEAC encourages this trend because it wants to be a meeting platform for publicly funded research projects too.

Then, there is of course the yearly HIPEAC conference, in beautiful Prague in January. This time, we have a record number of 40 co-located events making the conference by far the largest networking event for the computing systems community in Europe. New this year is that we will organize a number of communication and recruitment activities throughout the conference. These activities aim at increasing the impact of the European research efforts.

I learned this summer that our proposal for HIPEAC4 was selected for funding. This means that HIPEAC will continue for another two years. The activities will largely remain the same, but their focus will gradually shift to helping projects and researchers to increase the impact of their work. The new project will start in January 2016. We are very committed to make it at least as successful as the previous incarnations of HIPEAC.

Take care,
Koen De Bosschere
MESSAGE FROM THE PROJECT OFFICER

While most people across Europe were enjoying their holidays, an IT system running in an obscure server farm of the European Commission sent out an automatically-generated email to inform that proposal HiPEAC had been “retained” after the last round of evaluations for call ICT-04-2015. In plain English, this means that we are moving towards a “HiPEAC 4” project, and that we should start planning for the coming years.

The future of the HiPEAC community looks interesting and very, very challenging. Recent technological developments can have an enormous impact on computing: just to mention one example, the introduction of 3D V-NAND Flash memory will make terabytes of data available with very low latency; this is likely to change significantly the design of the software stack for many applications, even redefining what we mean by “file” or “database”. And this is certainly not the only disruptive technology that will become practically usable in the coming months.

Mastering these new technologies is becoming more and more important for many economic activities. For this reason, the European Commission wants to support the creation of platforms, in order to “ensure that all industrial sectors make the best use of new technologies and manage their transition towards higher value digitised products and processes” (Commissioner Oettinger, speech at Hannover Fair 2015).

This is a very important objective: the Europe that we want, where everybody can find a job, everybody can make use of high-quality public services and enjoy the freedom we are used to, needs a strong industry which creates value (and jobs!) through innovation and quality of products and services.

Commissioner Oettinger stated this very clearly when he proposed the outline of a strategy for the digitisation of European industry. One of the pillars of this strategy is the “Leadership in next generation open and interoperable digital platforms”, which will be a primary objective for the coming years; this means that the future research and innovation workprogrammes, as well as many other European initiatives, will be based on this “platform” concept. And here, the HiPEAC community can play a very important role.

A platform is something that is well known in the computing world: a part of it is what we describe as architecture, middleware, API, blueprint, standard, template, or with a plethora of similar concepts. All these things are certainly necessary in a platform, but not sufficient: to be useful, a platform must have a real economic value, attract several actors, and allow value creation on top of it. The various “app stores” which we all know are good examples of platforms, but they are also very limited: what industry needs most are platforms for industrial applications, cyber-physical systems, connected vehicles, transportation, avionics, factories of the future, internet of things… in other words the business-critical applications which the HiPEAC community knows very well. These platforms will allow European industry to strengthen its leadership and to create value. To build these platforms, the HiPEAC community will have to engage even more than today with European industry in all possible application areas, in order to understand what is really needed on the market and which technologies are the most important for real-life applications.

There are interesting years ahead, and I believe that the HiPEAC community will certainly be at the core of the technological platforms which will shape our future.

References to speeches of Commissioner Oettinger:

Sandro D’Elia
I could almost call it a personal tradition by now: attending the ACACES summer school in July. For the third time, I travelled to Fiuggi, Italy, and enjoyed an amazing week of lectures, discussions, and Italian cuisine. For me, the summer school is the perfect opportunity for connecting with other peers, following up on their work, and exchanging research ideas.

Every year, prominent figures from academia and industry are invited to hold lectures about their research. The topics cover a broad range of interests, ranging from computer architecture, to compiler design, as well as entrepreneurship. For example, Fabrice Rastello’s lecture about “SSA based Compiler Design” was of particular interest to me this year, as my PhD focuses on compilers and code generation. Another particularly engaging lecture was given by Koen Bertels. He shared his expertise and insights on launching three start-ups, discussing the ‘when’ and ‘how’ to start one’s own company.

Even though the lectures last the entire week, they are not my main reason for attending ACACES. For me, the week at Fiuggi is more about the other participants and the unique atmosphere found there. Each year, ca. 150-200 people take part in the summer school. One meets fellow PhD students from other universities, as well as engineers from major IT companies. All of these attendees come together to present their research, share new ideas, or gain valuable feedback and inspiration for their own work. This can be done either within the setting of the official poster session, or, as preferred by many, throughout the numerous coffee breaks and meals. This is what makes ACACES so attractive to me: it brings together like-
A HiPEAC Workshop on “Building Partnership” was organized at Faculty of Electrical Engineering, Budapest University of Technology and Economics (BME), on June 22, 2015. The workshop intended to offer a face-to-face meeting opportunity for HiPEAC representatives, the local HiPEAC members and other professionals interested in cooperation possibilities with the EU HiPEAC Network of Excellence. The participants were welcomed by Prof. Gábor Péceli, Rector of BME, who gave a concise introduction to the university and the main research topics related to HiPEAC. During the first session Koen De Bosschere and Rainer Leupers presented the HiPEAC Network and its key activities, as well as some advanced projects on embedded security and energy efficient processor designs.

The presentations from local HiPEAC members were related to Cyber-Physical Systems, Heterogeneous Architectures, Many-Core Processors and Services in the Cloud. The members well represented the Hungarian academic community and industrial partners from both SMEs and multinational R&D centers.

The afternoon sessions were dedicated to young researchers and PhD students, who presented their research work in multiple areas of embedded systems design, optimization, especially on FPGA based reconfigurable computing, HIL simulation, and mobile robot navigation.

The program ended with a short discussion about the possible involvement of participants and other Hungarian professionals in future HiPEAC activities.

Fehér Béla, Budapest University of Technology and Economics

All in all, ACACES is a rich intellectual, professional and cultural experience. It may have just ended, but I’m already looking forward to next year!

Nico Reissmann, Norwegian University of Science and Technology (NTNU)

minded people in a unique environment, giving them ample opportunity to exchange ideas and establish collaborations.

However, ACACES is by no means all work and no play. Evenings are normally free, and people take the chance to explore Fiuggi or relax in the spa or by the pool. The offered meals provide a tasty introduction to Italian cuisine, with people going to town for a beer or wine afterwards.
BOOK: WORKLOAD MODELING FOR COMPUTER SYSTEMS PERFORMANCE EVALUATION

“It’s the workload, stupid!”

The workload used in a performance evaluation study is no less important than the evaluation methodology. But until now there was no comprehensive source of information on workload modeling and characterization. This book fills this gap. Moreover, it emphasizes the intuition and reasoning behind the definitions and derivations of workload models, rather than the abstract mathematics, with hundreds of graphs from real workload data. The included information will help readers to analyze collected workload data and clean it if necessary, to derive statistical models that include skewed marginal distributions and correlations, and to consider the need for generative models and feedback from the system.

http://www.cs.huji.ac.il/~feit/wlmod/
Dror Feitelson, Hebrew University of Jerusalem

INTERNATIONAL WORKSHOP ON SOLUTIONS FOR MULTICORE DEBUG (SMD 2015)

MOTIVATION
Multicore processors and systems-on-chip have become predominant in all computing domains. Development of novel architectures, programming models, and tools as well as compilers are extensively addressed both in research and industry. However, debugging, diagnosis, and validation of software/hardware systems have not yet received the corresponding level of attention in the multicore age, and they still seem to be an afterthought. With ever-increasing complexity and multicore-specific effects and bugs, classical debug approaches such as breakpointing and tracing have reached their limits. System developers face limited observability within SoC platforms, due to platform heterogeneity, skyrocketing software complexity, and upcoming manycore systems with hundreds of integrated processing elements. These challenges demand radically new debug approaches, methods and tools.

GOALS AND TOPICS
The workshop is aimed at discussing engineering requirements, upcoming issues and innovative, maybe unconventional approaches related to all aspects of multicore system and application debugging. Position statements, industry needs and experiences, research presentations by experts from industry and academia will provide the required background for fruitful discussions and follow-up activities in the area. The following topics are of particular interest:

- Debugging of complete systems including low level and application software as well as hardware to meet functional and non-functional requirements
- Debugging multicore/manycore-specific problems (e.g. races)
- Reduction of debug complexity by e.g. increasing software abstraction or incremental software development
- Novel, scalable debugging tools and methodologies for multicores/manycores
- Hardware support for software debugging
- Debug for certification
- Debugging software for timing errors
- Debugging model-based software
- Debug-relevant standardization efforts

WORKSHOP ORGANIZERS

Chairs
Rainer Leupers (RWTH Aachen)
Andreas Herkersdorf (TU Munich)
Albrecht Mayer (Infineon)

Steering Committee
Adam Morawiec (ECSI)
Philipp Wagner (TU Munich)
Luis Gabriel Murillo (RWTH Aachen)
For questions and registrations please contact: smd@ecsi.org

http://www.ecsi.org/smd2015

Rainer Leupers, RWTH Aachen University
WHERE ARE THE GIRLS?

You may know that one of the foundations of European Union is diversity. For example, we enjoy diversity in the workforce, because building Europe is difficult, so very diverse skills and experiences are needed. At a more general level, we enjoy having different languages, different traditions, different cultures, and even different types of food (even if sometimes I think that standardising on Italian cuisine would be a very good idea ... ;-) )

So, the question is: why do I find myself so often in an all-male environment at work? It happens in many activities we do at the European Commission: proposal evaluations, project reviews, workshops, internal meetings, conferences, and seminars. It reminds me of my boys-only primary school where only the teacher was a lady, but that was many, many years ago and I would never have thought of living a similar experience in 2015.

In the European Commission we experience a serious shortage of female experts, with a very low percentage of women in our database; this has an impact on all the activities for which we hire external experts from the Participant Portal database. A few examples are: proposal evaluation, project monitoring, report writing, and assistance in policy drafting.

**THIS SHORT NOTE IS AN INVITATION TO ALL BRIGHT AND SMART WOMEN, WHO HAVE A HIGH LEVEL OF EXPERTISE IN COMPUTING, TO REGISTER IN THE PARTICIPANT PORTAL AS EXPERTS.**

We need you: we try always to get a reasonable mix of skills, experiences and affiliations in the teams of experts that work with us, but we need also to get a reasonable mix of genders, which should not mean 90% men and 10% women. Working as an expert for the European Commission may not be the best way to become a millionaire, but it will certainly allow you to meet smart people and to be in touch with the latest ideas and trends in research and innovation in your domain.

So, please, think about it: if you are a woman and you have a high level of expertise in computing, go to http://ec.europa.eu/research/participants/portal/desktop/en/experts/ and register your candidature. In order to be easily found, you should use as “Specialist field” the option “Computer and information science” and as “Predefined theme” the option “Information and communication technologies”. Also, please enter a few relevant open keywords, like “advanced computing”, “low power computing”, “embedded computing”, “real time systems” or whatever describes your real expertise.

We will be very happy to consider your candidature the next time we need the help of external experts.

Sandro D’Elia, European Commission

---

GROMACS 5.1 IS OUT WITH OPENCL SUPPORT

Molecular dynamics software GROMACS now also works on AMD hardware via OpenCL.

HiPEAC member StreamComputing has helped GROMACS by porting their code to OpenCL. The result is that the software now achieves competitive performance on AMD hardware. The actual port took 4.5 months, and it has now been released after rigorous testing.

GROMACS is a software package used by many research groups worldwide for doing simulations in chemistry, bio-informatics and physics. Up until now, only CUDA support has been available, leaving out more than half of the available hardware. The OpenCL code is open source and we hope that others will help out by continuing to add OpenCL support in GROMACS for Intel GPUs, Xeon Phi and other OpenCL devices. The goal of the port was to show StreamComputing’s expertise to the world, while also helping the open source community.


Vincent Hindriksen, StreamComputing
**HiPEAC START-UP NEWS**

**SILEXICA ANNOUNCES VERSION 2015.06 OF MULTICORE PROGRAMMING TOOL SUITE**

New release of SLX product line provides major improvements in user interface and programming flow automation

Silexica, the leading provider of embedded multicore programming solutions, announced the availability of the latest release 2015.06 of its SLX compiler tool suite on June 17, 2015.

"Many of our customers demanded an even more easy-to-use software task mapping facility," explained Silexica’s CEO Maximilian Odendahl. "With SLX Mapper’s novel user interface, numerous different task mapping options can be conveniently explored, leading to optimal parallel software distribution in a very short time, even for complex, heterogeneous SoCs. "Moreover," Weihua Sheng, CTO of Silexica, added, "next to many smaller tool improvements, we also integrated major new features into SLX Explorer. These enable HW/SW system architects to efficiently co-optimize the target multicore architecture, SW task allocation, and task scheduling policies, altogether within a unified IDE cockpit."

The SLX tool suite comprises four unique software development tools that can be licensed either as bundle or separately:

- SLX Parallelizer - automated parallelization of sequential legacy C code
- SLX Mapper - parallel task-to-processor mapping and software distribution
- SLX Generator - automatic native target C code generation, including inter-task communication code
- SLX Explorer - fast early multicore hardware platform selection and optimization

As a fast growing start-up company in the multicore software business, Silexica has recently started marketing and promotion campaigns in Europe, North America and Asia. Initial customer feedback has been highly encouraging. A sales representative in Japan has been hired earlier this year, and multiple additional worldwide distribution channels are currently under preparation.

With the announcement of the new SLX 2015.06 tool suite, an evaluation license is available to customers upon request. Please contact info@silexica.com.

http://www.silexica.com/

Rainer Leupers, Chief Scientist, Silexica Software Solutions GmbH

---

**SPARSITY TECHNOLOGIES, HIGHEST INNOVATION CAPACITY EUROPEAN SME**

The first Innovation Radar Report reviews the innovation potential of ICT projects funded under 7th Framework Programme and has pinpointed Sparsity Technologies, HiPEAC member, as the SME with the highest innovation capacity in Europe.

Sparsity Technologies (www.sparsity-technologies.com), a Barcelona-based spin-out from the Universitat Politècnica de Catalunya, focuses on the management of Graphs from the technology and applications point of view. The work of the company, in deep collaboration with the DAMA-UPC research group, concentrates on high-end graph technologies, including Sparksee, its generic graph database, and other highly-focused technologies based on graph analysis in the areas of social network analysis, knowledge management, recommendation and routing, all of which allow third party industry to improve their software products and services.

The Innovation Radar Report (see http://sciencebusiness.net/news/77125/EU-Commission-releases-list-of-top-10-most-innovative-SMEs-in-ICT) has selected Sparsity as the SME with the highest innovation capacity in Europe. In its global ranking, which includes larger companies and research institutions, Sparsity was ranked in third position, just after the University of Cambridge and Fraunhofer.

Sparsity has been evaluated for two very significant projects, LDBC (www.ldbcouncil.org) and CoherentPaaS (www.coherent-paas.eu), and the company also participates in other research and development FP7 projects including Tetracom (www.tetracom.eu)
and frontierCities (www.fi-frontiercities.eu) and H2020 projects like SOMATCH (www.somatch.eu) and IT2Rail (www.it2rail.eu).

LDDBC has created a suite of benchmarks and a benchmark council for Graph and RDF technologies. The software released by LDBC is Open Source and the effort is public and open to the participation of all companies, research institutions and individuals in the area. The council is now integrated with prominent European companies in the area of Graph and RDF technologies, such as OpenLink Software, Ontotext and Neo4j and by larger US companies like Oracle and IBM, but also with academic institutions and individuals.

CoherentPaaS focuses on the creation of a high performance unified transactional layer for different flavours of database management solutions, among which are Sparksee, Monet DB and MongoDB. The project is also focusing on the creation of a single point of query entry for the different systems and a whole suite of optimization and analysis tools. The project leader is Universidade Politécnica de Madrid.

The other projects in which Sparsity Technologies is involved are Tetracom, where Sparsity is producing research on Community Search and Knowledge Management; frontierCities, where Sparsity is creating a platform for the management of mobility policies by the city in collaboration with Turisme de Barcelona; SOMATCH, where Sparsity is providing its Social Network Analytics capacity to predict the trends in fashion and IT2Rail, where Sparsity is providing technology to create the IT infrastructure of the future for the European Railway system.

http://www.sparsity-technologies.com/

Josep Lluis Larriba-Pey, UPC BarcelonaTech

INNOVATIVE EMBEDDED ARCHITECTURE FOR HPC MADE IN EUROPE

Early in February, Barcelona Supercomputing Center (BSC) successfully deployed the Mont-Blanc prototype. After three years of intensive research effort, the team installed a two-rack prototype, which is now available to the Mont-Blanc consortium partners. This has been a formidable challenge as it is the first time that a large HPC system based on mobile embedded technology has been deployed and made fully operational to a scientific community composed of scientists at six of the most important research centres in Europe.

Installed in the Torre Girona chapel, the Mont-Blanc prototype is made up of a total of two racks containing 8 standard BullX chassis, 72 compute blades fitting 1,080 compute cards, for a total of 2,160 CPUs and 1,080 GPUs. The heterogeneous architecture of the Mont-Blanc prototype takes advantage of computing elements (CPUs and GPUs) developed by ARM and integrated by BULL under the design guidance of all Mont-Blanc partners.

"After the installation of the prototype the next steps will be an intensive evaluation of the scientific applications in terms of their performance and scalability, measuring power consumption under different types of workloads and opening up access to industries interested in testing it", says Filippo Mantovani, coordinator of the Mont-Blanc project.

The Mont-Blanc partners and industrial members of the End User Group will now take full advantage of this hardware. The Mont-Blanc prototype offers the same application development and tuning environment that is available on a standard supercomputer, with an extensive and stable software ecosystem developed by the partners, including software support for embedded hardware components, scientific libraries, debugging and performance analysis tools, and support for the most used programming languages and parallel programming models. Besides the standard use, the users can take advantage of an advanced power monitoring tool that allows researchers to monitor the application’s power consumption. This is a fundamental tool in order to carefully measure the power efficiency of the prototype.

"Now the challenge starts", says Filippo Mantovani, “because with this platform we can foresee how inexpensive technologies from the mobile market can be leveraged for traditional scientific high-performance workloads at a much lower total cost of ownership than state-of-the-art supercomputers.”

The work of the Mont-Blanc project will continue and the team will focus their efforts on developing the OmpSs parallel programming model further to automatically exploit multiple cluster nodes, transparent application check pointing and application-based techniques for fault tolerance, support for ARMv8 64-bit processors and the initial study of the Mont-Blanc Exascale architecture.

"It is important to remember that this is all possible thanks to the synergy amongst the industrial and academic partners that joined together to address technological and multidisciplinary scientific challenges with the support of the European Commission” says Dr. Mantovani.
HiPEAC NEWS

A HETEROGENEOUS EXECUTION ENGINE FOR LLVM

Adding Heterogeneity Support to LLVM

Hexe, which stands for Heterogeneous Execution Engine, is a new compiler component that integrates with the LLVM infrastructure. It targets efficient computation on heterogeneous platforms by allowing automatic offloading of workloads to computational accelerators, such as Graphics Processing Units (GPUs) or Digital Signal Processors (DSPs).

The workloads we consider for offloading are either explicitly annotated by the programmer or automatically detected by static compiler analysis and runtime checks. Our infrastructure operates at the level of the LLVM intermediate representation and it effectively supports multiple source languages.

Hexe consists of a set of compiler passes and a runtime environment. The compiler passes perform the required code analysis and transformations to enable workload offloading. The runtime environment manages data transfers and synchronization operations, and performs dynamic workload scheduling.

We consider a diverse set of heterogeneous systems ranging from mobile devices equipped with ARM-based multi-core CPUs, embedded GPUs and DSPs, to data center nodes consisting of x86 multi-cores and high-end GPUs. Hexe has a modular design where new accelerator types and programming environments can be supported via a plugin interface. We also consider interoperability between Hexe and modern JIT technologies, such as LLVM MCJIT.

Experiments on workload offloading for benchmarks and real world applications indicate significant performance improvements. We evaluate Hexe on modern x86 cluster nodes equipped with high-end NVIDIA and AMD GPUs and mobile development boards.

Hexe development started as an internship project at the Qualcomm Innovation Center in Silicon Valley. The project is still under development and code patches have been released.

Chris Margiolas, The University of Edinburgh

UNIVERSITY OF KAISERSLAUTERN RELEASES DRAMSPEC IN COOPERATION WITH ARM

DRAMSpec is an open source DRAM Current and Timing Generator, which generates the Datasheet values for current and future DRAM chips.

In systems ranging from mobile devices to servers, DRAM has a big impact on performance and it contributes a significant part of the total consumed power. The performance and power of the system depends on the architecture of the DRAM chip, the design of the memory controller and the access patterns received by the memory controller. We introduce DRAMSpec, an open source high-level DRAM bank modeling tool. As major contribution, we move the DRAM modeling abstraction level from detailed circuit level to the DRAM bank and by the integration in full system simulators like gem5 we allow system or processor designers (non-DRAM experts) to tune future DRAM architectures for their target applications and use cases.

DRAMSpec was developed in cooperation with ARM and was presented at the IEEE International Conference on Embedded Computer Systems Architectures Modeling and Simulation (SAMOS15), Samos Island, Greece.

http://www.uni-kl.de/3d-dram/tools/dramspec/

Omar Naji, Christian Weis, Matthias Jung, Norbert Wehn (University of Kaiserslautern) and Andreas Hansson (ARM Ltd.)
Mateo Valero has received the Euro-Par Achievement Award 2015. Euro-Par is a major series of European conferences on parallel processing. The Euro-Par Steering Committee has conveyed the Euro-Par Achievement Award at the yearly Euro-Par conference since 2008 to individuals who have made special contributions to parallel processing. The 2015 edition of Euro-Par was held in Vienna, Austria, from August 24 to 28. It was hosted at the Vienna University of Technology and was organized by the Research Group for Parallel Computing.

Recipients must have made both an extraordinary impact on parallel processing, and must have contributed significantly to the Euro-Par conference series. In his address, Prof. Christian Lengauer, Chair of the Euro-Par Steering Committee, described Valero as the most prominent computer scientist in Spain and the most prominent academic computer architect in Europe and emphasized “his over 400 invited talks and over 800 academic descendants as an impact unequalled. He has been involved in the organization of more than 300 international conferences”.

Lengauer highlighted Valero’s allegiance to Euro-Par through his seminal membership of the steering committee in the mid-Nineties and his co-authorship of a dozen Euro-Par papers – most importantly though, he stressed Valero’s pleasant and disarming disposition and way of interacting with people.

Mateo Valero expressed his gratitude to the steering committee for the award which he “collected on behalf of my collaborators and doctoral students”. In his address, Valero said he had tried to follow the advice of the people who are closest to him and academically he followed the instructions of his mentor Tomas Lang: “hire people better than you”. Valero also stressed the importance of giving back to the community, which is why he says he always makes an effort to participate in talks and conferences.

Past recipients are:
• 2014: Henri E. Bal (Vrije Universiteit Amsterdam, The Netherlands)
• 2013: Arndt Bode (TU Munich, Germany)
• 2012: Barbara Chapman (University of Houston, U.S.A.)
• 2011: Michel Cosnard (INRIA, France)
• 2010: Jack Dongarra (University of Tennessee / Oak Ridge National Laboratory, U.S.A.)
• 2009: Paul Feautrier (ENS Lyon, France)
• 2008: Ron Perrott (Oxford e-Research Centre, U.K.)

About Euro-Par
Euro-Par is the prime European conference covering all aspects of parallel and distributed processing, ranging from theory to practice, from small to the largest parallel and distributed systems and infrastructures, from fundamental computational problems to full-fledged applications, from architecture, compiler, language and interface design and implementation to tools, support infrastructures, and application performance aspects. Euro-Par’s organization into topics provides an excellent forum for focused technical discussion, as well as interaction with a large, broad and diverse audience.

**MATEO VALERO RECEIVES THE EURO-PAR ACHIEVEMENT AWARD 2015**

Mateo Valero has received the Euro-Par Achievement Award 2015.

**COLLABORATION REPORT: AMIR H. ASHOURI**

I am a third year PhD student at Politecnico Di Milano. I have been working on my PhD under the advice of Cristina Silvano and Gianluca Palermo. For the past six months, I was privileged to be awarded the HiPEAC PhD short-term visiting grant for my proposal and I have been working under the supervision of John Cavazos at University of Delaware, USA. John Cavazos has been applying machine learning to hard problems such as compiler auto-tuning and high-performance computing. During my stay, I attended the 26th annual International Conference for High Performance Computing, Networking, Storage, and Analysis (Supercomputing 2014). Attending this conference helped me to substantially refine my domain knowledge in distributed computing. I have also successfully completed a 3-month short-term visit with the HiPEAC HiPacking team at University of Delaware, USA.

I am very grateful to the HiPEAC team for providing me with this wonderful opportunity and the HiPEAC PhD committee for selecting my proposal.

**Host Institution:** University of Delaware  
**Title:** Compiler Optimization Using Machine Learning
HiPEAC STUDENTS

and parallel systems, build my professional
network, and learn about the most recent
advances in the field of high performance
computing.

I was involved in three main projects while
at the University of Delaware. Cavazos’ lab
is targeting high-performance computing,
compilers and cyber-security. First, I
managed to extend one of my previous
works: fortifying the evaluation and
analysis of experiments by adding more
evaluation programs, datasets, and
machine learning models. The text of the
work will be finished upon return to my
home institution. I was collaborating with
Eunjung Park in this project.

Second, as the field is shifting toward the
exclusive use of many/multi-core low-
power systems, traditional sequential tech-
niques should be adapted to the parallel
domain. We propose a solution by exploiting
hardware parallelism to a greater degree,
and bringing it to the compiler level. In this
project I was collaborating with Wei Wang.
We used the LLVM-OMP targeting compiler
to perform multi-objective (performance
and power consumption) auto-tuning in a
parallel domain exploration space.

Next, I was involved in a new, exploratory
research project where machine learning
techniques could be applied to the field of
cyber-security. For this project, we have
been applying machine learning techniques
for malware clustering. This was
done in collaboration with William Killian,
Tristan Vanderbruggen, and Marco Alvarez.
This project gave me insights on working
with assembly files and disassembler tools
such as IdaPro as well as broadening my
application of machine learning to a new
research domain.

My involvement in these projects helped
me to establish good research connections
with the colleagues I have met and worked
with at the University of Delaware and can
be used in my future research path. I am
planning to publish the first two afore-
mentioned projects upon my return to
Politecnico Di Milano.

Last but not least, I would like to thank
HiPEAC organizers specifically Koen De
Bosschere and Vicky Wandels for providing
such opportunities for PhD students to be
involved in mobility programs.

Amir H. Ashouri, Politecnico Di Milano

HiPEAC info 44

COLLABORATION REPORT: SOHAN LAL

Host Institution: TU Eindhoven
Title: Memory Divergence in GPUs: Analysis and Potentials

GPUs are much more power-efficient devices compared to CPUs
for many compute intensive applications. However, due to several
performance bottlenecks, the performance per watt of GPUs is
often much lower than what could be achieved theoretically. Memory divergence is one such performance bottleneck. An
access to the memory is divergent if the memory accesses in a
warp (a SIMD vector of 32) for a load or store instruction cannot all
be combined into a single memory transaction. There could be
many possible causes for low performance and low energy
efficiency of memory divergent workloads e.g., divergence causes
data over-fetch that wastes cache capacity, increases cache con-
tention, consumes scarce resources such as Miss Status Holding
Registers (MSHRs) and memory bandwidth, thus, reducing the
number of ready warps and exposing stalls that can throttle
performance. However, there is no clear study that reveals all the
possible causes and quantifies the impact of each on the overall
performance loss. In this work, we study the data over-fetch and
cache contention caused by memory divergence.

I would like to thank HiPEAC for giving me wonderful internship
opportunity and ES group of TUE for all support and warm
hospitality.

Sohan Lal, TU Berlin

COLLABORATION REPORT: DAVIDE ZONI

Host Institution: University of Cyprus
Title: A flexible, low cost signalling network and methodology to aggressively reduce the leakage power
in On-Chip Networks by exploiting the power gating mechanism

I have a post-doc position at the Politecnico
di Milano - Dipartimento di Elettronica,
Informazione e Bioingegneria (DEIB). My
research activities are mainly focused on
power-performance optimization in Networks-on-Chip (NoCs), considering
cache hierarchy and coherence protocols as
well. Moreover hardware design issues and
feasibility are taken into account as well as
the exploitation of control theory for the
optimization stages. Thanks to the HiPEAC
grant I spent three months, from January
2015 to April 2015 at the University of Cyprus
(UCY) working with Prof. Yannakis Saizeides
and Prof. Chrysostomos Nicopoulos.

The internship focused on the static power
reduction in Networks-on-Chip (NoCs) by
exploiting power gating actuators on the
router buffers. Such interconnection fabric
promises to be an efficient, reliable and
flexible communication infrastructure for multi-core platforms, although its power consumption cannot be neglected. Besides, the overall chip performance is strongly influenced by this communication layer. Thus, a power-performance methodology has been investigated and the work has been organized in two main parts. First, the state of the art has been deeply investigated to extract the real issues in designing power gating targeting NoCs and to collect the most promising and effective solutions to reduce their leakage power. A key objective was to keep almost the same performance of the baseline NoC with a relevant leakage power reduction, mainly due to the high impact the NoC buffers have on the static power. Second, a control-theory inspired methodology has been proposed to aggressively switch-off idle buffers and to prolong their switch-off time by allowing to reuse the same buffer for different traffic types at different points in time. The proposed methodology has been implemented and compared with both the baseline and one of the state of the art solutions, namely Power Punch. Power and performance metrics have been investigate by means of an RTL Verilog implementation to accurately account for both area and power, achieving a great static power reduction with a minimal impact on both area and performance.

I would like to express my appreciation to the HiPEAC NoE for this excellent opportunity that allowed me to establish a cooperation bridge with a different European research group, with a cross sharing of competencies and the creation of a long-term scientific contact between the two universities. In particular, I would like to sincerely thank professors Yannakis Sazeides and Chrysostomos Nicopoulos for their guidance and hospitality and the rest of the people of their research group that warmly welcomed me. I think all PhD students and young researchers may greatly benefit from the collaboration grants sponsored by HiPEAC.

Davide Zoni, Politecnico di Milano

---

**INTERNSHIP REPORT: KARThIKEYAN P. SARAVANAN**

*Host Institution: Samsung R&D Institute UK*

*Title: Machine learning based GPU DVFS control for Power/Performance optimization*

Our investigation into machine learning based DVFS control for GPU and high graphics games involved multiple challenges,

1. Analysing the problem and identifying potential for improvement
2. Identifying the appropriate machine learning heuristic for the given problem
3. Estimating power within the system and identifying high graphic workloads

**Analysing the problem and identifying potential for improvement**

Our analysis was focused on the Samsung Galaxy Note 4 device, and we used the Monsoon power monitor for our power analysis. Our analysis of the problem involved graphics applications, specifically games such as Crossy Road and Godfire, GFX Benchmark, and others. Our analysis showed that interesting trade-offs between power and performance existed, which could be used for energy savings with a minimal effect on performance.

**Identifying the appropriate machine learning heuristic for the given problem**

We looked at several machine learning techniques and found a combination of Q-learning and neural networks would work best for machine learning based DVFS control. Our problem required reinforcement based learning, for which, Q-learning has proven to work very well. Our search space and action space is also very large, so we explored the use of neural networks for function approximation within the Q-learning model. Our results showed that the Q-learning based algorithm successfully controlled the GPU DVFS for specific temperature, load and performance goals. Results showed that the Q-learning heuristic managed to find trade-offs when presented with conflicting goals. We further analysed various methods to improve the Q-learning technique in improving heuristic for action choice, reward choice, etc. Our analysis using a simple neural network as a function approximation mechanism did not show interesting results, and we concluded that further work in that direction was required.

**Estimating power within the system and identifying high graphic workloads**

Two interesting challenges in controlling DVFS using machine learning are, 1. the need for estimating power consumption, so that the estimates can be used for power/performance trade-offs, and 2. identifying applications, and specifically high graphics workloads, so that DVFS control methods can be tuned for application domains. In this regard, we used the neural network based learning technique for power estimation and workload detection. We had some success in each of the above challenges but in general more complex neural network models were required for more accurate estimation of both of the above.

**Conclusions**

Power/thermal and performance are key trade-offs that require careful optimization on mobile devices. Various variables in mobile devices contribute in different ways to the device’s power, performance and terminal points. Our work involved investigating smarter DVFS control mechanisms. We investigated various aspects of the problem and built a prototype Q-learning engine that controls DVFS, which showed promising results within the scope of our problem. We also identified several opportunities for further extension of this work. Specifically improving the neural net logic will make our estimation part of the work more accurate.

Karthikeyan P. Saravanan, Barcelona Supercomputing Center
INTERNSHIP REPORT: TIAN XU

Host Institution: Samsung R&D Institute UK
Title: Vision Algorithms Optimisation on Multi-Core Mobile Devices

I am a PhD student at University of Glasgow, working on Parallel Image Processing. My research interests focus on investigating how to optimize and accelerate image processing algorithms for real-time applications by utilizing multi-core processors (Multi-core CPU and GPU). Thanks to the HiPEAC Industrial PhD Internship scheme, I had the opportunity to spend three months at Samsung Research & Development Institute UK during the summer of 2014.

I joined the Android Graphic Team in Samsung, and aimed to develop and optimise vision algorithms on multi-core mobile devices, e.g. using GPU compute and CPU. I first got myself familiar with the Android platform, knowing how to create an Android project. By understanding mobile CPU and GPU, I learned how CPU and GPU work together to get better performance for mobile, and how they are different from a desktop CPU and GPU. Basically, both mobile CPU and GPU share the same memory, so there is no memory copy process needed when using GPU, which is a significant difference compared with a normal GPU. Also the limited memory size, low CPU and GPU computational capability and power consumption constraints have to be taken into consideration when designing algorithms for mobile devices.

After understanding the hardware architecture, we focused on developing a 3D reconstruction algorithm for mobile devices, which reads in several colour images or a set of frames, and generates multiple depth maps. For the input images, we extract the image features first, and then match key points to estimate the camera position for each image. Finally, a plane sweep algorithm is applied to generate the depth maps. The implementation involves OpenCV and OpenGL libraries. OpenGL code can be performed directly on the mobile GPU, while other parts need special care to utilise the multi-core architecture. So we rewrote part of the program in OpenCL to improve the algorithm’s efficiency.

I would like to express my appreciation to HiPEAC for providing this excellent opportunity. It allowed me to gain industrial experience and see how academic knowledge can be applied in practice. I would also like to thank Samsung for accepting me as an intern. I met many excellent engineers there and had a really wonderful time.

Tian Xu, University of Glasgow

INTERNSHIP REPORT: JELENA MILOSEVIC

Host Institution: IBM Cyber Security Center of Excellence
Title: Anomaly Detection in Time Series

My name is Jelena Milosevic. I am a PhD student at Advanced Learning and Research Institute, Faculty of Informatics, University of Lugano, Switzerland. My research is in the field of malware detection. In order to broaden my knowledge and gain more industrial experience, from February 2015 until May 2015, I was an intern in IBM Cyber Security Center of Excellence and Department of Computer Science at Ben-Gurion University of the Negev, located in Beersheba, Israel. The internship was related to the field of anomaly detection in time series. Anomaly detection, sometimes also called outlier detection or novelty detection, has as a goal to identify points, events or patterns that do not conform to the expected behavior. Such techniques have broad usage in many fields, including malware detection, intrusion detection, fraud detection, and health monitoring.

In order to detect anomalies different approaches are proposed in the literature. Among them symbolic representation appears to be very promising, together with approaches based on regression methods, and sliding window techniques. We aimed at a comparative analysis of the most relevant state-of-the-art techniques in the field with a novel anomaly detection approach developed at IBM laboratories. During the internship an implementation of the mentioned techniques has been done and initial results were obtained. The analysis was performed with data related to anomalies in heartbeats and power supply. Our goal is to extend the tested dataset with data related to network anomalies and malicious traffic that we collected during the internship. We have continued our collaboration after my internship was finished and we aim at publishing results in a relevant conference as soon as the work is finalized.

I would like to thank HiPEAC for providing me with the opportunity to participate in this internship. Also, I am thankful to all colleagues from IBM, so as to professor Shlomi Dolev from Ben-Gurion University of the Negev, for all the support and assistance during the internship and a great experience I had there.

Jelena Milosevic, Advanced Learning and Research Institute, Faculty of Informatics, University of Lugano
COLLABORATION REPORT: IVAN RATKOVIC

Host Institution: University of California at Berkeley
Title: Fully Parametrizable Floating Point Unit Design for Energy Efficient Vector Processors

A HiPEAC collaborative internship has allowed me to spend four months as a visiting scholar at the Berkeley Wireless Research Center, University of California at Berkeley. I was doing research as a part of the Raven Project group. The most recent Raven tapeout, in 28 nm FD-SOI, supports a fully-realized RISC processor and an associated vector processor (Hwacha) in a variable voltage domain supplied by switched capacitor DC-DC converters. The main design goals of the Raven processor are power and energy efficiency. The next version will be sent for fabrication in October.

My part of the project was to work on the low-power floating point unit (FPU) that is used by both the vector and scalar cores. However, since the vector core is more computational intensive, the stress was on the optimizations for vector processing. The design goals were to make it fully IEEE compliant, fully parameterizable, power and energy-efficient, and to integrate a division/square-root into the FPU. As a baseline I used the previous Raven’s FPU that is written in Chisel and implemented as a Fused Multiply Add unit capable of performing fused multiply-add, multiplication, and addition. First, I devoted some time to study the internal architecture and trade-offs of the FPU. Then, I managed to make it fully compatible with IEEE floating point standards. Afterwards I worked on full parameterization of the FPU by adding the implementation of several pipelining styles, register types, and low power techniques, and including parameters that indicate which pipelining style, register style and low power techniques are applied to the current FPU. I evaluated low power techniques independently, e.g. vector processing aware operand isolation technique achieves in average 3.1x power and 1.1x area reductions with 1.26x timing overhead. I also implemented a set of fine-grained internal clock-gating techniques, whose precise evaluation is still pending. Finally, I integrated division/square-root, thus making our FPU an “All-in-One” FP unit capable of performing fused multiply-add, multiplication, addition, division and square-root floating point operations. I obtained all the results using Synopsys tool-flows and 28 nm FDSOI STMicroelectronics technology.

I would like to thank HiPEAC and the University of California, Berkeley for the opportunity of doing this internship, as it was a very valuable experience at all levels. Moreover, I would like to thank all the people that contributed to my integration in the research center, as well as those that shared their knowledge with me.

Ivan Ratkovic, Barcelona Supercomputing Center

COLLABORATION REPORT: HAZEM ISMAIL ALI

Host Institution: TU Eindhoven
Title: Unifying SDF graphs and Traditional Real-Time scheduling

I am a PhD student at CISTER (Research Centre in Real-Time and Embedded Computing Systems)/INESC-TEC, based in the Polytechnic Institute of Porto, Portugal. Between mid-August and mid-November of 2014, I visited the Embedded Systems (ES) group based in the Electrical Engineering Department in Eindhoven University of Technology (TU/e), to work with dr.ir. Sander Stuijk as a part of a HiPEAC PhD collaboration grant.

Eindhoven University of Technology (TU/e) in general and the Embedded Systems (ES) group in particular offered me a high quality research environment with a bright and friendly group of researchers to interact with and exchange research ideas. ES group has a great experience in dataflow computation and analysis that I learned from throughout the three-month collaboration period. The main goal of my research at the ES group is developing an efficient and scalable algorithm that extracts the timing parameters of dataflow applications represented in Synchronous Dataflow (SDF) computational model.

In addition, I gave two research seminars at TU/e about my PhD work on Integrating Dataflow and Non-Dataflow Real-time Application Models on Multi-core Platforms. The first seminar was at the ES group on the ES day event that is held annually, with the objective to present the work of the researchers of the ES group. The second seminar was at the System Architecture and Networking (SAN) group at the Department of Mathematics and Computer Science in TU/e. The work that I began at the ES group in TU/e is still ongoing and I expect joint publications to emerge in the near future. During my visit, I had many fruitful interactions and networking with other researchers in the dataflow field beyond the ES group, including Professor Marco Bekooij from the University of Twente and Professor Todor Stefanov from the university of Leiden. These interactions gave me valuable feedback and deep insights on my ongoing research. Overall, I found the visit to be greatly beneficial for me in experiencing a top class research environment and in building connections with several excellent researchers working on topics related to my interests, which I am sure will prove helpful in future. I wish to thank dr.ir. Sander Stuijk for being an excellent host and all the ES group members for their warm hospitality. I would also like to thank HiPEAC for supporting my visit.

Hazem Ismail Ali, CISTER/ISEP
A MODEL-BASED APPROACH FOR THE SPECIFICATION AND REFINEMENT OF STREAMING APPLICATIONS

Christian Zebelein, Siemens AG, Germany
Advisor: Prof. Dr.-Ing. habil. Christian Haubelt, University of Rostock
Graduation date: September 2014

Today, embedded systems can be found in a wide range of applications including transportation systems and consumer electronics. Model-based design flows can be a solution to the increasingly challenging task of designing and programming embedded systems, which have to meet a wide range of constraints. Focusing on data flow models, this thesis proposes a seamless model-based design flow from system level to the instruction/logic level for a wide range of streaming applications. In the proposed design flow, the same (refined) data flow model used at system level constitutes the input model for subsequent hardware/software synthesis steps at the next lower levels of abstraction. As a result, complex model-based optimizations like inter-process resource sharing can be automatically applied during synthesis even at these lower levels of abstraction, considerably reducing the modeling complexity. Moreover, the design space is greatly extended, as different configurations can be automatically synthesized and evaluated during design space exploration.

HYBRID INTERCONNECT DESIGN FOR HETEROGENEOUS HARDWARE ACCELERATORS

Cuong Pham Quoc, Delft University of Technology, The Netherlands
Advisor: Dr. ir. Z. Al-Ars and Prof. Dr. K. Bertels
Graduation date: April 2015

As illustrated by the recent acquisition of Altera by Intel, hardware accelerators are becoming increasingly important as one solution to increase computational power. In order to efficiently use such accelerators, a major challenge is to deal with the data flowing through the system where dependencies and communication bottlenecks may kill the envisioned performance improvement. As data communication patterns can be specific for each application, one way to address this issue to define tailored interconnects. In this dissertation, we study the possible benefits of hybrid (and tailored) interconnects based on detailed data communication information providing the most appropriate support for the communication pattern inside an application while keeping the hardware resource usage for the interconnect minimal. We define such interconnects as ‘hybrid’ interconnects since ultimately the entire interconnect will consist not only of e.g. a NoC but also of uni- or bidirectional communication channels, locally shared buffers for data exchange, and so on. To minimize the hardware resource usage for the hybrid interconnect, we also propose an adaptive mapping algorithm to connect the computing kernels to the proposed hybrid interconnect. The experimental results on both an embedded system and a high performance computing system show that our design not only improves system performance but also reduces overall energy consumption compared to the baseline systems.
This thesis proposes a novel model of computation, called stream rewriting, for the specification and implementation of highly concurrent applications. Basically, the active tasks of an application and their dependencies are encoded as a token stream, which is iteratively modified by a set of rewriting rules at runtime. The creation of new tasks, the synchronization of cooperating tasks, and the scheduling of dependent tasks are implemented as local pattern matching, which can be performed in parallel on several regions of the stream. Hence, stream rewriting is most useful for compute-intensive applications with frequently varying and unpredictable data rates and further enables global resource sharing as well as lightweight lock-free synchronization. Several many-core systems with up to 128 general purpose processors have been implemented on an FPGA and show the scalability of stream rewriting for complex examples including recursive algorithms and graphics processing.

The high performance computing landscape is shifting from assemblies of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices, such as GPUs or many-core coprocessors. These devices provide superior theoretical performance compared to traditional multi-core CPUs, but not every application fits into the programming model they impose, and exploiting their computing power remains a challenging task.

This dissertation discusses the issues that arise when trying to efficiently use general purpose accelerators. To that end, we use as case study the statistical technique Kernel Density Estimation (KDE). KDE is a memory bound application that poses several challenges for its adaptation to the accelerator-based model. We present a novel algorithm for the computation of KDE that considerably reduces its computational complexity, and we analyse its performance in a different set of coprocessors, trying to highlight the bottlenecks and the limits that the code reaches in each platform. In addition, we present an application of our KDE algorithm in the field of climatology: a novel methodology for the evaluation of environmental models.
**CONTRIBUTIONS TO HIGH-THROUGHPUT COMPUTING BASED ON THE PEER-TO-PEER PARADIGM**

Carlos Pérez Miguel, University of the Basque Country, Spain
Advisor: José Miguel-Alonso and Alexander Mendiburu
Graduation date: June 2015

In this thesis, we propose a novel High Throughput Computing (HTC) architecture in which the queue and the administrative tasks of the master node of usual HTC systems are scattered along every node of the system. This system is built on top of a Peer-to-Peer data management tool, Cassandra, which provides high scalability and availability. The data availability provided by Cassandra is analysed by means of several stochastic models. These models can be used to make predictions about the availability of any Cassandra deployment, as well as to select the best possible configuration of any Cassandra system. Finally, we propose a set of scheduling policies that try to solve a common problem of HTC systems: re-execution of tasks due to a failure in the node where the task was running, without additional resource mis-spending. In order to reduce the number of re-executions our proposals try to find good fits between the reliability of nodes and the estimated length of each task.

**DETERMINISTIC EXECUTION OF MULTITHREADED APPLICATIONS FOR RELIABILITY OF MULTICORE SYSTEMS**

Hamid Mushtaq, Delft University of Technology, The Netherlands
Advisor: Dr. ir. Zaid Al-Ars, Prof. Dr. Koen Bertels
Graduation date: June 2015

Constant reduction in the size of transistors has made it possible to implement many cores on a single die. However, smaller transistors are more susceptible to both temporary and permanent faults. To make such systems more reliable, online fault tolerance techniques can be applied. A common approach for providing fault tolerance is to perform redundant execution of the software. This is done by using the program replication approach. In this approach, the replicated copies of a program (known as replicas) follow the same execution sequence and produce the same output if given the same input. This requirement necessitates that the replicas handle non-deterministic events such as asynchronous signals and non-deterministic functions deterministically. This is usually done by having one replica log the non-deterministic events and have the other replicas replay them at the same point in program execution. In a shared memory multithreaded program, this also means that the replicas perform non-deterministic shared memory accesses deterministically, so that they do not diverge in the absence of faults.

In this thesis, we employed two techniques for doing so, which are record/replay and deterministic multithreading. Both of our schemes are implemented using a user-level library and do not require a modified kernel. Moreover, they are very portable since they do not depend upon any special hardware for deterministic execution. In addition, we compare the advantages and disadvantages of both schemes in terms of performance, memory consumption and reliability. We also showed how our techniques improve upon existing techniques in terms of performance, scalability and portability. Lastly, we implemented specialized hardware extensions to further improve the performance and scalability of deterministic multithreading.

Deterministic multithreading is useful not only for fault tolerance, but also for debugging and testing of multithreaded applications running on a multicore system. It can be useful in reducing the time needed to calculate the worst-case-execution-time (WCET) of tasks running on multicore systems, as deterministic multithreading reduces the possible number of states a multithreaded program can reach. Finding a good WCET estimate (less pessimistic) of a real time task is much simpler if it runs on a single core processor than if it runs on a multicore processor concurrently with other tasks. This is because those tasks can share resources, such as a shared cache or a shared bus, and/or may need to concurrently read and/or write shared data. In this thesis, we show that using deterministic shared memory accesses helps in reducing the possible number of states used by the estimation algorithm and therefore reduce the WCET calculation time. Moreover, we implemented optimizations to further reduce WCET calculation time as well as to get a tighter WCET estimate, in addition to utilizing our specialized hardware extensions for that purpose.
Within the last decade, the industry shifted from designs with a single processor to multi-core designs, integrating multiple processing units on a single chip. The number of cores is expected to grow exponentially and future systems are expected to integrate thousands of processing units. In order to provide sufficient memory bandwidth in these systems, main memory is physically distributed over multiple memory controllers with non-uniform access to memory (NUMA).

Past research has identified programming models based on fine-grained, dependent tasks as a key technique to unleash the parallel processing power of massively parallel general-purpose computing architectures. However, many questions regarding the dynamic optimization of task-parallel programs for architectures with non-uniform memory access remain open.

In this thesis, we explored the main factors on locality and therefore performance of task-parallel programs. We proposed a set of transparent, portable and fully automatic online mapping mechanisms for tasks to cores and data to memory controllers in order to improve locality. Placement decisions are based on information about point-to-point data dependences, readily available in the run-time systems of modern task-parallel programming frameworks. The experimental evaluation of these techniques was conducted on our implementation in the run-time of the OpenStream language and a set of high-performance scientific benchmarks. Finally, we designed and implemented Aftermath, a tool for performance analysis and debugging of task-parallel applications and run-times.

We continue our research on the optimization of task-parallel applications and performance analysis during my time as a postdoctoral researcher at the University of Manchester.

http://www.drebesium.org/thesis

The push for more energy efficient embedded devices yields products with greater capabilities and longer battery life, but this push must come from multiple directions. Hardware can only be energy efficient if the software running upon it allows this. My thesis proposes energy modelling techniques for multi-threaded and multi-core deeply embedded systems to produce estimates of the energy consumption of programs. This allows developers to explore aspects such as algorithm selection, thread and core allocation and communication patterns, aiding them in understanding the energy impact of their code and make better informed design and implementation choices.

A profiling framework for a hardware multi-threaded processor is presented and an energy model developed from it. Swallow, a multi-core embedded system of many dual-core chips is used to extend the profiling and modelling to consider multiple devices and the impact of communication across Swallow’s grid-like network. The work centres on instruction set simulation, but through the EU FP7 FET project ENTRA, the models have also been used in static analysis at various levels, including assembly and LLVM IR.
UPCOMING EVENTS

IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-15)
23-25 September 2015, Turin, Italy
http://mcsoc-forum.org/2015/

Embedded Systems Week
4 - 9 October 2015, Amsterdam, The Netherlands,
http://www.esweek.org

2015 IEEE Nordic Circuits and Systems Conference (NORCAS)
26-28 October 2015, Oslo, Norway
http://www.norcas.org/

7-10 December 2015, Limassol, Cyprus
http://cyprusconferences.org/ucc2015/
http://datasys.cs.iit.edu/events/BDC2015/

The 11th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC 2016)
18-20 January 2016, Prague, Czech Republic
https://www.hipeac.net/2016/prague/

24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2016)
17-19 February 2016, Heraklion, Crete, Greece
http://www.pdp2016.org/

12-16 March 2016, Barcelona, Spain
http://hPCA22.site.ac.upc.edu/
http://conf.researchr.org/home/PPoPP-2016
http://cgo.org/cgo2016/

Design, Automation and Test in Europe (DATE16)
14-18 March 2016, Dresden, Germany
http://www.date-conference.com/

4-7 April 2016, Nuremberg, Germany
http://www3.cs.fau.de/arcs2016/

contributions If you are a HiPEAC member and would like to contribute to future HiPEAC newsletters, please visit https://www.hipeac.net/publications/newsletter/

HIPEAC INFO IS A QUARTERLY NEWSLETTER PUBLISHED BY THE HIPEAC NETWORK OF EXCELLENCE, FUNDED BY THE 7TH EUROPEAN FRAMEWORK PROGRAMME (FP7) UNDER CONTRACT NO. FP7/ICT 287759
WEBSITE: HTTPS://WWW.HIPEAC.NET/
SUBSCRIPTIONS: HTTPS://WWW.HIPEAC.NET/PUBLICATIONS/NEWSLETTER/