Embedded Computing for High Performance: Performance

Recently, embedded.com ran excerpts from Embedded Computing for High Performance, a book by João Cardoso, José Gabriel Coutinho, and Pedro Diniz, which was published last spring. Performance is always a timely topic, so I’ve been devoting some posts to excerpt summaries. My first such post focused on target architectures and multiprocessor and multicore architectures; the second covered core-based architectural enhancement and hardware accelerators.

While the entire book is dedicated to performance, there’s a section – called Performance, by the way, that really drills down on the topic. And that’s the excerpt I’m addressing here.

This section begins by noting that there are myriad nonfunctional product requirements (e.g., execution time, memory capacity, energy consumption) that developers need to take into account as they devise solutions that optimize system performance. And that attention needs to be paid to identifying “the most suitable performance metrics to guide” the process by which they evaluate possible solutions.

The authors list common metrics:

The arithmetic inverse of the app’s execution time
Clock cycles required to execute a function, code section or app
Task latency
Task throughput
Scalability

And, in the full excerpt, further define each of these metrics.

They also provide formulae, starting with:

In terms of raw performance metrics, the execution time of an application or task, designated as Texec, is computed as the number of clock cycles the hardware (e.g., the CPU) takes to complete it, multiplied by the period of the operating clock (or divided by the clock frequency).

This is followed by formulae for the metric used when some computations are offloaded to a hardware accelerator; and speedup “which quantifies the performance improvement of an optimized version over a baseline implementation. In addressing speedup, the authors go into some detail, including an explication of Amdahl’s Law, which “states that the performance improvement of a program is limited by the sections that must be executed sequentially, and thus can be used to estimate the potential for speeding up applications using parallelization and/or hardware acceleration.”

They finish up the Performance section with a discussion of the Roofline Model “an increasingly popular method for capturing the compute-memory ratio of a computation and hence quickly identify if the computation is compute or memory bound.”

As with the other parts of this book, this excerpt is well worth the read. (And I think I’m talking myself into going out and buying the actual book.)