I am from the days when computer engineers and scientists had to write assembly language
on IBM mainframes to develop high-performance programs. Programs were written on
punch cards and compilation was a one-day process; you dropped o your punch-code
written program and picked up the results the next day. If there was an error, you did
it again. In those days, a good programmer had to understand the underlying machine
hardware to produce good code. I get a little nervous when I see computer science students
being taught only at a high abstraction level and languages like Ruby. Although abstraction
is a beautiful thing to develop things without getting bogged down with unnecessary details,
it is a bad thing when you are trying to develop super high performance code.
Since the introduction of the rst CPU, computer architects added incredible features
into CPU hardware to \forgive" bad programming skills; while you had to order the sequence
of machine code instructions by hand two decades ago, CPUs do that in hardware for you
today (e.g., out of order processing). A similar trend is clearly visible in the GPU world.
Most of the techniques that were taught asperformance improvement techniquesin GPU
programming ve years ago (e.g., thread divergence, shared memory bank conicts, and
reduced usage of atomics) are becoming less relevant with the improved GPU architectures
because GPU architects are adding hardware features that are improving these previous
ineciencies so much that it won’t even matter if a programmer is sloppy about it within
another 5{10 years. However, this is just a guess. What GPU architects can do depends on
their (i)transistor budget, as well as (ii) their customers’ demands. When I saytransistor
budget, I am referring to how many transistors the GPU manufacturers can cram into an
Integrated Circuit (IC), aka a \chip." When I saycustomer demands, I mean that even if
they can implement a feature, the applications that their customers are using might not
1