Multi-cores are here, and they are here to stay. Industry trends show that each individual core is likely to become smaller and slower (see my post to understand the reason). Improving performance of a single program with multi-core requires that the program be split into threads that can run on multiple cores concurrently. In effect, this pushes the problem of finding parallelism in the code to the programmers. I have noticed that many hardware designers do not understand the MT challenges (since they have never written MT apps). This post is to show them the tip of this massive iceberg.
macroclock_output_data mb; //Global variable for (...) // The OUTER loop decode_slice(...); if(mb->last_mb_in_slice) break; void decode_slice(...) ... mb = ...
X = X + 1
|A||Load X , R0||D||Load A, R0|
|B||Increment R0||E||Increment R0|
|C||Store R0, X||F||Store R0, A|