Topic: Carl Love (IBM)
i is column, j is row
for(j=...)
for(i=...)
a[i][j]
will run faster than
for(i=...)
for(j=...)
a[i][j]
since the cashe won't be swapped out so much.
row major order (go by row by row)[C]
column major order (column by column) [FORTRAN]
this is language dependent.
the first way there is one cache miss then followed by 7 hits. the second has all misses so 7 more misses than the first.
4 way set associative - 4 elements in a 2 dim array fit into four cache lines
if 256 K L2 cache elements upto a(4,95) will fit into cache.
L1 cache direct mapped. 16K.
to measure it
a[N];
reg int val; // high number in M
start = time(); // real time clock
for(i=0, i< N, i++) val = a(i);
stop = time();
optimized code - compiler may throw this code away
int *a[N];
a[0] points to a[1]...a[N] points to a[1]
p = (**int)p; // forces the compiler to wait until p is set in the for loop.
int *a[N];
a[0] = a[1]...
reg int val; // high number in M
start = time(); // real time clock
for(i=0, i< N, i++) p = (**int)p;
stop = time();
time is flat at 2 cycles for array size < L1 cache size then it jumps up and is flat at 5 cycles for array size < L2 size (but not 128K due to virtual memory pages), then it goes up linearly to a max of 102 cycles then flat. Pages in memory not laid out perfectly (may be in any order) so probably not using all the L2 cache. The array size should be 2 times the size of the L2 cache to see the max time.
each cache miss costs about 95 cycles.
Linux single user mode to eliminate network
8 ints (bytes) per cache line
int a,c;
if a CPU has control of a variable it has to take control of the entire cache line that the variable is in. If the CPU has the line locked another CPU can't take it until the first one releases it.
this causes false sharing.
rewrite it as:
int a;
int junk[7];
int c;
this forces c to be in another cache line from a. junk isn't used but the space is allocated.
write programs with regard to cache line alignment.
abstraction can cause poor performance.
knowledge of the hardware and how it works and how the compiler works can help magnitudes.
check the IP every x number of clock cycles
to help find a cache miss in the program
performance counters and profilers.
look in loops and nested loops and for functions that are being called a lot.
inline macros are better than small functions
copying memory can be expensive so use assembly functions to copy blocks of memory.
Posted by lcbell4
at 7:01 PM PST
Updated: Wednesday, 2 February 2005 7:04 PM PST