Blog Tools
Edit your Blog
Build a Blog
View Profile
« February 2005 »
S M T W T F S
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28
You are not logged in. Log in
Entries by Topic
All topics  «
Assignment 2
Carl Love (IBM)
Guest Speaker
Requirements and Use Case
week 3
week 4
Senior Project
Wednesday, 2 February 2005
memory perfomance management
Topic: Carl Love (IBM)
i is column, j is row

for(j=...)
for(i=...)
a[i][j]
will run faster than
for(i=...)
for(j=...)
a[i][j]
since the cashe won't be swapped out so much.
row major order (go by row by row)[C]
column major order (column by column) [FORTRAN]
this is language dependent.

the first way there is one cache miss then followed by 7 hits. the second has all misses so 7 more misses than the first.

4 way set associative - 4 elements in a 2 dim array fit into four cache lines

if 256 K L2 cache elements upto a(4,95) will fit into cache.

L1 cache direct mapped. 16K.

to measure it
a[N];
reg int val; // high number in M
start = time(); // real time clock
for(i=0, i< N, i++) val = a(i);
stop = time();

optimized code - compiler may throw this code away

int *a[N];
a[0] points to a[1]...a[N] points to a[1]
p = (**int)p; // forces the compiler to wait until p is set in the for loop.

int *a[N];
a[0] = a[1]...
reg int val; // high number in M
start = time(); // real time clock
for(i=0, i< N, i++) p = (**int)p;
stop = time();

time is flat at 2 cycles for array size < L1 cache size then it jumps up and is flat at 5 cycles for array size < L2 size (but not 128K due to virtual memory pages), then it goes up linearly to a max of 102 cycles then flat. Pages in memory not laid out perfectly (may be in any order) so probably not using all the L2 cache. The array size should be 2 times the size of the L2 cache to see the max time.
each cache miss costs about 95 cycles.

Linux single user mode to eliminate network

8 ints (bytes) per cache line

int a,c;
if a CPU has control of a variable it has to take control of the entire cache line that the variable is in. If the CPU has the line locked another CPU can't take it until the first one releases it.
this causes false sharing.
rewrite it as:
int a;
int junk[7];
int c;
this forces c to be in another cache line from a. junk isn't used but the space is allocated.
write programs with regard to cache line alignment.
abstraction can cause poor performance.
knowledge of the hardware and how it works and how the compiler works can help magnitudes.

check the IP every x number of clock cycles
to help find a cache miss in the program
performance counters and profilers.

look in loops and nested loops and for functions that are being called a lot.

inline macros are better than small functions

copying memory can be expensive so use assembly functions to copy blocks of memory.

Posted by lcbell4 at 7:01 PM PST
Updated: Wednesday, 2 February 2005 7:04 PM PST
Permalink

Newer | Latest | Older