A Somewhat Painless Introduction to IPROBE

John L. Henning

CSD Performance Group

19 Jun 1997

 

IPROBE

+ Access low-level chip counters

+ Works with ev4, ev45, ev5, ev56, pca56

+ Works with NT, Unix, VMS

 

- Hard to get started

- Incomplete documentation

- Lots of pieces to put together

- A few missing floorboards (but work is in progress to fix them - if
you find any, send email to goddard@zko.dec.com)

 

 

Purpose of this talk

 

 

What you are about to see is true…

A live terminal session

with only minor edits and explanations

 

 

Problem:

 

-tune=ev5

instead of:

-tune=ev4

 

    1. Verify that the -tune switch causes a difference
    2. Find out why
    3. Discover a workaround and/or file detailed problem reports with the right engineering group(s)

 

Parts 1 and 2 are shown on the pages that follow.

 

 

 

 

 

 

Compile it both ways and verify the difference

Script started on Tue Jun 17 06:06:42 1997

% unlimit

% setenv PARALLEL 4

% pwd

/users01/john/swim

% ls *

typescript

 

e4:

swim.f swim.in swim2.in

e5:

swim.f swim.in swim2.in

% cd e5

% kf77 -v -V -machine_code -fkapargs='-mc=10000' swim.f

/usr/bin/kapf -cmp=./swim.cmp.f swim.f -mc=10000 -conc -tune=EV5 -natural

KAP/Digital_UA_F 3.1a k280615 970519 17-Jun-1997 06:07:39

0 errors in file swim.f

/usr/bin/f77 -fast -automatic -v -V -machine_code ./swim.cmp.f -tune host

-call_shared -lkmp_osfp10 -pthread

. . .

% time ./a.out <swim.in >swim.out

228.55u 0.26s 0:57 399% 0+153k 0+8io 0pf+0w

 

% cd ../e4

% kf77 -v -V -machine_code -fkapargs='-mc=10000 -tune=ev4' swim.f

/usr/bin/kapf -cmp=./swim.cmp.f swim.f -mc=10000 -tune=ev4 -conc -natural

KAP/Digital_UA_F 3.1a k280615 970519 17-Jun-1997 06:09:43

0 errors in file swim.f

/usr/bin/f77 -fast -automatic -v -V -machine_code ./swim.cmp.f -tune host

-call_shared -lkmp_osfp10 -pthread

. . .

% time ./a.out < swim.in > swim.out

196.58u 0.21s 0:49 399% 0+153k 0+2io 0pf+0w

 

 

 

 

  1. -v so we can see the kapf and f77 options generated by kf77
  2. -V -machine code for ease of interpreting disassemblies (can compare compiler output to dis)
  3. -tune=EV5 is generated automatically
  4. The EV5 tuning is about 15% slower
  5. Get IPROBE over the network

 

% su

Password:

# mkdir iprobe

# cd iprobe

# ftp

ftp> op 16.31.144.83

Connected to 16.31.144.83.

220 perf.zko.dec.com FTP server (Digital UNIX Version 5.60) ready.

Name (16.31.144.83:john): anonymous

331 Guest login ok, send ident as password.

Password:

230 Guest login ok, access restrictions apply.

Remote system type is UNIX.

Using binary mode to transfer files.

ftp> cd pub

250 CWD command successful.

ftp> cd IprobeKits

250 CWD command successful.

ftp> pwd

257 "/pub/IprobeKits" is current directory.

ftp> ls

200 PORT command successful.

150 Opening ASCII mode data connection for file list (16.31.32.191,1039).

Iprobe.ps

WhatsHere.txt

Iprobe020.a

IprobeVms021.a

IprobeVms022.a

Nt35IprobeT21.zip

Nt35IprobeT21Update.zip

Nt40IprobeT22Ev4.zip

Nt40IprobeT23Ev5.zip

UnzipAxp.exe

IprNew.mod

Iprobe020ProgrammingKit.bck

Api.doc

Iprobe021Osf.tar.Z

Iprobe0221Unix40.tar.Z

TurboLaserBusMonitorUnix.tar.Z

226 Transfer complete.

ftp> bin

200 Type set to I.

ftp> get Iprobe0221Unix40.tar.Z

200 PORT command successful.

150 Opening BINARY mode data connection for Iprobe0221Unix40.tar.Z (16.31.32.191,1040) (692920 bytes).

226 Transfer complete.

692920 bytes received in 4.2 seconds (1.6e+02 Kbytes/s)

ftp> bye

221 Goodbye.

 

Install IPROBE

 

# ls

Iprobe0221Unix40.tar.Z

# zcat * | tar xvf -

blocksize = 16

x ./IPRTEST

x ./IPRTEST/instctrl

x ./IPRTEST/instctrl/IPRBASE221.inv, 3284 bytes, 7 tape blocks

x ./IPRTEST/instctrl/IPRBASE221.ctrl, 168 bytes, 1 tape blocks

x ./IPRTEST/instctrl/IPRBASE221.scp, 4187 bytes, 9 tape blocks

x ./IPRTEST/instctrl/IPRBASE221.image, 24 bytes, 1 tape blocks

x ./IPRTEST/IPRBASE221, 1505280 bytes, 2940 tape blocks

x ./IPRTEST/INSTCTRL, 20480 bytes, 40 tape blocks

x ./IPRTEST/.image, 24 bytes, 1 tape blocks

x ./IPRTEST/IPRBASE221.image, 24 bytes, 1 tape blocks

# cd IPRTEST

# ls

.image IPRBASE221 instctrl

INSTCTRL IPRBASE221.image

# setld -l .

 

The subsets listed below are optional:

 

There may be more optional subsets than can be presented on a single

screen. If this is the case, you can choose subsets screen by screen

or all at once on the last screen. All of the choices you make will

be collected for your confirmation before any subsets are installed.

 

1) IPROBE kit for Digital Unix 4.0

 

Or you may choose one of the following options:

 

2) ALL of the above

3) CANCEL selections and redisplay menus

4) EXIT without installing any subsets

 

Enter your choices or press RETURN to redisplay menus.

 

Choices (for example, 1 2 4-6): 1

 

You are installing the following optional subsets:

 

IPROBE kit for Digital Unix 4.0

 

Is this correct? (y/n): y

 

Checking file system space required to install selected subsets:

 

File system space checked OK.

 

1 subset(s) will be installed.

 

Loading 1 of 1 subset(s)....

Checking for previous IPROBE kits...

No previous IPROBE installation found

 

IPROBE kit for Digital Unix 4.0

Copying from . (disk)

Verifying

 

1 of 1 subset(s) installed successfully.

 

 

Configuring "IPROBE kit for Digital Unix 4.0" (IPRBASE221)

Copying Digital Unix V4.0 specific files to target directories...

Done

Configuring and loading the IPROBE driver...

IPROBE for Digital Unix 4.0 succesfully installed

# exit

 

 

The actual installation of IPROBE is the only part that requires privileges. At this point we return to non-privileged mode.

 

Note that a reboot of the system is NOT required - IPROBE has a loadable driver.

 

Test that IPROBE was installed correctly

 

% rehash

% iprobe

Node name : gemosf.zko.dec.com

OS : OSF1 T4.0-738.5

CPU count : 4

Model : Unknown

Memory size : 381 MB

Counter count : 3

cycles : Low frequency

Current time : Tue Jun 17 06:15:44 1997

Start time: : immediate

Duration : 0 (until user interrupts)

Interval : 1

Method : count

Measured Modes : all modes

Measured Data : pid ctr ps pc

Buffer_count : 3

Buffer_size : 8192

time cpu freq event # events evts/sec

06:15:44 0 2^16 cycles 399310848 399310848

06:15:44 1 2^16 cycles 399310848 399310848

06:15:44 2 2^16 cycles 399310848 399310848

06:15:44 3 2^16 cycles 399310848 399310848

06:15:45 0 2^16 cycles 399572992 399572992

06:15:45 1 2^16 cycles 399572992 399572992

06:15:45 2 2^16 cycles 399572992 399572992

06:15:45 3 2^16 cycles 399572992 399572992

06:15:46 0 2^16 cycles 399638528 399638528

06:15:46 1 2^16 cycles 399638528 399638528

06:15:46 2 2^16 cycles 399638528 399638528

06:15:46 3 2^16 cycles 399572992 399572992

06:15:47 0 2^16 cycles 79495168 79495168

06:15:47 1 2^16 cycles 79429632 79429632

06:15:47 2 2^16 cycles 79429632 79429632

06:15:47 3 2^16 cycles 79495168 79495168

Total event count:

06:15:47 0 2^16 cycles 1278017536

06:15:47 1 2^16 cycles 1277952000

06:15:47 2 2^16 cycles 1277952000

06:15:47 3 2^16 cycles 1277952000

 

(^C to terminate)

 

 

The display shows us that IPROBE sees all 4 CPUs, running at 400 MHz.

Create a script to invoke IPROBE and the benchmark

 

% pwd

/users01/john/swim/e4

% cd ..

% cat >run_cycl_and_bmiss.csh

set verbose

unlimit

set notify

setenv PARALLEL 4

iprobe -quiet -method sample cycles bcache_miss &

time ./a.out <swim.in >swim.out

kill %iprobe

unset verbose

 

 

IPROBE is run in the background and killed by the script.

Why cshell? Handy builtin commands such as time and kill

Usage:

csh

% source run_cycl_and_bmiss.csh

because otherwise the kill would fail:

kill %iprobe

%iprobe: No such process

-quiet: don’t litter my screen

-method sample: tuck events away for later analysis

cycles bcache_miss: the events we want

 

 

à Pick your two favorite events and always start there

 

 

Collect the events

 

% cd e4

% source ../run_cycl_and_bmiss.csh

unlimit

set notify

setenv PARALLEL 4

iprobe -quiet -method sample cycles bcache_miss &

[1] 3529

time ./a.out < swim.in > swim.out

Start of sampling

196.35u 0.53s 0:49 396% 0+153k 0+5io 0pf+0w

kill %iprobe

unset verbose

%

[1] Terminated iprobe -quiet -method sample cycles bcache_miss

 

% cd ../e5

% !sour

% source ../run_cycl_and_bmiss.csh

unlimit

set notify

setenv PARALLEL 4

iprobe -quiet -method sample cycles bcache_miss &

[1] 3535

time ./a.out < swim.in > swim.out

Start of sampling

238.44u 0.59s 1:00 397% 0+153k 0+5io 0pf+0w

kill %iprobe

unset verbose

%

[1] Terminated iprobe -quiet -method sample cycles bcache_miss

 

% cd ..

% ls *

run_cycl_and_bmiss.csh typescript

 

e4:

a.out swim.cmp.f swim.f swim.out

pcsample.dat swim.cmp.l swim.in swim2.in

 

e5:

a.out swim.cmp.f swim.f swim.out

pcsample.dat swim.cmp.l swim.in swim2.in

 

 

Grab John’s data reduction harness

 

% ftp

ftp> op 16.31.144.83

Connected to 16.31.144.83.

220 perf.zko.dec.com FTP server (Digital UNIX Version 5.60) ready.

Name (16.31.144.83:john): anonymous

331 Guest login ok, send ident as password.

Password:

ftp> cd pub/henning

ftp> get harness.pl

200 PORT command successful.

150 Opening BINARY mode data connection for harness.pl (16.31.32.191,1045) (22835 bytes).

226 Transfer complete.

22835 bytes received in 0.032 seconds (7e+02 Kbytes/s)

ftp> bye

221 Goodbye.

 

% chmod +x harness.pl

 

% head harness.pl

#!/usr/local/bin/perl

# "harness.pl" - this edition noon 24 Mar 97

#

# Assist with data reduction for IPROBE pc samples.

#

# One of the common complaints about IPROBE is that the supporting tools

# for data reduction are hard to use. This script tries to make things

# easier in two ways:

#

# 1) By providing a harness for data reduction which may meet users'

 

%

% ls /usr/local/bin/perl

/usr/local/bin/perl

 

If you don’t already have perl (gasp!) see
http://www.perl.com/perl/info/software.html

 

If you do have it, be sure to change the second line of the script if your copy is not in /usr/local/bin

 

Report bugs in harness.pl to henning@zko.dec.com or PERFOM::INTERNAL_PERF_TOOLS note 89

Attempt to invoke the harness, discover the non_shared floorboard

 

% cd e4

% ls

a.out swim.cmp.f swim.f swim.out

pcsample.dat swim.cmp.l swim.in swim2.in

 

% ../harness.pl -x a.out -d pcsample.dat -e cycles

Os=unix

Running rep to create addresses.resolved

rep a.out

You did remember to include a ./, didn't you?

 

Your output file will be addresses.resolved

 

SPEC benchmark 102.swim

Time passes… press Control-C

 

Generating top-level report for cycles

ipreduce -input_file pcsample.dat -output_file cycles.rpt -event cycles -pthresh 1

forrtl: info: Fortran error message number is 69.

forrtl: warning: Could not open message catalog: for_msg.cat.

forrtl: info: Check environment variable NLSPATH and protection of /usr/lib/nls/msg/en_US.ISO8859-1/for_msg.cat.

forrtl: error (69): Message not found

can't open cycles.rpt at ../harness.pl line 245.

 

% pwd

/users01/john/swim/e4

% ./a.out <swim.in >swim.out &

[1] 3552

% rep -pid 3552

 

Your output file will be addresses.resolved

 

Resolving addresses

Writing output file addresses.resolved

 

% kill %1

% cd ../e5

% ./a.out < swim.in >swim.out &

[1] 3561

% rep -pid 3561

 

Your output file will be addresses.resolved

 

Resolving addresses

Writing output file addresses.resolved

 

% kill %1

 

Try the harness again…

 

% cd e4

% ../harness.pl -x a.out -d pcsample.dat -e cycles

Os=unix

 

Generating top-level report for cycles

ipreduce -input_file pcsample.dat -output_file cycles.rpt -event cycles

-pthresh 1

 

ipreduce -o cycles_pkcalc3__.rpd -d pc -event cycles -input_file pcsample.dat

-pc 12000D430:12000D8AF

dis -h -p pkcalc3__ a.out > pkcalc3__.dis_tmp

Annotating pkcalc3__

 

ipreduce -o cycles_pkcalc2__.rpd -d pc -event cycles -input_file pcsample.dat

-pc 12000C350:12000C8AF

dis -h -p pkcalc2__ a.out > pkcalc2__.dis_tmp

Annotating pkcalc2__

 

ipreduce -o cycles_pkcalc1__.rpd -d pc -event cycles -input_file pcsample.dat

-pc 12000B690:12000BAFF

dis -h -p pkcalc1__ a.out > pkcalc1__.dis_tmp

Annotating pkcalc1__

 

ipreduce -o cycles_spin_wait_join_barrier.rpd -d pc -event cycles -input_file

pcsample.dat -pc 1200B6088:1200B61CF

dis -h -p spin_wait_join_barrier a.out > spin_wait_join_barrier.dis_tmp

Annotating spin_wait_join_barrier

 

% ../harness.pl -x a.out -d pcsample.dat -e bcache_miss

Os=unix

 

Generating top-level report for bcache_miss

ipreduce -input_file pcsample.dat -output_file bcache_miss.rpt -event

bcache_miss -pthresh 1

 

ipreduce -o bcache_miss_pkcalc2__.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000C350:12000C8AF

Annotating pkcalc2__

 

ipreduce -o bcache_miss_pkcalc3__.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000D430:12000D8AF

Annotating pkcalc3__

 

ipreduce -o bcache_miss_pkcalc1__.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000B690:12000BAFF

Annotating pkcalc1__

 

ipreduce -o bcache_miss_calc3_.rpd -d pc -event bcache_miss -input_file p

csample.dat -pc 12000CDD0:12000D42F

dis -h -p calc3_ a.out > calc3_.dis_tmp

Annotating calc3_

 

% cd ../e5

% ../harness.pl -x a.out -d pcsample.dat -e cycles

Os=unix

 

Generating top-level report for cycles

ipreduce -input_file pcsample.dat -output_file cycles.rpt -event cycles

-pthresh 1

 

ipreduce -o cycles_pkcalc3__.rpd -d pc -event cycles -input_file pcsample.dat

-pc 12000C510:12000C93F

dis -h -p pkcalc3__ a.out > pkcalc3__.dis_tmp

Annotating pkcalc3__

 

ipreduce -o cycles_pkcalc2__.rpd -d pc -event cycles -input_file pcsample.dat

-pc 12000B790:12000B9DF

dis -h -p pkcalc2__ a.out > pkcalc2__.dis_tmp

Annotating pkcalc2__

 

ipreduce -o cycles_pkcalc1__.rpd -d pc -event cycles -input_file pcsample.dat

-pc 12000B040:12000B22F

dis -h -p pkcalc1__ a.out > pkcalc1__.dis_tmp

Annotating pkcalc1__

 

ipreduce -o cycles_spin_wait_join_barrier.rpd -d pc -event cycles -input_file

pcsample.dat -pc 1200B5118:1200B525F

dis -h -p spin_wait_join_barrier a.out > spin_wait_join_barrier.dis_tmp

Annotating spin_wait_join_barrier

 

% ../harness.pl -x a.out -d pcsample.dat -e bcache_miss

Os=unix

 

Generating top-level report for bcache_miss

ipreduce -input_file pcsample.dat -output_file bcache_miss.rpt -event

bcache_miss -pthresh 1

 

ipreduce -o bcache_miss_pkcalc2__.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000B790:12000B9DF

Annotating pkcalc2__

 

ipreduce -o bcache_miss_pkcalc3__.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000C510:12000C93F

Annotating pkcalc3__

 

ipreduce -o bcache_miss_pkcalc1__.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000B040:12000B22F

Annotating pkcalc1__

 

ipreduce -o bcache_miss_calc3_.rpd -d pc -event bcache_miss -input_file

pcsample.dat -pc 12000BF00:12000C50F

dis -h -p calc3_ a.out > calc3_.dis_tmp

Annotating calc3_

 

 

Results of harness.pl

 

 

% ls e4 e5

 

e4:a.out pkcalc1__.disaddresses.resolved pkcalc1__.source_bcache_missbcache_miss.hot_routines pkcalc1__.source_cyclesbcache_miss.rpt pkcalc2__.dibbcache_miss_calc3_.rpd pkcalc2__.disbcache_miss_pkcalc1__.rpd pkcalc2__.source_bcache_missbcache_miss_pkcalc2__.rpd pkcalc2__.source_cyclesbcache_miss_pkcalc3__.rpd pkcalc3__.dibcalc3_.dib pkcalc3__.discalc3_.dis pkcalc3__.source_bcache_misscalc3_.source_bcache_miss pkcalc3__.source_cyclescycles.hot_routines swim.cmp.fcycles.rpt swim.cmp.lcycles_pkcalc1__.rpd swim.fcycles_pkcalc2__.rpd swim.incycles_pkcalc3__.rpd swim.outpcsample.dat swim2.inpkcalc1__.dibe5:a.out pkcalc1__.disaddresses.resolved pkcalc1__.source_bcache_missbcache_miss.hot_routines pkcalc1__.source_cyclesbcache_miss.rpt pkcalc2__.dibbcache_miss_calc3_.rpd pkcalc2__.disbcache_miss_pkcalc1__.rpd pkcalc2__.source_bcache_missbcache_miss_pkcalc2__.rpd pkcalc2__.source_cyclesbcache_miss_pkcalc3__.rpd pkcalc3__.dibcalc3_.dib pkcalc3__.discalc3_.dis pkcalc3__.source_bcache_misscalc3_.source_bcache_miss pkcalc3__.source_cyclescycles.hot_routines swim.cmp.fcycles.rpt swim.cmp.lcycles_pkcalc1__.rpd swim.fcycles_pkcalc2__.rpd swim.incycles_pkcalc3__.rpd swim.outpcsample.dat swim2.inpkcalc1__.dib

 

Hot Routines Report, first clues from disassembly

 

% cat e4/cycles.hot_routines

Hot Routines for cycles -pthresh 1

Events % Routine Image Addr

434186 36 pkcalc3__ a.out 12000D430:12000D8AF

352761 29 pkcalc2__ a.out 12000C350:12000C8AF

267796 22 pkcalc1__ a.out 12000B690:12000BAFF

101029 8 spin_wait_join_barrier a.out 1200B6088:1200B61CF

% cat e5/cycles.hot_routines

Hot Routines for cycles -pthresh 1

Events % Routine Image Addr

504803 34 pkcalc3__ a.out 12000C510:12000C93F

422918 29 pkcalc2__ a.out 12000B790:12000B9DF

278563 19 pkcalc1__ a.out 12000B040:12000B22F

199509 14 spin_wait_join_barrier a.out 1200B5118:1200B525F

 

 

Notice that calc2 has 20% more cycles.

 

 

 

% cd e4

% wc -l pkcalc2__.dis

348 pkcalc2__.dis

% cd ../e5

% wc -l pkcalc2__.dis

152 pkcalc2__.dis

 

Note that there are many fewer instructions in the ev5 version. This is usually a bad sign, often indicating that a loop was not unrolled. (Rule of thumb: For integer programs, you care about the istream and want it to be small. For floating point programs, you care about the dstream and will spend lots of instructions to get better dstream flow.)

 

Source cycles report EV5

 

% cat pkcalc2__.source_cycles

cycles for pkcalc2__ by source line

printing lines with at least 4229.18 events

 

swim 983 18564

swim 984 102647

swim 985 93550

swim 986 25799

swim 987 61730

swim 988 52982

swim 989 21188

swim 990 38300

swim 991 7957

 

% head -993 swim.cmp.f | tail -12

DO J1=II1,II2

DO I1=1,M1

UNEW1(I1+1,J1) = UOLD1(I1+1,J1) + TDTS81 * (Z1(I1+1,J1+1) + Z1(

X I1+1,J1)) * (CV1(I1+1,J1+1) + CV1(I1,J1+1) + CV1(I1,J1) + CV1

X (I1+1,J1)) - TDTSDX1 * (H1(I1+1,J1) - H1(I1,J1))

VNEW1(I1,J1+1) = VOLD1(I1,J1+1) - TDTS81 * (Z1(I1+1,J1+1) + Z1(

X I1,J1+1)) * (CU1(I1+1,J1+1) + CU1(I1,J1+1) + CU1(I1,J1) + CU1

X (I1+1,J1)) - TDTSDY1 * (H1(I1,J1+1) - H1(I1,J1))

PNEW1(I1,J1) = POLD1(I1,J1) - TDTSDX1 * (CU1(I1+1,J1) - CU1(I1,

X J1)) - TDTSDY1 * (CV1(I1,J1+1) - CV1(I1,J1))

END DO

END DO

 

The action is all in one place - and KAP has not unrolled the loop

Source Cycles EV4

 

% cd ../e4

% !cat

% cat pkcalc2__.source_cycles

cycles for pkcalc2__ by source line

printing lines with at least 3527.61 events

 

swim 1559 33234

swim 1560 10864

swim 1561 8004

swim 1565 13459

swim 1567 4733

swim 1568 5978

swim 1569 10359

swim 1570 4670

swim 1573 10287

swim 1577 8946

swim 1578 8405

swim 1579 13754

swim 1582 6124

swim 1589 4057

swim 1595 19931

swim 1596 25718

swim 1597 6779

swim 1598 9580

swim 1600 3598

swim 1603 3984

swim 1607 7356

swim 1608 9548

swim 1609 7021

swim 1610 11351

swim 1611 6496

swim 1612 8425

swim 1617 4728

swim 1625 3693

swim 1626 20320

swim 1627 11306

 

But with -tune=ev4, the action seems more spread out…

 

 

EV4 KAP Source Code

 

% head -1630 swim.cmp.f | tail -70

RR51 = Z1(I1+II21,J1+1) + Z1(I1+II21,J1)

RR131 = TDTS81 * RR111

RR141 = TDTS81 * RR121

RR41 = TDTS81 * RR51

RR151 = CV1(I1+1,J1+1) + CV1(I1,J1+1)

RR161 = CV1(I1+II11,J1+1) + CV1(I1+II31,J1+1)

RR81 = CV1(I1+II21,J1+1) + CV1(I1+II11,J1+1)

RR171 = RR151 + CV1(I1,J1)

RR181 = RR161 + CV1(I1+II31,J1)

RR71 = RR81 + CV1(I1+II11,J1)

RR191 = RR171 + CV1(I1+1,J1)

RR201 = RR181 + CV1(I1+II11,J1)

RR61 = RR71 + CV1(I1+II21,J1)

RR211 = RR131 * RR191

RR221 = RR141 * RR201

RR31 = RR41 * RR61

RR231 = UOLD1(I1+1,J1) + RR211

RR241 = UOLD1(I1+II11,J1) + RR221

RR29 = UOLD1(I1+II21,J1) + RR31

RR251 = H1(I1+1,J1) - H1(I1,J1)

RR261 = H1(I1+II11,J1) - H1(I1+II31,J1)

RR101 = H1(I1+II21,J1) - H1(I1+II11,J1)

RR271 = TDTSDX1 * RR251

RR281 = TDTSDX1 * RR261

RR91 = TDTSDX1 * RR101

UNEW1(I1+1,J1) = RR231 - RR271

UNEW1(I1+II11,J1) = RR241 - RR281

UNEW1(I1+II21,J1) = RR29 - RR91

RR111 = Z1(I1+1,J1+1) + Z1(I1,J1+1)

RR121 = Z1(I1+II11,J1+1) + Z1(I1+II31,J1+1)

RR51 = Z1(I1+II21,J1+1) + Z1(I1+II11,J1+1)

RR131 = TDTS81 * RR111

RR141 = TDTS81 * RR121

RR41 = TDTS81 * RR51

RR151 = CU1(I1+1,J1+1) + CU1(I1,J1+1)

RR161 = CU1(I1+II11,J1+1) + CU1(I1+II31,J1+1)

RR81 = CU1(I1+II21,J1+1) + CU1(I1+II11,J1+1)

RR171 = RR151 + CU1(I1,J1)

RR181 = RR161 + CU1(I1+II31,J1)

RR71 = RR81 + CU1(I1+II11,J1)

RR191 = RR171 + CU1(I1+1,J1)

RR201 = RR181 + CU1(I1+II11,J1)

RR61 = RR71 + CU1(I1+II21,J1)

RR211 = RR131 * RR191

RR221 = RR141 * RR201

RR31 = RR41 * RR61

RR231 = VOLD1(I1,J1+1) - RR211

RR241 = VOLD1(I1+II31,J1+1) - RR221

RR29 = VOLD1(I1+II11,J1+1) - RR31

RR251 = H1(I1,J1+1) - H1(I1,J1)

RR261 = H1(I1+II31,J1+1) - H1(I1+II31,J1)

RR101 = H1(I1+II11,J1+1) - H1(I1+II11,J1)

RR271 = TDTSDY1 * RR251

RR281 = TDTSDY1 * RR261

RR91 = TDTSDY1 * RR101

VNEW1(I1,J1+1) = RR231 - RR271

VNEW1(I1+II31,J1+1) = RR241 - RR281

VNEW1(I1+II11,J1+1) = RR29 - RR91

RR131 = CU1(I1+1,J1) - CU1(I1,J1)

RR141 = CU1(I1+II11,J1) - CU1(I1+II31,J1)

RR41 = CU1(I1+II21,J1) - CU1(I1+II11,J1)

RR211 = TDTSDX1 * RR131

RR221 = TDTSDX1 * RR141

RR31 = TDTSDX1 * RR41

RR231 = POLD1(I1,J1) - RR211

RR241 = POLD1(I1+II31,J1) - RR221

RR29 = POLD1(I1+II11,J1) - RR31

RR191 = CV1(I1,J1+1) - CV1(I1,J1)

RR201 = CV1(I1+II31,J1+1) - CV1(I1+II31,J1)

RR61 = CV1(I1+II11,J1+1) - CV1(I1+II11,J1)

 

 

KAP has unrolled by 3x - note the 3 stores:

 

UNEW1(I1+1,J1) = RR231 - RR271

UNEW1(I1+II11,J1) = RR241 - RR281

UNEW1(I1+II21,J1) = RR29 - RR91

 

The constants are defined (in this routine) as II11=2 and II21=3:

 

 

% grep -n II11 swim.cmp.f | grep PARA

1064: PARAMETER (II11 = 2)

1459: PARAMETER (II11 = 2)

2038: PARAMETER (II11 = 3)

% grep -n II21 swim.cmp.f | grep PARA

1055: PARAMETER (II21 = 3)

1448: PARAMETER (II21 = 3)

2150: PARAMETER (II21 = 4)

% grep -n II31 swim.cmp.f | grep PARA

1072: PARAMETER (II31 = 1)

1467: PARAMETER (II31 = 1)

2054: PARAMETER (II31 = 1)

 

EV5 disassembly

% cd ../e5

% cat pkcalc2__.dis

Cycle=cycles

BMis=bcache_miss

pkcalc2__:

file line addr Instr Cycle BMis

...

swim 985 12000b8d8 lds $f13,2052(r18) 7251 4

swim 985 12000b8dc lds $f12,2056(r18) 66 1

swim 988 12000b8e0 lds $f15,2052(r6) 3914

swim 988 12000b8e4 lds $f14,2056(r6) 60

swim 984 12000b8e8 lds $f11,0(r19) 3687

swim 983 12000b8ec addl r16,1,r16

swim 984 12000b8f0 lds $f18,-2052(r19) 3590

swim 983 12000b8f4 cmple r16,r1,r24

swim 983 12000b8f8 lda r23,4(r23) 3477

swim 983 12000b8fc lda r26,4(r26)

swim 983 12000b900 lda r19,4(r19) 3765

swim 983 12000b904 lda r22,4(r22)

swim 983 12000b908 lda r8,4(r8) 3634

swim 983 12000b90c lda r6,4(r6)

swim 983 12000b910 lda r18,4(r18) 3965

swim 983 12000b914 lda r17,4(r17)

swim 987 12000b918 lds $f19,-8(r19) 7330 32

swim 987 12000b91c lda r21,4(r21)

swim 985 12000b920 adds $f12,$f13,$f12 68442 1205

swim 983 12000b924 lda r27,4(r27)

swim 988 12000b928 lds $f16,-4(r6) 209

swim 988 12000b92c adds $f14,$f15,$f14 25107 277

swim 988 12000b930 lds $f17,0(r6) 3983

swim 984 12000b934 adds $f11,$f18,$f18 29849 725

swim 985 12000b938 lds $f22,0(r18)

swim 986 12000b93c lds $f24,-4(r22) 3566 2

swim 987 12000b940 adds $f11,$f19,$f11 13322 22

swim 984 12000b944 muls $f0,$f18,$f18 6427

swim 986 12000b948 lds $f25,0(r22)

swim 988 12000b94c adds $f14,$f16,$f14 12049 88

swim 985 12000b950 lds $f20,-4(r18) 3455

swim 990 12000b954 subs $f17,$f16,$f21 5842 38

swim 987 12000b958 muls $f0,$f11,$f11 542

swim 989 12000b95c lds $f26,2048(r22) 133 2

swim 988 12000b960 adds $f14,$f17,$f14 7660

swim 990 12000b964 lds $f23,-4(r8) 112

swim 990 12000b968 muls $f1,$f21,$f21 6373 1

swim 984 12000b96c lds $f27,-4(r23) 4842 72

swim 986 12000b970 subs $f25,$f24,$f25 21816 171

swim 987 12000b974 muls $f11,$f14,$f11 1165

swim 987 12000b978 lds $f28,-4(r26) 209

swim 985 12000b97c adds $f12,$f20,$f12 7693 19

swim 991 12000b980 subs $f13,$f20,$f13 3743

swim 989 12000b984 subs $f26,$f24,$f24 17159 244

swim 986 12000b988 muls $f1,$f25,$f25 411

swim 985 12000b98c adds $f12,$f22,$f12 6643 2

swim 991 12000b990 muls $f10,$f13,$f13 4196

swim 990 12000b994 subs $f23,$f21,$f21 18804 510

swim 989 12000b998 muls $f10,$f24,$f24 3869

swim 984 12000b99c muls $f18,$f12,$f12 6513 1

swim 987 12000b9a0 subs $f28,$f11,$f11 30419 743

swim 990 12000b9a4 subs $f21,$f13,$f13 3740

swim 984 12000b9a8 adds $f27,$f12,$f12 29793 621

swim 987 12000b9ac subs $f11,$f24,$f11 4043

swim 990 12000b9b0 sts $f13,-4(r17) 3368 2

swim 984 12000b9b4 subs $f12,$f25,$f12 6793

swim 987 12000b9b8 sts $f11,-4(r27) 4462 1

swim 984 12000b9bc sts $f12,-4(r21) 11045 3

swim 983 12000b9c0 bne r24,12000b8d8 3686

 

EV4 Disassembly

 

% cd ../e4

% cat pkcalc2__.dis

Cycle=cycles

BMi=bcache_miss

pkcalc2__:

file line addr Instr Cycle BMi

...

swim 1565 12000c4b4 lds $f12,2052(r24) 3682 2

swim 1566 12000c4b8 lds $f22,2060(r24) 69

swim 1565 12000c4bc lds $f11,2056(r24) 1199

swim 1567 12000c4c0 lds $f24,2064(r24) 1250

swim 1559 12000c4c4 lds $f14,0(r26) 106 1

swim 1561 12000c4c8 lds $f20,8(r26) 1319 3

swim 1559 12000c4cc lds $f15,-2052(r26) 1135

swim 1568 12000c4d0 lds $f16,0(r24) 4086 22

swim 1558 12000c4d4 addl r17,3,r17

swim 1570 12000c4d8 lds $f25,8(r24) 3359 29

swim 1558 12000c4dc cmple r17,r16,r19

swim 1569 12000c4e0 lds $f19,4(r24) 9068 133

swim 1558 12000c4e4 lda r21,12(r21)

swim 1558 12000c4e8 lda r22,12(r22) 1287 1

swim 1558 12000c4ec lda r26,12(r26)

swim 1573 12000c4f0 lds $f30,12(r24) 9076 161

swim 1558 12000c4f4 lda r27,12(r27)

swim 1595 12000c4f8 lds $f27,2052(r6) 3332 6

swim 1558 12000c4fc lda r8,12(r8)

swim 1596 12000c500 lds $f3,2060(r6) 22240 396

swim 1565 12000c504 adds $f11,$f12,$f13 8578 121

swim 1558 12000c508 lda r23,12(r23)

swim 1595 12000c50c lds $f26,2056(r6) 9675 185

swim 1597 12000c510 lds $f4,2064(r6) 3208 14

swim 1559 12000c514 adds $f14,$f15,$f15 31975 694

swim 1558 12000c518 lda r6,12(r6)

swim 1560 12000c51c lds $f17,-8(r26) 1823 3

swim 1560 12000c520 lds $f18,-2060(r26) 1909

swim 1568 12000c524 adds $f13,$f16,$f13 1892 31

swim 1558 12000c528 lda r0,12(r0)

swim 1567 12000c52c adds $f24,$f22,$f24 3483 76

swim 1561 12000c530 lds $f21,-2056(r26) 1420

swim 1562 12000c534 muls $f0,$f15,$f15

swim 1566 12000c538 adds $f22,$f11,$f23

swim 1558 12000c53c lda r24,12(r24)

swim 1628 12000c540 subs $f12,$f16,$f12 1185

swim 1558 12000c544 lda r18,12(r18)

swim 1589 12000c548 lds $f28,-16(r26) 2198 1

swim 1571 12000c54c adds $f13,$f19,$f13 2176 2

swim 1570 12000c550 adds $f24,$f25,$f24 1311 1

swim 1598 12000c554 lds $f29,-12(r6) 2437 49

swim 1595 12000c558 adds $f26,$f27,$f27 6924 42

swim 1569 12000c55c adds $f23,$f19,$f23 1291 1

swim 1600 12000c560 lds $f5,-4(r6) 2259 11

swim 1574 12000c564 muls $f15,$f13,$f13

swim 1596 12000c568 adds $f3,$f26,$f26 3478 93

swim 1599 12000c56c lds $f15,-8(r6) 911

swim 1579 12000c570 lds $f6,-4(r21) 3204 24

swim 1560 12000c574 adds $f17,$f18,$f18 7132 55

swim 1631 12000c578 muls $f10,$f12,$f12

swim 1561 12000c57c adds $f20,$f21,$f21 5265 79

swim 1573 12000c580 adds $f24,$f30,$f24 1211

swim 1577 12000c584 lds $f30,-12(r21)

swim 1589 12000c588 adds $f14,$f28,$f28 1859 1

swim 1563 12000c58c muls $f0,$f18,$f18 508

swim 1598 12000c590 adds $f27,$f29,$f27 7143 89

swim 1564 12000c594 muls $f0,$f21,$f21 154

swim 1572 12000c598 adds $f23,$f25,$f23 1213

swim 1592 12000c59c muls $f0,$f28,$f28 1176

swim 1590 12000c5a0 adds $f17,$f14,$f14 1206

swim 1601 12000c5a4 adds $f27,$f15,$f27 3280 1

swim 1576 12000c5a8 muls $f21,$f24,$f21

swim 1580 12000c5ac lds $f24,-8(r8) 21

swim 1591 12000c5b0 adds $f20,$f17,$f17 1200 1

swim 1603 12000c5b4 lds $f20,0(r6)

swim 1575 12000c5b8 muls $f18,$f23,$f18

swim 1578 12000c5bc lds $f23,-8(r21)

swim 1597 12000c5c0 adds $f4,$f3,$f3 3571 76

swim 1582 12000c5c4 lds $f4,0(r8)

swim 1577 12000c5c8 adds $f30,$f13,$f13 8931 167

swim 1581 12000c5cc lds $f30,-4(r8)

swim 1604 12000c5d0 muls $f28,$f27,$f27 1296

swim 1580 12000c5d4 lds $f28,-12(r8)

swim 1599 12000c5d8 adds $f26,$f15,$f26

swim 1593 12000c5dc muls $f0,$f14,$f14 1210

swim 1600 12000c5e0 adds $f3,$f5,$f3 1339

swim 1594 12000c5e4 muls $f0,$f17,$f17

swim 1579 12000c5e8 adds $f6,$f21,$f6 10550 262

swim 1607 12000c5ec lds $f21,-12(r27) 64 1

swim 1602 12000c5f0 adds $f26,$f5,$f26 1297

swim 1619 12000c5f4 subs $f15,$f29,$f29 1178

swim 1603 12000c5f8 adds $f3,$f20,$f3 3984 39

swim 1578 12000c5fc adds $f23,$f18,$f18 8405 28

swim 1582 12000c600 subs $f4,$f30,$f4 6124 21

swim 1605 12000c604 muls $f14,$f26,$f14

swim 1610 12000c608 lds $f26,2040(r8) 38

swim 1580 12000c60c subs $f24,$f28,$f23 1928 8

swim 1606 12000c610 muls $f17,$f3,$f3 1251

swim 1608 12000c614 lds $f17,-8(r27) 26

swim 1581 12000c618 subs $f30,$f24,$f2

swim 1620 12000c61c subs $f5,$f15,$f15 1221

swim 1585 12000c620 muls $f1,$f4,$f4 1275

swim 1607 12000c624 subs $f21,$f27,$f21 7191 156

swim 1626 12000c628 lds $f27,-8(r0) 109 2

swim 1583 12000c62c muls $f1,$f23,$f23 1189

swim 1584 12000c630 muls $f1,$f2,$f2 1247

swim 1621 12000c634 subs $f20,$f5,$f5

swim 1622 12000c638 muls $f1,$f29,$f29 1147

swim 1629 12000c63c subs $f11,$f19,$f11

swim 1588 12000c640 subs $f6,$f4,$f4 1236

swim 1612 12000c644 lds $f6,2048(r8) 101 5

swim 1623 12000c648 muls $f1,$f15,$f15

swim 1586 12000c64c subs $f13,$f23,$f13 1154

swim 1611 12000c650 lds $f23,2044(r8) 1202

swim 1587 12000c654 subs $f18,$f2,$f2

swim 1609 12000c658 lds $f18,-4(r27) 38

swim 1624 12000c65c muls $f1,$f5,$f5

swim 1610 12000c660 subs $f26,$f28,$f26 11299 267

swim 1625 12000c664 lds $f28,-12(r0)

swim 1632 12000c668 muls $f10,$f11,$f11

swim 1608 12000c66c subs $f17,$f14,$f14 9522 165

swim 1627 12000c670 lds $f17,-4(r0) 1212

swim 1630 12000c674 subs $f22,$f25,$f22

swim 1626 12000c678 subs $f27,$f15,$f15 20211 424

swim 1586 12000c67c sts $f13,-12(r22) 101 2

swim 1613 12000c680 muls $f10,$f26,$f26 1169

swim 1587 12000c684 sts $f2,-8(r22) 53 3

swim 1588 12000c688 sts $f4,-4(r22) 1316 2

swim 1612 12000c68c subs $f6,$f30,$f6 8324 185

swim 1633 12000c690 muls $f10,$f22,$f22 1147

swim 1635 12000c694 subs $f15,$f11,$f11 1089

swim 1611 12000c698 subs $f23,$f24,$f23 5294

swim 1609 12000c69c subs $f18,$f3,$f3 6983 139

swim 1615 12000c6a0 muls $f10,$f6,$f6 1253

swim 1625 12000c6a4 subs $f28,$f29,$f28 3665

swim 1627 12000c6a8 subs $f17,$f5,$f5 10094 173

swim 1635 12000c6ac sts $f11,-8(r18) 88

swim 1614 12000c6b0 muls $f10,$f23,$f23 1352

swim 1616 12000c6b4 subs $f21,$f26,$f21

swim 1618 12000c6b8 subs $f3,$f6,$f3 1542

swim 1634 12000c6bc subs $f28,$f12,$f12 1149

swim 1617 12000c6c0 subs $f14,$f23,$f14 1996 1

swim 1616 12000c6c4 sts $f21,-12(r23) 124 4

swim 1636 12000c6c8 subs $f5,$f22,$f5 1188

swim 1618 12000c6cc sts $f3,-4(r23) 524 1

swim 1634 12000c6d0 sts $f12,-12(r18) 1248

swim 1617 12000c6d4 sts $f14,-8(r23) 2732 19

swim 1636 12000c6d8 sts $f5,-4(r18) 1810 14

swim 1558 12000c6dc bne r19,12000c4b4

 

 

Count the stores

 

% cd e4

% grep sts pkcalc2__.dis | wc -l

12

% cd ../e5

% grep sts pkcalc2__.dis | wc -l

3

 

The EV4 version is unrolled by 3x (plus a cleanup loop)

How about other events?

 

% cd ..

% cat e4/bcache_miss.hot_routines

Hot Routines for bcache_miss -pthresh 1

Events % Routine Image Addr

4536 39 pkcalc2__ a.out 12000C350:12000C8AF

3513 30 pkcalc3__ a.out 12000D430:12000D8AF

2883 25 pkcalc1__ a.out 12000B690:12000BAFF

155 1 calc3_ a.out 12000CDD0:12000D42F

% cat e5/bcache_miss.hot_routines

Hot Routines for bcache_miss -pthresh 1

Events % Routine Image Addr

4787 37 pkcalc2__ a.out 12000B790:12000B9DF

4640 35 pkcalc3__ a.out 12000C510:12000C93F

2778 21 pkcalc1__ a.out 12000B040:12000B22F

215 2 calc3_ a.out 12000BF00:12000C50F

 

% cat >run_sca.csh

set verbose

unlimit

set notify

setenv PARALLEL 4

iprobe -quiet -method sample scache_miss &

time ./a.out <swim.in >swim.out

kill %iprobe

unset verbose

 

% csh

% cd e4

% source ../run_sca.csh

unlimit

set notify

setenv PARALLEL 4

iprobe -quiet -method sample scache_miss &

[1] 3701

time ./a.out < swim.in > swim.out

Start of sampling

192.78u 0.23s 0:48 398% 0+153k 0+6io 0pf+0w

kill %iprobe

unset verbose

[1] Terminated iprobe -quiet -method sample scache_miss

 

% cd ../e5

% source ../run_sca.csh

unlimit

set notify

setenv PARALLEL 4

iprobe -quiet -method sample scache_miss &

[1] 3702

time ./a.out < swim.in > swim.out

Start of sampling

227.14u 0.18s 0:56 399% 0+153k 0+6io 0pf+0w

kill %iprobe

unset verbose

[1] Terminated iprobe -quiet -method sample scache_miss

 

The Dreaded Signal 1 Floorboard

 

% ../harness.pl -d pcsample.dat -e scache_miss

User defined signal 1

% iprobe

Node name : gemosf.zko.dec.com

OS : OSF1 T4.0-738.5

CPU count : 4

Model : Unknown

Memory size : 381 MB

Counter count : 3

cycles : Low frequency

Current time : Tue Jun 17 07:17:04 1997

Start time: : immediate

Duration : 0 (until user interrupts)

Interval : 1

Method : count

Measured Modes : all modes

Measured Data : pid ctr ps pc

Buffer_count : 3

Buffer_size : 8192

time cpu freq event # events evts/sec

07:17:04 0 2^16 cycles 399966208 399966208

07:17:04 1 2^16 cycles 399966208 399966208

07:17:04 2 2^16 cycles 399966208 399966208

07:17:04 3 2^16 cycles 399769600 399769600

07:17:05 0 2^16 cycles 399572992 399572992

07:17:05 1 2^16 cycles 399507456 399507456

07:17:05 2 2^16 cycles 399507456 399507456

07:17:05 3 2^16 cycles 399572992 399572992

07:17:06 0 2^16 cycles 2818048 2818048

07:17:06 1 2^16 cycles 2883584 2883584

07:17:06 2 2^16 cycles 2883584 2883584

07:17:06 3 2^16 cycles 2818048 2818048

Total event count:

07:17:06 0 2^16 cycles 802357248

07:17:06 1 2^16 cycles 802357248

07:17:06 2 2^16 cycles 802357248

07:17:06 3 2^16 cycles 802160640

 

Reduce the scache miss events

 

% ../harness.pl -d pcsample.dat -e scache_miss

Os=unix

 

Generating top-level report for scache_miss

ipreduce -input_file pcsample.dat -output_file scache_miss.rpt -event

scache_miss -pthresh 1

 

ipreduce -o scache_miss_pkcalc3__.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc 12000C510:12000C93F

Annotating pkcalc3__

 

ipreduce -o scache_miss_pkcalc2__.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc 12000B790:12000B9DF

Annotating pkcalc2__

 

ipreduce -o scache_miss_pkcalc1__.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc 12000B040:12000B22F

Annotating pkcalc1__

 

ipreduce -o scache_miss_hardclock.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc FFFFFC0000253490:FFFFFC00002540EF

dis -h -p hardclock /vmunix > hardclock.dis_tmp

Annotating hardclock

 

ipreduce -o scache_miss_spin_wait_join_barrier.rpd -d pc -event scache_miss

-input_file pcsample.dat -pc 1200B5118:1200B525F

dis -h -p spin_wait_join_barrier a.out > spin_wait_join_barrier.dis_tmp

Annotating spin_wait_join_barrier

% cd ../e4

% ../harness.pl -d pcsample.dat -e scache_miss

Os=unix

 

Generating top-level report for scache_miss

ipreduce -input_file pcsample.dat -output_file scache_miss.rpt -event

scache_miss -pthresh 1

 

ipreduce -o scache_miss_pkcalc3__.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc 12000D430:12000D8AF

Annotating pkcalc3__

 

ipreduce -o scache_miss_pkcalc2__.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc 12000C350:12000C8AF

Annotating pkcalc2__

 

ipreduce -o scache_miss_pkcalc1__.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc 12000B690:12000BAFF

Annotating pkcalc1__

 

ipreduce -o scache_miss_hardclock.rpd -d pc -event scache_miss -input_file

pcsample.dat -pc FFFFFC0000253490:FFFFFC00002540EF

dis -h -p hardclock /vmunix > hardclock.dis_tmp

Annotating hardclock

 

ipreduce -o scache_miss_spin_wait_join_barrier.rpd -d pc -event scache_miss

-input_file pcsample.dat -pc 1200B6088:1200B61CF

dis -h -p spin_wait_join_barrier a.out > spin_wait_join_barrier.dis_tmp

Annotating spin_wait_join_barrier

 

 

% cd ..

% cat e4/scache_miss.hot_routines

Hot Routines for scache_miss -pthresh 1

Events % Routine Image Addr

17489 36 pkcalc3__ a.out 12000D430:12000D8AF

16835 35 pkcalc2__ a.out 12000C350:12000C8AF

11260 23 pkcalc1__ a.out 12000B690:12000BAFF

668 1 hardclock /vmunix FFFFFC0000253490:FFFFFC00002540EF

546 1 spin_wait_join_barrier a.out 1200B6088:1200B61CF

% !!:s/e4/e5/

% cat e5/scache_miss.hot_routines

Hot Routines for scache_miss -pthresh 1

Events % Routine Image Addr

19570 37 pkcalc3__ a.out 12000C510:12000C93F

18734 35 pkcalc2__ a.out 12000B790:12000B9DF

11570 22 pkcalc1__ a.out 12000B040:12000B22F

697 1 hardclock /vmunix FFFFFC0000253490:FFFFFC00002540EF

549 1 spin_wait_join_barrier a.out 1200B5118:1200B525F

 

% cd e5

% cat pkcalc2__.dis

Cycle=cycles

BMis=bcache_miss

SMis=scache_miss

pkcalc2__:

file line addr Instr Cycle BMis SMis

...

swim 985 12000b8d8 lds $f13,2052(r18) 7251 4 32

swim 985 12000b8dc lds $f12,2056(r18) 66 1 8

swim 988 12000b8e0 lds $f15,2052(r6) 3914 17

swim 988 12000b8e4 lds $f14,2056(r6) 60 2

swim 984 12000b8e8 lds $f11,0(r19) 3687 215

swim 983 12000b8ec addl r16,1,r16

swim 984 12000b8f0 lds $f18,-2052(r19) 3590 14

swim 983 12000b8f4 cmple r16,r1,r24

swim 983 12000b8f8 lda r23,4(r23) 3477 12

swim 983 12000b8fc lda r26,4(r26)

swim 983 12000b900 lda r19,4(r19) 3765 951

swim 983 12000b904 lda r22,4(r22)

swim 983 12000b908 lda r8,4(r8) 3634 120

swim 983 12000b90c lda r6,4(r6)

swim 983 12000b910 lda r18,4(r18) 3965 163

swim 983 12000b914 lda r17,4(r17)

swim 987 12000b918 lds $f19,-8(r19) 7330 32 926

swim 987 12000b91c lda r21,4(r21)

swim 985 12000b920 adds $f12,$f13,$f12 68442 1205 2109

swim 983 12000b924 lda r27,4(r27)

 

script done on Tue Jun 17 07:19:18 1997

 

 

Summary

 

 

 

 

 

 

 

Comments on IPROBE may be sent to goddard@zko.dec.com

 

Comments on the data reduction harness may be sent to henning@zko.dec.com

 

Digital-internal users can find IPROBE via the CSD/PG web site http://sdtad.zko.dec.com/pub/csdpg and the data reduction harnes via http://tlg-www.zko.dec.com/~henning

 

External users should contact Greg Tarsa, tarsa@zko.dec.com, to inquire about licensing.