Snapshot #6

About Articles How to contact me Projects Site Map

Joseph Koshy > Projects > PMC based Performance Measurement in FreeBSD > Code Snapshots > Snapshot #6

Snapshot #6, against -CURRENT of 14 Apr 2005


Snapshot date Against -CURRENT of date Format Download
14-Apr-2005 14-Apr-2005 Patch (-p1), gzip'ed Download (~107KB)
MD5: 3b59d1ffc391d2bcff212bcf9646082a

Note: Code equivalent in functionality to this snapshot is in -CURRENT as of 19 Apr 2005. The differences are:


## Snapshot 6 of the hardware PMC support code

I am pleased to announce the next snapshot of the hardware performance
counter support code.

While its stability is much better than before please do keep in mind
that this is still pre-alpha code.  Please test on a scratch box.

## What's new since the previous snapshot

  - Support for Intel P-Pro, Pentium-III, Pentium-II and
    Celeron processors.
  - Improved P4/HTT support.
  - Additions/changes to the pmc(3) API.
  - A Python extension to the pmc(3) library.
  - Many bug fixes, improved documentation.

## What's available

You can now answer the question "what are the hardware events
happening on this system?" on the following CPUs:

  - AMD Athlon64/Opteron
  - AMD Athlon
  - Intel P4 and P4/HTT processors
  - Intel Pentium Pro, Pentium II, Celeron and Pentium III processors

You can script programs that use the functionality using Python and C.

(Support for answering the next question, namely "which portions of
code are related to these events?" is being worked on).

## Code components

  - A kernel driver pmc(4). [hwpmc(4) in -CURRENT]

  - A userland library ("libpmc", see pmc(3)) to access the driver.

  - Userland utilities to use the driver (pmcstat(8) and

  - Documentation in the form of manual pages.

  - A Python interface (available as a separate download).

## What can it do today?

  - Measure a whole bunch of hardware events.  See the documentation
    for pmc(3).

  - Supported PMC kinds:

    (a) Process-virtual PMCs: these PMCs count hardware events only
        when their target process is scheduled on a CPU,

    (b) System-wide PMCs: these PMCs count hardware events for
        the system as a whole.

  - Supported PMC modes:

    (a) "Counting mode" PMCs: these PMCs only count events, and do not
        sample the instruction pointer.

    "Sampling mode" PMCs are being worked on.

  (Please see the section on "Known bugs and limitations" below).

## Using the code

  - Download the patch.

  - Apply it to a freshly checked out -CURRENT source.

    # cd /usr/src
    # patch -p1 < PATCH-FILE

  - Update 'world'.

  - Add "options PMC_HOOKS" to your kernel config file, recompile
    and reboot the new kernel. [Use "options HWPMC_HOOKS" in

  - Load the new kernel module and start using it.

    # kldload pmc	["kldload hwpmc" in -CURRENT]

    # pmcstat [options] ...

    Or, using python:

    # python
    ... snip ...
    >>> import pmc
    >>> pmc.initialize()
    >>> p = pmc.X86Pmc('p4-machine-clear') # On Intel P4 machines
    >>> p.attach(TARGET-PID)
    >>> p.start()
    ... etc ...

## Examples

  These examples use the pmcstat(8) tool.

  - Example 1:  Measure the TLB miss behaviour of 'firefox' on an
    AMD Athlon.  Print counts every 1 second.

% ps -ax | grep firefox
... [snip]
 1884  v0  S      0:04.59  /usr/X11R6/lib/firefox/lib/firefox-0.9.3/firefox-bin
... [snip]

    'firefox' is already running so we attach to it using the '-t
    TARGET' option.  The '-w 1' option specifies the desired interval.

% pmcstat -p k7-l1-dtlb-miss-and-l2-dtlb-hits -p k7-l1-and-l2-dtlb-misses \
          -w 1 -t 1884
# p/k7-l1-dtlb-miss-and-l2-dtlb-hits p/k7-l1-and-l2-dtlb-misses
  ... [snip]
                               63415                      13455
                              124529                      35816
                              113868                      28945
                              152241                      39426
                              306551                      78661
                              290212                      61392
                               40013                      11361
                               38530                      11169
                              183136                      47750
                               45264                      12981
                              169459                      37038
                               81363                      19049
                             1306901                     451348
                             1504465                     557414
                              482502                     100939
                              498394                      76948
                              962112                     110082
                             2677131                     245249
                             1258533                     178191
                              812905                     166234
                              144888                      34476
                               89319                      21937
                              330546                      46530
                              282000                      39137
                               85583                      19415
                              330585                      53437
                               37653                      10805
                               37263                      10892
                               48671                      14793
                                1952                       1105
                                   0                          0
  ... [snip]

    Clearly this program can stress the TLB!

  - Example 2: Measure cycles interrupts were masked while the
    ATA driver's interrupt handling thread was executing while
    the 'diskinfo' command was scheduled.

    We need to be root to do this:

amd64# ps -ax | grep ata
  25  ??  WL     0:00.25 [irq14: ata0]
  26  ??  WL     0:00.00 [irq15: ata1]
  31  ??  WL     0:00.00 [irq20: atapci0]

    We setup pmcstat(8) to count cycles spent with the processors IF
    bit cleared and when the ata0 thread (pid 25) is executing.

amd64# diskinfo -c ad0 > /dev/null & \
  pmcstat -p k8-fr-interrupts-masked-while-pending-cycles -t 25 -w 1
# p/k8-fr-interrupts-masked-while-pending-cycles
  ... [snip]
   ... [snip]

   - Example 3: Measure the total number of interrupts seen by the
     system while a particular command was executing.  Also count the
     number of cycles the CPU's IF bit was zero when the command was
     scheduled on a CPU.

amd64# pmcstat -p k8-fr-interrupts-masked-while-pending-cycles \
       -s k8-fr-taken-hardware-interrupts -w 1 diskinfo -c ad0 > /dev/null
# p/k8-fr-interrupts-masked-while-pending-cycles s/k8-fr-taken-hardware-interrupts
                                           22887                              1149
                                           88001                              1308
                                           48058                              6406
                                           34986                              7910
                                           47714                              7893
                                           22399                              1961

## Known bugs and limitations

  - Sampling mode support is disabled in this snapshot.

  - P4 HTT CPUs:
    - Using system-mode PMCs on a P4/HTT system and the SCHED_ULE
      scheduler can cause lockups.  The SCHED_4BSD scheduler seems to
      work fine.

      The problem has been narrowed down to the call to 'sched_bind()'
      locking up when switching CPUs.

      You can work around this problem by turning off HTT before
      loading the pmc(4) module.

      # sysctl machdep.hlt_logical_cpus=1
      # kldload pmc

  - Intel Pentium III CPUs:
    - P-III counters see a large jump in value when their counts cross
      2^31.  This has been narrowed down to the fact that writing out
      a perf counter value that has bit 31 (e.g., 0x80000000) set
      seems to trigger a sign extension to 40 bits.

## Next Steps (in approximate order)

  Please contact me if you would like to take up any of these.

  - Implement sampling modes.

  - Test suites for all the PMC architectures supported.

  - A number of Intel P4 specific features (precise sampling,
    PMC cascading etc.) remain to be implemented.

  - userland tools
    - use PMC based instruction pointer sampling with
      gcc -pg.
    - enhance our profiling support code to use the ability to read
      process-mode PMC counts with the RDPMC instruction.
    - convert sampling mode output to gprof format.
    - create a tool that can correlate measured cache/tlb/etc.
      behaviour with data structure layout and code layout.

  - Write documentation suitable for /usr/share/doc/papers/.

  - A port of PAPI.

Related Links

Last Modified: Sat Apr 21 22:53:24 2007
Site Search Google