Exchanges in NA Digest regarding flops

NA digest, December 3, 2000, Issue 49

Subject: Flop Count Missing in Matlab R12
From: Stephen Vavasis 
Date: Thu, 30 Nov 2000 15:13:46 -0500

I installed Matlab 6.0 (R12) last week and was dismayed to find that the
'flops' function is gone.  I have used this function many times in my
published papers to compare two algorithms in computational tests.
Probably many of you also have used 'flops' in your own research.  I'm an
editor of several NA journals and have seen 'flops' used by papers I have
handled.

I propose that numerical analysts should lobby Mathworks to reinstate the
'flops' function in a future version of Matlab.  Do you agree?  I'm not
sure how to proceed.  Perhaps you can send me email if you agree or
disagree.  Or maybe you can make a followup posting to NA Digest if you
have some thoughts about 'flops'.

Clearly there is a connection between our community and Mathworks,
considering that Cleve Moler is the editor of this newsletter!  Maybe we
can use that connection to our advantage!

  Steve Vavasis (vavasis@cs.cornell.edu)


[Cleve's response:] You don't need to lobby us, Steve.  You've already got
our vote for resurrecting FLOPS as soon as we can.

The trouble is, LAPACK and, especially, optimized BLAS have no provision
for keeping track of counts of floating point operations.  Unfortunately,
it would require more extensive software modifications than we are able to
make. There might also be some degratation of performance.

Jack Dongarra and his colleagues at the University of Tennessee ATLAS
project tell us that they can get flop counts through access to privileged
hardware instructions available on some modern chips, but I'm sure yet how
that works.

It would also be feasible to provide rough flop estimates that would have
approximately the right coefficient of n^3 for matrix operations, but we
don't think that's good enough.  A number of people were concerned that
the traditional FLOP function didn't count every single operation,
including those in system level libraries outside of our control.

By the way, I mentioned the FLOP difficulties briefly in my MATLAB News
and Notes column about LAPACK.  
(http://www.mathworks.com/company/newsletter/
clevescorner/winter2000.cleve.shtml)

   Cleve

NA digest, December 10, 2000, Issue 50

Subject: Re: Flop Count Missing in Matlab R12
From: Tim Davis 
Date: Tue, 05 Dec 2000 10:53:38 -0500

To count flops, we need to first know what they are.  What is a flop?

LAPACK is not the only place where the question "what is a flop?" is
relevant. Sparse matrix codes are another.  Multifrontal and supernodal
factorization algorithms store L and U (and intermediate submatrices, for
the multifrontal method) as a set of dense submatrices.  It's more
efficient that way, since the dense BLAS can be used within the dense
submatrices.  It is often better explicitly store some of the numerical
zeros, so that one ends up with fewer frontal matrices or supernodes.

So what happens when I compute zero times zero plus zero?  Is that a flop
(or two flops)?  I computed it, so one could argue that it counts.  But it
was useless, so one could argue that it shouldn't count.  Computing it
allowed me to use more BLAS-3, so I get a faster algorithm that happens to
do some useless flops.  How do I compare the "mflop rate" of two
algorithms that make different decisions on what flops to perform and
which of those to include in the "flop count"?

A somewhat better measure would be to compare the two algorithms based an
external count.  For example, the "true" flop counts for sparse LU
factorization can be computed in Matlab from the pattern of L and U as:

        [L,U,P] = lu (A) ;
        Lnz = full (sum (spones (L))) - 1 ;    % off diagonal nz in cols of L
        Unz = full (sum (spones (U')))' - 1 ;  % off diagonal nz in rows of U
        flops = 2*Lnz*Unz + sum (Lnz) ;

The same can be done on the LU factors found by any other factorization
code. This does count a few spurious flops, namely the computation a_ij +
l_ik*u_kj is always counted as two flops, even if a_ij is initially zero.

However, even with this "better" measure, the algorithm that does more
flops can be much faster.  You're better off picking the algorithm with
the smallest memory space requirements (which is not always the smallest
nnz (L+U)) and/or fastest run time.

So my vote is to either leave out the the flop count, or at most return a
reasonable agreed-upon estimate (like the "true flop count" for LU, above)
that is somewhat independent of algorithmic details.  Matrix multiply, for
example, should report 2*n^3, as Cleve states in his Winter 2000
newsletter, even though "better" methods with fewer flops (Strassen's
method) are available.

Tim Davis
University of Florida
davis@cise.ufl.edu

NA digest, December 10, 2000, Issue 50

Subject: Re: Flop Count Missing in Matlab R12
From: Jack Dongarra 
Date: Tue, 05 Dec 2000 15:29:45 -0500

This is to clarify some points in the discussion last week by Steve
Vavasis and Cleve Moler in the na-digest about counting floating point
operations. We have a project at the University of Tennessee called PAPI
(http://icl.cs.utk.edu/papi/). The Performance API (PAPI) project
specifies a standard application programming interface (API) for accessing
hardware performance counters available on most modern microprocessors.

For years collecting performance data on applications programs has been an
imprecise art. The user has had to rely on timers with poor resolution or
granularity, imprecise empirical information on the number of operations
performed in the program in question, vague information on the effects of
the memory hierarchy, etc. Today hardware counters exist on every major
processor platform. These counters can provide application developers
valuable information about the performance of critical parts of the
application and point to ways for improving performance.  The current
problem facing users and tool developers is that access to these counters
is often poorly documented, unstable or unavailable to the user level
program. The focus of the PAPI project is to provide an easy to use,
common set of interfaces that will gain access to these performance
counters on all major processor platforms, thereby providing application
developers the information they need to tune their software on different
platforms.  The goal is to make it easy for users to gain access to the
counters to aid in performance analysis, modeling, and tuning.

For more details on PAPI see
http://www.netlib.org/utk/people/JackDongarra/PAPERS/papi-sc2000.pdf

Jack