GUIDED_TOUR_OF_SOURCES - OpenGrok cross reference for /dports/math/cado-nfs/cado-nfs-f4284e2391121b2bfb97bc4880b6273c7250dc2f/linalg/bwc/GUIDED_TOUR_OF_SOURCES

File listing, and purpose of each file (20190710)
-------------------------------------------------

Please always look up the history of each file to see how current it can
possibly be. A file that has hardly been touched for many years is quite
surely completely outdated.


linalg/bwc/CMakeLists-nodist.txt
linalg/bwc/CMakeLists.txt
linalg/bwc/Makefile

    => build system


linalg/bwc/GUIDED_TOUR_OF_SOURCES
linalg/bwc/README
linalg/bwc/README.gfp

    => some bits of documentation


linalg/bwc/mpfq_layer.h
linalg/bwc/mpfq/*

    => low-level code that performs arithmetic in the base field. This is
    aut-generated from mpfq, and in truth only a very tiny portion of
    mpfq is actually used.


linalg/bwc/arith-modp.hpp

    => low-level arithmetic code used for operations modulo p when doing
    linear algebra over prime fields


linalg/bwc/matmul-basic.c
linalg/bwc/matmul-basicp.cpp
linalg/bwc/matmul-bucket.cpp
linalg/bwc/matmul-common.c
linalg/bwc/matmul-common.h
linalg/bwc/matmul-libnames.h.in
linalg/bwc/matmul-mf.c
linalg/bwc/matmul-mf.h
linalg/bwc/matmul-sliced.cpp
linalg/bwc/matmul-sub-large-fbd.S
linalg/bwc/matmul-sub-large-fbd.h
linalg/bwc/matmul-sub-large-fbi.S
linalg/bwc/matmul-sub-large-fbi.h
linalg/bwc/matmul-sub-small1.S
linalg/bwc/matmul-sub-small1.h
linalg/bwc/matmul-sub-small2.S
linalg/bwc/matmul-sub-small2.h
linalg/bwc/matmul-sub-vsc-combine.S
linalg/bwc/matmul-sub-vsc-combine.h
linalg/bwc/matmul-sub-vsc-dispatch.S
linalg/bwc/matmul-sub-vsc-dispatch.h
linalg/bwc/matmul-threaded.c
linalg/bwc/matmul-zone.cpp
linalg/bwc/matmul.c
linalg/bwc/matmul.h
linalg/bwc/matmul_facade.c
linalg/bwc/matmul_facade.h

    => low-level code that creates the in-memory matrix cache for fast
    matrix-times-vector evaluation. The important entry points are
    linalg/bwc/matmul-bucket.cpp for matrices over GF(2) and
    linalg/bwc/matmul-zone.cpp for matrices over GF(p)


linalg/bwc/bench_matcache.c
linalg/bwc/build_matcache.c

    => utilities & bench code for the matrix cache ; this builds upon the
    matmul_ files.


linalg/bwc/balancing.c
linalg/bwc/balancing.h
linalg/bwc/intersections.c
linalg/bwc/intersections.h
linalg/bwc/balancing_workhorse.cpp
linalg/bwc/balancing_workhorse.h
linalg/bwc/raw_matrix_u32.h
linalg/bwc/mf.c
linalg/bwc/mf.h
linalg/bwc/mf_bal.c
linalg/bwc/mf_bal.h
linalg/bwc/mf_bal_main.c
linalg/bwc/mf_scan.c
linalg/bwc/mf_scan2.cpp

    => code that implements the dispatching of matrix data to several
    nodes. The mf_ files define some auxiliary binaries that are useful,
    even used in test scripts.


linalg/bwc/random_matrix.c
linalg/bwc/random_matrix.h

    => code to generate random matrices (including an API to do that on
    the fly)


linalg/bwc/parallelizing_info.c
linalg/bwc/parallelizing_info.h

    => backend that implements the hybrid MPI+threads computing model
    used by bwc. Take it with a grain of salt: by the time it was
    written, MPI-3 didn't exist, and this was the only way to go. Now
    with the MPI-3 MPI_Win primitives, we would perhaps have been able to
    do without all this complication.


linalg/bwc/matmul_top.c
linalg/bwc/matmul_top.h

    => very important piece of code. This implements the parallel
    matrix-times-vector operation.


linalg/bwc/bwc.pl

    => main driver scripts, prepares command lines for binaries.


linalg/bwc/flint-fft/*
linalg/bwc/test-flint.c
linalg/bwc/update-flint.sh
linalg/bwc/parse-flint-fft-doc.pl

    => low-level arithmetic that is specific to the lingen part of BW,
    and only over GF(p). This includes an ad hoc implementation of an FFT
    caching interface.


linalg/bwc/lingen_bigmatpoly_ft.c
linalg/bwc/lingen_bigmatpoly_ft.h
linalg/bwc/lingen_bigmatpoly.c
linalg/bwc/lingen_bigmatpoly.h
linalg/bwc/lingen_matpoly_ft.cpp
linalg/bwc/lingen_matpoly_ft.h
linalg/bwc/lingen_matpoly.c
linalg/bwc/lingen_matpoly.h
linalg/bwc/lingen_polymat.c
linalg/bwc/lingen_polymat.h

    => low-level arithmetic code that multiplies matrices of polynomials
    over prime fields, used for lingen in BW over GF(p)


linalg/bwc/strassen-thresholds.hpp
linalg/bwc/lingen_mat_types.hpp

    => low-level arithmetic code that deals with matrices of polynomials
    over GF(2), used for lingen in BW over GF(2). The multiplication is
    then forwarded to GF2X.


linalg/bwc/lingen_qcode_binary.c
linalg/bwc/lingen_qcode_binary.h
linalg/bwc/lingen_qcode_binary_do.cpp

    => low-level arithmetic code that does the quadratic base case of BW
    over GF(2)


linalg/bwc/blocklanczos.c

    => top-level implementation of block Lanczos (builds upon matmul_top,
    really).


linalg/bwc/cpubinding.cpp
linalg/bwc/cpubinding.h
linalg/bwc/worker-threads.c
linalg/bwc/worker-threads.h

    => used for threads. cpubinding plays an important role, while
    worker-threads is actually mostly unused. most of the thread-level
    stuff is handled by either parallelizing_info or openmp.


linalg/bwc/async.c
linalg/bwc/async.h
linalg/bwc/logline.c
linalg/bwc/logline.h
linalg/bwc/tree_stats.cpp
linalg/bwc/tree_stats.hpp

    => utility code for printing intermediate timings (async.[ch]: per
    iteration timings, e.g. in krylov and mksol ; logline.[ch] and
    tree_stats.[ch]pp: used in (p)lingen).


linalg/bwc/acollect.c

    => utility code for stitching together all A files before calling
    lingen. This should not exist.


linalg/bwc/bw-common.c
linalg/bwc/bw-common.h
linalg/bwc/bwc_config_h.in
linalg/bwc/cheating_vec_init.h
linalg/bwc/rolling.c
linalg/bwc/rolling.h
linalg/bwc/xvectors.c
linalg/bwc/xvectors.h
linalg/bwc/xdotprod.c
linalg/bwc/xdotprod.h

    => utility files (from the most to the least general).


linalg/bwc/dispatch.cpp
linalg/bwc/prep.cpp
linalg/bwc/secure.cpp
linalg/bwc/krylov.cpp
linalg/bwc/lingen-*
linalg/bwc/mksol.cpp
linalg/bwc/gather.cpp
linalg/bwc/cleanup.c
linalg/bwc/bwccheck.cpp

    => main code for the different binaries. See linalg/bwc/README


linalg/bwc/README.TORQUE
linalg/bwc/TODO
linalg/bwc/lingen.txt

    => outdated documentation


linalg/bwc/bcast-file.c

    => debug/bench/example code to share a file across multiple nodes
    over MPI. Can be handy in scripts.


linalg/bwc/bench_polmatmul.cpp

    => old bench code, being phased out.


The pi_go() function
--------------------

This is the main workhorse for the parallelizing of the computation. It
allows multi-thread (with pthread) and multinode (with MPI) processing.
Some of the calls offered with the parallelizing_info structure are
reminiscent of the MPI calling interface.

The parameters given to pi_go() are:
   . an outer description of how many mpi jobs and threads should be set up.
   . a pointer to the function that must be called in parallel.
   . arguments which are to be passed to the function.

Its implementation is in parallelizing_info.[ch] .  The main data
structure that conveys information on the parallel setting is
  struct parallelizing_info_s
The details are not so simple... but in a nutshell, we get 3
communicators, the meaning of which is more or less a generalization of
"MPI_Comm" to encompass also the pthread level. The first communicator
allows full communication, the second is for row-wise communications, the
third is for column-wise communications.

The prototype of the function passed to pi_go() is
  void *(*fcn)(parallelizing_info_ptr pi, void * arg)
where parallelizing_info_ptr pi is built by pi_go and passed to the
function. The last argument allows any additional thing to be passed
through.

Below are a few things you can do with the pi variable. In these protos,
wr should be one of the communicator inside pi.
  pi_hello(pi)           // a small hello-world that tests the pi
  pi_thread_bcast(void* ptr, size_t size, BWC_PI_BYTE, uint root, wr)
                        // Broadcasting function at thread level
  pi_bcast(...)
                        // Broadcasting function at MPI+thread level
  serialize(wr)         // Serializing stuff
  serialize_threads(wr)
The most interesting example of using these is matmul_top.[ch] . Be
careful: in these files, the pi's communicators are completed by others
communicators that links pieces of the matrix. In particular, there are
two intersecting communicators which are sub-communicators of the larger
one. This implies that serializing on one of these sub-communicators with
all threads of the other communicator concurrently is illegal, because
serialize() entails MPI calls which do not handle concurrency well.

The "abase" mechanism
---------------------

The matrix stuff is kind-of generic w.r.t the type of objects inside the
matrix. The "abase" mechanism provides this genericity -- don't expect a
clean ring structure, beautifully following the mathematical notions.
This layer is built from the mpfq library, together with a table of
indirect functions serving as wrappers.

Thanks to the abase layer, the same code may be used both in
characteristic two and in characteristic p, only with minor changes.

The abase layer is accessed in different ways depending on whether we're
looking at a higher level or lower level part of the code.

For the higher level code (typically, matmul_top.c, krylov.c, etc), we
have some object-oriented access. Everything goes through the
mpfq_vbase_ptr type, which is the type of the parent structure, holding
all the required function pointers.  The elements are cast to void*
pointers, will all the accompanying infelicities. The cod compiled in
this way is independent of the actual implementation of the underlying
arithmetic.

For lower level code (typically, matmul-bucket.cpp), it's different. We
have the two main types:
    abdst_field - a descriptor of the field we're working with
    abelt       - an element.

Each instance of the abase mechanism is a set of files, as follows:
    mpfq/mpfq_<something>.c
    mpfq/mpfq_<something>.h
    mpfq/mpfq_<something>_t.c
    mpfq/mpfq_<something>_t.h
This may describe elements whose sizes are known at compile-time
(uint64_t), or blocks of them, of width possibly known only at
running-time (this is the case for the mpfq_pz layer).

The idea is to have genericity and variable width for boring stuff (like
prep) that does not need maximal efficiency, and fixed sized- code for
the main operations inside Krylov.

See the example uses of the abase layer in either high level or low level
code for details on how to use this. Functions are all following the mpfq
api.

Currently the following are implemented:
  abase-u64k1   // uint64_t
  abase-m128    // uint128_t, packed into SSE2
  abase-u64k#   // block of # uint64_t, size known at compile-time, with
                // # being 1,2,3, or 4

The main use of this type is for vectors to be multiplied by the binary
matrix. We have for instance in matmul.[ch] a function called:
  void matmul_mul(matmul_ptr M , void * B , void const * A , int d)
that computes B = M . A  (or transpose if d=1), where A and B are vectors
of "abase".

Matrix - vector product
-----------------------

The code that will run on a single thread is in matmul.[ch] and the
different implementations it encapsulates. For the moment:
    matmul-basic.[ch]
    matmul-sliced.{cpp,h}
    matmul-bucket.{cpp,h}       [ default ]

practically all cpu time is spent in these routines. Therefore, optimizing
these for a particular CPU/memory controller combination is very relevant. See
more on this particular step below in the ``low-level'' section below.

On top of this very basic mono-threaded matrix x vector product, the
matmul_top.[ch] mechanism handles the parallelizing: each node / thread
gets a piece of the matrix (produced by balance) and does its own small
block x partial_vector product, and then the partial results are
transmitted and collected.

The intersections.[ch] files are used to help the communications. In one
word, this describes how an horizontal vector (spread on various
nodes/threads) can be transposed in a vertical vector (also spread on
various nodes/threads). This is tricky and will be explained somewhere
else (the code is short, but beware of headaches).

Matrix - vector product -- low-level
------------------------------------

Among the different possible source file listed above, presumably only one
will be used for a single bwc run. All bwc binaries (well, the important ones:
prep secure krylov mksol gather) accept the mm_impl= argument which selects
the implementation to use.

The primary on-disk matrix is in trivial binary format (32-bit integer
indices for non-zero coefficients, each row prefixed by its length).
However, memory representation for fast computation is different. It
depends on the implementation. For convenience, a ``cached'' matrix file
matching the in-memory representation for a given implementation is
generated on first use, and saved on disk

[ Note that this single task can be benched separately of the bw
  infrastructure by the standalone bench program. Be aware though that
  some of the difficulty performance-wise arises when all cpu cores are
  trying to interact with memory simultaneously. ]

(documentation to be continued).

Distributed vectors -- the mmt_vec type
---------------------------------------

The mmt_vec type holds vector data, which is to be distributed across the
different threads and jobs. And mmt_vec is split into as many pieces as
we have threads and jobs: when we work with an N*N matrix split into a
mesh of nh rows and nv columns, each thread "owns" a fraction N/(nh*nv)
of a vector. However, the mmt_vec has storage space for slightly more
than that, because each thread needs at some point to access all the data
it needs to perform a local matrix times vector, or vector times matrix
product.

We first focus on the chunks (of size N/(nh*nv)) owned by the various threads. How a vector is
distributed depends on the ordering in which these parts are spread
across threads. This is controlled by a "direction" parameter.
We say that an mmt_vec is distributed "in direction d", where d is the
field m->d. For d==0, the chunks are distributed row-major with respect
to the parallelizing_info structure:
 - chunks 0 to nv-1 go to the first row (threads with pi->wr[1]->trank ==
   0 and pi->wr[1]->jrank == 0)
 - chunks nv to 2nv-1 go to the second row, and so on.
When d==1, the chunks are distributed column-major.

Now as said above, a vector has storage for slightly more than the chunk
it owns. The storage area for v is v->v.  As a whole, v->v is meant to
contain coefficients for the index range [v->i0..v->i1[.

The chunk owned by v is a subpart of v->v ; It starts at the pointer
mmt_my_own_subvec(), namely at offset mmt_my_own_offset_in_items() within
v->v, and for a length which is mmt_my_own_size(). The index range
corresponding to this sub-area is
[v->i0+own_offset..v->i0+own_offset+own_size[.

Several threads and jobs are likely to have data areas (v->v) relative to
the same index range [v->i0..v->i1[. There appears, then, the problem of
deciding how consistent these data areas may be.

A vector may be:
     - fully consistent: this means that all threads and job which
     have a data area for the range [v->i0..v->i1[ agree on its
     contents.
     - partially consistent: each job and thread is correct about its
     own sub-area, but the other parts may be bogus.
     - inconsistent: data differs everywhere, and even each thread's
     own sub-area contains stuff which is meaningless if we forget
     about what's in the other threads for this index range.

Matrix - vector product (1)
---------------------------

The matmul_top_data type is responsible of tying together the different
threads and nodes, so as to work collectively on a matrix times vector
product.

The matmul_top_data object (generally abbreviated mmt) contains in
particular
 - on each thread/job, a pointer to the local share of the matrix.
 - pointer to a parallelizing_info structure, which brings the different
   communicators
 - more bookkeeping info about the matrix (embedded permutations and
   such, as well as the number of rows and columns of the matrix in total)

The main function is matmul_top_mul(), which decomposes in two steps:
  matmul_top_mul_cpu(), which does no communication
  matmul_top_mul_comm(), which does (almost) no computation

In a Wiedemann setting, bw->dir (nullspace=... at the top-level) should
be the same as the d-parameter passed to matmul_top_mul().

In a Lanczos setting, matmul_top_mul() is called in both directions.

Apart from this d=[0|1] nightmare, matmul_top_mul_cpu() just calls matmul_mul(),
and put the result in the destination (partial) vector.

The input of matmul_top_mul_cpu() must be a distributed vector with full
consistency.

The output of matmul_top_mul_cpu() is a vector distributed in the
opposite direction, but which is inconsistent. We need to do some work in
order to reconcile the data.

This tricky part is the matmul_top_mul_comm() part.

Matrix - vector product (2)
---------------------------

Two phases of communication are defined in order to reach maximal
consistency again.
  - mmt_vec_reduce()
  - mmt_vec_broadcast()
It takes as input the result of the cpu step, which is in the
"destination" vector, and does all the communications to reconstruct the
vector for the next iteration (thus, stored again in the "source"
vector).
Most of the comments in the source code somehow implicitly assume that
the vector being passed is distributed in the "d=1" direction,
even though "d=0" is probably the most extensively tested.
Such a point of view is also visible in the naming of variables, e.g.
when we write:
    pi_comm_ptr picol = mmt->pi->wr[d];
this does actually correspond to a column communicator only when d==1.


mmt_vec_reduce is actually a slight misnomer. The real operation is a
"reduce-scatter" operation, in MPI parlance. It adds all data, and sends
each share of the resulting vector to its owner. Of course, it's really
"across" only in case we're doing matrix times vector.

likewise, mmt_vec_broadcast is more accurately an MPI "all-gather"
operation. Each thread sends the data it owns to all threads which are
going to need it.

Note that the vector layout which has been presented has the effect that
if the matrix is in fact the identity, then matmul_top_mul() will not
compute exactly the expected w = v * Id, but a shuffle of it, which
involves some transposition of indices. In other words, what is computed
correspond to a matrix M' = P^-1*M, where P^-1 is a permutation matrix.
This reduces / simplifies / accelerates the mul_comm() part, and if the
goal is to find a kernel vector, knowing one for M' is equivalent to
knowing one for M. There's more about this in matmul_top.c.


*** Reduce_accross:

If at the top-level, one has d=1 (nullspace=right), then reduce_accross()
will be called with d=0, i.e. we want to reduce accross rows.
First, the reduction is done at the thread level. This is just some
additions of vectors, done in the appropriate places of memory. Advantage
is taken of multithread, here: each thread is in charge of a subset of
the rows. This done in-place (i.e., still in the dst-vector).
Second, the reduction is done at the mpi level. Here, only the thread 0
is working in each node. This is done with MPI_Reduce_scatter (or any
replacement routine). And this is where the shuffle is implicitly
started. Indeed, each job will take care of reducing only a portion of
its indices. And therefore, in the end, the pieces of the vector are not
where they should be.
The reduction at the MPI level is where the result is written back in the
src vector in order to be ready for the next iteration.

*** Broadcast_down:

The broadcast is simply a nop at the thread level, since with shared
memory, everyone sees everyone else's data.
At the MPI level, though, again, there is some work to do, and only the
thread number 0 in each node is in charge. The MPI primitive that is used
here is MPI_Allgather. If at the top-level on has d=1, then this is
called with d=1, i.e. the communications are down within columns.