1
2File listing, and purpose of each file (20190710)
3-------------------------------------------------
4
5Please always look up the history of each file to see how current it can
6possibly be. A file that has hardly been touched for many years is quite
7surely completely outdated.
8
9
10linalg/bwc/CMakeLists-nodist.txt
11linalg/bwc/CMakeLists.txt
12linalg/bwc/Makefile
13
14    => build system
15
16
17linalg/bwc/GUIDED_TOUR_OF_SOURCES
18linalg/bwc/README
19linalg/bwc/README.gfp
20
21    => some bits of documentation
22
23
24linalg/bwc/mpfq_layer.h
25linalg/bwc/mpfq/*
26
27    => low-level code that performs arithmetic in the base field. This is
28    aut-generated from mpfq, and in truth only a very tiny portion of
29    mpfq is actually used.
30
31
32linalg/bwc/arith-modp.hpp
33
34    => low-level arithmetic code used for operations modulo p when doing
35    linear algebra over prime fields
36
37
38linalg/bwc/matmul-basic.c
39linalg/bwc/matmul-basicp.cpp
40linalg/bwc/matmul-bucket.cpp
41linalg/bwc/matmul-common.c
42linalg/bwc/matmul-common.h
43linalg/bwc/matmul-libnames.h.in
44linalg/bwc/matmul-mf.c
45linalg/bwc/matmul-mf.h
46linalg/bwc/matmul-sliced.cpp
47linalg/bwc/matmul-sub-large-fbd.S
48linalg/bwc/matmul-sub-large-fbd.h
49linalg/bwc/matmul-sub-large-fbi.S
50linalg/bwc/matmul-sub-large-fbi.h
51linalg/bwc/matmul-sub-small1.S
52linalg/bwc/matmul-sub-small1.h
53linalg/bwc/matmul-sub-small2.S
54linalg/bwc/matmul-sub-small2.h
55linalg/bwc/matmul-sub-vsc-combine.S
56linalg/bwc/matmul-sub-vsc-combine.h
57linalg/bwc/matmul-sub-vsc-dispatch.S
58linalg/bwc/matmul-sub-vsc-dispatch.h
59linalg/bwc/matmul-threaded.c
60linalg/bwc/matmul-zone.cpp
61linalg/bwc/matmul.c
62linalg/bwc/matmul.h
63linalg/bwc/matmul_facade.c
64linalg/bwc/matmul_facade.h
65
66    => low-level code that creates the in-memory matrix cache for fast
67    matrix-times-vector evaluation. The important entry points are
68    linalg/bwc/matmul-bucket.cpp for matrices over GF(2) and
69    linalg/bwc/matmul-zone.cpp for matrices over GF(p)
70
71
72linalg/bwc/bench_matcache.c
73linalg/bwc/build_matcache.c
74
75    => utilities & bench code for the matrix cache ; this builds upon the
76    matmul_ files.
77
78
79linalg/bwc/balancing.c
80linalg/bwc/balancing.h
81linalg/bwc/intersections.c
82linalg/bwc/intersections.h
83linalg/bwc/balancing_workhorse.cpp
84linalg/bwc/balancing_workhorse.h
85linalg/bwc/raw_matrix_u32.h
86linalg/bwc/mf.c
87linalg/bwc/mf.h
88linalg/bwc/mf_bal.c
89linalg/bwc/mf_bal.h
90linalg/bwc/mf_bal_main.c
91linalg/bwc/mf_scan.c
92linalg/bwc/mf_scan2.cpp
93
94    => code that implements the dispatching of matrix data to several
95    nodes. The mf_ files define some auxiliary binaries that are useful,
96    even used in test scripts.
97
98
99linalg/bwc/random_matrix.c
100linalg/bwc/random_matrix.h
101
102    => code to generate random matrices (including an API to do that on
103    the fly)
104
105
106linalg/bwc/parallelizing_info.c
107linalg/bwc/parallelizing_info.h
108
109    => backend that implements the hybrid MPI+threads computing model
110    used by bwc. Take it with a grain of salt: by the time it was
111    written, MPI-3 didn't exist, and this was the only way to go. Now
112    with the MPI-3 MPI_Win primitives, we would perhaps have been able to
113    do without all this complication.
114
115
116linalg/bwc/matmul_top.c
117linalg/bwc/matmul_top.h
118
119    => very important piece of code. This implements the parallel
120    matrix-times-vector operation.
121
122
123linalg/bwc/bwc.pl
124
125    => main driver scripts, prepares command lines for binaries.
126
127
128linalg/bwc/flint-fft/*
129linalg/bwc/test-flint.c
130linalg/bwc/update-flint.sh
131linalg/bwc/parse-flint-fft-doc.pl
132
133    => low-level arithmetic that is specific to the lingen part of BW,
134    and only over GF(p). This includes an ad hoc implementation of an FFT
135    caching interface.
136
137
138linalg/bwc/lingen_bigmatpoly_ft.c
139linalg/bwc/lingen_bigmatpoly_ft.h
140linalg/bwc/lingen_bigmatpoly.c
141linalg/bwc/lingen_bigmatpoly.h
142linalg/bwc/lingen_matpoly_ft.cpp
143linalg/bwc/lingen_matpoly_ft.h
144linalg/bwc/lingen_matpoly.c
145linalg/bwc/lingen_matpoly.h
146linalg/bwc/lingen_polymat.c
147linalg/bwc/lingen_polymat.h
148
149    => low-level arithmetic code that multiplies matrices of polynomials
150    over prime fields, used for lingen in BW over GF(p)
151
152
153linalg/bwc/strassen-thresholds.hpp
154linalg/bwc/lingen_mat_types.hpp
155
156    => low-level arithmetic code that deals with matrices of polynomials
157    over GF(2), used for lingen in BW over GF(2). The multiplication is
158    then forwarded to GF2X.
159
160
161linalg/bwc/lingen_qcode_binary.c
162linalg/bwc/lingen_qcode_binary.h
163linalg/bwc/lingen_qcode_binary_do.cpp
164
165    => low-level arithmetic code that does the quadratic base case of BW
166    over GF(2)
167
168
169
170linalg/bwc/blocklanczos.c
171
172    => top-level implementation of block Lanczos (builds upon matmul_top,
173    really).
174
175
176linalg/bwc/cpubinding.cpp
177linalg/bwc/cpubinding.h
178linalg/bwc/worker-threads.c
179linalg/bwc/worker-threads.h
180
181    => used for threads. cpubinding plays an important role, while
182    worker-threads is actually mostly unused. most of the thread-level
183    stuff is handled by either parallelizing_info or openmp.
184
185
186linalg/bwc/async.c
187linalg/bwc/async.h
188linalg/bwc/logline.c
189linalg/bwc/logline.h
190linalg/bwc/tree_stats.cpp
191linalg/bwc/tree_stats.hpp
192
193    => utility code for printing intermediate timings (async.[ch]: per
194    iteration timings, e.g. in krylov and mksol ; logline.[ch] and
195    tree_stats.[ch]pp: used in (p)lingen).
196
197
198linalg/bwc/acollect.c
199
200    => utility code for stitching together all A files before calling
201    lingen. This should not exist.
202
203
204linalg/bwc/bw-common.c
205linalg/bwc/bw-common.h
206linalg/bwc/bwc_config_h.in
207linalg/bwc/cheating_vec_init.h
208linalg/bwc/rolling.c
209linalg/bwc/rolling.h
210linalg/bwc/xvectors.c
211linalg/bwc/xvectors.h
212linalg/bwc/xdotprod.c
213linalg/bwc/xdotprod.h
214
215    => utility files (from the most to the least general).
216
217
218linalg/bwc/dispatch.cpp
219linalg/bwc/prep.cpp
220linalg/bwc/secure.cpp
221linalg/bwc/krylov.cpp
222linalg/bwc/lingen-*
223linalg/bwc/mksol.cpp
224linalg/bwc/gather.cpp
225linalg/bwc/cleanup.c
226linalg/bwc/bwccheck.cpp
227
228    => main code for the different binaries. See linalg/bwc/README
229
230
231linalg/bwc/README.TORQUE
232linalg/bwc/TODO
233linalg/bwc/lingen.txt
234
235    => outdated documentation
236
237
238linalg/bwc/bcast-file.c
239
240    => debug/bench/example code to share a file across multiple nodes
241    over MPI. Can be handy in scripts.
242
243
244linalg/bwc/bench_polmatmul.cpp
245
246    => old bench code, being phased out.
247
248
249The pi_go() function
250--------------------
251
252This is the main workhorse for the parallelizing of the computation. It
253allows multi-thread (with pthread) and multinode (with MPI) processing.
254Some of the calls offered with the parallelizing_info structure are
255reminiscent of the MPI calling interface.
256
257The parameters given to pi_go() are:
258   . an outer description of how many mpi jobs and threads should be set up.
259   . a pointer to the function that must be called in parallel.
260   . arguments which are to be passed to the function.
261
262Its implementation is in parallelizing_info.[ch] .  The main data
263structure that conveys information on the parallel setting is
264  struct parallelizing_info_s
265The details are not so simple... but in a nutshell, we get 3
266communicators, the meaning of which is more or less a generalization of
267"MPI_Comm" to encompass also the pthread level. The first communicator
268allows full communication, the second is for row-wise communications, the
269third is for column-wise communications.
270
271The prototype of the function passed to pi_go() is
272  void *(*fcn)(parallelizing_info_ptr pi, void * arg)
273where parallelizing_info_ptr pi is built by pi_go and passed to the
274function. The last argument allows any additional thing to be passed
275through.
276
277Below are a few things you can do with the pi variable. In these protos,
278wr should be one of the communicator inside pi.
279  pi_hello(pi)           // a small hello-world that tests the pi
280  pi_thread_bcast(void* ptr, size_t size, BWC_PI_BYTE, uint root, wr)
281                        // Broadcasting function at thread level
282  pi_bcast(...)
283                        // Broadcasting function at MPI+thread level
284  serialize(wr)         // Serializing stuff
285  serialize_threads(wr)
286The most interesting example of using these is matmul_top.[ch] . Be
287careful: in these files, the pi's communicators are completed by others
288communicators that links pieces of the matrix. In particular, there are
289two intersecting communicators which are sub-communicators of the larger
290one. This implies that serializing on one of these sub-communicators with
291all threads of the other communicator concurrently is illegal, because
292serialize() entails MPI calls which do not handle concurrency well.
293
294The "abase" mechanism
295---------------------
296
297The matrix stuff is kind-of generic w.r.t the type of objects inside the
298matrix. The "abase" mechanism provides this genericity -- don't expect a
299clean ring structure, beautifully following the mathematical notions.
300This layer is built from the mpfq library, together with a table of
301indirect functions serving as wrappers.
302
303Thanks to the abase layer, the same code may be used both in
304characteristic two and in characteristic p, only with minor changes.
305
306The abase layer is accessed in different ways depending on whether we're
307looking at a higher level or lower level part of the code.
308
309For the higher level code (typically, matmul_top.c, krylov.c, etc), we
310have some object-oriented access. Everything goes through the
311mpfq_vbase_ptr type, which is the type of the parent structure, holding
312all the required function pointers.  The elements are cast to void*
313pointers, will all the accompanying infelicities. The cod compiled in
314this way is independent of the actual implementation of the underlying
315arithmetic.
316
317For lower level code (typically, matmul-bucket.cpp), it's different. We
318have the two main types:
319    abdst_field - a descriptor of the field we're working with
320    abelt       - an element.
321
322Each instance of the abase mechanism is a set of files, as follows:
323    mpfq/mpfq_<something>.c
324    mpfq/mpfq_<something>.h
325    mpfq/mpfq_<something>_t.c
326    mpfq/mpfq_<something>_t.h
327This may describe elements whose sizes are known at compile-time
328(uint64_t), or blocks of them, of width possibly known only at
329running-time (this is the case for the mpfq_pz layer).
330
331The idea is to have genericity and variable width for boring stuff (like
332prep) that does not need maximal efficiency, and fixed sized- code for
333the main operations inside Krylov.
334
335See the example uses of the abase layer in either high level or low level
336code for details on how to use this. Functions are all following the mpfq
337api.
338
339Currently the following are implemented:
340  abase-u64k1   // uint64_t
341  abase-m128    // uint128_t, packed into SSE2
342  abase-u64k#   // block of # uint64_t, size known at compile-time, with
343                // # being 1,2,3, or 4
344
345The main use of this type is for vectors to be multiplied by the binary
346matrix. We have for instance in matmul.[ch] a function called:
347  void matmul_mul(matmul_ptr M , void * B , void const * A , int d)
348that computes B = M . A  (or transpose if d=1), where A and B are vectors
349of "abase".
350
351Matrix - vector product
352-----------------------
353
354The code that will run on a single thread is in matmul.[ch] and the
355different implementations it encapsulates. For the moment:
356    matmul-basic.[ch]
357    matmul-sliced.{cpp,h}
358    matmul-bucket.{cpp,h}       [ default ]
359
360practically all cpu time is spent in these routines. Therefore, optimizing
361these for a particular CPU/memory controller combination is very relevant. See
362more on this particular step below in the ``low-level'' section below.
363
364On top of this very basic mono-threaded matrix x vector product, the
365matmul_top.[ch] mechanism handles the parallelizing: each node / thread
366gets a piece of the matrix (produced by balance) and does its own small
367block x partial_vector product, and then the partial results are
368transmitted and collected.
369
370The intersections.[ch] files are used to help the communications. In one
371word, this describes how an horizontal vector (spread on various
372nodes/threads) can be transposed in a vertical vector (also spread on
373various nodes/threads). This is tricky and will be explained somewhere
374else (the code is short, but beware of headaches).
375
376Matrix - vector product -- low-level
377------------------------------------
378
379Among the different possible source file listed above, presumably only one
380will be used for a single bwc run. All bwc binaries (well, the important ones:
381prep secure krylov mksol gather) accept the mm_impl= argument which selects
382the implementation to use.
383
384The primary on-disk matrix is in trivial binary format (32-bit integer
385indices for non-zero coefficients, each row prefixed by its length).
386However, memory representation for fast computation is different. It
387depends on the implementation. For convenience, a ``cached'' matrix file
388matching the in-memory representation for a given implementation is
389generated on first use, and saved on disk
390
391[ Note that this single task can be benched separately of the bw
392  infrastructure by the standalone bench program. Be aware though that
393  some of the difficulty performance-wise arises when all cpu cores are
394  trying to interact with memory simultaneously. ]
395
396(documentation to be continued).
397
398Distributed vectors -- the mmt_vec type
399---------------------------------------
400
401The mmt_vec type holds vector data, which is to be distributed across the
402different threads and jobs. And mmt_vec is split into as many pieces as
403we have threads and jobs: when we work with an N*N matrix split into a
404mesh of nh rows and nv columns, each thread "owns" a fraction N/(nh*nv)
405of a vector. However, the mmt_vec has storage space for slightly more
406than that, because each thread needs at some point to access all the data
407it needs to perform a local matrix times vector, or vector times matrix
408product.
409
410We first focus on the chunks (of size N/(nh*nv)) owned by the various threads. How a vector is
411distributed depends on the ordering in which these parts are spread
412across threads. This is controlled by a "direction" parameter.
413We say that an mmt_vec is distributed "in direction d", where d is the
414field m->d. For d==0, the chunks are distributed row-major with respect
415to the parallelizing_info structure:
416 - chunks 0 to nv-1 go to the first row (threads with pi->wr[1]->trank ==
417   0 and pi->wr[1]->jrank == 0)
418 - chunks nv to 2nv-1 go to the second row, and so on.
419When d==1, the chunks are distributed column-major.
420
421Now as said above, a vector has storage for slightly more than the chunk
422it owns. The storage area for v is v->v.  As a whole, v->v is meant to
423contain coefficients for the index range [v->i0..v->i1[.
424
425The chunk owned by v is a subpart of v->v ; It starts at the pointer
426mmt_my_own_subvec(), namely at offset mmt_my_own_offset_in_items() within
427v->v, and for a length which is mmt_my_own_size(). The index range
428corresponding to this sub-area is
429[v->i0+own_offset..v->i0+own_offset+own_size[.
430
431Several threads and jobs are likely to have data areas (v->v) relative to
432the same index range [v->i0..v->i1[. There appears, then, the problem of
433deciding how consistent these data areas may be.
434
435A vector may be:
436     - fully consistent: this means that all threads and job which
437     have a data area for the range [v->i0..v->i1[ agree on its
438     contents.
439     - partially consistent: each job and thread is correct about its
440     own sub-area, but the other parts may be bogus.
441     - inconsistent: data differs everywhere, and even each thread's
442     own sub-area contains stuff which is meaningless if we forget
443     about what's in the other threads for this index range.
444
445Matrix - vector product (1)
446---------------------------
447
448The matmul_top_data type is responsible of tying together the different
449threads and nodes, so as to work collectively on a matrix times vector
450product.
451
452The matmul_top_data object (generally abbreviated mmt) contains in
453particular
454 - on each thread/job, a pointer to the local share of the matrix.
455 - pointer to a parallelizing_info structure, which brings the different
456   communicators
457 - more bookkeeping info about the matrix (embedded permutations and
458   such, as well as the number of rows and columns of the matrix in total)
459
460The main function is matmul_top_mul(), which decomposes in two steps:
461  matmul_top_mul_cpu(), which does no communication
462  matmul_top_mul_comm(), which does (almost) no computation
463
464In a Wiedemann setting, bw->dir (nullspace=... at the top-level) should
465be the same as the d-parameter passed to matmul_top_mul().
466
467In a Lanczos setting, matmul_top_mul() is called in both directions.
468
469Apart from this d=[0|1] nightmare, matmul_top_mul_cpu() just calls matmul_mul(),
470and put the result in the destination (partial) vector.
471
472The input of matmul_top_mul_cpu() must be a distributed vector with full
473consistency.
474
475The output of matmul_top_mul_cpu() is a vector distributed in the
476opposite direction, but which is inconsistent. We need to do some work in
477order to reconcile the data.
478
479This tricky part is the matmul_top_mul_comm() part.
480
481Matrix - vector product (2)
482---------------------------
483
484Two phases of communication are defined in order to reach maximal
485consistency again.
486  - mmt_vec_reduce()
487  - mmt_vec_broadcast()
488It takes as input the result of the cpu step, which is in the
489"destination" vector, and does all the communications to reconstruct the
490vector for the next iteration (thus, stored again in the "source"
491vector).
492Most of the comments in the source code somehow implicitly assume that
493the vector being passed is distributed in the "d=1" direction,
494even though "d=0" is probably the most extensively tested.
495Such a point of view is also visible in the naming of variables, e.g.
496when we write:
497    pi_comm_ptr picol = mmt->pi->wr[d];
498this does actually correspond to a column communicator only when d==1.
499
500
501mmt_vec_reduce is actually a slight misnomer. The real operation is a
502"reduce-scatter" operation, in MPI parlance. It adds all data, and sends
503each share of the resulting vector to its owner. Of course, it's really
504"across" only in case we're doing matrix times vector.
505
506likewise, mmt_vec_broadcast is more accurately an MPI "all-gather"
507operation. Each thread sends the data it owns to all threads which are
508going to need it.
509
510Note that the vector layout which has been presented has the effect that
511if the matrix is in fact the identity, then matmul_top_mul() will not
512compute exactly the expected w = v * Id, but a shuffle of it, which
513involves some transposition of indices. In other words, what is computed
514correspond to a matrix M' = P^-1*M, where P^-1 is a permutation matrix.
515This reduces / simplifies / accelerates the mul_comm() part, and if the
516goal is to find a kernel vector, knowing one for M' is equivalent to
517knowing one for M. There's more about this in matmul_top.c.
518
519
520
521*** Reduce_accross:
522
523If at the top-level, one has d=1 (nullspace=right), then reduce_accross()
524will be called with d=0, i.e. we want to reduce accross rows.
525First, the reduction is done at the thread level. This is just some
526additions of vectors, done in the appropriate places of memory. Advantage
527is taken of multithread, here: each thread is in charge of a subset of
528the rows. This done in-place (i.e., still in the dst-vector).
529Second, the reduction is done at the mpi level. Here, only the thread 0
530is working in each node. This is done with MPI_Reduce_scatter (or any
531replacement routine). And this is where the shuffle is implicitly
532started. Indeed, each job will take care of reducing only a portion of
533its indices. And therefore, in the end, the pieces of the vector are not
534where they should be.
535The reduction at the MPI level is where the result is written back in the
536src vector in order to be ready for the next iteration.
537
538*** Broadcast_down:
539
540The broadcast is simply a nop at the thread level, since with shared
541memory, everyone sees everyone else's data.
542At the MPI level, though, again, there is some work to do, and only the
543thread number 0 in each node is in charge. The MPI primitive that is used
544here is MPI_Allgather. If at the top-level on has d=1, then this is
545called with d=1, i.e. the communications are down within columns.
546
547