1 2File listing, and purpose of each file (20190710) 3------------------------------------------------- 4 5Please always look up the history of each file to see how current it can 6possibly be. A file that has hardly been touched for many years is quite 7surely completely outdated. 8 9 10linalg/bwc/CMakeLists-nodist.txt 11linalg/bwc/CMakeLists.txt 12linalg/bwc/Makefile 13 14 => build system 15 16 17linalg/bwc/GUIDED_TOUR_OF_SOURCES 18linalg/bwc/README 19linalg/bwc/README.gfp 20 21 => some bits of documentation 22 23 24linalg/bwc/mpfq_layer.h 25linalg/bwc/mpfq/* 26 27 => low-level code that performs arithmetic in the base field. This is 28 aut-generated from mpfq, and in truth only a very tiny portion of 29 mpfq is actually used. 30 31 32linalg/bwc/arith-modp.hpp 33 34 => low-level arithmetic code used for operations modulo p when doing 35 linear algebra over prime fields 36 37 38linalg/bwc/matmul-basic.c 39linalg/bwc/matmul-basicp.cpp 40linalg/bwc/matmul-bucket.cpp 41linalg/bwc/matmul-common.c 42linalg/bwc/matmul-common.h 43linalg/bwc/matmul-libnames.h.in 44linalg/bwc/matmul-mf.c 45linalg/bwc/matmul-mf.h 46linalg/bwc/matmul-sliced.cpp 47linalg/bwc/matmul-sub-large-fbd.S 48linalg/bwc/matmul-sub-large-fbd.h 49linalg/bwc/matmul-sub-large-fbi.S 50linalg/bwc/matmul-sub-large-fbi.h 51linalg/bwc/matmul-sub-small1.S 52linalg/bwc/matmul-sub-small1.h 53linalg/bwc/matmul-sub-small2.S 54linalg/bwc/matmul-sub-small2.h 55linalg/bwc/matmul-sub-vsc-combine.S 56linalg/bwc/matmul-sub-vsc-combine.h 57linalg/bwc/matmul-sub-vsc-dispatch.S 58linalg/bwc/matmul-sub-vsc-dispatch.h 59linalg/bwc/matmul-threaded.c 60linalg/bwc/matmul-zone.cpp 61linalg/bwc/matmul.c 62linalg/bwc/matmul.h 63linalg/bwc/matmul_facade.c 64linalg/bwc/matmul_facade.h 65 66 => low-level code that creates the in-memory matrix cache for fast 67 matrix-times-vector evaluation. The important entry points are 68 linalg/bwc/matmul-bucket.cpp for matrices over GF(2) and 69 linalg/bwc/matmul-zone.cpp for matrices over GF(p) 70 71 72linalg/bwc/bench_matcache.c 73linalg/bwc/build_matcache.c 74 75 => utilities & bench code for the matrix cache ; this builds upon the 76 matmul_ files. 77 78 79linalg/bwc/balancing.c 80linalg/bwc/balancing.h 81linalg/bwc/intersections.c 82linalg/bwc/intersections.h 83linalg/bwc/balancing_workhorse.cpp 84linalg/bwc/balancing_workhorse.h 85linalg/bwc/raw_matrix_u32.h 86linalg/bwc/mf.c 87linalg/bwc/mf.h 88linalg/bwc/mf_bal.c 89linalg/bwc/mf_bal.h 90linalg/bwc/mf_bal_main.c 91linalg/bwc/mf_scan.c 92linalg/bwc/mf_scan2.cpp 93 94 => code that implements the dispatching of matrix data to several 95 nodes. The mf_ files define some auxiliary binaries that are useful, 96 even used in test scripts. 97 98 99linalg/bwc/random_matrix.c 100linalg/bwc/random_matrix.h 101 102 => code to generate random matrices (including an API to do that on 103 the fly) 104 105 106linalg/bwc/parallelizing_info.c 107linalg/bwc/parallelizing_info.h 108 109 => backend that implements the hybrid MPI+threads computing model 110 used by bwc. Take it with a grain of salt: by the time it was 111 written, MPI-3 didn't exist, and this was the only way to go. Now 112 with the MPI-3 MPI_Win primitives, we would perhaps have been able to 113 do without all this complication. 114 115 116linalg/bwc/matmul_top.c 117linalg/bwc/matmul_top.h 118 119 => very important piece of code. This implements the parallel 120 matrix-times-vector operation. 121 122 123linalg/bwc/bwc.pl 124 125 => main driver scripts, prepares command lines for binaries. 126 127 128linalg/bwc/flint-fft/* 129linalg/bwc/test-flint.c 130linalg/bwc/update-flint.sh 131linalg/bwc/parse-flint-fft-doc.pl 132 133 => low-level arithmetic that is specific to the lingen part of BW, 134 and only over GF(p). This includes an ad hoc implementation of an FFT 135 caching interface. 136 137 138linalg/bwc/lingen_bigmatpoly_ft.c 139linalg/bwc/lingen_bigmatpoly_ft.h 140linalg/bwc/lingen_bigmatpoly.c 141linalg/bwc/lingen_bigmatpoly.h 142linalg/bwc/lingen_matpoly_ft.cpp 143linalg/bwc/lingen_matpoly_ft.h 144linalg/bwc/lingen_matpoly.c 145linalg/bwc/lingen_matpoly.h 146linalg/bwc/lingen_polymat.c 147linalg/bwc/lingen_polymat.h 148 149 => low-level arithmetic code that multiplies matrices of polynomials 150 over prime fields, used for lingen in BW over GF(p) 151 152 153linalg/bwc/strassen-thresholds.hpp 154linalg/bwc/lingen_mat_types.hpp 155 156 => low-level arithmetic code that deals with matrices of polynomials 157 over GF(2), used for lingen in BW over GF(2). The multiplication is 158 then forwarded to GF2X. 159 160 161linalg/bwc/lingen_qcode_binary.c 162linalg/bwc/lingen_qcode_binary.h 163linalg/bwc/lingen_qcode_binary_do.cpp 164 165 => low-level arithmetic code that does the quadratic base case of BW 166 over GF(2) 167 168 169 170linalg/bwc/blocklanczos.c 171 172 => top-level implementation of block Lanczos (builds upon matmul_top, 173 really). 174 175 176linalg/bwc/cpubinding.cpp 177linalg/bwc/cpubinding.h 178linalg/bwc/worker-threads.c 179linalg/bwc/worker-threads.h 180 181 => used for threads. cpubinding plays an important role, while 182 worker-threads is actually mostly unused. most of the thread-level 183 stuff is handled by either parallelizing_info or openmp. 184 185 186linalg/bwc/async.c 187linalg/bwc/async.h 188linalg/bwc/logline.c 189linalg/bwc/logline.h 190linalg/bwc/tree_stats.cpp 191linalg/bwc/tree_stats.hpp 192 193 => utility code for printing intermediate timings (async.[ch]: per 194 iteration timings, e.g. in krylov and mksol ; logline.[ch] and 195 tree_stats.[ch]pp: used in (p)lingen). 196 197 198linalg/bwc/acollect.c 199 200 => utility code for stitching together all A files before calling 201 lingen. This should not exist. 202 203 204linalg/bwc/bw-common.c 205linalg/bwc/bw-common.h 206linalg/bwc/bwc_config_h.in 207linalg/bwc/cheating_vec_init.h 208linalg/bwc/rolling.c 209linalg/bwc/rolling.h 210linalg/bwc/xvectors.c 211linalg/bwc/xvectors.h 212linalg/bwc/xdotprod.c 213linalg/bwc/xdotprod.h 214 215 => utility files (from the most to the least general). 216 217 218linalg/bwc/dispatch.cpp 219linalg/bwc/prep.cpp 220linalg/bwc/secure.cpp 221linalg/bwc/krylov.cpp 222linalg/bwc/lingen-* 223linalg/bwc/mksol.cpp 224linalg/bwc/gather.cpp 225linalg/bwc/cleanup.c 226linalg/bwc/bwccheck.cpp 227 228 => main code for the different binaries. See linalg/bwc/README 229 230 231linalg/bwc/README.TORQUE 232linalg/bwc/TODO 233linalg/bwc/lingen.txt 234 235 => outdated documentation 236 237 238linalg/bwc/bcast-file.c 239 240 => debug/bench/example code to share a file across multiple nodes 241 over MPI. Can be handy in scripts. 242 243 244linalg/bwc/bench_polmatmul.cpp 245 246 => old bench code, being phased out. 247 248 249The pi_go() function 250-------------------- 251 252This is the main workhorse for the parallelizing of the computation. It 253allows multi-thread (with pthread) and multinode (with MPI) processing. 254Some of the calls offered with the parallelizing_info structure are 255reminiscent of the MPI calling interface. 256 257The parameters given to pi_go() are: 258 . an outer description of how many mpi jobs and threads should be set up. 259 . a pointer to the function that must be called in parallel. 260 . arguments which are to be passed to the function. 261 262Its implementation is in parallelizing_info.[ch] . The main data 263structure that conveys information on the parallel setting is 264 struct parallelizing_info_s 265The details are not so simple... but in a nutshell, we get 3 266communicators, the meaning of which is more or less a generalization of 267"MPI_Comm" to encompass also the pthread level. The first communicator 268allows full communication, the second is for row-wise communications, the 269third is for column-wise communications. 270 271The prototype of the function passed to pi_go() is 272 void *(*fcn)(parallelizing_info_ptr pi, void * arg) 273where parallelizing_info_ptr pi is built by pi_go and passed to the 274function. The last argument allows any additional thing to be passed 275through. 276 277Below are a few things you can do with the pi variable. In these protos, 278wr should be one of the communicator inside pi. 279 pi_hello(pi) // a small hello-world that tests the pi 280 pi_thread_bcast(void* ptr, size_t size, BWC_PI_BYTE, uint root, wr) 281 // Broadcasting function at thread level 282 pi_bcast(...) 283 // Broadcasting function at MPI+thread level 284 serialize(wr) // Serializing stuff 285 serialize_threads(wr) 286The most interesting example of using these is matmul_top.[ch] . Be 287careful: in these files, the pi's communicators are completed by others 288communicators that links pieces of the matrix. In particular, there are 289two intersecting communicators which are sub-communicators of the larger 290one. This implies that serializing on one of these sub-communicators with 291all threads of the other communicator concurrently is illegal, because 292serialize() entails MPI calls which do not handle concurrency well. 293 294The "abase" mechanism 295--------------------- 296 297The matrix stuff is kind-of generic w.r.t the type of objects inside the 298matrix. The "abase" mechanism provides this genericity -- don't expect a 299clean ring structure, beautifully following the mathematical notions. 300This layer is built from the mpfq library, together with a table of 301indirect functions serving as wrappers. 302 303Thanks to the abase layer, the same code may be used both in 304characteristic two and in characteristic p, only with minor changes. 305 306The abase layer is accessed in different ways depending on whether we're 307looking at a higher level or lower level part of the code. 308 309For the higher level code (typically, matmul_top.c, krylov.c, etc), we 310have some object-oriented access. Everything goes through the 311mpfq_vbase_ptr type, which is the type of the parent structure, holding 312all the required function pointers. The elements are cast to void* 313pointers, will all the accompanying infelicities. The cod compiled in 314this way is independent of the actual implementation of the underlying 315arithmetic. 316 317For lower level code (typically, matmul-bucket.cpp), it's different. We 318have the two main types: 319 abdst_field - a descriptor of the field we're working with 320 abelt - an element. 321 322Each instance of the abase mechanism is a set of files, as follows: 323 mpfq/mpfq_<something>.c 324 mpfq/mpfq_<something>.h 325 mpfq/mpfq_<something>_t.c 326 mpfq/mpfq_<something>_t.h 327This may describe elements whose sizes are known at compile-time 328(uint64_t), or blocks of them, of width possibly known only at 329running-time (this is the case for the mpfq_pz layer). 330 331The idea is to have genericity and variable width for boring stuff (like 332prep) that does not need maximal efficiency, and fixed sized- code for 333the main operations inside Krylov. 334 335See the example uses of the abase layer in either high level or low level 336code for details on how to use this. Functions are all following the mpfq 337api. 338 339Currently the following are implemented: 340 abase-u64k1 // uint64_t 341 abase-m128 // uint128_t, packed into SSE2 342 abase-u64k# // block of # uint64_t, size known at compile-time, with 343 // # being 1,2,3, or 4 344 345The main use of this type is for vectors to be multiplied by the binary 346matrix. We have for instance in matmul.[ch] a function called: 347 void matmul_mul(matmul_ptr M , void * B , void const * A , int d) 348that computes B = M . A (or transpose if d=1), where A and B are vectors 349of "abase". 350 351Matrix - vector product 352----------------------- 353 354The code that will run on a single thread is in matmul.[ch] and the 355different implementations it encapsulates. For the moment: 356 matmul-basic.[ch] 357 matmul-sliced.{cpp,h} 358 matmul-bucket.{cpp,h} [ default ] 359 360practically all cpu time is spent in these routines. Therefore, optimizing 361these for a particular CPU/memory controller combination is very relevant. See 362more on this particular step below in the ``low-level'' section below. 363 364On top of this very basic mono-threaded matrix x vector product, the 365matmul_top.[ch] mechanism handles the parallelizing: each node / thread 366gets a piece of the matrix (produced by balance) and does its own small 367block x partial_vector product, and then the partial results are 368transmitted and collected. 369 370The intersections.[ch] files are used to help the communications. In one 371word, this describes how an horizontal vector (spread on various 372nodes/threads) can be transposed in a vertical vector (also spread on 373various nodes/threads). This is tricky and will be explained somewhere 374else (the code is short, but beware of headaches). 375 376Matrix - vector product -- low-level 377------------------------------------ 378 379Among the different possible source file listed above, presumably only one 380will be used for a single bwc run. All bwc binaries (well, the important ones: 381prep secure krylov mksol gather) accept the mm_impl= argument which selects 382the implementation to use. 383 384The primary on-disk matrix is in trivial binary format (32-bit integer 385indices for non-zero coefficients, each row prefixed by its length). 386However, memory representation for fast computation is different. It 387depends on the implementation. For convenience, a ``cached'' matrix file 388matching the in-memory representation for a given implementation is 389generated on first use, and saved on disk 390 391[ Note that this single task can be benched separately of the bw 392 infrastructure by the standalone bench program. Be aware though that 393 some of the difficulty performance-wise arises when all cpu cores are 394 trying to interact with memory simultaneously. ] 395 396(documentation to be continued). 397 398Distributed vectors -- the mmt_vec type 399--------------------------------------- 400 401The mmt_vec type holds vector data, which is to be distributed across the 402different threads and jobs. And mmt_vec is split into as many pieces as 403we have threads and jobs: when we work with an N*N matrix split into a 404mesh of nh rows and nv columns, each thread "owns" a fraction N/(nh*nv) 405of a vector. However, the mmt_vec has storage space for slightly more 406than that, because each thread needs at some point to access all the data 407it needs to perform a local matrix times vector, or vector times matrix 408product. 409 410We first focus on the chunks (of size N/(nh*nv)) owned by the various threads. How a vector is 411distributed depends on the ordering in which these parts are spread 412across threads. This is controlled by a "direction" parameter. 413We say that an mmt_vec is distributed "in direction d", where d is the 414field m->d. For d==0, the chunks are distributed row-major with respect 415to the parallelizing_info structure: 416 - chunks 0 to nv-1 go to the first row (threads with pi->wr[1]->trank == 417 0 and pi->wr[1]->jrank == 0) 418 - chunks nv to 2nv-1 go to the second row, and so on. 419When d==1, the chunks are distributed column-major. 420 421Now as said above, a vector has storage for slightly more than the chunk 422it owns. The storage area for v is v->v. As a whole, v->v is meant to 423contain coefficients for the index range [v->i0..v->i1[. 424 425The chunk owned by v is a subpart of v->v ; It starts at the pointer 426mmt_my_own_subvec(), namely at offset mmt_my_own_offset_in_items() within 427v->v, and for a length which is mmt_my_own_size(). The index range 428corresponding to this sub-area is 429[v->i0+own_offset..v->i0+own_offset+own_size[. 430 431Several threads and jobs are likely to have data areas (v->v) relative to 432the same index range [v->i0..v->i1[. There appears, then, the problem of 433deciding how consistent these data areas may be. 434 435A vector may be: 436 - fully consistent: this means that all threads and job which 437 have a data area for the range [v->i0..v->i1[ agree on its 438 contents. 439 - partially consistent: each job and thread is correct about its 440 own sub-area, but the other parts may be bogus. 441 - inconsistent: data differs everywhere, and even each thread's 442 own sub-area contains stuff which is meaningless if we forget 443 about what's in the other threads for this index range. 444 445Matrix - vector product (1) 446--------------------------- 447 448The matmul_top_data type is responsible of tying together the different 449threads and nodes, so as to work collectively on a matrix times vector 450product. 451 452The matmul_top_data object (generally abbreviated mmt) contains in 453particular 454 - on each thread/job, a pointer to the local share of the matrix. 455 - pointer to a parallelizing_info structure, which brings the different 456 communicators 457 - more bookkeeping info about the matrix (embedded permutations and 458 such, as well as the number of rows and columns of the matrix in total) 459 460The main function is matmul_top_mul(), which decomposes in two steps: 461 matmul_top_mul_cpu(), which does no communication 462 matmul_top_mul_comm(), which does (almost) no computation 463 464In a Wiedemann setting, bw->dir (nullspace=... at the top-level) should 465be the same as the d-parameter passed to matmul_top_mul(). 466 467In a Lanczos setting, matmul_top_mul() is called in both directions. 468 469Apart from this d=[0|1] nightmare, matmul_top_mul_cpu() just calls matmul_mul(), 470and put the result in the destination (partial) vector. 471 472The input of matmul_top_mul_cpu() must be a distributed vector with full 473consistency. 474 475The output of matmul_top_mul_cpu() is a vector distributed in the 476opposite direction, but which is inconsistent. We need to do some work in 477order to reconcile the data. 478 479This tricky part is the matmul_top_mul_comm() part. 480 481Matrix - vector product (2) 482--------------------------- 483 484Two phases of communication are defined in order to reach maximal 485consistency again. 486 - mmt_vec_reduce() 487 - mmt_vec_broadcast() 488It takes as input the result of the cpu step, which is in the 489"destination" vector, and does all the communications to reconstruct the 490vector for the next iteration (thus, stored again in the "source" 491vector). 492Most of the comments in the source code somehow implicitly assume that 493the vector being passed is distributed in the "d=1" direction, 494even though "d=0" is probably the most extensively tested. 495Such a point of view is also visible in the naming of variables, e.g. 496when we write: 497 pi_comm_ptr picol = mmt->pi->wr[d]; 498this does actually correspond to a column communicator only when d==1. 499 500 501mmt_vec_reduce is actually a slight misnomer. The real operation is a 502"reduce-scatter" operation, in MPI parlance. It adds all data, and sends 503each share of the resulting vector to its owner. Of course, it's really 504"across" only in case we're doing matrix times vector. 505 506likewise, mmt_vec_broadcast is more accurately an MPI "all-gather" 507operation. Each thread sends the data it owns to all threads which are 508going to need it. 509 510Note that the vector layout which has been presented has the effect that 511if the matrix is in fact the identity, then matmul_top_mul() will not 512compute exactly the expected w = v * Id, but a shuffle of it, which 513involves some transposition of indices. In other words, what is computed 514correspond to a matrix M' = P^-1*M, where P^-1 is a permutation matrix. 515This reduces / simplifies / accelerates the mul_comm() part, and if the 516goal is to find a kernel vector, knowing one for M' is equivalent to 517knowing one for M. There's more about this in matmul_top.c. 518 519 520 521*** Reduce_accross: 522 523If at the top-level, one has d=1 (nullspace=right), then reduce_accross() 524will be called with d=0, i.e. we want to reduce accross rows. 525First, the reduction is done at the thread level. This is just some 526additions of vectors, done in the appropriate places of memory. Advantage 527is taken of multithread, here: each thread is in charge of a subset of 528the rows. This done in-place (i.e., still in the dst-vector). 529Second, the reduction is done at the mpi level. Here, only the thread 0 530is working in each node. This is done with MPI_Reduce_scatter (or any 531replacement routine). And this is where the shuffle is implicitly 532started. Indeed, each job will take care of reducing only a portion of 533its indices. And therefore, in the end, the pieces of the vector are not 534where they should be. 535The reduction at the MPI level is where the result is written back in the 536src vector in order to be ready for the next iteration. 537 538*** Broadcast_down: 539 540The broadcast is simply a nop at the thread level, since with shared 541memory, everyone sees everyone else's data. 542At the MPI level, though, again, there is some work to do, and only the 543thread number 0 in each node is in charge. The MPI primitive that is used 544here is MPI_Allgather. If at the top-level on has d=1, then this is 545called with d=1, i.e. the communications are down within columns. 546 547