Responsible for performing addition and multiplication in parallel.
inner_product: Inner product function.
max_vecs_per_node: Max number of vecs in memory per node.
verbosity: 1 prints progress and warnings, 0 prints almost nothing.
print_interval: Min time (secs) between printed progress messages.
Note: Computations are often sped up by using all available processors, even if this lowers max_vecs_per_node proportionally. However, this depends on the computer and the nature of the functions supplied, and sometimes loading from file is slower with more processors.
Computes the matrix of inner product combinations between vectors.
The vecs are retrieved in memory-efficient chunks and are not all in memory at once. The row vecs and col vecs are assumed to be different. When they are the same, use compute_symmetric_inner_product() for a 2x speedup.
Each MPI worker (processor) is responsible for retrieving a subset of the rows and columns. The processors then send/recv columns via MPI so they can be used to compute all IPs for the rows on each MPI worker. This is repeated until all MPI workers are done with all of their row chunks. If there are 2 processors:
| x o | rank0 | x o | | x o | - | o x | rank1 | o x | | o x |
In the next step, rank 0 sends column 0 to rank 1 and rank 1 sends column 1 to rank 0. The remaining IPs are filled in:
| x x | rank0 | x x | | x x | - | x x | rank1 | x x | | x x |
When the number of cols and rows is not divisible by the number of processors, the processors are assigned unequal numbers of tasks. However, all processors are always part of the passing cycle.
The scaling is:
where n_r is number of rows, n_c number of columns, max is max_vecs_per_proc = max_vecs_per_node/num_procs_per_node, and n_p is the number of MPI workers (processors).
If there are more rows than columns, then an internal transpose and un-transpose is performed to improve efficiency (since n_c only appears in the scaling in the quadratic term).
Computes an upper-triangular symmetric matrix of inner products.
See the documentation for compute_inner_product_mat() for an idea how this works.
TODO: JON, write detailed documentation similar to compute_inner_product_mat().
Linearly combines the basis vecs and calls put on result.
sum_vec_handles: List of handles for the sum vectors.
basis_vec_handles: List of handles for the basis vecs.
Each processor retrieves a subset of the basis vecs to compute as many outputs as a processor can have in memory at once. Each processor computes the “layers” from the basis it is resonsible for, and for as many modes as it can fit in memory. The layers from all procs are summed together to form the sum_vecs and put ed.
num gets/worker = n_s/(n_p*(max-2)) * n_b/n_p
passes/worker = (n_p-1) * n_s/(n_p*(max-2)) * (n_b/n_p)
scalar multiplies/worker = n_s*n_b/n_p
Where n_s is number of sum vecs, n_b is number of basis vecs, n_p is number of processors, max = max_vecs_per_node.
Print a message from rank 0.
Check user-supplied vec handle and vec objects.
The add and mult functions are tested for the vector object. This is not a complete testing, but catches some common mistakes. Raises an error if a check fails.
Inner products and linear combinations with matrices.