kokkos-3.4.01/doc/design_notes_space_instances.md

# Design Notes for Execution and Memory Space Instances

## Objective

 * Enable Kokkos interoperability with coarse-grain tasking models

## Requirements

 * Backwards compatible with existing Kokkos API
 * Support existing Host execution spaces (Serial, Threads, OpenMP)
 * Support DARMA threading model (may require a new Host execution space)
 * Support Uintah threading model, i.e. indepentant worker threadpools working of of shared task queues


## Execution Space

  * Parallel work is *dispatched* on an execution space instance

  * Execution space instances are conceptually disjoint/independent from each other


## Host Execution Space Instances

  *  A host-side *control* thread dispatches work to an instance

  * `main` is the initial control thread

  *  A host execution space instance is an organized thread pool

  *  All instances are disjoint, i.e. hardware resources are not shared between instances

  *  Exactly one control thread is associated with
     an instance and only that control thread may
     dispatch work to to that instance

  *  The control thread is a member of the instance

  *  The pool of threads associated with an instances is not mutatable during that instance existence

  *  The pool of threads associated with an instance may be masked

    -  Allows work to be dispatched to a subset of the pool

    -  Example: only one hyperthread per core of the instance

    -  A mask can be applied during the policy creation of a parallel algorithm

    -  Masking is portable by defining it as ceiling of fraction between [0.0, 1.0]
       of the available resources

```
class ExecutionSpace {
public:
  using execution_space = ExecutionSpace;
  using memory_space = ...;
  using device_type = Kokkos::Device<execution_space, memory_space>;
  using array_layout = ...;
  using size_type = ...;
  using scratch_memory_space = ...;


  class Instance
  {
    int thread_pool_size( int depth = 0 );
    ...
  };

  class InstanceRequest
  {
  public:
    using Control = std::function< void( Instance * )>;

    InstanceRequest( Control control
                   , unsigned thread_count
                   , unsigned use_numa_count = 0
                   , unsigned use_cores_per_numa = 0
                   );

  };

  static bool in_parallel();

  static bool sleep();
  static bool wake();

  static void fence();

  static void print_configuration( std::ostream &, const bool detailed = false );

  static void initialize( unsigned thread_count = 0
                        , unsigned use_numa_count = 0
                        , unsigned use_cores_per_numa = 0
                        );

  // Partition the current instance into the requested instances
  // and run the given functions on the cooresponding instances
  // will block until all the partitioned instances complete and
  // the original instance will be restored
  //
  // Requires that the space has already been initialized
  // Requires that the request can be statisfied by the current instance
  //   i.e. the sum of number of requested threads must be less than the
  //   max_hardware_threads
  //
  // Each control functor will accept a handle to its new default instance
  // Each instance must be independent of all other instances
  //   i.e. no assumption on scheduling between instances
  // The user is responible for checking the return code for errors
  static int run_instances( std::vector< InstanceRequest> const& requests );

  static void finalize();

  static int is_initialized();

  static int concurrency();

  static int thread_pool_size( int depth = 0 );

  static int thread_pool_rank();

  static int max_hardware_threads();

  static int hardware_thread_id();

 };

```