:mod:`heat.linalg`
=======================
.. py:module:: heat.core.linalg

.. autoapi-nested-parse::

   Import all linear algebra functions into the ht.linalg namespace


Submodules
----------
.. toctree::
   :titlesonly:
   :maxdepth: 1

   basics/index.rst
   eigh/index.rst
   polar/index.rst
   qr/index.rst
   randomized/index.rst
   solver/index.rst
   svd/index.rst
   svdtools/index.rst


Package Contents
----------------


.. function:: condest(A: heat.core.dndarray.DNDarray, p: Union[int, str] = None, algorithm: str = 'randomized', params: list = None) -> heat.core.dndarray.DNDarray

   Computes a (possibly randomized) upper estimate of the l2-condition number of the input 2D DNDarray.

   :param A: The matrix, i.e., a 2D DNDarray, for which the condition number shall be estimated.
   :type A: DNDarray
   :param p: The norm to use for the condition number computation. If None, the l2-norm (default, p=2) is used.
             So far, only p=2 is implemented.
   :type p: int or str (optional)
   :param algorithm: The algorithm to use for the estimation. Currently, only "randomized" (default) is implemented.
   :type algorithm: str
   :param params: A list of parameters required for the chosen algorithm; if not provided, default values for the respective algorithm are chosen.
                  If `algorithm="randomized"` the number of random samples to use can be specified under the key "nsamples"; default is 10.
   :type params: dict (optional)

   .. rubric:: Notes

   The "randomized" algorithm follows the approach described in [1]; note that in the paper actually the condition number w.r.t. the Frobenius norm is estimated.
   However, this yields an upper bound for the condition number w.r.t. the l2-norm as well.

   .. rubric:: References

   [1] T. Gudmundsson, C. S. Kenney, and A. J. Laub. Small-Sample Statistical Estimates for Matrix Norms. SIAM Journal on Matrix Analysis and Applications 1995 16:3, 776-792.


.. function:: cross(a: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray, axisa: int = -1, axisb: int = -1, axisc: int = -1, axis: int = -1) -> heat.core.dndarray.DNDarray

   Returns the cross product. 2D vectors will we converted to 3D.

   :param a: First input array.
   :type a: DNDarray
   :param b: Second input array. Must have the same shape as 'a'.
   :type b: DNDarray
   :param axisa: Axis of `a` that defines the vector(s). By default, the last axis.
   :type axisa: int
   :param axisb: Axis of `b` that defines the vector(s). By default, the last axis.
   :type axisb: int
   :param axisc: Axis of the output containing the cross product vector(s). By default, the last axis.
   :type axisc: int
   :param axis: Axis that defines the vectors for which to compute the cross product. Overrides `axisa`, `axisb` and `axisc`. Default: -1
   :type axis: int

   :raises ValueError: If the two input arrays don't match in shape, split, device, or comm. If the vectors are along the split axis.
   :raises TypeError: If 'axis' is not an integer.

   .. rubric:: Examples

   >>> a = ht.eye(3)
   >>> b = ht.array([[0, 1, 0], [0, 0, 1], [1, 0, 0]])
   >>> cross = ht.cross(a, b)
   DNDarray([[0., 0., 1.],
             [1., 0., 0.],
             [0., 1., 0.]], dtype=ht.float32, device=cpu:0, split=None)


.. function:: det(a: heat.core.dndarray.DNDarray) -> heat.core.dndarray.DNDarray

   Returns the determinant of a square matrix.

   :param a: A square matrix or a stack of matrices. Shape = (...,M,M)
   :type a: DNDarray

   :raises RuntimeError: If the dtype of 'a' is not floating-point.
   :raises RuntimeError: If `a.ndim < 2` or if the length of the last two dimensions is not the same.

   .. rubric:: Examples

   >>> a = ht.array([[-2, -1, 2], [2, 1, 4], [-3, 3, -1]])
   >>> ht.linalg.det(a)
   DNDarray(54., dtype=ht.float64, device=cpu:0, split=None)


.. function:: dot(a: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray, out: Optional[heat.core.dndarray.DNDarray] = None) -> Union[heat.core.dndarray.DNDarray, float]

   Returns the dot product of two ``DNDarrays``.
   Specifically,

       1. If both a and b are 1-D arrays, it is inner product of vectors.

       2. If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or ``a@b`` is preferred.

       3. If either a or b is 0-D (scalar), it is equivalent to multiply and using ``multiply(a, b)`` or ``a*b`` is preferred.

   :param a: First input DNDarray
   :type a: DNDarray
   :param b: Second input DNDarray
   :type b: DNDarray
   :param out: Output buffer.
   :type out: DNDarray, optional

   .. seealso::

      :py:obj:`vecdot`
          Supports (vector) dot along an axis.


.. function:: inv(a: heat.core.dndarray.DNDarray) -> heat.core.dndarray.DNDarray

   Computes the multiplicative inverse of a square matrix.

   :param a: Square matrix of floating-point data type or a stack of square matrices. Shape = (...,M,M)
   :type a: DNDarray

   :raises RuntimeError: If the inverse does not exist.
   :raises RuntimeError: If the dtype is not floating-point
   :raises RuntimeError: If a is not at least two-dimensional or if the lengths of the last two dimensions are not the same.

   .. rubric:: Examples

   >>> a = ht.array([[1.0, 2], [2, 3]])
   >>> ht.linalg.inv(a)
   DNDarray([[-3.,  2.],
             [ 2., -1.]], dtype=ht.float32, device=cpu:0, split=None)


.. function:: matmul(a: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray, allow_resplit: bool = False) -> heat.core.dndarray.DNDarray

   Matrix multiplication of two ``DNDarrays``: ``a@b=c`` or ``A@B=c``.
   Returns a tensor with the result of ``a@b``. The split dimension of the returned array is
   typically the split dimension of a. If both are ``None`` and if ``allow_resplit=False`` then ``c.split`` is also ``None``.

   Batched inputs (with batch dimensions being leading dimensions) are allowed; see also the Notes below.

   :param a: matrix :math:`L \times P` or vector :math:`P` or batch of matrices: :math:`B_1 \times ... \times B_k \times L \times P`
   :type a: DNDarray
   :param b: matrix :math:`P \times Q` or vector :math:`P` or batch of matrices: :math:`B_1 \times ... \times B_k \times P \times Q`
   :type b: DNDarray
   :param allow_resplit: Whether to distribute ``a`` in the case that both ``a.split is None`` and ``b.split is None``.
                         Default is ``False``. If ``True``, if both are not split then ``a`` will be distributed in-place along axis 0.
   :type allow_resplit: bool, optional

   .. rubric:: Notes

   - For batched inputs, batch dimensions must coincide and if one matrix is split along a batch axis the other must be split along the same axis.
   - If ``a`` or ``b`` is a vector the result will also be a vector.
   - We recommend to avoid the particular split combinations ``1``-``0``, ``None``-``0``, and ``1``-``None`` (for ``a.split``-``b.split``) due to their comparably high memory consumption, if possible. Applying ``DNDarray.resplit_`` or ``heat.resplit`` on one of the two factors before calling ``matmul`` in these situations might improve performance of your code / might avoid memory bottlenecks.

   .. rubric:: References

   [1] R. Gu, et al., "Improving Execution Concurrency of Large-scale Matrix Multiplication on
   Distributed Data-parallel Platforms," IEEE Transactions on Parallel and Distributed Systems,
   vol 28, no. 9. 2017.

   [2] S. Ryu and D. Kim, "Parallel Huge Matrix Multiplication on a Cluster with GPGPU
   Accelerators," 2018 IEEE International Parallel and Distributed Processing Symposium
   Workshops (IPDPSW), Vancouver, BC, 2018, pp. 877-882.

   .. rubric:: Examples

   >>> a = ht.ones((n, m), split=1)
   >>> a[0] = ht.arange(1, m + 1)
   >>> a[:, -1] = ht.arange(1, n + 1).larray
   [0/1] tensor([[1., 2.],
                 [1., 1.],
                 [1., 1.],
                 [1., 1.],
                 [1., 1.]])
   [1/1] tensor([[3., 1.],
                 [1., 2.],
                 [1., 3.],
                 [1., 4.],
                 [1., 5.]])
   >>> b = ht.ones((j, k), split=0)
   >>> b[0] = ht.arange(1, k + 1)
   >>> b[:, 0] = ht.arange(1, j + 1).larray
   [0/1] tensor([[1., 2., 3., 4., 5., 6., 7.],
                 [2., 1., 1., 1., 1., 1., 1.]])
   [1/1] tensor([[3., 1., 1., 1., 1., 1., 1.],
                 [4., 1., 1., 1., 1., 1., 1.]])
   >>> linalg.matmul(a, b).larray
   [0/1] tensor([[18.,  8.,  9., 10.],
                 [14.,  6.,  7.,  8.],
                 [18.,  7.,  8.,  9.],
                 [22.,  8.,  9., 10.],
                 [26.,  9., 10., 11.]])
   [1/1] tensor([[11., 12., 13.],
                 [ 9., 10., 11.],
                 [10., 11., 12.],
                 [11., 12., 13.],
                 [12., 13., 14.]])


.. function:: matrix_norm(x: heat.core.dndarray.DNDarray, axis: Optional[Tuple[int, int]] = None, keepdims: bool = False, ord: Optional[Union[int, str]] = None) -> heat.core.dndarray.DNDarray

   Computes the matrix norm of an array.

   :param x: Input array
   :type x: DNDarray
   :param axis: Both axes of the matrix. If `None` 'x' must be a matrix. Default: `None`
   :type axis: tuple, optional
   :param keepdims: Retains the reduced dimension when `True`. Default: `False`
   :type keepdims: bool, optional
   :param ord: The matrix norm order to compute. If `None` the Frobenius norm (`'fro'`) is used. Default: `None`
   :type ord: int, 'fro', 'nuc', optional

   .. seealso::

      :py:obj:`norm`
          Computes the vector or matrix norm of an array.

      :py:obj:`vector_norm`
          Computes the vector norm of an array.

   .. rubric:: Notes

   The following norms are supported:

   =====  ============================
   ord    norm for matrices
   =====  ============================
   None   Frobenius norm
   'fro'  Frobenius norm
   'nuc'  nuclear norm
   inf    max(sum(abs(x), axis=1))
   -inf   min(sum(abs(x), axis=1))
   1      max(sum(abs(x), axis=0))
   -1     min(sum(abs(x), axis=0))
   =====  ============================

   The following matrix norms are currently **not** supported:

   =====  ============================
   ord    norm for matrices
   =====  ============================
   2      largest singular value
   -2     smallest singular value
   =====  ============================

   :raises TypeError: If axis is not a 2-tuple
   :raises ValueError: If an invalid matrix norm is given or 'x' is a vector.

   .. rubric:: Examples

   >>> ht.matrix_norm(ht.array([[1, 2], [3, 4]]))
   DNDarray([[5.4772]], dtype=ht.float64, device=cpu:0, split=None)
   >>> ht.matrix_norm(ht.array([[1, 2], [3, 4]]), keepdims=True, ord=-1)
   DNDarray([[4.]], dtype=ht.float64, device=cpu:0, split=None)


.. function:: matrix_exp(A: heat.core.dndarray.DNDarray) -> heat.core.dndarray.DNDarray

   Computes the matrix exponential of a square matrix.

   Letting :math:`\mathbb{K}` be :math:`\mathbb{R}` or :math:`\mathbb{C}`,
   this function computes the **matrix exponential** of :math:`A \in \mathbb{K}^{n \times n}`, which is defined as

   .. math::
       \mathrm{matrix\_exp}(A) = \sum_{k=0}^\infty \frac{1}{k!}A^k \in \mathbb{K}^{n \times n}.

   If the matrix :math:`A` has eigenvalues :math:`\lambda_i \in \mathbb{C}`,
   the matrix :math:`\mathrm{matrix\_exp}(A)` has eigenvalues :math:`e^{\lambda_i} \in \mathbb{C}`.

   Supports input of bfloat16, float, double, cfloat and cdouble dtypes.
   Also supports batches of matrices, and if :attr:`A` is a batch of matrices then
   the output has the same batch dimensions.

   .. note::
        A may only be distributed in the batch dimensions.

   .. seealso::
            :func:`torch.linalg.matrix_exp` is called under the hood on the local data.

   :param A: DNDarray of shape `(*, n, n)` where `*` is zero or more batch dimensions.
   :type A: DNDarray

   Example::

       >>> A = ht.empty((2, 2, 2), split=0)
       >>> A[0, :, :] = ht.eye((2, 2))
       >>> A[1, :, :] = 2 * ht.eye((2, 2))
       >>> ht.linalg.matrix_exp(A)
       DNDarray([[[2.7183, 0.0000],
          [0.0000, 2.7183]],

         [[7.3891, 0.0000],
          [0.0000, 7.3891]]], dtype=ht.float32, device=cpu:0, split=0)


.. data:: expm
   

   Alias for :py:func:`matrix_exp`

.. function:: norm(x: heat.core.dndarray.DNDarray, axis: Optional[Union[int, Tuple[int, int]]] = None, keepdims: bool = False, ord: Optional[Union[int, float, str]] = None) -> heat.core.dndarray.DNDarray

   Return the vector or matrix norm of an array.

   :param x: Input vector
   :type x: DNDarray
   :param axis: Axes along which to compute the norm. If an integer, vector norm is used. If a 2-tuple, matrix norm is used.
                If `None`, it is inferred from the dimension of the array. Default: `None`
   :type axis: int, tuple, optional
   :param keepdims: Retains the reduced dimension when `True`. Default: `False`
   :type keepdims: bool, optional
   :param ord: The norm order to compute. See Notes
   :type ord: int, float, inf, -inf, 'fro', 'nuc'

   .. seealso::

      :py:obj:`vector_norm`
          Computes the vector norm of an array.

      :py:obj:`matrix_norm`
          Computes the matrix norm of an array.

   .. rubric:: Notes

   The following norms are supported:

   =====  ============================  ==========================
   ord    norm for matrices             norm for vectors
   =====  ============================  ==========================
   None   Frobenius norm                L2-norm (Euclidean)
   'fro'  Frobenius norm                --
   'nuc'  nuclear norm                  --
   inf    max(sum(abs(x), axis=1))      max(abs(x))
   -inf   min(sum(abs(x), axis=1))      min(abs(x))
   0      --                            sum(x != 0)
   1      max(sum(abs(x), axis=0))      L1-norm (Manhattan)
   -1     min(sum(abs(x), axis=0))      1./sum(1./abs(a))
   2      --                            L2-norm (Euclidean)
   -2     --                            1./sqrt(sum(1./abs(a)**2))
   other  --                            sum(abs(x)**ord)**(1./ord)
   =====  ============================  ==========================

   The following matrix norms are currently **not** supported:

   =====  ============================
   ord    norm for matrices
   =====  ============================
   2      largest singular value
   -2     smallest singular value
   =====  ============================

   :raises ValueError: If 'axis' has more than 2 elements

   .. rubric:: Examples

   >>> from heat import linalg as LA
   >>> a = ht.arange(9, dtype=ht.float) - 4
   >>> a
   DNDarray([-4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.], dtype=ht.float32, device=cpu:0, split=None)
   >>> b = a.reshape((3, 3))
   >>> b
   DNDarray([[-4., -3., -2.],
         [-1.,  0.,  1.],
         [ 2.,  3.,  4.]], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a)
   DNDarray(7.7460, dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(b)
   DNDarray(7.7460, dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(b, ord="fro")
   DNDarray(7.7460, dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, float("inf"))
   DNDarray([4.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(b, ht.inf)
   DNDarray([9.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, -ht.inf))
   DNDarray([0.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(b, -ht.inf)
   DNDarray([2.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, 1)
   DNDarray([20.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(b, 1)
   DNDarray([7.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, -1)
   DNDarray([0.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(b, -1)
   DNDarray([6.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, 2)
   DNDarray(7.7460, dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, -2)
   DNDarray([0.], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, 3)
   DNDarray([5.8480], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(a, -3)
   DNDarray([0.], dtype=ht.float32, device=cpu:0, split=None)
   c = ht.array([[ 1, 2, 3],
                 [-1, 1, 4]])
   >>> LA.norm(c, axis=0)
   DNDarray([1.4142, 2.2361, 5.0000], dtype=ht.float64, device=cpu:0, split=None)
   >>> LA.norm(c, axis=1)
   DNDarray([3.7417, 4.2426], dtype=ht.float64, device=cpu:0, split=None)
   >>> LA.norm(c, axis=1, ord=1)
   DNDarray([6., 6.], dtype=ht.float64, device=cpu:0, split=None)
   >>> m = ht.arange(8).reshape(2, 2, 2)
   >>> LA.norm(m, axis=(1, 2))
   DNDarray([ 3.7417, 11.2250], dtype=ht.float32, device=cpu:0, split=None)
   >>> LA.norm(m[0, :, :]), LA.norm(m[1, :, :])
   (DNDarray(3.7417, dtype=ht.float32, device=cpu:0, split=None), DNDarray(11.2250, dtype=ht.float32, device=cpu:0, split=None))


.. function:: outer(a: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray, out: Optional[heat.core.dndarray.DNDarray] = None, split: Optional[int] = None) -> heat.core.dndarray.DNDarray

   Compute the outer product of two 1-D DNDarrays: :math:`out(i, j) = a(i) \times b(j)`.
   Given two vectors, :math:`a = (a_0, a_1, ..., a_N)` and :math:`b = (b_0, b_1, ..., b_M)`, the outer product is:

   .. math::
       :nowrap:

       \begin{pmatrix}
          a_0 \cdot b_0  & a_0 \cdot b_1 & . & . &  a_0 \cdot b_M \\
          a_1 \cdot b_0 & a_1 \cdot b_1 & . & . & a_1 \cdot b_M \\
          . & . & . & . & .   \\
          a_N \cdot b_0 & a_N \cdot b_1 & . & . & a_N \cdot b_M
       \end{pmatrix}

   :param a: 1-dimensional: :math:`N`
             Will be flattened by default if more than 1-D.
   :type a: DNDarray
   :param b: 1-dimensional: :math:`M`
             Will be flattened by default if more than 1-D.
   :type b: DNDarray
   :param out: 2-dimensional: :math:`N \times M`
               A location where the result is stored
   :type out: DNDarray, optional
   :param split: Split dimension of the resulting DNDarray. Can be 0, 1, or None.
                 This is only relevant if the calculations are memory-distributed.
                 Default is ``split=0`` (see Notes).
   :type split: int, optional

   .. rubric:: Notes

   Parallel implementation of outer product, assumes arrays are dense.
   In the classical (dense) case, one of the two arrays needs to be communicated around the processes in
   a ring.

   * Sending ``b`` around in a ring results in ``outer`` being split along the rows (``outer.split = 0``).

   * Sending ``a`` around in a ring results in ``outer`` being split along the columns (``outer.split = 1``).


   So, if specified, ``split`` defines which ``DNDarray`` stays put and which one is passed around.
   If ``split`` is ``None`` or unspecified, the result will be distributed along axis ``0``, i.e. by default ``b`` is
   passed around, ``a`` stays put.

   .. rubric:: Examples

   >>> a = ht.arange(4)
   >>> b = ht.arange(3)
   >>> ht.outer(a, b).larray
   (3 processes)
   [0/2]   tensor([[0, 0, 0],
                   [0, 1, 2],
                   [0, 2, 4],
                   [0, 3, 6]], dtype=torch.int32)
   [1/2]   tensor([[0, 0, 0],
                   [0, 1, 2],
                   [0, 2, 4],
                   [0, 3, 6]], dtype=torch.int32)
   [2/2]   tensor([[0, 0, 0],
                   [0, 1, 2],
                   [0, 2, 4],
                   [0, 3, 6]], dtype=torch.int32)
   >>> a = ht.arange(4, split=0)
   >>> b = ht.arange(3, split=0)
   >>> ht.outer(a, b).larray
   [0/2]   tensor([[0, 0, 0],
                   [0, 1, 2]], dtype=torch.int32)
   [1/2]   tensor([[0, 2, 4]], dtype=torch.int32)
   [2/2]   tensor([[0, 3, 6]], dtype=torch.int32)
   >>> ht.outer(a, b, split=1).larray
   [0/2]   tensor([[0],
                   [0],
                   [0],
                   [0]], dtype=torch.int32)
   [1/2]   tensor([[0],
                   [1],
                   [2],
                   [3]], dtype=torch.int32)
   [2/2]   tensor([[0],
                   [2],
                   [4],
                   [6]], dtype=torch.int32)
   >>> a = ht.arange(5, dtype=ht.float32, split=0)
   >>> b = ht.arange(4, dtype=ht.float64, split=0)
   >>> out = ht.empty((5,4), dtype=ht.float64, split=1)
   >>> ht.outer(a, b, split=1, out=out)
   >>> out.larray
   [0/2]   tensor([[0., 0.],
                   [0., 1.],
                   [0., 2.],
                   [0., 3.],
                   [0., 4.]], dtype=torch.float64)
   [1/2]   tensor([[0.],
                   [2.],
                   [4.],
                   [6.],
                   [8.]], dtype=torch.float64)
   [2/2]   tensor([[ 0.],
                   [ 3.],
                   [ 6.],
                   [ 9.],
                   [12.]], dtype=torch.float64)


.. function:: projection(a: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray) -> heat.core.dndarray.DNDarray

   Projection of vector ``a`` onto vector ``b``

   :param a: The vector to be projected. Must be a 1D ``DNDarray``
   :type a: DNDarray
   :param b: The vector to project onto. Must be a 1D ``DNDarray``
   :type b: DNDarray


.. function:: trace(a: heat.core.dndarray.DNDarray, offset: Optional[int] = 0, axis1: Optional[int] = 0, axis2: Optional[int] = 1, dtype: Optional[heat.core.types.datatype] = None, out: Optional[heat.core.dndarray.DNDarray] = None) -> Union[heat.core.dndarray.DNDarray, float]

   Return the sum along diagonals of the array

   If `a` is 2D, the sum along its diagonal with the given offset is returned, i.e. the sum of
   elements a[i, i+offset] for all i.

   If `a` has more than two dimensions, then the axes specified by `axis1` and `axis2` are used
   to determine the 2D-sub-DNDarrays whose traces are returned.
   The shape of the resulting array is the same as that of `a` with `axis1` and `axis2` removed.

   :param a: Input array, from which the diagonals are taken
   :type a: array_like
   :param offset: Offsets of the diagonal from the main diagonal. Can be both positive and negative. Defaults to 0.
   :type offset: int, optional
   :param axis1: Axis to be used as the first axis of the 2D-sub-arrays from which the diagonals
                 should be taken. Default is the first axis of `a`
   :type axis1: int, optional
   :param axis2: Axis to be used as the second axis of the 2D-sub-arrays from which the diagonals
                 should be taken. Default is the second two axis of `a`
   :type axis2: int, optional
   :param dtype: Determines the data-type of the returned array and of the accumulator where the elements are
                 summed. If `dtype` has value None than the dtype is the same as that of `a`
   :type dtype: dtype, optional
   :param out: Array into which the output is placed. Its type is preserved and it must be of the right shape
               to hold the output
               Only applicable if `a` has more than 2 dimensions, thus the result is not a scalar.
               If distributed, its split axis might change eventually.
   :type out: ht.DNDarray, optional

   :returns: **sum_along_diagonals** -- If `a` is 2D, the sum along the diagonal is returned as a scalar
             If `a` has more than 2 dimensions, then a DNDarray of sums along diagonals is returned
   :rtype: number (of defined dtype) or ht.DNDarray

   .. rubric:: Examples

   2D-case
   >>> x = ht.arange(24).reshape((4, 6))
   >>> x
       DNDarray([[ 0,  1,  2,  3,  4,  5],
                 [ 6,  7,  8,  9, 10, 11],
                 [12, 13, 14, 15, 16, 17],
                 [18, 19, 20, 21, 22, 23]], dtype=ht.int32, device=cpu:0, split=None)
   >>> ht.trace(x)
       42
   >>> ht.trace(x, 1)
       46
   >>> ht.trace(x, -2)
       31

   > 2D-case
   >>> x = x.reshape((2, 3, 4))
   >>> x
       DNDarray([[[ 0,  1,  2,  3],
                  [ 4,  5,  6,  7],
                  [ 8,  9, 10, 11]],

                 [[12, 13, 14, 15],
                  [16, 17, 18, 19],
                  [20, 21, 22, 23]]], dtype=ht.int32, device=cpu:0, split=None)
   >>> ht.trace(x)
       DNDarray([16, 18, 20, 22], dtype=ht.int32, device=cpu:0, split=None)
   >>> ht.trace(x, 1)
       DNDarray([24, 26, 28, 30], dtype=ht.int32, device=cpu:0, split=None)
   >>> ht.trace(x, axis1=0, axis2=2)
       DNDarray([13, 21, 29], dtype=ht.int32, device=cpu:0, split=None)


.. function:: transpose(a: heat.core.dndarray.DNDarray, axes: Optional[List[int]] = None) -> heat.core.dndarray.DNDarray

   Permute the dimensions of an array.

   :param a: Input array.
   :type a: DNDarray
   :param axes: By default, reverse the dimensions, otherwise permute the axes according to the values given.
   :type axes: None or List[int,...], optional


.. function:: tril(m: heat.core.dndarray.DNDarray, k: int = 0) -> heat.core.dndarray.DNDarray

   Returns the lower triangular part of the ``DNDarray``.
   The lower triangular part of the array is defined as the elements on and below the diagonal, the other elements of
   the result array are set to 0.
   The argument ``k`` controls which diagonal to consider. If ``k=0``, all elements on and below the main diagonal are
   retained. A positive value includes just as many diagonals above the main diagonal, and similarly a negative
   value excludes just as many diagonals below the main diagonal.

   :param m: Input array for which to compute the lower triangle.
   :type m: DNDarray
   :param k: Diagonal above which to zero elements. ``k=0`` (default) is the main diagonal, ``k<0`` is below and ``k>0`` is above.
   :type k: int, optional


.. function:: triu(m: heat.core.dndarray.DNDarray, k: int = 0) -> heat.core.dndarray.DNDarray

   Returns the upper triangular part of the ``DNDarray``.
   The upper triangular part of the array is defined as the elements on and below the diagonal, the other elements of the result array are set to 0.
   The argument ``k`` controls which diagonal to consider. If ``k=0``, all elements on and below the main diagonal are
   retained. A positive value includes just as many diagonals above the main diagonal, and similarly a negative
   value excludes just as many diagonals below the main diagonal.

   :param m: Input array for which to compute the upper triangle.
   :type m: DNDarray
   :param k: Diagonal above which to zero elements. ``k=0`` (default) is the main diagonal, ``k<0`` is below and ``k>0`` is above.
   :type k: int, optional


.. function:: vdot(x1: heat.core.dndarray.DNDarray, x2: heat.core.dndarray.DNDarray) -> heat.core.dndarray.DNDarray

   Computes the dot product of two vectors. Higher-dimensional arrays will be flattened.

   :param x1: first input array. If it's complex, it's complex conjugate will be used.
   :type x1: DNDarray
   :param x2: second input array.
   :type x2: DNDarray

   :raises ValueError: If the number of elements is inconsistent.

   .. seealso::

      :py:obj:`dot`
          Return the dot product without using the complex conjugate.

   .. rubric:: Examples

   >>> a = ht.array([1 + 1j, 2 + 2j])
   >>> b = ht.array([1 + 2j, 3 + 4j])
   >>> ht.vdot(a, b)
   DNDarray([(17+3j)], dtype=ht.complex64, device=cpu:0, split=None)
   >>> ht.vdot(b, a)
   DNDarray([(17-3j)], dtype=ht.complex64, device=cpu:0, split=None)


.. function:: vecdot(x1: heat.core.dndarray.DNDarray, x2: heat.core.dndarray.DNDarray, axis: Optional[int] = None, keepdims: Optional[bool] = None) -> heat.core.dndarray.DNDarray

   Computes the (vector) dot product of two DNDarrays.

   :param x1: first input array.
   :type x1: DNDarray
   :param x2: second input array. Must be compatible with x1.
   :type x2: DNDarray
   :param axis: axis over which to compute the dot product. The last dimension is used if 'None'.
   :type axis: int, optional
   :param keepdims: If this is set to 'True', the axes which are reduced are left in the result as dimensions with size one.
   :type keepdims: bool, optional

   .. seealso::

      :py:obj:`dot`
          NumPy-like dot function.

   .. rubric:: Examples

   >>> ht.vecdot(ht.full((3, 3, 3), 3), ht.ones((3, 3)), axis=0)
   DNDarray([[9., 9., 9.],
             [9., 9., 9.],
             [9., 9., 9.]], dtype=ht.float32, device=cpu:0, split=None)


.. function:: vector_norm(x: heat.core.dndarray.DNDarray, axis: Optional[Union[int, Tuple[int]]] = None, keepdims=False, ord: Optional[Union[int, float]] = None) -> heat.core.dndarray.DNDarray

   Computes the vector norm of an array.

   :param x: Input array
   :type x: DNDarray
   :param axis: Axis along which to compute the vector norm. If `None` 'x' must be a vector. Default: `None`
   :type axis: int, tuple, optional
   :param keepdims: Retains the reduced dimension when `True`. Default: `False`
   :type keepdims: bool, optional
   :param ord: The norm order to compute. If `None` the euclidean norm (`2`) is used. Default: `None`
   :type ord: int, float, optional

   .. seealso::

      :py:obj:`norm`
          Computes the vector norm or matrix norm of an array.

      :py:obj:`matrix_norm`
          Computes the matrix norm of an array.

   .. rubric:: Notes

   The following norms are suported:

   =====  ==========================
   ord    norm for vectors
   =====  ==========================
   None   L2-norm (Euclidean)
   inf    max(abs(x))
   -inf   min(abs(x))
   0      sum(x != 0)
   1      L1-norm (Manhattan)
   -1     1./sum(1./abs(a))
   2      L2-norm (Euclidean)
   -2     1./sqrt(sum(1./abs(a)**2))
   other  sum(abs(x)**ord)**(1./ord)
   =====  ==========================

   :raises TypeError: If axis is not an integer or a 1-tuple
   :raises ValueError: If an invalid vector norm is given.

   .. rubric:: Examples

   >>> ht.vector_norm(ht.array([1, 2, 3, 4]))
   DNDarray([5.4772], dtype=ht.float64, device=cpu:0, split=None)
   >>> ht.vector_norm(ht.array([[1, 2], [3, 4]]), axis=0, ord=1)
   DNDarray([[4., 6.]], dtype=ht.float64, device=cpu:0, split=None)


.. function:: cg(A: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray, x0: heat.core.dndarray.DNDarray, out: Optional[heat.core.dndarray.DNDarray] = None) -> heat.core.dndarray.DNDarray

   Conjugate gradients method for solving a system of linear equations :math: `Ax = b`

   :param A: 2D symmetric, positive definite Matrix
   :type A: DNDarray
   :param b: 1D vector
   :type b: DNDarray
   :param x0: Arbitrary 1D starting vector
   :type x0: DNDarray
   :param out: Output Vector
   :type out: DNDarray, optional


.. function:: lanczos(A: heat.core.dndarray.DNDarray, m: int, v0: Optional[heat.core.dndarray.DNDarray] = None, V_out: Optional[heat.core.dndarray.DNDarray] = None, T_out: Optional[heat.core.dndarray.DNDarray] = None) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   The Lanczos algorithm is an iterative approximation of the solution to the eigenvalue problem, as an adaptation of
   power methods to find the m "most useful" (tending towards extreme highest/lowest) eigenvalues and eigenvectors of
   an :math:`n \times n` Hermitian matrix, where often :math:`m<<n`.
   It returns two matrices :math:`V` and :math:`T`, where:

       - :math:`V` is a Matrix of size :math:`n\times m`, with orthonormal columns, that span the Krylow subspace \n
       - :math:`T` is a Tridiagonal matrix of size :math:`m\times m`, with coefficients :math:`\alpha_1,..., \alpha_n`
         on the diagonal and coefficients :math:`\beta_1,...,\beta_{n-1}` on the side-diagonals\n

   :param A: 2D Hermitian (if complex) or symmetric positive-definite matrix.
             Only distribution along axis 0 is supported, i.e. `A.split` must be `0` or `None`.
   :type A: DNDarray
   :param m: Number of Lanczos iterations
   :type m: int
   :param v0: 1D starting vector of Euclidean norm 1. If not provided, a random vector will be used to start the algorithm
   :type v0: DNDarray, optional
   :param V_out: Output Matrix for the Krylow vectors, Shape = (n, m), dtype=A.dtype, must be initialized to zero
   :type V_out: DNDarray, optional
   :param T_out: Output Matrix for the Tridiagonal matrix, Shape = (m, m), must be initialized to zero
   :type T_out: DNDarray, optional


.. function:: solve(A: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray, out: Optional[heat.core.dndarray.DNDarray] = None) -> heat.core.dndarray.DNDarray

   Computes the solution of a square system of linear equations with a unique solution.

   Letting :math:`\mathbb{K}` be :math:`\mathbb{R}` or :math:`\mathbb{C}`,
   this function computes the solution :math:`X \in \mathbb{K}^{n \times k}` of the **linear system** associated to
   :math:`A \in \mathbb{K}^{n \times n}, B \in \mathbb{K}^{n \times k}`, which is defined as

   .. math:: AX = B

   Supports inputs of integer, float, double, cfloat and cdouble dtypes.
   Also supports batches of matrices, and if the inputs are batches of matrices then
   the output has the same batch dimensions.

   Letting `*` be zero or more batch dimensions,

   - If :attr:`A` has shape `(*, n, n)` and :attr:`B` has shape `(*, n)` (a batch of vectors) or shape
     `(*, n, k)` (a batch of matrices or "multiple right-hand sides"), this function returns `X` of shape
     `(*, n)` or `(*, n, k)` respectively.
   - Otherwise, if :attr:`A` has shape `(*, n, n)` and  :attr:`B` has shape `(n,)`  or `(n, k)`, :attr:`B`
     is broadcast to have shape `(*, n)` or `(*, n, k)` respectively.
     This function then returns the solution of the resulting batch of systems of linear equations.

   .. note::
       A and b may only be distributed in the batch dimensions. If both are split, they must be split along matching batch axes.

   .. seealso::

           :func:`torch.linalg.solve` is called under the hood on the local data. This docstring is also heavily inspired by the docstring of this function.


   :param A: Matrix to be inverted of shape `(*, n, n)` where `*` is zero or more batch dimensions
   :type A: DNDarray
   :param b: Right-hand side of shape `(*, n)` or  `(*, n, k)` or `(n,)` or `(n, k)`
   :type b: DNDarray
   :param out: Output Vector
   :type out: DNDarray, optional
   :param Examples::: >>> A = ht.random.randn(3, 3)
                      >>> b = ht.random.randn(3)
                      >>> x = ht.linalg.solve(A, b)
                      >>> ht.allclose(A @ x, b, atol=1e-5)
                      True
                      >>> A = ht.random.randn(2, 3, 3, split=0)
                      >>> B = ht.random.randn(2, 3, 4, split=0)
                      >>> X = ht.linalg.solve(A, B)
                      >>> X.shape
                      (2, 3, 4)
                      >>> ht.allclose(A @ X, B, atol=1e-5)
                      True
                      >>> A = ht.random.randn(2, 3, 3, split=None)
                      >>> B = ht.random.randn(2, 3, 4, split=2)
                      >>> X = ht.linalg.solve(A, B)
                      >>> X.split
                      2
                      >>> ht.allclose(A @ X, B, atol=1e-5)
                      True

                      >>> A = ht.random.randn(2, 3, 3, split=0)
                      >>> b = ht.random.randn(3, 1)
                      >>> x = ht.linalg.solve(A, b) # b is broadcast to size (2, 3, 1)
                      >>> x.shape
                      (2, 3, 1)
                      >>> x.split
                      0
                      >>> ht.allclose((A @ x).resplit_(None), b, atol=1e-5)
                      True


.. function:: solve_triangular(A: heat.core.dndarray.DNDarray, b: heat.core.dndarray.DNDarray) -> heat.core.dndarray.DNDarray

   Solver for (possibly batched) upper triangular systems of linear equations: it returns `x` in `Ax = b`, where `A` is a (possibly batched) upper triangular matrix and
   `b` a (possibly batched) vector or matrix of suitable shape, both provided as input to the function.
   The implementation builts on the corresponding solver in PyTorch and implements an memory-distributed, MPI-parallel block-wise version thereof.

   :param A: An upper triangular invertible square (n x n) matrix or a batch thereof,  i.e. a ``DNDarray`` of shape `(..., n, n)`.
   :type A: DNDarray
   :param b: a (possibly batched) n x k matrix, i.e. an DNDarray of shape (..., n, k), where the batch-dimensions denoted by ... need to coincide with those of A.
             (Batched) Vectors have to be provided as ... x n x 1 matrices and the split dimension of b must the second last dimension if not None.
   :type b: DNDarray

   .. note::

      Since such a check might be computationally expensive, we do not check whether A is indeed upper triangular.
      If you require such a check, please open an issue on our GitHub page and request this feature.


.. function:: qr(A: heat.core.dndarray.DNDarray, mode: str = 'reduced', procs_to_merge: int = 2) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   Calculates the QR decomposition of a 2D ``DNDarray``.
   Factor the matrix ``A`` as *QR*, where ``Q`` is orthonormal and ``R`` is upper-triangular.
   If ``mode = "reduced"``, function returns ``QR(Q=Q, R=R)``, if ``mode = "r"`` function returns ``QR(Q=None, R=R)``

   This function also works for batches of matrices; in this case, the last two dimensions of the input array are considered as the matrix dimensions.
   The output arrays have the same leading batch dimensions as the input array.

   :param A: Array which will be decomposed.
   :type A: DNDarray of shape (M, N), of shape (...,M,N) in the batched case
   :param mode: default "reduced" returns Q and R with dimensions (M, min(M,N)) and (min(M,N), N). Potential batch dimensions are not modified.
                "r" returns only R, with dimensions (min(M,N), N).
   :type mode: str, optional
   :param procs_to_merge: This parameter is only relevant for split=0 (-2, in the batched case) and determines the number of processes to be merged at one step during the so-called TS-QR algorithm.
                          The default is 2. Higher choices might be faster, but will probably result in higher memory consumption. 0 corresponds to merging all processes at once.
                          We only recommend to modify this parameter if you are familiar with the TS-QR algorithm (see the references below).
   :type procs_to_merge: int, optional

   .. rubric:: Notes

   The distribution schemes of ``Q`` and ``R`` depend on that of the input ``A``.

       - If ``A`` is distributed along the columns (A.split = 1), so will be ``Q`` and ``R``.

       - If ``A`` is distributed along the rows (A.split = 0), ``Q`` too will have  `split=0`. ``R`` won't be distributed, i.e. `R. split = None`, if ``A`` is tall-skinny, i.e., if
         the largest local chunk of data of ``A`` has at least as many rows as columns. Otherwise, ``R`` will be distributed along the rows as well, i.e., `R.split = 0`.

   Note that the argument `calc_q` allowed in earlier Heat versions is no longer supported; `calc_q = False` is equivalent to `mode = "r"`.
   Unlike ``numpy.linalg.qr()``, `ht.linalg.qr` only supports ``mode="reduced"`` or ``mode="r"`` for the moment, since "complete" may result in heavy memory usage.

   Heats QR function is built on top of PyTorchs QR function, ``torch.linalg.qr()``, using LAPACK (CPU) and MAGMA (CUDA) on
   the backend. Both cases split=0 and split=1 build on a column-block-wise version of stabilized Gram-Schmidt orthogonalization.
   For split=1 (-1, in the batched case), this is directly applied to the local arrays of the input array.
   For split=0, a tall-skinny QR (TS-QR) is implemented for the case of tall-skinny matrices (i.e., the largest local chunk of data has at least as many rows as columns),
   and extended to non tall-skinny matrices by applying a block-wise version of stabilized Gram-Schmidt orthogonalization.

   .. rubric:: References

   Basic information about QR factorization/decomposition can be found at, e.g.:

       - https://en.wikipedia.org/wiki/QR_factorization,

       - Gene H. Golub and Charles F. Van Loan. 1996. Matrix Computations (3rd Ed.).

   For an extensive overview on TS-QR and its variants we refer to, e.g.,

       - Demmel, James, et al. “Communication-Optimal Parallel and Sequential QR and LU Factorizations.” SIAM Journal on Scientific Computing, vol. 34, no. 1, 2 Feb. 2012, pp. A206–A239., doi:10.1137/080731992.


.. function:: hsvd_rank(A: heat.core.dndarray.DNDarray, maxrank: int, compute_sv: bool = False, maxmergedim: Optional[int] = None, safetyshift: int = 5, silent: bool = True) -> Union[Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, float], Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray], heat.core.dndarray.DNDarray]

   Hierarchical SVD (hSVD) with prescribed truncation rank `maxrank`.
   If A = U diag(sigma) V^T is the true SVD of A, this routine computes an approximation for U[:,:maxrank] (and sigma[:maxrank], V[:,:maxrank]).

   The accuracy of this approximation depends on the structure of A ("low-rank" is best) and appropriate choice of parameters.

   One can expect a similar outcome from this routine as for sci-kit learn's TruncatedSVD (with `algorithm='randomized'`) although a different, determinstic algorithm is applied here. Hereby, the parameters `n_components`
   and `n_oversamples` (sci-kit learn) roughly correspond to `maxrank` and `safetyshift` (see below).

   :param A: 2D-array (float32/64) of which the hSVD has to be computed.
   :type A: DNDarray
   :param maxrank: truncation rank. (This parameter corresponds to `n_components` in sci-kit learn's TruncatedSVD.)
   :type maxrank: int
   :param compute_sv: compute_sv=True implies that also Sigma and V.T are computed and returned. The default is False.
   :type compute_sv: bool, optional
   :param maxmergedim: maximal size of the concatenation matrices during the merging procedure. The default is None and results in an appropriate choice depending on the size of the local slices of A and maxrank.
                       Too small choices for this parameter will result in failure if the maximal size of the concatenation matrices does not allow to merge at least two matrices. Too large choices for this parameter can cause memory errors if the resulting merging problem becomes too large.
   :type maxmergedim: int, optional
   :param safetyshift: Increases the actual truncation rank within the computations by a safety shift. The default is 5. (There is some similarity to `n_oversamples` in sci-kit learn's TruncatedSVD.)
   :type safetyshift: int, optional
   :param silent: silent=False implies that some information on the computations are printed. The default is True.
   :type silent: bool, optional

   :returns: if compute_sv=True: U, Sigma, V.T, a-posteriori error estimate for the reconstruction error ||A-U Sigma V^T ||_F / ||A||_F (computed according to [2] along the "true" merging tree).
             if compute_sv=False: U, a-posteriori error estimate
   :rtype: (Union[    Tuple[DNDarray, DNDarray, DNDarray, float], Tuple[DNDarray, DNDarray, DNDarray], DNDarray])

   .. rubric:: Notes

   The size of the process local SVDs to be computed during merging is proportional to the non-split size of the input A and (maxrank + safetyshift). Therefore, conservative choice of maxrank and safetyshift is advised to avoid memory issues.
   Note that, as sci-kit learn's randomized SVD, this routine is different from `numpy.linalg.svd` because not all singular values and vectors are computed
   and even those computed may be inaccurate if the input matrix exhibts a unfavorable structure.

   Please note that "rank" in the context of SVD always refers to the number of singular values/vectors to compute (i.e., "rank" refers to the mathematical rank
   of a matrix). This is completely different from the notion of "(MPI-)rank", i.e., the ID given to a process, in a parallel MPI-application.

   .. seealso:: :func:`hsvd`, :func:`hsvd_rtol`

   .. rubric:: References

   [1] Iwen, Ong. A distributed and incremental SVD algorithm for agglomerative data analysis on large networks. SIAM J. Matrix Anal. Appl., 37(4), 2016.
   [2] Himpe, Leibner, Rave. Hierarchical approximate proper orthogonal decomposition. SIAM J. Sci. Comput., 40 (5), 2018.


.. function:: hsvd_rtol(A: heat.core.dndarray.DNDarray, rtol: float, compute_sv: bool = False, maxrank: Optional[int] = None, maxmergedim: Optional[int] = None, safetyshift: int = 5, no_of_merges: Optional[int] = None, silent: bool = True) -> Union[Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, float], Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray], heat.core.dndarray.DNDarray]

   Hierchical SVD (hSVD) with prescribed upper bound on the relative reconstruction error.
   If A = U diag(sigma) V^T is the true SVD of A, this routine computes an approximation for U[:,:r] (and sigma[:r], V[:,:r])
   such that the rel. reconstruction error ||A-U[:,:r] diag(sigma[:r]) V[:,:r]^T ||_F / ||A||_F does not exceed rtol.

   The accuracy of this approximation depends on the structure of A ("low-rank" is best) and appropriate choice of parameters. This routine is similar to `hsvd_rank` with the difference that
   truncation is not performed after a fixed number (namly `maxrank` many) singular values but after such a number of singular values that suffice to capture a prescribed fraction of the amount of information
   contained in the input data (`rtol`).

   :param A: 2D-array (float32/64) of which the hSVD has to be computed.
   :type A: DNDarray
   :param rtol: desired upper bound on the relative reconstruction error ||A-U Sigma V^T ||_F / ||A||_F. This upper bound is processed into 'local'
                tolerances during the actual computations assuming the worst case scenario of a binary "merging tree"; therefore, the a-posteriori
                error for the relative error using the true "merging tree" (see output) may be significantly smaller than rtol.
                Prescription of maxrank or maxmergedim (disabled in default) can result in loss of desired precision, but can help to avoid memory issues.
   :type rtol: float
   :param compute_sv: compute_sv=True implies that also Sigma and V.T are computed and returned. The default is False.
   :type compute_sv: bool, optional
   :param no_of_merges: Maximum number of processes to be merged at each step. If no further arguments are provided (see below),
                        this completely determines the "merging tree" and may cause memory issues. The default is None and results in a binary merging tree.
                        Note that no_of_merges dominates maxrank and maxmergedim in the sense that at most no_of_merges processes are merged
                        even if maxrank and maxmergedim would allow merging more processes.
   :type no_of_merges: int, optional
   :param maxrank: maximal truncation rank. The default is None.
                   Setting at least one of maxrank and maxmergedim is recommended to avoid memory issues, but can result in loss of desired precision.
                   Setting only maxrank (and not maxmergedim) results in an appropriate default choice for maxmergedim depending on the size of the local slices of A and the value of maxrank.
   :type maxrank: int, optional
   :param maxmergedim: maximal size of the concatenation matrices during the merging procedure. The default is None and results in an appropriate choice depending on the size of the local slices of A and maxrank. The default is None.
                       Too small choices for this parameter will result in failure if the maximal size of the concatenation matrices does not allow to merge at least two matrices. Too large choices for this parameter can cause memory errors if the resulting merging problem becomes too large.
                       Setting at least one of maxrank and maxmergedim is recommended to avoid memory issues, but can result in loss of desired precision.
                       Setting only maxmergedim (and not maxrank) results in an appropriate default choice for maxrank.
   :type maxmergedim: int, optional
   :param safetyshift: Increases the actual truncation rank within the computations by a safety shift. The default is 5.
   :type safetyshift: int, optional
   :param silent: silent=False implies that some information on the computations are printed. The default is True.
   :type silent: bool, optional

   :returns: if compute_sv=True: U, Sigma, V.T, a-posteriori error estimate for the reconstruction error ||A-U Sigma V^T ||_F / ||A||_F (computed according to [2] along the "true" merging tree used in the computations).
             if compute_sv=False: U, a-posteriori error estimate
   :rtype: (Union[    Tuple[DNDarray, DNDarray, DNDarray, float], Tuple[DNDarray, DNDarray, DNDarray], DNDarray])

   .. rubric:: Notes

   The maximum size of the process local SVDs to be computed during merging is proportional to the non-split size of the input A and (maxrank + safetyshift). Therefore, conservative choice of maxrank and safetyshift is advised to avoid memory issues.
   For similar reasons, prescribing only rtol and the number of processes to be merged in each step (without specifying maxrank or maxmergedim) may result in memory issues.
   Although prescribing maxrank is therefore strongly recommended to avoid memory issues, but may result in loss of desired precision (rtol). If this occures, a separate warning will be raised.

   Note that this routine is different from `numpy.linalg.svd` because not all singular values and vectors are computed and even those computed may be inaccurate if the input matrix exhibts a unfavorable structure.

   To avoid confusion, note that `rtol` in this routine does not have any similarity to `tol` in scikit learn's TruncatedSVD.

   Please note that "rank" in the context of SVD always refers to the number of singular values/vectors to compute (i.e., "rank" refers to the mathematical rank
   of a matrix). This is completely different from the notion of "(MPI-)rank", i.e., the ID given to a process, in a parallel MPI-application.

   .. seealso:: :func:`hsvd`, :func:`hsvd_rank`

   .. rubric:: References

   [1] Iwen, Ong. A distributed and incremental SVD algorithm for agglomerative data analysis on large networks. SIAM J. Matrix Anal. Appl., 37(4), 2016.
   [2] Himpe, Leibner, Rave. Hierarchical approximate proper orthogonal decomposition. SIAM J. Sci. Comput., 40 (5), 2018.


.. function:: hsvd(A: heat.core.dndarray.DNDarray, maxrank: Optional[int] = None, maxmergedim: Optional[int] = None, rtol: Optional[float] = None, safetyshift: int = 0, no_of_merges: Optional[int] = 2, compute_sv: bool = False, silent: bool = True, warnings_off: bool = False) -> Union[Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, float], Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray], heat.core.dndarray.DNDarray]

   Computes an approximate truncated SVD of A utilizing a distributed hierarchical algorithm; see the references.
   The present function `hsvd` is a low-level routine, provides many options/parameters, but no default values, and is not recommended for usage by non-experts since conflicts
   arising from inappropriate parameter choice will not be caught. We strongly recommend to use the corresponding high-level functions `hsvd_rank` and `hsvd_rtol` instead.

   Input
   -------
   A: DNDarray
       2D-array (float32/64) of which hSVD has to be computed
   maxrank: int, optional
       truncation rank of the SVD
   maxmergedim: int, optional
       maximal size of the concatenation matrices when "merging" the local SVDs
   rtol: float, optional
       upper bound on the relative reconstruction error ||A-U Sigma V^T ||_F / ||A||_F (may deteriorate due to other parameters)
   safetyshift: int, optional
       shift that increases the actual truncation rank of the local SVDs during the computations in order to increase accuracy
   no_of_merges: int, optional
       maximum number of local SVDs to be "merged" at one step
   compute_sv: bool, optional
       determines whether to compute U, Sigma, V.T (compute_sv=True) or not (then U only)
   silent: bool, optional
       determines whether to print infos on the computations performed (silent=False)
   warnings_off: bool, optional
       switch on and off warnings that are not intended for the high-level routines based on this function

   .. rubric:: Notes

   Please note that "rank" in the context of SVD always refers to the number of singular values/vectors to compute (i.e., "rank" refers to the mathematical rank
   of a matrix). This is completely different from the notion of "(MPI-)rank", i.e., the ID given to a process, in a parallel MPI-application.

   .. rubric:: References

   [1] Iwen, Ong. A distributed and incremental SVD algorithm for agglomerative data analysis on large networks. SIAM J. Matrix Anal. Appl., 37(4), 2016.
   [2] Himpe, Leibner, Rave. Hierarchical approximate proper orthogonal decomposition. SIAM J. Sci. Comput., 40 (5), 2018.

   .. seealso:: :func:`hsvd_rank`, :func:`hsvd_rtol`


.. function:: isvd(new_data: heat.core.dndarray.DNDarray, U_old: heat.core.dndarray.DNDarray, S_old: heat.core.dndarray.DNDarray, Vt_old: heat.core.dndarray.DNDarray, maxrank: Optional[int] = None) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   Incremental SVD (iSVD) for the addition of new data to an existing SVD.
   Given the SVD of an "old" matrix, :math:`X_\textnormal{old} = `U_\textnormal{old} \cdot S_\textnormal{old} \cdot V_\textnormal{old}^T`, and additional columns :math:`N` (\"`new_data`\"), this routine computes
   (a possibly approximate) SVD of the extended matrix :math:`X_\textnormal{new} = [ X_\textnormal{old} | N]`.

   :param new_data: 2D-array (float32/64) of columns that are added to the "old" SVD. It must hold `new_data.split != 1` if `U_old.split = 0`.
   :type new_data: DNDarray
   :param U_old: U-factor of the SVD of the "old" matrix, 2D-array (float32/64). It must hold `U_old.split != 0` if `new_data.split = 1`.
   :type U_old: DNDarray
   :param S_old: Sigma-factor of the SVD of the "old" matrix, 1D-array (float32/64)
   :type S_old: DNDarray
   :param Vt_old: Transpose of V-factor of the SVD of the "old" matrix, 2D-array (float32/64)
   :type Vt_old: DNDarray
   :param maxrank: truncation rank of the SVD of the extended matrix. The default is None, i.e., no bound on the maximal rank is imposed.
   :type maxrank: int, optional

   .. rubric:: Notes

   Inexactness may arise due to truncation to maximal rank `maxrank` if rank of the data to be processed exceeds this rank.
   If you set `maxrank` to a high number (or None) in order to avoid inexactness, you may encounter memory issues.
   The implementation follows the approach described in Ref. [1], Sect. 2.

   Please note that "rank" in the context of SVD always refers to the number of singular values/vectors to compute (i.e., "rank" refers to the mathematical rank
   of a matrix). This is completely different from the notion of "(MPI-)rank", i.e., the ID given to a process, in a parallel MPI-application.

   .. rubric:: References

   [1] Brand, M. (2006). Fast low-rank modifications of the thin singular value decomposition. Linear algebra and its applications, 415(1), 20-30.


.. function:: svd(A: heat.core.dndarray.DNDarray, full_matrices: bool = False, compute_uv: bool = True, qr_procs_to_merge: int = 2, r_max_zolopd: int = 8) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   Computes the singular value decomposition of a matrix (the input array ``A``).
   For an input DNDarray ``A`` of shape ``(M, N)``, the function returns DNDarrays ``U``, ``S``, and ``V.T`` such that ``A = U @ ht.diag(S) @ V.T``
   with shapes ``(M, min(M,N))``, ``(min(M, N),)``, and ``(min(M,N),N)``, respectively, in the case that ``compute_uv=True``, or
   only the vector containing the singular values ``S`` of shape ``(min(M, N),)`` in the case that ``compute_uv=False``. By definition of the singular value decomposition,
   the matrix ``U`` is orthogonal, the matrix ``V`` is orthogonal, and the entries of the vector ``S``are non-negative real numbers.

   We refer to, e.g., wikipedia (https://en.wikipedia.org/wiki/Singular_value_decomposition) or to Gene H. Golub and Charles F. Van Loan, Matrix Computations (3rd Ed., 1996),
   for more detailed information on the singular value decomposition.

   :param A: The input array (2D, float32 or float64) for which the singular value decomposition is computed.
             Must be tall skinny (``M >> N``) or short fat (``M << n``) for the current implementation; an implementation that covers the remaining cases is planned.
   :type A: ht.DNDarray
   :param full_matrices: currently, only the default value ``False`` is supported. This argument is included for compatibility with NumPy.
   :type full_matrices: bool, optional
   :param compute_uv: if ``True``, the matrices ``U`` and ``V.T`` are computed and returned together with the singular values ``S``.
                      If ``False``, only the vector ``S`` containing the singular values is returned.
   :type compute_uv: bool, optional
   :param qr_procs_to_merge: the number of processes to merge in the tall skinny QR decomposition that is applied if the input array is tall skinny (``M > N``) or short fat (``M < N``).
                             See the corresponding remarks for :func:``heat.linalg.qr`` for more details.
   :type qr_procs_to_merge: int, optional
   :param r_max_zolopd: an internal parameter only relevant for the case that the input matrix is neither tall-skinny nor short-fat.
                        This parameter is passed to the Zolotarev-Polar Decomposition and the symmetric eigenvalue decomposition that is applied in this case.
                        See the documentation of :func:``heat.linalg.polar`` as well as of :func:``heat.linalg.eigh`` for more details.
   :type r_max_zolopd: int, optional

   .. rubric:: Notes

   Unlike in NumPy, we currently do not support the option ``full_matrices=True``, since this can result in heavy memory consumption (in particular for tall skinny
   and short fat matrices) that should be avoided in the context Heat is designed for. If you nevertheless require this feature, please open an issue on GitHub.

   The algorithm used for the computation of the singular value depends on the shape of the input array ``A``.
   For tall and skinny matrices (``M > N``), the algorithm is based on the tall-skinny QR decomposition. For the remaining cases we use the approach based on
   Zolotarev-Polar Decomposition and a symmetric eigenvalue decomposition based on Zolotarev-Polar Decomposition; see Algorithm 5.3 in:

       Nakatsukasa, Y., & Freund, R. W. (2016). Computing fundamental matrix decompositions accurately via the
       matrix sign function in two iterations: The power of Zolotarev's functions. SIAM Review, 58(3).

   .. seealso:: :func:`heat.linalg.qr`, :func:`heat.linalg.polar`, :func:`heat.linalg.eigh`


.. function:: polar(A: heat.core.dndarray.DNDarray, r: int = None, calcH: bool = True, condition_estimate: float = 1e+16, silent: bool = True, r_max: int = 8) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   Computes the so-called polar decomposition of the input 2D DNDarray ``A``, i.e., it returns the orthogonal matrix ``U`` and the symmetric, positive definite
   matrix ``H`` such that ``A = U @ H``.

   Input
   -----
   A : ht.DNDarray,
       The input matrix for which the polar decomposition is computed;
       must be two-dimensional, of data type float32 or float64, and must have at least as many rows as columns.
   r : int, optional, default: None
       The parameter r used in the Zolotarev-PD algorithm; if provided, must be an integer between 1 and 8 that divides the number of MPI processes.
       Higher values of r lead to faster convergence, but memory consumption is proportional to r.
       If not provided, the largest 1 <= r <= r_max that divides the number of MPI processes is chosen.
   calcH : bool, optional, default: True
       If True, the function returns the symmetric, positive definite matrix H. If False, only the orthogonal matrix U is returned.
   condition_estimate : float, optional, default: 1.e16.
       This argument allows to provide an estimate for the condition number of the input matrix ``A``, if such estimate is already known.
       If a positive number greater than 1., this value is used as an estimate for the condition number of A.
       If smaller or equal than 1., the condition number is estimated internally.
       The default value of 1.e16 is the worst case scenario considered in [1].
   silent : bool, optional, default: True
       If True, the function does not print any output. If False, some information is printed during the computation.
   r_max : int, optional, default: 8
       See the description of r for the meaning; r_max is only taken into account if r is not provided.


   .. rubric:: Notes

   The implementation follows Algorithm 5.1 in Reference [1]; however, instead of switching from QR to Cholesky decomposition depending on the condition number,
   we stick to QR decomposition in all iterations.

   .. rubric:: References

   [1] Nakatsukasa, Y., & Freund, R. W. (2016). Computing Fundamental Matrix Decompositions Accurately via the Matrix Sign Function in Two Iterations: The Power of Zolotarev's Functions. SIAM Review, 58(3), DOI: https://doi.org/10.1137/140990334.


.. function:: eigh(A: heat.core.dndarray.DNDarray, r_max_zolopd: int = 8, silent: bool = True) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   Computes the symmetric eigenvalue decomposition of a symmetric n x n - matrix A, provided as a DNDarray.

   The function returns DNDarrays Lambda (shape (n,) with split = 0) and V (shape (n,n)) such that
   A = V @ diag(Lambda) @ V^T, where Lambda contains the eigenvalues of A and V is an orthonormal matrix
   containing the corresponding eigenvectors as columns.

   :param A: The input matrix. Must be symmetric.
   :type A: DNDarray
   :param r_max_zolopd: This is a hyperparameter for the computation of the polar decomposition via :func:`heat.linalg.polar` which is
                        applied multiple times in this function. See the documentation of :func:`heat.linalg.polar` for more details on its
                        meaning and the respective default value.
   :type r_max_zolopd: int, optional
   :param silent: If True (default), suppresses output messages; otherwise, some information on the recursion is printed to the console.
   :type silent: bool, optional

   .. rubric:: Notes

   Unlike the :func:`torch.linalg.eigh` function, the eigenvalues are returned in descending order.
   Note that no check of symmetry is performed on the input matrix A; thus, applying this function to a non-symmetric matrix may
   result in unpredictable behaviour without a specific error message pointing to this issue.

   The algorithm used for the computation of the symmetric eigenvalue decomposition is based on the Zolotarev polar decomposition;
   see Algorithm 5.2 in:

       Nakatsukasa, Y., & Freund, R. W. (2016). Computing fundamental matrix decompositions accurately via the
       matrix sign function in two iterations: The power of Zolotarev's functions. SIAM Review, 58(3).

   .. seealso:: :func:`heat.linalg.polar`


.. function:: rsvd(A: heat.core.dndarray.DNDarray, svd_rank: int, n_oversamples: int = 10, power_iter: int = 0, qr_procs_to_merge: int = 2) -> Union[Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray], Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]]

   Randomized SVD (rSVD) with prescribed truncation rank `svd_rank`.
   If :math:`A = U \operatorname{diag}(S) V^T` is the true SVD of A, this routine computes an approximation for U[:,:svd_rank] (and S[:svd_rank], V.T[:,:svd_rank]).

   The accuracy of this approximation depends on the structure of A ("low-rank" is best) and appropriate choice of parameters.

   :param A: 2D-array (float32/64) of which the rSVD has to be computed.
   :type A: DNDarray
   :param svd_rank: truncation rank of the SVD. (This parameter corresponds to `n_components` in scikit-learn's TruncatedSVD.)
   :type svd_rank: int
   :param n_oversamples: number of oversamples. The default is 10.
   :type n_oversamples: int, optional
   :param power_iter: number of power iterations. The default is 0.
                      Choosing `power_iter > 0` can improve the accuracy of the SVD approximation in the case of slowly decaying singular values, but increases the computational cost.
   :type power_iter: int, optional
   :param qr_procs_to_merge: number of processes to merge at each step of QR decomposition in the power iteration (if power_iter > 0). The default is 2. See the corresponding remarks for :func:`heat.linalg.qr() <heat.core.linalg.qr.qr()>` for more details.
   :type qr_procs_to_merge: int, optional

   .. rubric:: Notes

   Memory requirements: the SVD computation of a matrix of size (svd_rank + n_oversamples) x (svd_rank + n_oversamples) must fit into the memory of a single process.
   The implementation follows Algorithm 4.4 (randomized range finder) and Algorithm 5.1 (direct SVD) in [1].

   Please note that "rank" in the context of SVD always refers to the number of singular values/vectors to compute (i.e., "rank" refers to the mathematical rank
   of a matrix). This is completely different from the notion of "(MPI-)rank", i.e., the ID given to a process, in a parallel MPI-application.

   .. rubric:: References

   [1] Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2), 217-288.


.. function:: reigh(A: heat.core.dndarray.DNDarray, n_eigenvalues: int, n_oversamples: int = 10, power_iter: int = 0, qr_procs_to_merge: int = 2) -> Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

   Randomized eigenvalue decomposition (rEIGH). Only the top `n_eigenvalues` eigenvalues (ordered by magnitude) and corresponding eigenvectors are computed.

   :param A: 2D-array (float32/64) of which the rEIGH has to be computed. Must be symmetric.
   :type A: DNDarray
   :param n_eigenvalues: number of eigenvalues to compute. (This parameter corresponds to `n_components` in scikit-learn's PCA.)
   :type n_eigenvalues: int
   :param n_oversamples: number of oversamples. The default is 10.
   :type n_oversamples: int, optional
   :param power_iter: number of power iterations. The default is 0.
                      Choosing `power_iter > 0` can improve the accuracy of the eigenvalue approximation in the case of slowly decaying eigenvalues, but increases the computational cost.
   :type power_iter: int, optional
   :param qr_procs_to_merge: number of processes to merge at each step of QR decomposition in the power iteration (if power_iter > 0). The default is 2. See the corresponding remarks for :func:`heat.linalg.qr() <heat.core.linalg.qr.qr()>` for more details.
   :type qr_procs_to_merge: int, optional

   .. rubric:: Notes

   Memory requirements: the symmetric eigenvalue decomposition of a matrix of size (n_eigenvalues + n_oversamples) x (n_eigenvalues + n_oversamples) must fit into the memory of a single process.
   The implementation follows Algorithm 4.4 (randomized range finder) and Algorithm 5.3 (eigenvalue decomposition from an SVD) in [1].

   .. rubric:: References

   [1] Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.
   SIAM review, 53(2), 217-288.