heat.cluster
Add the clustering functions to the ht.cluster namespace
Submodules
Package Contents
- class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)
Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.
- Parameters:
array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (
self.balanced is None
), it can be gathered via theis_balanced()
method (requires communication).
- __prephalo(start, end) torch.Tensor
Extracts the halo indexed by start, end from
self.array
in the direction ofself.split
- Parameters:
start (int) – Start index of the halo extracted from
self.array
end (int) – End index of the halo extracted from
self.array
- get_halo(halo_size: int) torch.Tensor
Fetch halos of size
halo_size
from neighboring ranks and save them inself.halo_next/self.halo_prev
.- Parameters:
halo_size (int) – Size of the halo.
- __cat_halo() torch.Tensor
Return local array concatenated to halos if they are available.
- __array__() numpy.ndarray
Returns a view of the process-local slice of the
DNDarray
as a numpy ndarray, if theDNDarray
resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice ofDNDarray
as numpy ndarray.
- astype(dtype, copy=True) DNDarray
Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is
True
, the same array is returned instead.
- balance_() DNDarray
Function for balancing a
DNDarray
between all nodes. To determine if this is needed use theis_balanced()
function. If theDNDarray
is already balanced this function will do nothing. This function modifies theDNDarray
itself and will not return anything.Examples
>>> a = ht.zeros((10, 2), split=0) >>> a[:, 0] = ht.arange(10) >>> b = a[3:] [0/2] tensor([[3., 0.], [1/2] tensor([[4., 0.], [5., 0.], [6., 0.]]) [2/2] tensor([[7., 0.], [8., 0.], [9., 0.]]) >>> b.balance_() >>> print(b.gshape, b.lshape) [0/2] (7, 2) (1, 2) [1/2] (7, 2) (3, 2) [2/2] (7, 2) (3, 2) >>> b [0/2] tensor([[3., 0.], [4., 0.], [5., 0.]]) [1/2] tensor([[6., 0.], [7., 0.]]) [2/2] tensor([[8., 0.], [9., 0.]]) >>> print(b.gshape, b.lshape) [0/2] (7, 2) (3, 2) [1/2] (7, 2) (2, 2) [2/2] (7, 2) (2, 2)
- __cast(cast_function) float | int
Implements a generic cast function for
DNDarray
objects.- Parameters:
cast_function (function) – The actual cast function, e.g.
float
orint
- Raises:
TypeError – If the
DNDarray
object cannot be converted into a scalar.
- collect_(target_rank: int | None = 0) None
A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the
redistribute_
method.- Parameters:
target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.
- Raises:
TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.collect_() >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0) >>> st.collect_(1) >>> print(st.lshape) [0/2] (50, 81, 0) [1/2] (50, 81, 67) [2/2] (50, 81, 0)
- counts_displs() Tuple[Tuple[int], Tuple[int]]
Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.
- cpu() DNDarray
Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.
- create_lshape_map(force_check: bool = False) torch.Tensor
Generate a ‘map’ of the lshapes of the data on all processes. Units are
(process rank, lshape)
- Parameters:
force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map
- create_partition_interface()
Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.
An example of the output and shape is shown below.
- __partitioned__ = {
‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {
- (0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (1, 0, 0): {
‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (2, 0, 0): {
‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (3, 0, 0): {
‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}
}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,
}
- Return type:
dictionary containing the partition interface as shown above.
- fill_diagonal(value: float) DNDarray
Fill the main diagonal of a 2D
DNDarray
. This function modifies the input tensor in-place, and returns the input array.- Parameters:
value (float) – The value to be placed in the
DNDarrays
main diagonal
- __getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) DNDarray
Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the
DNDarray
usebalance()
(issue #187)- Parameters:
key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.
Examples
>>> a = ht.arange(10, split=0) (1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32) >>> a[1:6] (1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5], dtype=torch.int32) >>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] (1/2) >>> tensor([0.]) (2/2) >>> tensor([0., 0.])
- is_balanced(force_check: bool = False) bool
Determine if
self
is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returningself.balanced
. If no information is available (self.balanced = None
), the balanced status will be assessed via collective communication.Parameters force_check : bool, optional
If True, the balanced status of the
DNDarray
will be assessed via collective communication in any case.
- is_distributed() bool
Determines whether the data of this
DNDarray
is distributed across multiple processes.
- item()
Returns the only element of a 1-element
DNDarray
. Mirror of the pytorch command by the same name. If size ofDNDarray
is >1 element, then aValueError
is raised (by pytorch)Examples
>>> import heat as ht >>> x = ht.zeros((1)) >>> x.item() 0.0
- __len__() int
The length of the
DNDarray
, i.e. the number of items in the first dimension.
- numpy() numpy.array
Returns a copy of the
DNDarray
as numpy ndarray. If theDNDarray
resides on the GPU, the underlying data will be copied to the CPU first.If the
DNDarray
is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!Examples
>>> import heat as ht T1 = ht.random.randn((10,8)) T1.numpy()
- __repr__() str
Computes a printable representation of the passed DNDarray.
- ravel()
Flattens the
DNDarray
.See also
Examples
>>> a = ht.ones((2,3), split=0) >>> b = a.ravel() >>> a[0,0] = 4 >>> b DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)
- redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)
Redistributes the data of the
DNDarray
along the split axis to match the given target map. This function does not modify the non-split dimensions of theDNDarray
. This is an abstraction and extension of the balance function.- Parameters:
lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are
[rank, lshape]
.target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are
[rank, target lshape]
. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of thelshape_map
.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64) >>> target_map[0, 2] = 67 >>> print(target_map) [0/2] tensor([[ 0, 0, 67], [0/2] [ 0, 0, 0], [0/2] [ 0, 0, 0]], dtype=torch.int32) [1/2] tensor([[ 0, 0, 67], [1/2] [ 0, 0, 0], [1/2] [ 0, 0, 0]], dtype=torch.int32) [2/2] tensor([[ 0, 0, 67], [2/2] [ 0, 0, 0], [2/2] [ 0, 0, 0]], dtype=torch.int32) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.redistribute_(target_map=target_map) >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0)
- __redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)
Function to abstract the function used during redistribute for shuffling data between processes along the split axis
- Parameters:
snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question
- resplit_(axis: int = None)
In-place option for resplitting a
DNDarray
.- Parameters:
axis (int) – The new split axis,
None
denotes gathering, an int will set the new split axis
Examples
>>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, None) >>> a.split None >>> a.lshape (0/2) (4, 5) (1/2) (4, 5) >>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, 1) >>> a.split 1 >>> a.lshape (0/2) (4, 3) (1/2) (4, 2)
- __setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Global item setter
- Parameters:
key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)
Notes
If a
DNDarray
is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local arrayExamples
>>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] = 1 >>> a (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]]) (2/2) >>> tensor([[0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
- __setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Utility function for checking
value
and forwarding to :func:__setitem__
- Raises:
NotImplementedError – If the type of
value
ist not supported
- __str__() str
Computes a string representation of the passed
DNDarray
.
- tolist(keepsplit: bool = False) List
Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.
- Parameters:
keepsplit (bool) – Whether the list should be returned locally or globally.
Examples
>>> a = ht.array([[0,1],[2,3]]) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=0) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=1) >>> a.tolist(keepsplit=True) (1/2) [[0], [2]] (2/2) [[1], [3]]
- __torch_proxy__() torch.Tensor
Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.
- __xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) Tuple[int, int]
- class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)
- Parameters:
metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians
DNDarray
: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int) – Maximum number of iterations for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
. The full pipeline is algorithm specific.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)
- Parameters:
metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians
DNDarray
: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int) – Maximum number of iterations for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
. The full pipeline is algorithm specific.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)
Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.
- Parameters:
array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (
self.balanced is None
), it can be gathered via theis_balanced()
method (requires communication).
- __prephalo(start, end) torch.Tensor
Extracts the halo indexed by start, end from
self.array
in the direction ofself.split
- Parameters:
start (int) – Start index of the halo extracted from
self.array
end (int) – End index of the halo extracted from
self.array
- get_halo(halo_size: int) torch.Tensor
Fetch halos of size
halo_size
from neighboring ranks and save them inself.halo_next/self.halo_prev
.- Parameters:
halo_size (int) – Size of the halo.
- __cat_halo() torch.Tensor
Return local array concatenated to halos if they are available.
- __array__() numpy.ndarray
Returns a view of the process-local slice of the
DNDarray
as a numpy ndarray, if theDNDarray
resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice ofDNDarray
as numpy ndarray.
- astype(dtype, copy=True) DNDarray
Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is
True
, the same array is returned instead.
- balance_() DNDarray
Function for balancing a
DNDarray
between all nodes. To determine if this is needed use theis_balanced()
function. If theDNDarray
is already balanced this function will do nothing. This function modifies theDNDarray
itself and will not return anything.Examples
>>> a = ht.zeros((10, 2), split=0) >>> a[:, 0] = ht.arange(10) >>> b = a[3:] [0/2] tensor([[3., 0.], [1/2] tensor([[4., 0.], [5., 0.], [6., 0.]]) [2/2] tensor([[7., 0.], [8., 0.], [9., 0.]]) >>> b.balance_() >>> print(b.gshape, b.lshape) [0/2] (7, 2) (1, 2) [1/2] (7, 2) (3, 2) [2/2] (7, 2) (3, 2) >>> b [0/2] tensor([[3., 0.], [4., 0.], [5., 0.]]) [1/2] tensor([[6., 0.], [7., 0.]]) [2/2] tensor([[8., 0.], [9., 0.]]) >>> print(b.gshape, b.lshape) [0/2] (7, 2) (3, 2) [1/2] (7, 2) (2, 2) [2/2] (7, 2) (2, 2)
- __cast(cast_function) float | int
Implements a generic cast function for
DNDarray
objects.- Parameters:
cast_function (function) – The actual cast function, e.g.
float
orint
- Raises:
TypeError – If the
DNDarray
object cannot be converted into a scalar.
- collect_(target_rank: int | None = 0) None
A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the
redistribute_
method.- Parameters:
target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.
- Raises:
TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.collect_() >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0) >>> st.collect_(1) >>> print(st.lshape) [0/2] (50, 81, 0) [1/2] (50, 81, 67) [2/2] (50, 81, 0)
- counts_displs() Tuple[Tuple[int], Tuple[int]]
Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.
- cpu() DNDarray
Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.
- create_lshape_map(force_check: bool = False) torch.Tensor
Generate a ‘map’ of the lshapes of the data on all processes. Units are
(process rank, lshape)
- Parameters:
force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map
- create_partition_interface()
Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.
An example of the output and shape is shown below.
- __partitioned__ = {
‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {
- (0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (1, 0, 0): {
‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (2, 0, 0): {
‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (3, 0, 0): {
‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}
}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,
}
- Return type:
dictionary containing the partition interface as shown above.
- fill_diagonal(value: float) DNDarray
Fill the main diagonal of a 2D
DNDarray
. This function modifies the input tensor in-place, and returns the input array.- Parameters:
value (float) – The value to be placed in the
DNDarrays
main diagonal
- __getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) DNDarray
Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the
DNDarray
usebalance()
(issue #187)- Parameters:
key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.
Examples
>>> a = ht.arange(10, split=0) (1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32) >>> a[1:6] (1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5], dtype=torch.int32) >>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] (1/2) >>> tensor([0.]) (2/2) >>> tensor([0., 0.])
- is_balanced(force_check: bool = False) bool
Determine if
self
is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returningself.balanced
. If no information is available (self.balanced = None
), the balanced status will be assessed via collective communication.Parameters force_check : bool, optional
If True, the balanced status of the
DNDarray
will be assessed via collective communication in any case.
- is_distributed() bool
Determines whether the data of this
DNDarray
is distributed across multiple processes.
- item()
Returns the only element of a 1-element
DNDarray
. Mirror of the pytorch command by the same name. If size ofDNDarray
is >1 element, then aValueError
is raised (by pytorch)Examples
>>> import heat as ht >>> x = ht.zeros((1)) >>> x.item() 0.0
- __len__() int
The length of the
DNDarray
, i.e. the number of items in the first dimension.
- numpy() numpy.array
Returns a copy of the
DNDarray
as numpy ndarray. If theDNDarray
resides on the GPU, the underlying data will be copied to the CPU first.If the
DNDarray
is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!Examples
>>> import heat as ht T1 = ht.random.randn((10,8)) T1.numpy()
- __repr__() str
Computes a printable representation of the passed DNDarray.
- ravel()
Flattens the
DNDarray
.See also
Examples
>>> a = ht.ones((2,3), split=0) >>> b = a.ravel() >>> a[0,0] = 4 >>> b DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)
- redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)
Redistributes the data of the
DNDarray
along the split axis to match the given target map. This function does not modify the non-split dimensions of theDNDarray
. This is an abstraction and extension of the balance function.- Parameters:
lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are
[rank, lshape]
.target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are
[rank, target lshape]
. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of thelshape_map
.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64) >>> target_map[0, 2] = 67 >>> print(target_map) [0/2] tensor([[ 0, 0, 67], [0/2] [ 0, 0, 0], [0/2] [ 0, 0, 0]], dtype=torch.int32) [1/2] tensor([[ 0, 0, 67], [1/2] [ 0, 0, 0], [1/2] [ 0, 0, 0]], dtype=torch.int32) [2/2] tensor([[ 0, 0, 67], [2/2] [ 0, 0, 0], [2/2] [ 0, 0, 0]], dtype=torch.int32) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.redistribute_(target_map=target_map) >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0)
- __redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)
Function to abstract the function used during redistribute for shuffling data between processes along the split axis
- Parameters:
snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question
- resplit_(axis: int = None)
In-place option for resplitting a
DNDarray
.- Parameters:
axis (int) – The new split axis,
None
denotes gathering, an int will set the new split axis
Examples
>>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, None) >>> a.split None >>> a.lshape (0/2) (4, 5) (1/2) (4, 5) >>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, 1) >>> a.split 1 >>> a.lshape (0/2) (4, 3) (1/2) (4, 2)
- __setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Global item setter
- Parameters:
key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)
Notes
If a
DNDarray
is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local arrayExamples
>>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] = 1 >>> a (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]]) (2/2) >>> tensor([[0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
- __setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Utility function for checking
value
and forwarding to :func:__setitem__
- Raises:
NotImplementedError – If the type of
value
ist not supported
- __str__() str
Computes a string representation of the passed
DNDarray
.
- tolist(keepsplit: bool = False) List
Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.
- Parameters:
keepsplit (bool) – Whether the list should be returned locally or globally.
Examples
>>> a = ht.array([[0,1],[2,3]]) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=0) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=1) >>> a.tolist(keepsplit=True) (1/2) [[0], [2]] (2/2) [[1], [3]]
- __torch_proxy__() torch.Tensor
Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.
- __xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) Tuple[int, int]
- class KMeans(n_clusters: int = 8, init: str | heat.core.dndarray.DNDarray = 'random', max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None)
Bases:
heat.cluster._kcluster._KCluster
K-Means clustering algorithm. An implementation of Lloyd’s algorithm [1].
- Variables:
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray) –
Method for initialization:
‘k-means++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2].
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: initialize by using the batch parallel algorithm (see BatchParallelKMeans for more information).
DNDarray: it should be of shape (n_clusters, n_features) and gives the initial centers.
max_iter (int) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
Notes
The average complexity is given by \(O(k \cdot n \cdot T)\), were n is the number of samples and \(T\) is the number of iterations. In practice, the k-means algorithm is very fast, but it may fall into local minima. That is why it can be useful to restart it several times. If the algorithm stops before fully converging (because of
tol
ormax_iter
),labels_
andcluster_centers_
will not be consistent, i.e. thecluster_centers_
will not be the means of the points in each cluster. Also, the estimator will reassignlabels_
after the last iteration to makelabels_
consistent with predict on the training set.References
[1] Lloyd, Stuart P., “Least squares quantization in PCM”, IEEE Transactions on Information Theory, 28 (2), pp. 129–137, 1982.
[2] Arthur, D., Vassilvitskii, S., “k-means++: The Advantages of Careful Seeding”, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035, 2007.
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
Compute coordinates of new centroid as mean of the data points in
x
that are assigned to this centroid.
- fit(x: heat.core.dndarray.DNDarray) KMeans.fit.self
Computes the centroid of a k-means clustering.
- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)
- Parameters:
metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians
DNDarray
: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int) – Maximum number of iterations for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
. The full pipeline is algorithm specific.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)
Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.
- Parameters:
array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (
self.balanced is None
), it can be gathered via theis_balanced()
method (requires communication).
- __prephalo(start, end) torch.Tensor
Extracts the halo indexed by start, end from
self.array
in the direction ofself.split
- Parameters:
start (int) – Start index of the halo extracted from
self.array
end (int) – End index of the halo extracted from
self.array
- get_halo(halo_size: int) torch.Tensor
Fetch halos of size
halo_size
from neighboring ranks and save them inself.halo_next/self.halo_prev
.- Parameters:
halo_size (int) – Size of the halo.
- __cat_halo() torch.Tensor
Return local array concatenated to halos if they are available.
- __array__() numpy.ndarray
Returns a view of the process-local slice of the
DNDarray
as a numpy ndarray, if theDNDarray
resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice ofDNDarray
as numpy ndarray.
- astype(dtype, copy=True) DNDarray
Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is
True
, the same array is returned instead.
- balance_() DNDarray
Function for balancing a
DNDarray
between all nodes. To determine if this is needed use theis_balanced()
function. If theDNDarray
is already balanced this function will do nothing. This function modifies theDNDarray
itself and will not return anything.Examples
>>> a = ht.zeros((10, 2), split=0) >>> a[:, 0] = ht.arange(10) >>> b = a[3:] [0/2] tensor([[3., 0.], [1/2] tensor([[4., 0.], [5., 0.], [6., 0.]]) [2/2] tensor([[7., 0.], [8., 0.], [9., 0.]]) >>> b.balance_() >>> print(b.gshape, b.lshape) [0/2] (7, 2) (1, 2) [1/2] (7, 2) (3, 2) [2/2] (7, 2) (3, 2) >>> b [0/2] tensor([[3., 0.], [4., 0.], [5., 0.]]) [1/2] tensor([[6., 0.], [7., 0.]]) [2/2] tensor([[8., 0.], [9., 0.]]) >>> print(b.gshape, b.lshape) [0/2] (7, 2) (3, 2) [1/2] (7, 2) (2, 2) [2/2] (7, 2) (2, 2)
- __cast(cast_function) float | int
Implements a generic cast function for
DNDarray
objects.- Parameters:
cast_function (function) – The actual cast function, e.g.
float
orint
- Raises:
TypeError – If the
DNDarray
object cannot be converted into a scalar.
- collect_(target_rank: int | None = 0) None
A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the
redistribute_
method.- Parameters:
target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.
- Raises:
TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.collect_() >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0) >>> st.collect_(1) >>> print(st.lshape) [0/2] (50, 81, 0) [1/2] (50, 81, 67) [2/2] (50, 81, 0)
- counts_displs() Tuple[Tuple[int], Tuple[int]]
Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.
- cpu() DNDarray
Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.
- create_lshape_map(force_check: bool = False) torch.Tensor
Generate a ‘map’ of the lshapes of the data on all processes. Units are
(process rank, lshape)
- Parameters:
force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map
- create_partition_interface()
Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.
An example of the output and shape is shown below.
- __partitioned__ = {
‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {
- (0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (1, 0, 0): {
‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (2, 0, 0): {
‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (3, 0, 0): {
‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}
}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,
}
- Return type:
dictionary containing the partition interface as shown above.
- fill_diagonal(value: float) DNDarray
Fill the main diagonal of a 2D
DNDarray
. This function modifies the input tensor in-place, and returns the input array.- Parameters:
value (float) – The value to be placed in the
DNDarrays
main diagonal
- __getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) DNDarray
Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the
DNDarray
usebalance()
(issue #187)- Parameters:
key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.
Examples
>>> a = ht.arange(10, split=0) (1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32) >>> a[1:6] (1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5], dtype=torch.int32) >>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] (1/2) >>> tensor([0.]) (2/2) >>> tensor([0., 0.])
- is_balanced(force_check: bool = False) bool
Determine if
self
is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returningself.balanced
. If no information is available (self.balanced = None
), the balanced status will be assessed via collective communication.Parameters force_check : bool, optional
If True, the balanced status of the
DNDarray
will be assessed via collective communication in any case.
- is_distributed() bool
Determines whether the data of this
DNDarray
is distributed across multiple processes.
- item()
Returns the only element of a 1-element
DNDarray
. Mirror of the pytorch command by the same name. If size ofDNDarray
is >1 element, then aValueError
is raised (by pytorch)Examples
>>> import heat as ht >>> x = ht.zeros((1)) >>> x.item() 0.0
- __len__() int
The length of the
DNDarray
, i.e. the number of items in the first dimension.
- numpy() numpy.array
Returns a copy of the
DNDarray
as numpy ndarray. If theDNDarray
resides on the GPU, the underlying data will be copied to the CPU first.If the
DNDarray
is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!Examples
>>> import heat as ht T1 = ht.random.randn((10,8)) T1.numpy()
- __repr__() str
Computes a printable representation of the passed DNDarray.
- ravel()
Flattens the
DNDarray
.See also
Examples
>>> a = ht.ones((2,3), split=0) >>> b = a.ravel() >>> a[0,0] = 4 >>> b DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)
- redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)
Redistributes the data of the
DNDarray
along the split axis to match the given target map. This function does not modify the non-split dimensions of theDNDarray
. This is an abstraction and extension of the balance function.- Parameters:
lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are
[rank, lshape]
.target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are
[rank, target lshape]
. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of thelshape_map
.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64) >>> target_map[0, 2] = 67 >>> print(target_map) [0/2] tensor([[ 0, 0, 67], [0/2] [ 0, 0, 0], [0/2] [ 0, 0, 0]], dtype=torch.int32) [1/2] tensor([[ 0, 0, 67], [1/2] [ 0, 0, 0], [1/2] [ 0, 0, 0]], dtype=torch.int32) [2/2] tensor([[ 0, 0, 67], [2/2] [ 0, 0, 0], [2/2] [ 0, 0, 0]], dtype=torch.int32) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.redistribute_(target_map=target_map) >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0)
- __redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)
Function to abstract the function used during redistribute for shuffling data between processes along the split axis
- Parameters:
snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question
- resplit_(axis: int = None)
In-place option for resplitting a
DNDarray
.- Parameters:
axis (int) – The new split axis,
None
denotes gathering, an int will set the new split axis
Examples
>>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, None) >>> a.split None >>> a.lshape (0/2) (4, 5) (1/2) (4, 5) >>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, 1) >>> a.split 1 >>> a.lshape (0/2) (4, 3) (1/2) (4, 2)
- __setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Global item setter
- Parameters:
key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)
Notes
If a
DNDarray
is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local arrayExamples
>>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] = 1 >>> a (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]]) (2/2) >>> tensor([[0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
- __setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Utility function for checking
value
and forwarding to :func:__setitem__
- Raises:
NotImplementedError – If the type of
value
ist not supported
- __str__() str
Computes a string representation of the passed
DNDarray
.
- tolist(keepsplit: bool = False) List
Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.
- Parameters:
keepsplit (bool) – Whether the list should be returned locally or globally.
Examples
>>> a = ht.array([[0,1],[2,3]]) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=0) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=1) >>> a.tolist(keepsplit=True) (1/2) [[0], [2]] (2/2) [[1], [3]]
- __torch_proxy__() torch.Tensor
Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.
- __xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) Tuple[int, int]
- class KMedians(n_clusters: int = 8, init: str | heat.core.dndarray.DNDarray = 'random', max_iter: int = 300, tol: float = 0.0001, random_state: int = None)
Bases:
heat.cluster._kcluster._KCluster
K-Medians clustering algorithm [1]. Uses the Manhattan (City-block, \(L_1\)) metric for distance calculations
- Parameters:
n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘k-medians++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2].
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: initialize by using the batch parallel algorithm (see BatchParallelKMedians for more information).
DNDarray: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int, default: 300) – Maximum number of iterations of the k-means algorithm for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
References
[1] Hakimi, S., and O. Kariv. “An algorithmic approach to network location problems II: The p-medians.” SIAM Journal on Applied Mathematics 37.3 (1979): 539-560.
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
Compute coordinates of new centroid as median of the data points in
x
that are assigned to it
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of a k-medians clustering.
- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)
- Parameters:
metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians
DNDarray
: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int) – Maximum number of iterations for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
. The full pipeline is algorithm specific.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)
Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.
- Parameters:
array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (
self.balanced is None
), it can be gathered via theis_balanced()
method (requires communication).
- __prephalo(start, end) torch.Tensor
Extracts the halo indexed by start, end from
self.array
in the direction ofself.split
- Parameters:
start (int) – Start index of the halo extracted from
self.array
end (int) – End index of the halo extracted from
self.array
- get_halo(halo_size: int) torch.Tensor
Fetch halos of size
halo_size
from neighboring ranks and save them inself.halo_next/self.halo_prev
.- Parameters:
halo_size (int) – Size of the halo.
- __cat_halo() torch.Tensor
Return local array concatenated to halos if they are available.
- __array__() numpy.ndarray
Returns a view of the process-local slice of the
DNDarray
as a numpy ndarray, if theDNDarray
resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice ofDNDarray
as numpy ndarray.
- astype(dtype, copy=True) DNDarray
Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is
True
, the same array is returned instead.
- balance_() DNDarray
Function for balancing a
DNDarray
between all nodes. To determine if this is needed use theis_balanced()
function. If theDNDarray
is already balanced this function will do nothing. This function modifies theDNDarray
itself and will not return anything.Examples
>>> a = ht.zeros((10, 2), split=0) >>> a[:, 0] = ht.arange(10) >>> b = a[3:] [0/2] tensor([[3., 0.], [1/2] tensor([[4., 0.], [5., 0.], [6., 0.]]) [2/2] tensor([[7., 0.], [8., 0.], [9., 0.]]) >>> b.balance_() >>> print(b.gshape, b.lshape) [0/2] (7, 2) (1, 2) [1/2] (7, 2) (3, 2) [2/2] (7, 2) (3, 2) >>> b [0/2] tensor([[3., 0.], [4., 0.], [5., 0.]]) [1/2] tensor([[6., 0.], [7., 0.]]) [2/2] tensor([[8., 0.], [9., 0.]]) >>> print(b.gshape, b.lshape) [0/2] (7, 2) (3, 2) [1/2] (7, 2) (2, 2) [2/2] (7, 2) (2, 2)
- __cast(cast_function) float | int
Implements a generic cast function for
DNDarray
objects.- Parameters:
cast_function (function) – The actual cast function, e.g.
float
orint
- Raises:
TypeError – If the
DNDarray
object cannot be converted into a scalar.
- collect_(target_rank: int | None = 0) None
A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the
redistribute_
method.- Parameters:
target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.
- Raises:
TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.collect_() >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0) >>> st.collect_(1) >>> print(st.lshape) [0/2] (50, 81, 0) [1/2] (50, 81, 67) [2/2] (50, 81, 0)
- counts_displs() Tuple[Tuple[int], Tuple[int]]
Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.
- cpu() DNDarray
Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.
- create_lshape_map(force_check: bool = False) torch.Tensor
Generate a ‘map’ of the lshapes of the data on all processes. Units are
(process rank, lshape)
- Parameters:
force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map
- create_partition_interface()
Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.
An example of the output and shape is shown below.
- __partitioned__ = {
‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {
- (0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (1, 0, 0): {
‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (2, 0, 0): {
‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (3, 0, 0): {
‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}
}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,
}
- Return type:
dictionary containing the partition interface as shown above.
- fill_diagonal(value: float) DNDarray
Fill the main diagonal of a 2D
DNDarray
. This function modifies the input tensor in-place, and returns the input array.- Parameters:
value (float) – The value to be placed in the
DNDarrays
main diagonal
- __getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) DNDarray
Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the
DNDarray
usebalance()
(issue #187)- Parameters:
key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.
Examples
>>> a = ht.arange(10, split=0) (1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32) >>> a[1:6] (1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5], dtype=torch.int32) >>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] (1/2) >>> tensor([0.]) (2/2) >>> tensor([0., 0.])
- is_balanced(force_check: bool = False) bool
Determine if
self
is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returningself.balanced
. If no information is available (self.balanced = None
), the balanced status will be assessed via collective communication.Parameters force_check : bool, optional
If True, the balanced status of the
DNDarray
will be assessed via collective communication in any case.
- is_distributed() bool
Determines whether the data of this
DNDarray
is distributed across multiple processes.
- item()
Returns the only element of a 1-element
DNDarray
. Mirror of the pytorch command by the same name. If size ofDNDarray
is >1 element, then aValueError
is raised (by pytorch)Examples
>>> import heat as ht >>> x = ht.zeros((1)) >>> x.item() 0.0
- __len__() int
The length of the
DNDarray
, i.e. the number of items in the first dimension.
- numpy() numpy.array
Returns a copy of the
DNDarray
as numpy ndarray. If theDNDarray
resides on the GPU, the underlying data will be copied to the CPU first.If the
DNDarray
is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!Examples
>>> import heat as ht T1 = ht.random.randn((10,8)) T1.numpy()
- __repr__() str
Computes a printable representation of the passed DNDarray.
- ravel()
Flattens the
DNDarray
.See also
Examples
>>> a = ht.ones((2,3), split=0) >>> b = a.ravel() >>> a[0,0] = 4 >>> b DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)
- redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)
Redistributes the data of the
DNDarray
along the split axis to match the given target map. This function does not modify the non-split dimensions of theDNDarray
. This is an abstraction and extension of the balance function.- Parameters:
lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are
[rank, lshape]
.target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are
[rank, target lshape]
. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of thelshape_map
.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64) >>> target_map[0, 2] = 67 >>> print(target_map) [0/2] tensor([[ 0, 0, 67], [0/2] [ 0, 0, 0], [0/2] [ 0, 0, 0]], dtype=torch.int32) [1/2] tensor([[ 0, 0, 67], [1/2] [ 0, 0, 0], [1/2] [ 0, 0, 0]], dtype=torch.int32) [2/2] tensor([[ 0, 0, 67], [2/2] [ 0, 0, 0], [2/2] [ 0, 0, 0]], dtype=torch.int32) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.redistribute_(target_map=target_map) >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0)
- __redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)
Function to abstract the function used during redistribute for shuffling data between processes along the split axis
- Parameters:
snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question
- resplit_(axis: int = None)
In-place option for resplitting a
DNDarray
.- Parameters:
axis (int) – The new split axis,
None
denotes gathering, an int will set the new split axis
Examples
>>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, None) >>> a.split None >>> a.lshape (0/2) (4, 5) (1/2) (4, 5) >>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, 1) >>> a.split 1 >>> a.lshape (0/2) (4, 3) (1/2) (4, 2)
- __setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Global item setter
- Parameters:
key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)
Notes
If a
DNDarray
is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local arrayExamples
>>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] = 1 >>> a (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]]) (2/2) >>> tensor([[0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
- __setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Utility function for checking
value
and forwarding to :func:__setitem__
- Raises:
NotImplementedError – If the type of
value
ist not supported
- __str__() str
Computes a string representation of the passed
DNDarray
.
- tolist(keepsplit: bool = False) List
Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.
- Parameters:
keepsplit (bool) – Whether the list should be returned locally or globally.
Examples
>>> a = ht.array([[0,1],[2,3]]) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=0) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=1) >>> a.tolist(keepsplit=True) (1/2) [[0], [2]] (2/2) [[1], [3]]
- __torch_proxy__() torch.Tensor
Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.
- __xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) Tuple[int, int]
- class KMedoids(n_clusters: int = 8, init: str | heat.core.dndarray.DNDarray = 'random', max_iter: int = 300, random_state: int = None)
Bases:
heat.cluster._kcluster._KCluster
This is not the original implementation of k-medoids using PAM as originally proposed by in [1]. This is kmedoids with the Manhattan distance as fixed metric, calculating the median of the assigned cluster points as new cluster center and snapping the centroid to the the nearest datapoint afterwards.
- Parameters:
n_clusters (int, optional, default: 8) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘k-medoids++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2].
‘random’: choose k observations (rows) at random from data for the initial centroids.
DNDarray: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int, default: 300) – Maximum number of iterations of the algorithm for a single run.
random_state (int) – Determines random number generation for centroid initialization.
References
[1] Kaufman, L. and Rousseeuw, P.J. (1987), Clustering by means of Medoids, in Statistical Data Analysis Based on the L1 Norm and Related Methods, edited by Y. Dodge, North-Holland, 405416.
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
Compute new centroid
ci
as closest sample to the median of the data points inx
that are assigned toci
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of a k-medoids clustering.
- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)
Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.
- Parameters:
array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (
self.balanced is None
), it can be gathered via theis_balanced()
method (requires communication).
- __prephalo(start, end) torch.Tensor
Extracts the halo indexed by start, end from
self.array
in the direction ofself.split
- Parameters:
start (int) – Start index of the halo extracted from
self.array
end (int) – End index of the halo extracted from
self.array
- get_halo(halo_size: int) torch.Tensor
Fetch halos of size
halo_size
from neighboring ranks and save them inself.halo_next/self.halo_prev
.- Parameters:
halo_size (int) – Size of the halo.
- __cat_halo() torch.Tensor
Return local array concatenated to halos if they are available.
- __array__() numpy.ndarray
Returns a view of the process-local slice of the
DNDarray
as a numpy ndarray, if theDNDarray
resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice ofDNDarray
as numpy ndarray.
- astype(dtype, copy=True) DNDarray
Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is
True
, the same array is returned instead.
- balance_() DNDarray
Function for balancing a
DNDarray
between all nodes. To determine if this is needed use theis_balanced()
function. If theDNDarray
is already balanced this function will do nothing. This function modifies theDNDarray
itself and will not return anything.Examples
>>> a = ht.zeros((10, 2), split=0) >>> a[:, 0] = ht.arange(10) >>> b = a[3:] [0/2] tensor([[3., 0.], [1/2] tensor([[4., 0.], [5., 0.], [6., 0.]]) [2/2] tensor([[7., 0.], [8., 0.], [9., 0.]]) >>> b.balance_() >>> print(b.gshape, b.lshape) [0/2] (7, 2) (1, 2) [1/2] (7, 2) (3, 2) [2/2] (7, 2) (3, 2) >>> b [0/2] tensor([[3., 0.], [4., 0.], [5., 0.]]) [1/2] tensor([[6., 0.], [7., 0.]]) [2/2] tensor([[8., 0.], [9., 0.]]) >>> print(b.gshape, b.lshape) [0/2] (7, 2) (3, 2) [1/2] (7, 2) (2, 2) [2/2] (7, 2) (2, 2)
- __cast(cast_function) float | int
Implements a generic cast function for
DNDarray
objects.- Parameters:
cast_function (function) – The actual cast function, e.g.
float
orint
- Raises:
TypeError – If the
DNDarray
object cannot be converted into a scalar.
- collect_(target_rank: int | None = 0) None
A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the
redistribute_
method.- Parameters:
target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.
- Raises:
TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.collect_() >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0) >>> st.collect_(1) >>> print(st.lshape) [0/2] (50, 81, 0) [1/2] (50, 81, 67) [2/2] (50, 81, 0)
- counts_displs() Tuple[Tuple[int], Tuple[int]]
Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.
- cpu() DNDarray
Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.
- create_lshape_map(force_check: bool = False) torch.Tensor
Generate a ‘map’ of the lshapes of the data on all processes. Units are
(process rank, lshape)
- Parameters:
force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map
- create_partition_interface()
Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.
An example of the output and shape is shown below.
- __partitioned__ = {
‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {
- (0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (1, 0, 0): {
‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (2, 0, 0): {
‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (3, 0, 0): {
‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}
}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,
}
- Return type:
dictionary containing the partition interface as shown above.
- fill_diagonal(value: float) DNDarray
Fill the main diagonal of a 2D
DNDarray
. This function modifies the input tensor in-place, and returns the input array.- Parameters:
value (float) – The value to be placed in the
DNDarrays
main diagonal
- __getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) DNDarray
Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the
DNDarray
usebalance()
(issue #187)- Parameters:
key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.
Examples
>>> a = ht.arange(10, split=0) (1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32) >>> a[1:6] (1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5], dtype=torch.int32) >>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] (1/2) >>> tensor([0.]) (2/2) >>> tensor([0., 0.])
- is_balanced(force_check: bool = False) bool
Determine if
self
is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returningself.balanced
. If no information is available (self.balanced = None
), the balanced status will be assessed via collective communication.Parameters force_check : bool, optional
If True, the balanced status of the
DNDarray
will be assessed via collective communication in any case.
- is_distributed() bool
Determines whether the data of this
DNDarray
is distributed across multiple processes.
- item()
Returns the only element of a 1-element
DNDarray
. Mirror of the pytorch command by the same name. If size ofDNDarray
is >1 element, then aValueError
is raised (by pytorch)Examples
>>> import heat as ht >>> x = ht.zeros((1)) >>> x.item() 0.0
- __len__() int
The length of the
DNDarray
, i.e. the number of items in the first dimension.
- numpy() numpy.array
Returns a copy of the
DNDarray
as numpy ndarray. If theDNDarray
resides on the GPU, the underlying data will be copied to the CPU first.If the
DNDarray
is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!Examples
>>> import heat as ht T1 = ht.random.randn((10,8)) T1.numpy()
- __repr__() str
Computes a printable representation of the passed DNDarray.
- ravel()
Flattens the
DNDarray
.See also
Examples
>>> a = ht.ones((2,3), split=0) >>> b = a.ravel() >>> a[0,0] = 4 >>> b DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)
- redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)
Redistributes the data of the
DNDarray
along the split axis to match the given target map. This function does not modify the non-split dimensions of theDNDarray
. This is an abstraction and extension of the balance function.- Parameters:
lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are
[rank, lshape]
.target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are
[rank, target lshape]
. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of thelshape_map
.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64) >>> target_map[0, 2] = 67 >>> print(target_map) [0/2] tensor([[ 0, 0, 67], [0/2] [ 0, 0, 0], [0/2] [ 0, 0, 0]], dtype=torch.int32) [1/2] tensor([[ 0, 0, 67], [1/2] [ 0, 0, 0], [1/2] [ 0, 0, 0]], dtype=torch.int32) [2/2] tensor([[ 0, 0, 67], [2/2] [ 0, 0, 0], [2/2] [ 0, 0, 0]], dtype=torch.int32) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.redistribute_(target_map=target_map) >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0)
- __redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)
Function to abstract the function used during redistribute for shuffling data between processes along the split axis
- Parameters:
snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question
- resplit_(axis: int = None)
In-place option for resplitting a
DNDarray
.- Parameters:
axis (int) – The new split axis,
None
denotes gathering, an int will set the new split axis
Examples
>>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, None) >>> a.split None >>> a.lshape (0/2) (4, 5) (1/2) (4, 5) >>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, 1) >>> a.split 1 >>> a.lshape (0/2) (4, 3) (1/2) (4, 2)
- __setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Global item setter
- Parameters:
key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)
Notes
If a
DNDarray
is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local arrayExamples
>>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] = 1 >>> a (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]]) (2/2) >>> tensor([[0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
- __setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Utility function for checking
value
and forwarding to :func:__setitem__
- Raises:
NotImplementedError – If the type of
value
ist not supported
- __str__() str
Computes a string representation of the passed
DNDarray
.
- tolist(keepsplit: bool = False) List
Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.
- Parameters:
keepsplit (bool) – Whether the list should be returned locally or globally.
Examples
>>> a = ht.array([[0,1],[2,3]]) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=0) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=1) >>> a.tolist(keepsplit=True) (1/2) [[0], [2]] (2/2) [[1], [3]]
- __torch_proxy__() torch.Tensor
Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.
- __xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) Tuple[int, int]
- class Spectral(n_clusters: int = None, gamma: float = 1.0, metric: str = 'rbf', laplacian: str = 'fully_connected', threshold: float = 1.0, boundary: str = 'upper', n_lanczos: int = 300, assign_labels: str = 'kmeans', **params)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Spectral clustering
- Variables:
n_clusters (int) – Number of clusters to fit
gamma (float) – Kernel coefficient sigma for ‘rbf’, ignored for metric=’euclidean’
metric (string) –
How to construct the similarity matrix.
’rbf’ : construct the similarity matrix using a radial basis function (RBF) kernel.
’euclidean’ : construct the similarity matrix as only euclidean distance.
laplacian (str) – How to calculate the graph laplacian (affinity) Currently supported : ‘fully_connected’, ‘eNeighbour’
threshold (float) – Threshold for affinity matrix if laplacian=’eNeighbour’ Ignorded for laplacian=’fully_connected’
boundary (str) – How to interpret threshold: ‘upper’, ‘lower’ Ignorded for laplacian=’fully_connected’
n_lanczos (int) – number of Lanczos iterations for Eigenvalue decomposition
assign_labels (str) – The strategy to use to assign labels in the embedding space.
**params (dict) – Parameter dictionary for the assign_labels estimator
- _spectral_embedding(x: heat.core.dndarray.DNDarray) Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]
Helper function for dataset x embedding. Returns Tupel(Eigenvalues, Eigenvectors) of the graph’s Laplacian matrix.
- Parameters:
x (DNDarray) – Sample Matrix for which the embedding should be calculated
Notes
This will throw out the complex side of the eigenvalues found during this.
- fit(x: heat.core.dndarray.DNDarray)
Clusters dataset X via spectral embedding. Computes the low-dim representation by calculation of eigenspectrum (eigenvalues and eigenvectors) of the graph laplacian from the similarity matrix and fits the eigenvectors that correspond to the k lowest eigenvalues with a seperate clustering algorithm (currently only kmeans is supported). Similarity metrics for adjacency calculations are supported via spatial.distance. The eigenvalues and eigenvectors are computed by reducing the Laplacian via lanczos iterations and using the torch eigenvalue solver on this smaller matrix. If other eigenvalue decompostion methods are supported, this will be expanded.
- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray) heat.core.dndarray.DNDarray
Return the label each sample in X belongs to. X is transformed to the low-dim representation by calculation of eigenspectrum (eigenvalues and eigenvectors) of the graph laplacian from the similarity matrix. Inference of lables is done by extraction of the closest centroid of the n_clusters eigenvectors from the previously fitted clustering algorithm (kmeans).
- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
Warning
Caution: Calculation of the low-dim representation requires some time!
- class _KCluster(metric: Callable, n_clusters: int, init: str | heat.core.dndarray.DNDarray, max_iter: int, tol: float, random_state: int)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for k-statistics clustering algorithms (kmeans, kmedians, kmedoids). The clusters are represented by centroids ci (we use the term from kmeans for simplicity)
- Parameters:
metric (function) – One of the distance metrics in ht.spatial.distance. Needs to be passed as lambda function to take only two arrays as input
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str or DNDarray, default: ‘random’) –
Method for initialization:
‘probability_based’ : selects initial cluster centers for the clustering in a smart way to speed up convergence (k-means++)
‘random’: choose k observations (rows) at random from data for the initial centroids.
’batchparallel’: use the batch parallel algorithm to initialize the centroids, only available for split=0 and KMeans or KMedians
DNDarray
: gives the initial centers, should be of Shape = (n_clusters, n_features)
max_iter (int) – Maximum number of iterations for a single run.
tol (float, default: 1e-4) – Relative tolerance with regards to inertia to declare convergence.
random_state (int) – Determines random number generation for centroid initialization.
- _initialize_cluster_centers(x: heat.core.dndarray.DNDarray)
Initializes the K-Means centroids.
- Parameters:
x (DNDarray) – The data to initialize the clusters for. Shape = (n_samples, n_features)
- _assign_to_cluster(x: heat.core.dndarray.DNDarray, eval_functional_value: bool = False)
Assigns the passed data points to the centroids based on the respective metric
- _update_centroids(x: heat.core.dndarray.DNDarray, matching_centroids: heat.core.dndarray.DNDarray)
The Update strategy is algorithm specific (e.g. calculate mean of assigned points for kmeans, median for kmedians, etc.)
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
. The full pipeline is algorithm specific.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features)
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)
Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.
- Parameters:
array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (
self.balanced is None
), it can be gathered via theis_balanced()
method (requires communication).
- __prephalo(start, end) torch.Tensor
Extracts the halo indexed by start, end from
self.array
in the direction ofself.split
- Parameters:
start (int) – Start index of the halo extracted from
self.array
end (int) – End index of the halo extracted from
self.array
- get_halo(halo_size: int) torch.Tensor
Fetch halos of size
halo_size
from neighboring ranks and save them inself.halo_next/self.halo_prev
.- Parameters:
halo_size (int) – Size of the halo.
- __cat_halo() torch.Tensor
Return local array concatenated to halos if they are available.
- __array__() numpy.ndarray
Returns a view of the process-local slice of the
DNDarray
as a numpy ndarray, if theDNDarray
resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice ofDNDarray
as numpy ndarray.
- astype(dtype, copy=True) DNDarray
Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is
True
, the same array is returned instead.
- balance_() DNDarray
Function for balancing a
DNDarray
between all nodes. To determine if this is needed use theis_balanced()
function. If theDNDarray
is already balanced this function will do nothing. This function modifies theDNDarray
itself and will not return anything.Examples
>>> a = ht.zeros((10, 2), split=0) >>> a[:, 0] = ht.arange(10) >>> b = a[3:] [0/2] tensor([[3., 0.], [1/2] tensor([[4., 0.], [5., 0.], [6., 0.]]) [2/2] tensor([[7., 0.], [8., 0.], [9., 0.]]) >>> b.balance_() >>> print(b.gshape, b.lshape) [0/2] (7, 2) (1, 2) [1/2] (7, 2) (3, 2) [2/2] (7, 2) (3, 2) >>> b [0/2] tensor([[3., 0.], [4., 0.], [5., 0.]]) [1/2] tensor([[6., 0.], [7., 0.]]) [2/2] tensor([[8., 0.], [9., 0.]]) >>> print(b.gshape, b.lshape) [0/2] (7, 2) (3, 2) [1/2] (7, 2) (2, 2) [2/2] (7, 2) (2, 2)
- __cast(cast_function) float | int
Implements a generic cast function for
DNDarray
objects.- Parameters:
cast_function (function) – The actual cast function, e.g.
float
orint
- Raises:
TypeError – If the
DNDarray
object cannot be converted into a scalar.
- collect_(target_rank: int | None = 0) None
A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the
redistribute_
method.- Parameters:
target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.
- Raises:
TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.collect_() >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0) >>> st.collect_(1) >>> print(st.lshape) [0/2] (50, 81, 0) [1/2] (50, 81, 67) [2/2] (50, 81, 0)
- counts_displs() Tuple[Tuple[int], Tuple[int]]
Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.
- cpu() DNDarray
Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.
- create_lshape_map(force_check: bool = False) torch.Tensor
Generate a ‘map’ of the lshapes of the data on all processes. Units are
(process rank, lshape)
- Parameters:
force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map
- create_partition_interface()
Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.
An example of the output and shape is shown below.
- __partitioned__ = {
‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {
- (0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (1, 0, 0): {
‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (2, 0, 0): {
‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}, (3, 0, 0): {
‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’
}
}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,
}
- Return type:
dictionary containing the partition interface as shown above.
- fill_diagonal(value: float) DNDarray
Fill the main diagonal of a 2D
DNDarray
. This function modifies the input tensor in-place, and returns the input array.- Parameters:
value (float) – The value to be placed in the
DNDarrays
main diagonal
- __getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) DNDarray
Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the
DNDarray
usebalance()
(issue #187)- Parameters:
key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.
Examples
>>> a = ht.arange(10, split=0) (1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32) >>> a[1:6] (1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32) (2/2) >>> tensor([5], dtype=torch.int32) >>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] (1/2) >>> tensor([0.]) (2/2) >>> tensor([0., 0.])
- is_balanced(force_check: bool = False) bool
Determine if
self
is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returningself.balanced
. If no information is available (self.balanced = None
), the balanced status will be assessed via collective communication.Parameters force_check : bool, optional
If True, the balanced status of the
DNDarray
will be assessed via collective communication in any case.
- is_distributed() bool
Determines whether the data of this
DNDarray
is distributed across multiple processes.
- item()
Returns the only element of a 1-element
DNDarray
. Mirror of the pytorch command by the same name. If size ofDNDarray
is >1 element, then aValueError
is raised (by pytorch)Examples
>>> import heat as ht >>> x = ht.zeros((1)) >>> x.item() 0.0
- __len__() int
The length of the
DNDarray
, i.e. the number of items in the first dimension.
- numpy() numpy.array
Returns a copy of the
DNDarray
as numpy ndarray. If theDNDarray
resides on the GPU, the underlying data will be copied to the CPU first.If the
DNDarray
is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!Examples
>>> import heat as ht T1 = ht.random.randn((10,8)) T1.numpy()
- __repr__() str
Computes a printable representation of the passed DNDarray.
- ravel()
Flattens the
DNDarray
.See also
Examples
>>> a = ht.ones((2,3), split=0) >>> b = a.ravel() >>> a[0,0] = 4 >>> b DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)
- redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)
Redistributes the data of the
DNDarray
along the split axis to match the given target map. This function does not modify the non-split dimensions of theDNDarray
. This is an abstraction and extension of the balance function.- Parameters:
lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are
[rank, lshape]
.target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are
[rank, target lshape]
. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of thelshape_map
.
Examples
>>> st = ht.ones((50, 81, 67), split=2) >>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64) >>> target_map[0, 2] = 67 >>> print(target_map) [0/2] tensor([[ 0, 0, 67], [0/2] [ 0, 0, 0], [0/2] [ 0, 0, 0]], dtype=torch.int32) [1/2] tensor([[ 0, 0, 67], [1/2] [ 0, 0, 0], [1/2] [ 0, 0, 0]], dtype=torch.int32) [2/2] tensor([[ 0, 0, 67], [2/2] [ 0, 0, 0], [2/2] [ 0, 0, 0]], dtype=torch.int32) >>> print(st.lshape) [0/2] (50, 81, 23) [1/2] (50, 81, 22) [2/2] (50, 81, 22) >>> st.redistribute_(target_map=target_map) >>> print(st.lshape) [0/2] (50, 81, 67) [1/2] (50, 81, 0) [2/2] (50, 81, 0)
- __redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)
Function to abstract the function used during redistribute for shuffling data between processes along the split axis
- Parameters:
snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question
- resplit_(axis: int = None)
In-place option for resplitting a
DNDarray
.- Parameters:
axis (int) – The new split axis,
None
denotes gathering, an int will set the new split axis
Examples
>>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, None) >>> a.split None >>> a.lshape (0/2) (4, 5) (1/2) (4, 5) >>> a = ht.zeros((4, 5,), split=0) >>> a.lshape (0/2) (2, 5) (1/2) (2, 5) >>> ht.resplit_(a, 1) >>> a.split 1 >>> a.lshape (0/2) (4, 3) (1/2) (4, 2)
- __setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Global item setter
- Parameters:
key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)
Notes
If a
DNDarray
is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local arrayExamples
>>> a = ht.zeros((4,5), split=0) (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) (2/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]]) >>> a[1:4, 1] = 1 >>> a (1/2) >>> tensor([[0., 0., 0., 0., 0.], [0., 1., 0., 0., 0.]]) (2/2) >>> tensor([[0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
- __setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)
Utility function for checking
value
and forwarding to :func:__setitem__
- Raises:
NotImplementedError – If the type of
value
ist not supported
- __str__() str
Computes a string representation of the passed
DNDarray
.
- tolist(keepsplit: bool = False) List
Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.
- Parameters:
keepsplit (bool) – Whether the list should be returned locally or globally.
Examples
>>> a = ht.array([[0,1],[2,3]]) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=0) >>> a.tolist() [[0, 1], [2, 3]]
>>> a = ht.array([[0,1],[2,3]], split=1) >>> a.tolist(keepsplit=True) (1/2) [[0], [2]] (2/2) [[1], [3]]
- __torch_proxy__() torch.Tensor
Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.
- __xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) Tuple[int, int]
- self
Auxiliary single-process functions and base class for batch-parallel k-clustering
- _initialize_plus_plus(X, n_clusters, p, random_state=None)
Auxiliary function: single-process k-means++/k-medians++ initialization in pytorch p is the norm used for computing distances
- _kmex(X, p, n_clusters, init, max_iter, tol, random_state=None)
Auxiliary function: single-process k-means and k-medians in pytorch p is the norm used for computing distances: p=2 implies k-means, p=1 implies k-medians. p should be 1 (k-medians) or 2 (k-means). For other choice of p, we proceed as for p=2 and hope for the best. (note: kmex stands for kmeans and kmedians)
- _parallel_batched_kmex_predict(X, centers, p)
Auxiliary function: predict labels for parallel_batched_kmex
- class _BatchParallelKCluster(p: int, n_clusters: int, init: str, max_iter: int, tol: float, random_state: int | None, n_procs_to_merge: int | None)
Bases:
heat.ClusteringMixin
,heat.BaseEstimator
Base class for batch parallel k-clustering
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class BatchParallelKMeans(n_clusters: int = 8, init: str = 'k-means++', max_iter: int = 300, tol: float = 0.0001, random_state: int = None, n_procs_to_merge: int = None)
Bases:
_BatchParallelKCluster
Batch-parallel K-Means clustering algorithm from Ref. [1]. The input must be a
DNDarray
of shape (n_samples, n_features), with split=0 (i.e. split along the sample axis). This method performs K-Means clustering on each batch (i.e. on each process-local chunk) of data individually and in parallel. After that, all centroids from the local K-Means are gathered and another instance of K-means is performed on them in order to determine the final centroids. To improve scalability of this approach also on a large number of processes, this procedure can be applied in a hierarchical manner using the parameter n_procs_to_merge.- Variables:
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str) – Method for initialization for local and global k-means: - ‘k-means++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2]. - ‘random’: choose k observations (rows) at random from data for the initial centroids. (Not implemented yet)
max_iter (int) – Maximum number of iterations of the local/global k-means algorithms.
tol (float) – Relative tolerance with regards to inertia to declare convergence, both for local and global k-means.
random_state (int) – Determines random number generation for centroid initialization.
n_procs_to_merge (int) – Number of processes to merge after each iteration of the local k-means. If None, all processes are merged after each iteration.
References
[1] Rasim M. Alguliyev, Ramiz M. Aliguliyev, Lyudmila V. Sukhostat, Parallel batch k-means for Big data clustering, Computers & Industrial Engineering, Volume 152 (2021). https://doi.org/10.1016/j.cie.2020.107023.
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)
- class BatchParallelKMedians(n_clusters: int = 8, init: str = 'k-medians++', max_iter: int = 300, tol: float = 0.0001, random_state: int = None, n_procs_to_merge: int = None)
Bases:
_BatchParallelKCluster
Batch-parallel K-Medians clustering algorithm, in analogy to the K-means algorithm from Ref. [1]. This requires data to be given as DNDarray of shape (n_samples, n_features) with split=0 (i.e. split along the sample axis). The idea of the method is to perform the classical K-Medians on each batch of data (i.e. on each process-local chunk of data) individually and in parallel. After that, all centroids from the local K-Medians are gathered and another instance of K-Medians is performed on them in order to determine the final centroids. To improve scalability of this approach also on a range number of processes, this procedure can be applied in a hierarchical manor using the parameter n_procs_to_merge.
- Variables:
n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init (str) – Method for initialization for local and global k-medians: - ‘k-medians++’ : selects initial cluster centers for the clustering in a smart way to speed up convergence [2]. - ‘random’: choose k observations (rows) at random from data for the initial centroids. (Not implemented yet)
max_iter (int) – Maximum number of iterations of the local/global k-Medians algorithms.
tol (float) – Relative tolerance with regards to inertia to declare convergence, both for local and global k-Medians.
random_state (int) – Determines random number generation for centroid initialization.
n_procs_to_merge (int) – Number of processes to merge after each iteration of the local k-Medians. If None, all processes are merged after each iteration.
References
[1] Rasim M. Alguliyev, Ramiz M. Aliguliyev, Lyudmila V. Sukhostat, Parallel batch k-means for Big data clustering, Computers & Industrial Engineering, Volume 152 (2021). https://doi.org/10.1016/j.cie.2020.107023.
- fit(x: heat.core.dndarray.DNDarray)
Computes the centroid of the clustering algorithm to fit the data
x
.- Parameters:
x (DNDarray) – Training instances to cluster. Shape = (n_samples, n_features). It must hold x.split=0.
- predict(x: heat.core.dndarray.DNDarray)
Predict the closest cluster each sample in
x
belongs to.In the vector quantization literature,
cluster_centers_()
is called the code book and each value returned by predict is the index of the closest code in the code book.- Parameters:
x (DNDarray) – New data to predict. Shape = (n_samples, n_features)