`heat.naive_bayes`

add the GNB function to the ht.naive_bayes namespace

Submodules

heat.naive_bayes.gaussianNB

Package Contents

class DNDarray(array: torch.Tensor, gshape: Tuple[int, Ellipsis], dtype: heat.core.types.datatype, split: int | None, device: heat.core.devices.Device, comm: Communication, balanced: bool)

Distributed N-Dimensional array. The core element of HeAT. It is composed of PyTorch tensors local to each process.

Parameters:

array (torch.Tensor) – Local array elements
gshape (Tuple[int,...]) – The global shape of the array
dtype (datatype) – The datatype of the array
split (int or None) – The axis on which the array is divided between processes
device (Device) – The device on which the local arrays are using (cpu or gpu)
comm (Communication) – The communications object for sending and receiving data
balanced (bool or None) – Describes whether the data are evenly distributed across processes. If this information is not available (self.balanced is None), it can be gathered via the is_balanced() method (requires communication).

__prephalo(start, end) → torch.Tensor

Extracts the halo indexed by start, end from self.array in the direction of self.split

Parameters:

start (int) – Start index of the halo extracted from self.array
end (int) – End index of the halo extracted from self.array

get_halo(halo_size: int) → torch.Tensor

Fetch halos of size halo_size from neighboring ranks and save them in self.halo_next/self.halo_prev.

Parameters:: halo_size (int) – Size of the halo.

__cat_halo() → torch.Tensor: Return local array concatenated to halos if they are available.

__array__() → numpy.ndarray: Returns a view of the process-local slice of the DNDarray as a numpy ndarray, if the DNDarray resides on CPU. Otherwise, it returns a copy, on CPU, of the process-local slice of DNDarray as numpy ndarray.

astype(dtype, copy=True) → DNDarray

Returns a casted version of this array. Casted array is a new array of the same shape but with given type of this array. If copy is True, the same array is returned instead.

Parameters:

dtype (datatype) – Heat type to which the array is cast
copy (bool, optional) – By default the operation returns a copy of this array. If copy is set to False the cast is performed in-place and this array is returned

balance_() → DNDarray

Function for balancing a DNDarray between all nodes. To determine if this is needed use the is_balanced() function. If the DNDarray is already balanced this function will do nothing. This function modifies the DNDarray itself and will not return anything.

Examples

>>> a = ht.zeros((10, 2), split=0)
>>> a[:, 0] = ht.arange(10)
>>> b = a[3:]
[0/2] tensor([[3., 0.],
[1/2] tensor([[4., 0.],
              [5., 0.],
              [6., 0.]])
[2/2] tensor([[7., 0.],
              [8., 0.],
              [9., 0.]])
>>> b.balance_()
>>> print(b.gshape, b.lshape)
[0/2] (7, 2) (1, 2)
[1/2] (7, 2) (3, 2)
[2/2] (7, 2) (3, 2)
>>> b
[0/2] tensor([[3., 0.],
             [4., 0.],
             [5., 0.]])
[1/2] tensor([[6., 0.],
              [7., 0.]])
[2/2] tensor([[8., 0.],
              [9., 0.]])
>>> print(b.gshape, b.lshape)
[0/2] (7, 2) (3, 2)
[1/2] (7, 2) (2, 2)
[2/2] (7, 2) (2, 2)

__bool__() → bool: Boolean scalar casting.

__cast(cast_function) → float | int

Implements a generic cast function for DNDarray objects.

Parameters:: cast_function (function) – The actual cast function, e.g. float or int
Raises:: TypeError – If the DNDarray object cannot be converted into a scalar.

collect_(target_rank: int | None = 0) → None

A method collecting a distributed DNDarray to one MPI rank, chosen by the target_rank variable. It is a specific case of the redistribute_ method.

Parameters:

target_rank (int, optional) – The rank to which the DNDarray will be collected. Default: 0.

Raises:

TypeError – If the target rank is not an integer.
ValueError – If the target rank is out of bounds.

Examples

>>> st = ht.ones((50, 81, 67), split=2)
>>> print(st.lshape)
[0/2] (50, 81, 23)
[1/2] (50, 81, 22)
[2/2] (50, 81, 22)
>>> st.collect_()
>>> print(st.lshape)
[0/2] (50, 81, 67)
[1/2] (50, 81, 0)
[2/2] (50, 81, 0)
>>> st.collect_(1)
>>> print(st.lshape)
[0/2] (50, 81, 0)
[1/2] (50, 81, 67)
[2/2] (50, 81, 0)

__complex__() → DNDarray: Complex scalar casting.

counts_displs() → Tuple[Tuple[int], Tuple[int]]: Returns actual counts (number of items per process) and displacements (offsets) of the DNDarray. Does not assume load balance.

cpu() → DNDarray: Returns a copy of this object in main memory. If this object is already in main memory, then no copy is performed and the original object is returned.

create_lshape_map(force_check: bool = False) → torch.Tensor

Generate a ‘map’ of the lshapes of the data on all processes. Units are (process rank, lshape)

Parameters:: force_check (bool, optional) – if False (default) and the lshape map has already been created, use the previous result. Otherwise, create the lshape_map

create_partition_interface()

Create a partition interface in line with the DPPY proposal. This is subject to change. The intention of this to facilitate the usage of a general format for the referencing of distributed datasets.

An example of the output and shape is shown below.

__partitioned__ = {

‘shape’: (27, 3, 2), ‘partition_tiling’: (4, 1, 1), ‘partitions’: {

(0, 0, 0): {
‘start’: (0, 0, 0), ‘shape’: (7, 3, 2), ‘data’: tensor([…], dtype=torch.int32), ‘location’: [0], ‘dtype’: torch.int32, ‘device’: ‘cpu’

}, (1, 0, 0): {

‘start’: (7, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [1], ‘dtype’: torch.int32, ‘device’: ‘cpu’

}, (2, 0, 0): {

‘start’: (14, 0, 0), ‘shape’: (7, 3, 2), ‘data’: None, ‘location’: [2], ‘dtype’: torch.int32, ‘device’: ‘cpu’

}, (3, 0, 0): {

‘start’: (21, 0, 0), ‘shape’: (6, 3, 2), ‘data’: None, ‘location’: [3], ‘dtype’: torch.int32, ‘device’: ‘cpu’

}

}, ‘locals’: [(rank, 0, 0)], ‘get’: lambda x: x,

}

Return type:: dictionary containing the partition interface as shown above.

__float__() → DNDarray: Float scalar casting.

See also

flatten()

fill_diagonal(value: float) → DNDarray

Fill the main diagonal of a 2D DNDarray. This function modifies the input tensor in-place, and returns the input array.

Parameters:: value (float) – The value to be placed in the DNDarrays main diagonal

__getitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis]) → DNDarray

Global getter function for DNDarrays. Returns a new DNDarray composed of the elements of the original tensor selected by the indices given. This does NOT redistribute or rebalance the resulting tensor. If the selection of values is unbalanced then the resultant tensor is also unbalanced! To redistributed the DNDarray use balance() (issue #187)

Parameters:: key (int, slice, Tuple[int,...], List[int,...]) – Indices to get from the tensor.

Examples

>>> a = ht.arange(10, split=0)
(1/2) >>> tensor([0, 1, 2, 3, 4], dtype=torch.int32)
(2/2) >>> tensor([5, 6, 7, 8, 9], dtype=torch.int32)
>>> a[1:6]
(1/2) >>> tensor([1, 2, 3, 4], dtype=torch.int32)
(2/2) >>> tensor([5], dtype=torch.int32)
>>> a = ht.zeros((4,5), split=0)
(1/2) >>> tensor([[0., 0., 0., 0., 0.],
                  [0., 0., 0., 0., 0.]])
(2/2) >>> tensor([[0., 0., 0., 0., 0.],
                  [0., 0., 0., 0., 0.]])
>>> a[1:4, 1]
(1/2) >>> tensor([0.])
(2/2) >>> tensor([0., 0.])

__int__() → DNDarray: Integer scalar casting.

is_balanced(force_check: bool = False) → bool

Determine if self is balanced evenly (or as evenly as possible) across all nodes distributed evenly (or as evenly as possible) across all processes. This is equivalent to returning self.balanced. If no information is available (self.balanced = None), the balanced status will be assessed via collective communication.

Parameters force_check : bool, optional

If True, the balanced status of the DNDarray will be assessed via collective communication in any case.

is_distributed() → bool: Determines whether the data of this DNDarray is distributed across multiple processes.

__key_is_singular(key: any, axis: int, self_proxy: torch.Tensor) → bool

__key_adds_dimension(key: any, axis: int, self_proxy: torch.Tensor) → bool

item()

Returns the only element of a 1-element DNDarray. Mirror of the pytorch command by the same name. If size of DNDarray is >1 element, then a ValueError is raised (by pytorch)

Examples

>>> import heat as ht
>>> x = ht.zeros((1))
>>> x.item()
0.0

__len__() → int: The length of the DNDarray, i.e. the number of items in the first dimension.

numpy() → numpy.array

Returns a copy of the DNDarray as numpy ndarray. If the DNDarray resides on the GPU, the underlying data will be copied to the CPU first.

If the DNDarray is distributed, an MPI Allgather operation will be performed before converting to np.ndarray, i.e. each MPI process will end up holding a copy of the entire array in memory. Make sure process memory is sufficient!

Examples

>>> import heat as ht
T1 = ht.random.randn((10,8))
T1.numpy()

__repr__() → str: Computes a printable representation of the passed DNDarray.

ravel()

Flattens the DNDarray.

See also

ravel()

Examples

>>> a = ht.ones((2,3), split=0)
>>> b = a.ravel()
>>> a[0,0] = 4
>>> b
DNDarray([4., 1., 1., 1., 1., 1.], dtype=ht.float32, device=cpu:0, split=0)

redistribute_(lshape_map: torch.Tensor | None = None, target_map: torch.Tensor | None = None)

Redistributes the data of the DNDarray along the split axis to match the given target map. This function does not modify the non-split dimensions of the DNDarray. This is an abstraction and extension of the balance function.

Parameters:

lshape_map (torch.Tensor, optional) – The current lshape of processes. Units are [rank, lshape].
target_map (torch.Tensor, optional) – The desired distribution across the processes. Units are [rank, target lshape]. Note: the only important parts of the target map are the values along the split axis, values which are not along this axis are there to mimic the shape of the lshape_map.

Examples

>>> st = ht.ones((50, 81, 67), split=2)
>>> target_map = torch.zeros((st.comm.size, 3), dtype=torch.int64)
>>> target_map[0, 2] = 67
>>> print(target_map)
[0/2] tensor([[ 0,  0, 67],
[0/2]         [ 0,  0,  0],
[0/2]         [ 0,  0,  0]], dtype=torch.int32)
[1/2] tensor([[ 0,  0, 67],
[1/2]         [ 0,  0,  0],
[1/2]         [ 0,  0,  0]], dtype=torch.int32)
[2/2] tensor([[ 0,  0, 67],
[2/2]         [ 0,  0,  0],
[2/2]         [ 0,  0,  0]], dtype=torch.int32)
>>> print(st.lshape)
[0/2] (50, 81, 23)
[1/2] (50, 81, 22)
[2/2] (50, 81, 22)
>>> st.redistribute_(target_map=target_map)
>>> print(st.lshape)
[0/2] (50, 81, 67)
[1/2] (50, 81, 0)
[2/2] (50, 81, 0)

__redistribute_shuffle(snd_pr: int | torch.Tensor, send_amt: int | torch.Tensor, rcv_pr: int | torch.Tensor, snd_dtype: torch.dtype)

Function to abstract the function used during redistribute for shuffling data between processes along the split axis

Parameters:

snd_pr (int or torch.Tensor) – Sending process
send_amt (int or torch.Tensor) – Amount of data to be sent by the sending process
rcv_pr (int or torch.Tensor) – Receiving process
snd_dtype (torch.dtype) – Torch type of the data in question

resplit_(axis: int = None)

In-place option for resplitting a DNDarray.

Parameters:: axis (int) – The new split axis, None denotes gathering, an int will set the new split axis

Examples

>>> a = ht.zeros((4, 5,), split=0)
>>> a.lshape
(0/2) (2, 5)
(1/2) (2, 5)
>>> ht.resplit_(a, None)
>>> a.split
None
>>> a.lshape
(0/2) (4, 5)
(1/2) (4, 5)
>>> a = ht.zeros((4, 5,), split=0)
>>> a.lshape
(0/2) (2, 5)
(1/2) (2, 5)
>>> ht.resplit_(a, 1)
>>> a.split
1
>>> a.lshape
(0/2) (4, 3)
(1/2) (4, 2)

__setitem__(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)

Global item setter

Parameters:

key (Union[int, Tuple[int,...], List[int,...]]) – Index/indices to be set
value (Union[float, DNDarray,torch.Tensor]) – Value to be set to the specified positions in the DNDarray (self)

Notes

If a DNDarray is given as the value to be set then the split axes are assumed to be equal. If they are not, PyTorch will raise an error when the values are attempted to be set on the local array

Examples

>>> a = ht.zeros((4,5), split=0)
(1/2) >>> tensor([[0., 0., 0., 0., 0.],
                  [0., 0., 0., 0., 0.]])
(2/2) >>> tensor([[0., 0., 0., 0., 0.],
                  [0., 0., 0., 0., 0.]])
>>> a[1:4, 1] = 1
>>> a
(1/2) >>> tensor([[0., 0., 0., 0., 0.],
                  [0., 1., 0., 0., 0.]])
(2/2) >>> tensor([[0., 1., 0., 0., 0.],
                  [0., 1., 0., 0., 0.]])

__setter(key: int | Tuple[int, Ellipsis] | List[int, Ellipsis], value: float | DNDarray | torch.Tensor)

Utility function for checking value and forwarding to :func:__setitem__

Raises:: NotImplementedError – If the type of value ist not supported

__str__() → str: Computes a string representation of the passed DNDarray.

tolist(keepsplit: bool = False) → List

Return a copy of the local array data as a (nested) Python list. For scalars, a standard Python number is returned.

Parameters:: keepsplit (bool) – Whether the list should be returned locally or globally.

Examples

>>> a = ht.array([[0,1],[2,3]])
>>> a.tolist()
[[0, 1], [2, 3]]

>>> a = ht.array([[0,1],[2,3]], split=0)
>>> a.tolist()
[[0, 1], [2, 3]]

>>> a = ht.array([[0,1],[2,3]], split=1)
>>> a.tolist(keepsplit=True)
(1/2) [[0], [2]]
(2/2) [[1], [3]]

__torch_proxy__() → torch.Tensor: Return a 1-element torch.Tensor strided as the global self shape. Used internally for sanitation purposes.

__xitem_get_key_start_stop(rank: int, actives: list, key_st: int, key_sp: int, step: int, ends: torch.Tensor, og_key_st: int) → Tuple[int, int]

class GaussianNB(priors=None, var_smoothing=1e-09)

Bases: heat.ClassificationMixin, heat.BaseEstimator

Gaussian Naive Bayes (GaussianNB), based on scikit-learn.naive_bayes.GaussianNB. Can perform online updates to model parameters via method partial_fit(). For details on algorithm used to update feature means and variance online, see Chan, Golub, and LeVeque 1983 [1].

Parameters:

priors (DNDarray) – Prior probabilities of the classes, with shape (n_classes,). If specified, the priors are not adjusted according to the data.
var_smoothing (float, optional) – Portion of the largest variance of all features that is added to variances for calculation stability.

Variables:

class_count (DNDarray) – Number of training samples observed in each class. Shape = (n_classes,)
class_prior (DNDarray) – Probability of each class. Shape = (n_classes,)
classes (DNDarray) – Class labels known to the classifier. Shape = (n_classes,)
epsilon (float) – Absolute additive value to variances
sigma (DNDarray) – Variance of each feature per class. Shape = (n_classes, n_features)
theta (DNDarray) – Mean of each feature per class. Shape = (n_classes, n_features)

References

[1] Chan, Tony F., Golub, Gene H., and Leveque, Randall J., “Algorithms for Computing the Sample Variance: Analysis and Recommendations”, The American Statistician, 37:3, pp. 242-247, 1983

Examples

>>> import heat as ht
>>> X = ht.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]], dtype=ht.float32)
>>> Y = ht.array([1, 1, 1, 2, 2, 2])
>>> from heat.naive_bayes import GaussianNB
>>> clf = GaussianNB()
>>> clf.fit(X, Y)
<heat.naive_bayes.gaussianNB.GaussianNB object at 0x1a249f6dd8>
>>> print(clf.predict(ht.array([[-0.8, -1]])))
tensor([1])
>>> clf_pf = GaussianNB()
>>> clf_pf.partial_fit(X, Y, ht.unique(Y, sorted=True))
<heat.naive_bayes.gaussianNB.GaussianNB object at 0x1a249fbe10>
>>> print(clf_pf.predict(ht.array([[-0.8, -1]])))
tensor([1])

fit(x: heat.core.dndarray.DNDarray, y: heat.core.dndarray.DNDarray, sample_weight: heat.core.dndarray.DNDarray | None = None)

Fit Gaussian Naive Bayes according to x and y

Parameters:

x (DNDarray) – Training set, where n_samples is the number of samples and n_features is the number of features. Shape = (n_classes, n_features)
y (DNDarray) – Labels for training set. Shape = (n_samples, )
sample_weight (DNDarray, optional) – Weights applied to individual samples (1. for unweighted). Shape = (n_samples, )

__check_partial_fit_first_call(classes: heat.core.dndarray.DNDarray | None = None) → bool

Adapted to HeAT from scikit-learn.

This function returns True if it detects that this was the first call to partial_fit() on GaussianNB. In that case the classes_ attribute is also set on GaussianNB.

__update_mean_variance(n_past: int, mu: heat.core.dndarray.DNDarray, var: heat.core.dndarray.DNDarray, x: heat.core.dndarray.DNDarray, sample_weight: heat.core.dndarray.DNDarray | None = None) → Tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray]

Adapted to HeAT from scikit-learn. Compute online update of Gaussian mean and variance. Given starting sample count, mean, and variance, a new set of points x, and optionally sample weights, return the updated mean and variance. (NB - each dimension (column) in x is treated as independent – you get variance, not covariance). Can take scalar mean and variance, or vector mean and variance to simultaneously update a number of independent Gaussians. See Chan, Golub, and LeVeque 1983 [1]

Parameters:

n_past (int) – Number of samples represented in old mean and variance. If sample weights were given, this should contain the sum of sample weights represented in old mean and variance.
mu (DNDarray) – Means for Gaussians in original set. Shape = (number of Gaussians,)
var (DNDarray) – Variances for Gaussians in original set. Shape = (number of Gaussians,)
x (DNDarray) – Input data
sample_weight (DNDarray, optional) – Weights applied to individual samples (1. for unweighted). Shape = (n_samples,)

References

[1] Chan, Tony F., Golub, Gene H., and Leveque, Randall J., “Algorithms for Computing the Sample Variance: Analysis and Recommendations”, The American Statistician, 37:3, pp. 242-247, 1983

partial_fit(x: heat.core.dndarray.DNDarray, y: heat.core.dndarray.DNDarray, classes: heat.core.dndarray.DNDarray | None = None, sample_weight: heat.core.dndarray.DNDarray | None = None)

Adapted to HeAT from scikit-learn. Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance and numerical stability overhead, hence it is better to call partial_fit() on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.

Parameters:

x (DNDarray) – Training set, where n_samples is the number of samples and n_features is the number of features. Shape = (n_samples, n_features)
y (DNDarray) – Labels for training set. Shape = (n_samples,)
classes (DNDarray, optional) – List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit(), can be omitted in subsequent calls. Shape = (n_classes,)
sample_weight (DNDarray, optional) – Weights applied to individual samples (1. for unweighted). Shape = (n_samples,)

__partial_fit(x: heat.core.dndarray.DNDarray, y: heat.core.dndarray.DNDarray, classes: heat.core.dndarray.DNDarray | None = None, _refit: bool = False, sample_weight: heat.core.dndarray.DNDarray | None = None)

Actual implementation of Gaussian NB fitting. Adapted to HeAT from scikit-learn.

Parameters:

x (DNDarray) – Training set, where n_samples is the number of samples and n_features is the number of features. Shape = (n_samples, n_features)
y (DNDarray) – Labels for training set. Shape = (n_samples,)
classes (DNDarray, optional) – List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit(), can be omitted in subsequent calls. Shape = (n_classes,)
_refit (bool, optional) – If True, act as though this were the first time __partial_fit() is called (ie, throw away any past fitting and start over).
sample_weight (DNDarray, optional) – Weights applied to individual samples (1. for unweighted). Shape = (n_samples,)

__joint_log_likelihood(x: heat.core.dndarray.DNDarray) → heat.core.dndarray.DNDarray: Adapted to HeAT from scikit-learn. Calculates joint log-likelihood for n_samples to be assigned to each class. Returns a DNDarray joint_log_likelihood(n_samples, n_classes).

logsumexp(a: heat.core.dndarray.DNDarray, axis: int | Tuple[int, Ellipsis] | None = None, b: heat.core.dndarray.DNDarray | None = None, keepdims: bool = False, return_sign: bool = False) → heat.core.dndarray.DNDarray

Adapted to HeAT from scikit-learn. Compute the log of the sum of exponentials of input elements. The result, np.log(np.sum(np.exp(a))) calculated in a numerically more stable way. If b is given then np.log(np.sum(b*np.exp(a))) is returned.

Parameters:

a (DNDarray) – Input array.
axis (None or int or Tuple [int,...], optional) – Axis or axes over which the sum is taken. By default axis is None, and all elements are summed.
keepdims (bool, optional) – If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the original array.
b (DNDarray, optional) – Scaling factor for exp(a) must be of the same shape as a or broadcastable to a. These values may be negative in order to implement subtraction.
return_sign (bool, optional) – If this is set to True, the result will be a pair containing sign information; if False, results that are negative will be returned as NaN. #TODO: returns NotImplementedYet error.
sgn (DNDarray, NOT IMPLEMENTED YET) – #TODO If return_sign is True, this will be an array of floating-point numbers matching res and +1, 0, or -1 depending on the sign of the result. If False, only one result is returned.

predict(x: heat.core.dndarray.DNDarray) → heat.core.dndarray.DNDarray

Adapted to HeAT from scikit-learn. Perform classification on a tensor of test data x.

Parameters:: x (DNDarray) – Input data with shape (n_samples, n_features)

predict_log_proba(x: heat.core.dndarray.DNDarray) → heat.core.dndarray.DNDarray

Adapted to HeAT from scikit-learn. Return log-probability estimates of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.

Parameters:: x (DNDarray) – Input data. Shape = (n_samples, n_features).

predict_proba(x: heat.core.dndarray.DNDarray) → heat.core.dndarray.DNDarray

Adapted to HeAT from scikit-learn. Return probability estimates for the test tensor x of the samples for each class in the model. The columns correspond to the classes in sorted order, as they appear in the attribute classes_.

Parameters:: x (DNDarray) – Input data. Shape = (n_samples, n_features).

heat.naive_bayes

Submodules

Package Contents

`heat.naive_bayes`