heat.utils.data

add data utility functions to the ht.utils.data namespace

Submodules

Package Contents

class DataLoader(dataset: torch.utils.data.Dataset | heat.utils.data.partial_dataset.PartialH5Dataset, batch_size: int = 1, num_workers: int = 0, collate_fn: Callable = None, pin_memory: bool = False, drop_last: bool = False, timeout: int | float = 0, worker_init_fn: Callable = None)[source]

The combines either a DNDarray or a torch Dataset with a sampler. This provides an iterable over the local dataset and it will shuffle the data at the end of the iterator. If a DNDarray is given, then a Dataset() will be created internally.

Currently, this only supports only map-style datasets with single-process loading. It uses the random batch sampler. The rest of the DataLoader functionality mentioned in torch.utils.data.dataloader applies.

Parameters:
  • dataset

    Dataset(), torch Dataset, heat.utils.data.partial_dataset.PartialH5Dataset() A torch dataset from which the data will be returned by the created iterator

  • batch_size

    int, optional How many samples per batch to loadn

    Default: 1

  • num_workers – int, optional How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.n Default: 0

  • collate_fn – callable, optional Merges a list of samples to form a mini-batch of torch.Tensor(s). Used when using batched loading from a map-style dataset.n Default: None

  • pin_memory – bool, optional If True, the data loader will copy torch.Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below. n Default: False

  • drop_last – bool, optional Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller.n Default: False

  • timeout – int or float, optional If positive, the timeout value for collecting a batch from workers. Should always be non-negative.n Default: 0

  • worker_init_fn – callable, optional If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading.n default: None

Variables:
  • dataset – The dataset created from the local data

  • DataLoader – The local DataLoader object. Used in the creation of the iterable and the length

  • _first_iter (bool) – Flag indicating if the iterator created is the first one. If it is not, then the data will be shuffled before the iterator is created

  • last_epoch (bool) – Flag indicating last epoch

dataset
DataLoader
_first_iter = True
last_epoch = False
__iter__() Iterator[source]

Generate a new iterator of a type dependent on the type of dataset. Returns a partial_dataset.PartialH5DataLoaderIter if the dataset is a partial_dataset.PartialH5Dataset self._full_dataset_shuffle_iter() otherwise

__len__() int[source]

Get the length of the dataloader. Returns the number of batches.

_full_dataset_shuffle_iter()[source]
class Dataset(array, transforms: List | Callable | None = None, ishuffle: bool | None = False, test_set: bool | None = False)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a given dataset. This inherits from torch.utils.data.Dataset.

This class is a general example for what should be done to create a Dataset. When creating a dataset all of the standard attributes should be set, the __getitem__, __len__, and shuffle functions must be defined.

  • __getitem__ : how an item is given to the network

  • __len__ : the number of data elements to be given to the network in total

  • Shuffle() : how the data should be shuffled between the processes. The function shown below is for a dataset composed of only data and without targets. The function dataset_shuffle() abstracts this. For this function only the dataset and a list of attributes to shuffle are given.n

  • Ishuffle() : A non-blocking version of Shuffle(), this is handled in the abstract function dataset_ishuffle(). It works similarly to dataset_shuffle().

As the amount of data across processes can be non-uniform, the dataset class will slice off the remaining elements on whichever processes have more data than the others. This should only be 1 element. The shuffle function will shuffle all of the data on the process.

It is recommended that for DNDarray s, the split is either 0 or None

Parameters:
  • array (DNDarray) – DNDarray for which to great the dataset

  • transform (Callable) – Transformation to call before a data item is returned

  • ishuffle (bool, optional) – flag indicating whether to use non-blocking communications for shuffling the data between epochs Note: if True, the Ishuffle() function must be defined within the classn Default: False

Variables:
  • attributes. (These are the required)

  • htdata (DNDarray) – Full data

  • _cut_slice (slice) – Slice to cut off the last element to get a uniform amount of data on each process

  • comm (MPICommunicator) – Communication object used to send the data between processes

  • lcl_half (int) – Half of the number of data elements on the process

  • data (torch.Tensor) – The local data to be used in training

  • transforms (Callable) – Transform to be called during the getitem function

  • ishuffle (bool) – Flag indicating if non-blocking communications are used for shuffling the data between epochs

htdata
comm
test_set = False
lcl_half
_cut_slice
data
transforms = None
ishuffle = False
__getitem__(index: int | slice | tuple | list | torch.Tensor) torch.Tensor[source]

Basic form of __getitem__. As the dataset is often very specific to the dataset, this should be overwritten by the user. In this form it only gets the raw items from the data.

__len__() int[source]

Get the number of items in the dataset. This should be overwritten by custom datasets

Shuffle()[source]

Send half of the local data to the process self.comm.rank + 1 if available, else wrap around. After receiving the new data, shuffle the local tensor.

Ishuffle()[source]

Send half of the local data to the process self.comm.rank + 1 if available, else wrap around. After receiving the new data, shuffle the local tensor.

dataset_shuffle(dataset: Dataset | torch.utils.data.Dataset, attrs: List[list])[source]

Shuffle the given attributes of a dataset across multiple processes. This will send half of the data to rank + 1. Once the new data is received, it will be shuffled into the existing data on the process. This function will be called by the DataLoader automatically if dataset.ishuffle = False. attrs should have the form [[torch.Tensor, DNDarray], … i.e. [[‘data’, ‘htdata`]] assume that all of the attrs have the same dim0 shape as the local data

Parameters:
  • dataset (Dataset) – the dataset to shuffle

  • attrs (List[List[str, str], ... ]) – List of lists each of which contains 2 strings. The strings are the handles corresponding to the Dataset attributes corresponding to the global data DNDarray and the local data of that array, i.e. [[“data, “htdata”],] would shuffle the htdata around and set the correct amount of data for the dataset.data attribute. For multiple parameters multiple lists are required. I.e. [[“data”, “htdata”], [“targets”, “httargets”]]

Notes

dataset.comm must be defined for this function to work.

dataset_ishuffle(dataset: Dataset | torch.utils.data.Dataset, attrs: List[list])[source]

Shuffle the given attributes of a dataset across multiple processes, using non-blocking communications. This will send half of the data to rank + 1. The data must be received by the dataset_irecv() function.

This function will be called by the DataLoader automatically if dataset.ishuffle = True. This is set either during the definition of the class of its initialization by a given paramete.

Parameters:
  • dataset (Dataset) – the dataset to shuffle

  • attrs (List[List[str, str], ... ]) – List of lists each of which contains 2 strings. The strings are the handles corresponding to the Dataset attributes corresponding to the global data DNDarray and the local data of that array, i.e. [[“htdata, “data”],] would shuffle the htdata around and set the correct amount of data for the dataset.data attribute. For multiple parameters multiple lists are required. I.e. [[“htdata”, “data”], [“httargets”, “targets”]]

Notes

dataset.comm must be defined for this function to work.

class DistributedDataset(dndarray: heat.core.dndarray.DNDarray, transforms: torchvision.transforms.Compose = None)[source]

Bases: torch.utils.data.Dataset

A DistributedDataset for usage in PyTorch. Saves the dndarray and the larray tensor. Uses the larray tensor for the distribution and getting the items. Intented to be used with DistributedSampler.

dndarray
transforms = None
__len__() int[source]
__getitem__(index)[source]
__getitems__(indices)[source]
class DistributedSampler(dataset: DistributedDataset, shuffle: bool = False, seed: int | None = None, shuffle_type: Literal['global'] | Literal['local'] = 'global', correction: bool = False)[source]

Bases: torch.utils.data.Sampler

A DistributedSampler for usage in PyTorch with Heat Arrays. Uses the nature of the Heat DNDArray to give the locally stored data on the larray. Shuffling is done by shuffling the indices. The given Indices corrospond to the index of the larray tensor. Works only with DNDarray that are split on axis 0

dataset
dndarray
shuffle = False
linked_sampler = None
correction = False
_in_slice(idx: int, a_slice: slice) bool[source]

Check if the given index is inside the given slice

Parameters:
  • idx (int) – Index to check

  • a_slice (slice) – Slice to check

Returns:

Wether index is in slice

Return type:

bool

_shuffle() None[source]

Shuffles the given dndarray at creation across processes.

_alltoall_shuffle() None[source]
set_shuffle_type(shuffle_type: Literal['global'] | Literal['local']) None[source]

Sets the Shuffle type for the Sampler.

Parameters:

shuffle_type (Literal["global"] | Literal["local"]) –

  • Local Shuffle means the shuffle of the larray only.

  • Global Shuffle means the shuffle across all processes

Raises:
  • TypeError – Shuffle type needs to be a string

  • ValueError – Only Global/Local shuffle types exist

set_seed(value: int | None) None[source]

Sets the seed for the torch.randperm

Parameters:

value (int) – seed to set

Links another DistributedSampler to this one, to automatically sets the seed/shuffle_type of this and the linked one, rather than manually setting both seperately. Usefull when one Sampler contains training data and the linked one the label data.

Removes an established link. For more info view :link: function

__iter__() Iterator[int][source]
__len__() int[source]
create_train_val_split(X: heat.core.dndarray.DNDarray, y: heat.core.dndarray.DNDarray, p: float = 0.95, seed: int | None = None) tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray][source]

Shuffles the data and then creates the train val split.

Parameters:
  • X (DNDarray) – Training Data

  • y (DNDarray) – Training Labels

  • p (float, optional) – How much the training should contain, by default 0.95

  • seed (int | None, optional) – Random Seed to be used, by default None

Returns:

returns tuple of (train_arr, train_labels_arr, val_arr, val_labels_arr)

Return type:

tuple[DNDarray, DNDarray, DNDarray, DNDarray]

class PartialH5Dataset(file: str, comm: heat.core.communication.MPICommunication = MPI_WORLD, dataset_names: str | List[str] = 'data', transforms: List[Callable] = None, use_gpu: bool = True, validate_set: bool = False, initial_load: int = 7000, load_length: int = 1000)[source]

Bases: torch.utils.data.Dataset

Create a Dataset object for a dataset which loads portions of data from an HDF5 file. Very similar to <heat.utils.data.datatools.Dataset>(). This will create 2 threads, one for loading the data from the target file, and one for converting items before being passed to the network. The conversion is done by the iterator. A portion of the data of length initial_load is loaded upon initialization, the rest of the data is loaded after the loaded data is returned by PartialH5DataLoaderIter(). This iterator will be used by the HeAT heat.utils.data.datatools.DataLoader() automatically with this type of dataset.

Notes

H5 datasets require the GIL to load data. This can be a bottleneck if data needs to be loaded multiple times (e.g. the case for using this dataset). It is recommended to find another way to preprocess the data and avoid using H5 files for this reason.

Parameters:
  • file (str) – H5 file to use

  • comm (MPICommunication) – Global MPI communicator generated by HeAT

  • dataset_names (Union[str, List[str]], optional) – Name/s of dataset/s to load from file. If a string is given, it will be the only dataset loaded. Default is “data”.

  • transforms (List[Callable], optional) – Transforms to apply to the data after it is gotten from the loaded data before it is used by the network. This should be a list of Callable torch functions for each item returned by the __getitem__ function of the individual dataset. If a list element is None then no transform will be applied to the corresponding element returned by __getitem__. I.e. if __getitem__ returns an image and a label then the list would look like this: transforms = [image_transforms, None]. If this is None, no transforms will be applied to any elements. Default is None.

  • use_gpu (bool, optional) – Use GPUs if available. Defaults to True.

  • validate_set (bool, optional) – Load the entire dataset onto each node upon initialization and skip loaded in iterator This is typically the case needed for validation sets when the network should be tested against the whole dataset. Default is False.

  • initial_load (int, optional) – How many elements to load from the file in the 0th dimension. Default is 7000 elements

  • load_length (int, optional) – How many elements to load from the file in the iterator. Default is 1000 elements

ishuffle = False
file
comm
transforms
gpu
torch_device = 'cpu'
total_size
lcl_full_sz
local_data_start
local_data_end
loads_left = 0
load_start
load_end
dataset_names = 'data'
dataset_order = []
load_thread = None
epoch_end = False
loading_queue
loading_condition
convert_queue
used_indices = []
Shuffle()[source]

Send half of the local data to the process self.comm.rank + 1 if available, else wrap around. After receiving the new data, shuffle the local tensor.

Not implemented for partial dataset

Ishuffle()[source]

Send half of the local data to the process self.comm.rank + 1 if available, else wrap around. After receiving the new data, shuffle the local tensor.

Not implemented for partial dataset

__getitem__(index: int | slice | List[int] | torch.Tensor) torch.Tensor[source]

Abstract __getitem__ method. This should be defined by the user at runtime. This function needs to be designed such that the data is in the 0th dimension and the indexes called are only in the 0th dim!

__len__() int[source]

Get the total length of the dataset

thread_replace_converted_batches()[source]

Replace the elements of the dataset with newly loaded elements. :func:’PartialH5DataLoaderIter’ will put the used indices in the used_indices parameter. This object is reset to an empty list after these elements are overwritten with new data.

class PartialH5DataLoaderIter(loader)[source]

Bases: object

Iterator to be used with :func:’PartialH5Dataset’. It closely mirrors the standard torch iterator while loading new data to replace the loaded batches automatically. It also pre-fetches the batches and begins their preparation, collation, and device setting in the background.

dataset
_dataset_kind
_IterableDataset_len_called
_auto_collation
_drop_last
_index_sampler
_num_workers
_pin_memory
_timeout
_collate_fn
_sampler_iter
_base_seed
_num_yielded = 0
batch_size
comm
_dataset_fetcher
__len__()[source]

Get the length of the iterator

_next_data()[source]
__next__()[source]

Get the next batch of data. Shamelessly taken from torch.

__iter__()[source]

Get a new iterator of this class

Return type:

PartialH5DataLoaderIter

__thread_convert_all(index_list)