:mod:`heat.utils.data.datatools` ================================ .. py:module:: heat.utils.data.datatools .. autoapi-nested-parse:: Function and classes useful for loading data into neural networks Module Contents --------------- .. py:class:: DataLoader(dataset: Union[torch.utils.data.Dataset, heat.utils.data.partial_dataset.PartialH5Dataset], batch_size: int = 1, num_workers: int = 0, collate_fn: Callable = None, pin_memory: bool = False, drop_last: bool = False, timeout: Union[int, float] = 0, worker_init_fn: Callable = None) The combines either a :func:`DNDarray ` or a torch `Dataset `_ with a sampler. This provides an iterable over the local dataset and it will shuffle the data at the end of the iterator. If a :func:`DNDarray ` is given, then a :func:`Dataset` will be created internally. Currently, this only supports only map-style datasets with single-process loading. It uses the random batch sampler. The rest of the ``DataLoader`` functionality mentioned in `torch.utils.data.dataloader `_ applies. :param dataset: :func:`Dataset`, torch `Dataset `_, :func:`heat.utils.data.partial_dataset.PartialH5Dataset` A torch dataset from which the data will be returned by the created iterator :param batch_size: int, optional How many samples per batch to load\n Default: 1 :param num_workers: int, optional How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.\n Default: 0 :param collate_fn: callable, optional Merges a list of samples to form a mini-batch of torch.Tensor(s). Used when using batched loading from a map-style dataset.\n Default: None :param pin_memory: bool, optional If ``True``, the data loader will copy torch.Tensors into CUDA pinned memory before returning them. If your data elements are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type, see the example below. \n Default: False :param drop_last: bool, optional Set to ``True`` to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If ``False`` and the size of dataset is not divisible by the batch size, then the last batch will be smaller.\n Default: ``False`` :param timeout: int or float, optional If positive, the timeout value for collecting a batch from workers. Should always be non-negative.\n Default: 0 :param worker_init_fn: callable, optional If not ``None``, this will be called on each worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as input, after seeding and before data loading.\n default: None :ivar dataset: The dataset created from the local data :vartype dataset: :func:`Dataset`, torch `Dataset `_, :func:`heat.utils.data.partial_dataset.PartialH5Dataset` :ivar DataLoader: The local DataLoader object. Used in the creation of the iterable and the length :vartype DataLoader: `torch.utils.data.dataloader `_ :ivar _first_iter: Flag indicating if the iterator created is the first one. If it is not, then the data will be shuffled before the iterator is created :vartype _first_iter: bool :ivar last_epoch: Flag indicating last epoch :vartype last_epoch: bool .. attribute:: dataset .. attribute:: DataLoader .. attribute:: _first_iter :annotation: = True .. attribute:: last_epoch :annotation: = False .. role:: raw-html(raw) :format: html .. method:: __iter__() -> Iterator Generate a new iterator of a type dependent on the type of dataset. Returns a :class:`partial_dataset.PartialH5DataLoaderIter` if the dataset is a :class:`partial_dataset.PartialH5Dataset` :func:`self._full_dataset_shuffle_iter` otherwise .. method:: __len__() -> int Get the length of the dataloader. Returns the number of batches. .. method:: _full_dataset_shuffle_iter() .. py:class:: Dataset(array, transforms: Optional[Union[List, Callable]] = None, ishuffle: Optional[bool] = False, test_set: Optional[bool] = False) Bases: :class:`torch.utils.data.Dataset` An abstract class representing a given dataset. This inherits from torch.utils.data.Dataset. This class is a general example for what should be done to create a Dataset. When creating a dataset all of the standard attributes should be set, the ``__getitem__``, ``__len__``, and ``shuffle`` functions must be defined. - ``__getitem__`` : how an item is given to the network - ``__len__`` : the number of data elements to be given to the network in total - ``Shuffle()`` : how the data should be shuffled between the processes. The function shown below is for a dataset composed of only data and without targets. The function :func:`dataset_shuffle` abstracts this. For this function only the dataset and a list of attributes to shuffle are given.\n - ``Ishuffle()`` : A non-blocking version of ``Shuffle()``, this is handled in the abstract function :func:`dataset_ishuffle`. It works similarly to :func:`dataset_shuffle`. As the amount of data across processes can be non-uniform, the dataset class will slice off the remaining elements on whichever processes have more data than the others. This should only be 1 element. The shuffle function will shuffle all of the data on the process. It is recommended that for ``DNDarray`` s, the split is either 0 or None :param array: DNDarray for which to great the dataset :type array: DNDarray :param transform: Transformation to call before a data item is returned :type transform: Callable :param ishuffle: flag indicating whether to use non-blocking communications for shuffling the data between epochs Note: if ``True``, the ``Ishuffle()`` function must be defined within the class\n Default: False :type ishuffle: bool, optional :ivar These are the required attributes.: :ivar htdata: Full data :vartype htdata: DNDarray :ivar _cut_slice: Slice to cut off the last element to get a uniform amount of data on each process :vartype _cut_slice: slice :ivar comm: Communication object used to send the data between processes :vartype comm: MPICommunicator :ivar lcl_half: Half of the number of data elements on the process :vartype lcl_half: int :ivar data: The local data to be used in training :vartype data: torch.Tensor :ivar transforms: Transform to be called during the getitem function :vartype transforms: Callable :ivar ishuffle: Flag indicating if non-blocking communications are used for shuffling the data between epochs :vartype ishuffle: bool .. attribute:: htdata .. attribute:: comm .. attribute:: test_set :annotation: = False .. attribute:: lcl_half .. attribute:: _cut_slice .. attribute:: data .. attribute:: transforms :annotation: = None .. attribute:: ishuffle :annotation: = False .. role:: raw-html(raw) :format: html .. method:: __getitem__(index: Union[int, slice, tuple, list, torch.Tensor]) -> torch.Tensor Basic form of __getitem__. As the dataset is often very specific to the dataset, this should be overwritten by the user. In this form it only gets the raw items from the data. .. method:: __len__() -> int Get the number of items in the dataset. This should be overwritten by custom datasets .. method:: Shuffle() Send half of the local data to the process ``self.comm.rank + 1`` if available, else wrap around. After receiving the new data, shuffle the local tensor. .. method:: Ishuffle() Send half of the local data to the process ``self.comm.rank + 1`` if available, else wrap around. After receiving the new data, shuffle the local tensor. .. py:class:: DistributedDataset(dndarray: heat.core.dndarray.DNDarray, transforms: torchvision.transforms.Compose = None) Bases: :class:`torch.utils.data.Dataset` A DistributedDataset for usage in PyTorch. Saves the dndarray and the larray tensor. Uses the larray tensor for the distribution and getting the items. Intented to be used with DistributedSampler. .. attribute:: dndarray .. attribute:: transforms :annotation: = None .. role:: raw-html(raw) :format: html .. method:: __len__() -> int .. method:: __getitem__(index) .. method:: __getitems__(indices) .. py:class:: DistributedSampler(dataset: DistributedDataset, shuffle: bool = False, seed: Optional[int] = None, shuffle_type: Literal['global'] | Literal['local'] = 'global', correction: bool = False) Bases: :class:`torch.utils.data.Sampler` A DistributedSampler for usage in PyTorch with Heat Arrays. Uses the nature of the Heat DNDArray to give the locally stored data on the larray. Shuffling is done by shuffling the indices. The given Indices corrospond to the index of the larray tensor. Works only with DNDarray that are split on axis 0 .. attribute:: dataset .. attribute:: dndarray .. attribute:: shuffle :annotation: = False .. attribute:: linked_sampler :annotation: = None .. attribute:: correction :annotation: = False .. role:: raw-html(raw) :format: html .. method:: _in_slice(idx: int, a_slice: slice) -> bool Check if the given index is inside the given slice :param idx: Index to check :type idx: int :param a_slice: Slice to check :type a_slice: slice :returns: Wether index is in slice :rtype: bool .. method:: _shuffle() -> None Shuffles the given dndarray at creation across processes. .. method:: _alltoall_shuffle() -> None .. method:: set_shuffle_type(shuffle_type: Literal['global'] | Literal['local']) -> None Sets the Shuffle type for the Sampler. :param shuffle_type: - Local Shuffle means the shuffle of the larray only. - Global Shuffle means the shuffle across all processes :type shuffle_type: Literal["global"] | Literal["local"] :raises TypeError: Shuffle type needs to be a string :raises ValueError: Only Global/Local shuffle types exist .. method:: set_seed(value: int | None) -> None Sets the seed for the torch.randperm :param value: seed to set :type value: int .. method:: link(sampler: DistributedSampler) -> None Links another DistributedSampler to this one, to automatically sets the seed/shuffle_type of this and the linked one, rather than manually setting both seperately. Usefull when one Sampler contains training data and the linked one the label data. .. method:: unlink() -> None Removes an established link. For more info view :link: function .. method:: __iter__() -> Iterator[int] .. method:: __len__() -> int .. function:: create_train_val_split(X: heat.core.dndarray.DNDarray, y: heat.core.dndarray.DNDarray, p: float = 0.95, seed: int | None = None) -> tuple[heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray, heat.core.dndarray.DNDarray] Shuffles the data and then creates the train val split. :param X: Training Data :type X: DNDarray :param y: Training Labels :type y: DNDarray :param p: How much the training should contain, by default 0.95 :type p: float, optional :param seed: Random Seed to be used, by default None :type seed: int | None, optional :returns: returns tuple of (train_arr, train_labels_arr, val_arr, val_labels_arr) :rtype: tuple[DNDarray, DNDarray, DNDarray, DNDarray] .. function:: dataset_shuffle(dataset: Union[Dataset, torch.utils.data.Dataset], attrs: List[list]) Shuffle the given attributes of a dataset across multiple processes. This will send half of the data to rank + 1. Once the new data is received, it will be shuffled into the existing data on the process. This function will be called by the DataLoader automatically if ``dataset.ishuffle = False``. attrs should have the form [[torch.Tensor, DNDarray], ... i.e. [['data', 'htdata`]] assume that all of the attrs have the same dim0 shape as the local data :param dataset: the dataset to shuffle :type dataset: Dataset :param attrs: List of lists each of which contains 2 strings. The strings are the handles corresponding to the Dataset attributes corresponding to the global data DNDarray and the local data of that array, i.e. [["data, "htdata"],] would shuffle the htdata around and set the correct amount of data for the ``dataset.data`` attribute. For multiple parameters multiple lists are required. I.e. [["data", "htdata"], ["targets", "httargets"]] :type attrs: List[List[str, str], ... ] .. rubric:: Notes ``dataset.comm`` must be defined for this function to work. .. function:: dataset_ishuffle(dataset: Union[Dataset, torch.utils.data.Dataset], attrs: List[list]) Shuffle the given attributes of a dataset across multiple processes, using non-blocking communications. This will send half of the data to rank + 1. The data must be received by the :func:`dataset_irecv` function. This function will be called by the DataLoader automatically if ``dataset.ishuffle = True``. This is set either during the definition of the class of its initialization by a given paramete. :param dataset: the dataset to shuffle :type dataset: Dataset :param attrs: List of lists each of which contains 2 strings. The strings are the handles corresponding to the Dataset attributes corresponding to the global data DNDarray and the local data of that array, i.e. [["htdata, "data"],] would shuffle the htdata around and set the correct amount of data for the ``dataset.data`` attribute. For multiple parameters multiple lists are required. I.e. [["htdata", "data"], ["httargets", "targets"]] :type attrs: List[List[str, str], ... ] .. rubric:: Notes ``dataset.comm`` must be defined for this function to work.