:mod:`heat.utils.data.partial_dataset` ====================================== .. py:module:: heat.utils.data.partial_dataset .. autoapi-nested-parse:: Tool for using a dataset which will not fit in memory with neural networks Module Contents --------------- .. py:class:: PartialH5Dataset(file: str, comm: heat.core.communication.MPICommunication = MPI_WORLD, dataset_names: Union[str, List[str]] = 'data', transforms: List[Callable] = None, use_gpu: bool = True, validate_set: bool = False, initial_load: int = 7000, load_length: int = 1000) Bases: :class:`torch.utils.data.Dataset` Create a Dataset object for a dataset which loads portions of data from an HDF5 file. Very similar to :func:``. This will create 2 threads, one for loading the data from the target file, and one for converting items before being passed to the network. The conversion is done by the iterator. A portion of the data of length ``initial_load`` is loaded upon initialization, the rest of the data is loaded after the loaded data is returned by :func:`PartialH5DataLoaderIter`. This iterator will be used by the HeAT :func:`heat.utils.data.datatools.DataLoader` automatically with this type of dataset. .. rubric:: Notes H5 datasets require the GIL to load data. This can be a bottleneck if data needs to be loaded multiple times (e.g. the case for using this dataset). It is recommended to find another way to preprocess the data and avoid using H5 files for this reason. :param file: H5 file to use :type file: str :param comm: Global MPI communicator generated by HeAT :type comm: MPICommunication :param dataset_names: Name/s of dataset/s to load from ``file``. If a string is given, it will be the only dataset loaded. Default is "data". :type dataset_names: Union[str, List[str]], optional :param transforms: Transforms to apply to the data after it is gotten from the loaded data before it is used by the network. This should be a list of Callable torch functions for each item returned by the ``__getitem__`` function of the individual dataset. If a list element is ``None`` then no transform will be applied to the corresponding element returned by ``__getitem__``. I.e. if ``__getitem__`` returns an image and a label then the list would look like this: ``transforms = [image_transforms, None]``. If this is ``None``, no transforms will be applied to any elements. Default is ``None``. :type transforms: List[Callable], optional :param use_gpu: Use GPUs if available. Defaults to True. :type use_gpu: bool, optional :param validate_set: Load the entire dataset onto each node upon initialization and skip loaded in iterator This is typically the case needed for validation sets when the network should be tested against the whole dataset. Default is False. :type validate_set: bool, optional :param initial_load: How many elements to load from the file in the 0th dimension. Default is 7000 elements :type initial_load: int, optional :param load_length: How many elements to load from the file in the iterator. Default is 1000 elements :type load_length: int, optional .. attribute:: ishuffle :annotation: = False .. attribute:: file .. attribute:: comm .. attribute:: transforms .. attribute:: gpu .. attribute:: torch_device :annotation: = 'cpu' .. attribute:: total_size .. attribute:: lcl_full_sz .. attribute:: local_data_start .. attribute:: local_data_end .. attribute:: loads_left :annotation: = 0 .. attribute:: load_start .. attribute:: load_end .. attribute:: dataset_names :annotation: = 'data' .. attribute:: dataset_order :annotation: = [] .. attribute:: load_thread :annotation: = None .. attribute:: epoch_end :annotation: = False .. attribute:: loading_queue .. attribute:: loading_condition .. attribute:: convert_queue .. attribute:: used_indices :annotation: = [] .. role:: raw-html(raw) :format: html .. method:: Shuffle() Send half of the local data to the process ``self.comm.rank + 1`` if available, else wrap around. After receiving the new data, shuffle the local tensor. Not implemented for partial dataset .. method:: Ishuffle() Send half of the local data to the process ``self.comm.rank + 1`` if available, else wrap around. After receiving the new data, shuffle the local tensor. Not implemented for partial dataset .. method:: __getitem__(index: Union[int, slice, List[int], torch.Tensor]) -> torch.Tensor Abstract __getitem__ method. This should be defined by the user at runtime. This function needs to be designed such that the data is in the 0th dimension and the indexes called are only in the 0th dim! .. method:: __len__() -> int Get the total length of the dataset .. method:: thread_replace_converted_batches() Replace the elements of the dataset with newly loaded elements. :func:'PartialH5DataLoaderIter' will put the used indices in the ``used_indices`` parameter. This object is reset to an empty list after these elements are overwritten with new data. .. py:class:: PartialH5DataLoaderIter(loader) Bases: :class:`object` Iterator to be used with :func:'PartialH5Dataset'. It closely mirrors the standard torch iterator while loading new data to replace the loaded batches automatically. It also pre-fetches the batches and begins their preparation, collation, and device setting in the background. .. attribute:: dataset .. attribute:: _dataset_kind .. attribute:: _IterableDataset_len_called .. attribute:: _auto_collation .. attribute:: _drop_last .. attribute:: _index_sampler .. attribute:: _num_workers .. attribute:: _pin_memory .. attribute:: _timeout .. attribute:: _collate_fn .. attribute:: _sampler_iter .. attribute:: _base_seed .. attribute:: _num_yielded :annotation: = 0 .. attribute:: batch_size .. attribute:: comm .. attribute:: _dataset_fetcher .. role:: raw-html(raw) :format: html .. method:: __len__() Get the length of the iterator .. method:: _next_data() .. method:: __next__() Get the next batch of data. Shamelessly taken from torch. .. method:: __iter__() Get a new iterator of this class :rtype: PartialH5DataLoaderIter .. method:: __thread_convert_all(index_list)