Snapshot ======== :class:`~torchsnapshot.Snapshot` is the core API of TorchSnapshot. The class represents application state persisted in storage. A user can take a snapshot of an application via :meth:`Snapshot.take() ` (a.k.a saving a checkpoint), or restore the state of an application from a snapshot via :meth:`Snapshot.restore() ` (a.k.a loading a checkpoint). .. _app-state: Describing the application state -------------------------------- Before using :class:`~torchsnapshot.Snapshot` to save or restore application state, the user needs to **describe the application state**. This is done by creating a dictionary that contains all **stateful objects** that the user wishes to capture as application state: .. code-block:: Python app_state = {"model": model, "optimizer": optimizer} Any object that exposes ``.state_dict()`` and ``.load_state_dict()`` are considered **stateful objects**. Common PyTorch objects such as :class:`~torch.nn.Module`, :class:`~torch.optim.Optimizer`, and LR Schedulers all qualify as **stateful objects** and can be captured directly. Objects that don't meet this requirement can be captured via :class:`~torchsnapshot.StateDict`: .. code-block:: Python from torchsnapshot import StateDict extra_state = StateDict(iterations=0) app_state = {"model": model, "optimizer": optimizer, "extra_state": extra_state} Taking a snapshot ----------------- Once the :ref:`application state ` is described, the user can take a snapshot of the application via :func:`Snapshot.take() `. :func:`Snapshot.take() ` persists the application state to the user specified path and returns a :class:`~torchsnapshot.Snapshot` object, which is a reference to the snapshot. .. code-block:: Python from torchsnapshot import Snapshot snapshot = Snapshot.take(path="/path/to/my/snapshot", app_state=app_state) The user specified path can optionally be prepended with a URI prefix. By default, the prefix is ``fs://``, which suggests that the path is a file system location. TorchSnapshot also provides performant and reliable integration with commonly used cloud object storages. A storage backend can be selected by prepending the corresponding URI prefix (e.g. ``s3://`` for S3, ``gs://`` for Google Cloud Storage). .. code-block:: Python snapshot = Snapshot.take( path="s3://bucket/path/to/my/snapshot", app_state=app_state ) .. note:: Do not move GPU tensors to CPU before saving them with TorchSnapshot. TorchSnapshot implements various optimizations for increasing the throughput and decreasing the host memory usage of GPU-to-storage transfers. Moving GPU tensors to CPU manually will lower the throughput and increase the chance of "out of memory" issues. Restoring from a snapshot ------------------------- To restore from a snapshot, the user first need to obtain a reference to the snapshot. As seen previously, in the process where the snapshot is taken, a reference to the snapshot is returned by :func:`Snapshot.take() `. In another process (which is more common for resumption), a reference can be obtained by creating a :class:`~torchsnapshot.Snapshot` object with the snapshot path: .. code-block:: Python from torchsnapshot import Snapshot snapshot = Snapshot(path="/path/to/my/snapshot") To restore the :ref:`application state ` from the snapshot, invoke :func:`Snapshot.restore() ` with the application state: .. code-block:: Python snapshot.restore(app_state=app_state) .. note:: :func:`Snapshot.restore() ` restores stateful objects in-place to avoid creating unneccessary intermediate copies of the state. Distributed snapshot -------------------- TODO Elasticity ---------- TODO Reproducibility --------------- TODO Taking a snapshot asynchronously -------------------------------- TODO API Reference ------------- .. autoclass:: torchsnapshot.Snapshot :members: :undoc-members: