Lazy dataset access

Lazy dataset access#

Warning

As of February 2025, this feature is considered highly experimental so that there is no guarantee that it is fully functional, documentation may be broke and API can change without any deprecation warnings from one release to another. This is part of a wider effort to prepare argopy for evolutions of the Argo dataset in the cloud (cf the ADMT working group on Argo cloud format activities).

This argopy feature is implemented with the open_dataset methods relying on the s3 store: stores.s3store and and ArgoFloat. Since this is somehow a low-level implementation whereby users need to work with float data directly, it is probably targeting users with operator or expert knowledge of Argo.

Contrary to the other performance improvement methods, this one is not accessible with a DataFetcher.

What is laziness ?#

Laziness in our use case, relates to limiting remote data transfer/load to what is really needed for an operation. For instance:

if you want to work with a single Argo parameter for a given float, you don’t need to download from the GDAC server all the other parameters,
if you only are interested in assessing a file content (e.g. number of profiles or vertical levels), you also don’t need to load anything else than the dimensions of the netcdf files.

Remote laziness is natively supported by some file formats like zarr and parquet. However, netcdf was not designed for this use case.

How it works ?#

Since a regular Argo netcdf is not intended to be accessed partially from a remote server, it is rather tricky to access Argo data lazily. Hopefully, the kerchunk library has been developed precisely for this use-case.

In order to access lazily a remote Argo netcdf file with a server supporting byte range requests, the netcdf content has to be analysed in order to make a byte range catalogue of its content: this is called a reference. To do so, you need to have the kerchunk library installed in your working environment.

If someone or some party has created the netcdf reference and makes it accessible, you don’t need the kerchunk library installed.

kerchunk will analyse a netcdf file content (e.g. dimensions, list of variables) and store these metadata in a json file compatible with zarr. With a specific syntax, these metadata can then be given to xarray.open_dataset() or xarray.open_zarr() to open a netcdf file lazily.

Warning

Since variable content is not loaded, one limitation of the lazy approach, is that variables are not necessarily cast appropriately, and are often returned as simple object.

You can use the Dataset.argo.cast_types() method to cast Argo variables correctly.

Laziness support status#

Not all Argo data servers support the byte range request that is mandatory to access lazily a netcdf file; and not all argopy methods support laziness through kerchunk and zarr references data.

The table below syntheses lazy support status for all possible GDAC hosts:

Table 8 Laziness support status for **argopy** users#
GDAC hosts	Server support	argopy support	Can you use it ?
https://data-argo.ifremer.fr	❌	❌	❌
https://usgodae.org/pub/outgoing/argo	✅	🛠	❌
ftp://ftp.ifremer.fr/ifremer/argo	✅	✅	✅
s3://argo-gdac-sandbox/pub	✅	✅	✅
a local GDAC copy	✅	✅	❌

Laziness with an `ArgoFloat` or `gdacfs`#

When opening an Argo dataset with ArgoFloat, you can simply add the lazy argument to apply laziness:

In [1]: from argopy import ArgoFloat

In [2]: ds = ArgoFloat(6903091, host='s3').open_dataset('prof', lazy=True)

without additional argument, netcdf reference data are computed on the fly using the kerchunk library and stored in memory.

You can also use laziness from a gdacfs:

In [3]: from argopy import gdacfs

In [4]: ds = gdacfs('s3').open_dataset("dac/coriolis/6903090/6903090_prof.nc", lazy=True)

A simple way to ensure that you open the dataset lazily is to check for the source value in the encoding attribute, it should be:

In [5]: ds.encoding['source'] == 'reference://'
Out[5]: True

argopy kerchunk helper#

Under the hood, ArgoFloat and gdacfs will rely on kerchunk reference data provided by a stores.ArgoKerchunker instance.

A typical direct use case of stores.ArgoKerchunker is to save kerchunk data in a shared store (local or remote), so that other users will be able to use it. From the user perspective, this has the huge advantage of not requiring the kerchunk library, since opening lazily a dataset will be done with the zarr engine of xarray.

It could go like this:

In [6]: from argopy.stores import ArgoKerchunker

# Create an instance that will save netcdf to zarr references on a local
# folder at "~/kerchunk_data_folder":
In [7]: ak = ArgoKerchunker(store='local',
   ...:                     root='~/kerchunk_data_folder',
   ...:                     storage_options={'anon': True}, # so that AWS credentials are not required
   ...:                     )
   ...: 

# Note that you could also use a remote reference store, for instance:
#ak = ArgoKerchunker(store=fsspec.filesystem('dir',
#                                            path='s3://.../kerchunk_data_folder/',
#                                            target_protocol='s3'))

Now we can get a dummy list of netcdf files:

In [8]: from argopy import ArgoIndex

In [9]: idx = ArgoIndex(host='s3').query.box([-65, -55, 30, 40,
   ...:                                       '2025-01-01', '2025-02-01'])
   ...: 

In [10]: ncfiles = [af.ls_dataset()['prof'] for af in idx.iterfloats()]

In [11]: print(len(ncfiles))
30

and compute zarr references that will be saved by the stores.ArgoKerchunker instance. Note that this computation is done using Dask delayed when available, otherwise using multithreading:

In [12]: ak.translate(ncfiles, fs=idx.fs['src'], chunker='auto');

The chunker option determines which chunker to use, which is different for netcdf 3 and netcdf4/hdf5 files. Checkout the API documentation for more details here stores.ArgoKerchunker.translate().

To later reuse such references to open lazily one of these netcdf files, an operation that does not require the kerchunk library, you can provide the appropriate stores.ArgoKerchunker instance to a ArgoFloat or gdacfs:

In [13]: ak = ArgoKerchunker(store='local',
   ....:                     root='~/kerchunk_data_folder',
   ....:                     storage_options={'anon': True}, # so that AWS credentials are not required
   ....:                     )
   ....: 

In [14]: wmo = idx.read_wmo()[0]  # Select one float from the index search above

In [15]: ds = ArgoFloat(wmo, host='s3').open_dataset('prof', lazy=True, ak=ak)

In [16]: ds.encoding['source'] == 'reference://'
Out[16]: True