Lazy dataset access#
Warning
As of February 2025, this feature is considered highly experimental so that there is no guarantee that it is fully functional, documentation may be broke and API can change without any deprecation warnings from one release to another. This is part of a wider effort to prepare argopy for evolutions of the Argo dataset in the cloud (cf the ADMT working group on Argo cloud format activities).
This argopy feature is implemented with the open_dataset
methods relying on the s3 store: stores.s3store
and
and ArgoFloat
. Since this is somehow a low-level implementation whereby users need to work with float data
directly, it is probably targeting users with operator or expert knowledge of Argo.
Contrary to the other performance improvement methods, this one is not accessible with a DataFetcher
.
What is laziness ?#
Laziness in our use case, relates to limiting remote data transfer/load to what is really needed for an operation. For instance:
if you want to work with a single Argo parameter for a given float, you donβt need to download from the GDAC server all the other parameters,
if you only are interested in assessing a file content (e.g. number of profiles or vertical levels), you also donβt need to load anything else than the dimensions of the netcdf files.
Remote laziness is natively supported by some file formats like zarr and parquet. However, netcdf was not designed for this use case.
How it works ?#
Since a regular Argo netcdf is not intended to be accessed partially from a remote server, it is rather tricky to access Argo data lazily. Hopefully, the kerchunk library has been developed precisely for this use-case.
In order to access lazily a remote Argo netcdf file with a server supporting byte range requests, the netcdf content has to be analysed in order to make a byte range catalogue of its content: this is called a reference. To do so, you need to have the kerchunk library installed in your working environment.
If someone or some party has created the netcdf reference and makes it accessible, you donβt need the kerchunk library installed.
kerchunk will analyse a netcdf file content (e.g. dimensions, list of variables)
and store these metadata in a json file compatible with zarr. With a specific syntax, these metadata can then be given
to xarray.open_dataset()
or xarray.open_zarr()
to open a netcdf file lazily.
Warning
Since variable content is not loaded, one limitation of the lazy approach, is that variables are not necessarily cast appropriately, and are often returned as simple object.
You can use the Dataset.argo.cast_types()
method to cast Argo variables correctly.
Laziness support status#
Not all Argo data servers support the byte range request that is mandatory to access lazily a netcdf file. Nore all argopy methods support laziness through kerchunk and zarr references data.
The table below syntheses lazy support status for all possible GDAC hosts:
GDAC hosts |
Server support |
argopy support |
Can you use it ? |
---|---|---|---|
β |
β |
β |
|
β |
π |
β |
|
β |
π |
β |
|
β |
β |
β |
|
a local GDAC copy |
β |
π |
β |
Laziness with an ArgoFloat
or gdacfs
#
When opening an Argo dataset with ArgoFloat
, you can simply add the lazy argument to apply laziness:
In [1]: from argopy import ArgoFloat
In [2]: ds = ArgoFloat(6903091, host='s3').open_dataset('prof', lazy=True)
without additional argument, netcdf reference data are computed on the fly using the kerchunk library and stored in memory.
You can also use laziness from a gdacfs
:
In [3]: from argopy import gdacfs
In [4]: ds = gdacfs('s3').open_dataset("dac/coriolis/6903090/6903090_prof.nc", lazy=True)
A simple way to ensure that you open the dataset lazily is to check for the source value in the encoding attribute, it should be:
In [5]: ds.encoding['source'] == 'reference://'
Out[5]: True
argopy kerchunk helper#
Under the hood, ArgoFloat
and gdacfs
will rely on kerchunk reference data provided by a
stores.ArgoKerchunker
instance.
A typical direct use case of stores.ArgoKerchunker
is to save kerchunk data in a shared store (local or remote),
so that other users will be able to use it. From the user perspective, this has the huge advantage of not requiring the
kerchunk library, since opening lazily a dataset will be done with the zarr engine of xarray.
It could go like this:
In [6]: from argopy.stores import ArgoKerchunker
# Create an instance that will save netcdf to zarr references on a local
# folder at "~/kerchunk_data_folder":
In [7]: ak = ArgoKerchunker(store='local',
...: root='~/kerchunk_data_folder',
...: storage_options={'anon': True}, # so that AWS credentials are not required
...: )
...:
# Note that you could also use a remote reference store, for instance:
#ak = ArgoKerchunker(store=fsspec.filesystem('dir',
# path='s3://.../kerchunk_data_folder/',
# target_protocol='s3'))
Now we can get a dummy list of netcdf files:
In [8]: from argopy import ArgoIndex
In [9]: idx = ArgoIndex(host='s3').search_lat_lon_tim([-65, -55, 30, 40,
...: '2025-01-01', '2025-02-01'])
...:
In [10]: ncfiles = [af.ls_dataset()['prof'] for af in idx.iterfloats()]
In [11]: print(len(ncfiles))
29
and compute zarr references that will be saved by the stores.ArgoKerchunker
instance. Note that this computation is done using Dask delayed when available, otherwise using multithreading:
In [12]: ak.translate(ncfiles, fs=idx.fs['src'], chunker='auto');
The chunker
option determines which chunker to use, which is different for netcdf 3 and netcdf4/hdf5 files. Checkout the API documentation for more details here stores.ArgoKerchunker.translate()
.
To later re-use such references to open lazily one of these netcdf files, an operation that does not require the
kerchunk library, you can provide the appropriate stores.ArgoKerchunker
instance to a ArgoFloat
or gdacfs
:
In [13]: ak = ArgoKerchunker(store='local',
....: root='~/kerchunk_data_folder',
....: storage_options={'anon': True}, # so that AWS credentials are not required
....: )
....:
In [14]: wmo = idx.read_wmo()[0] # Select one float from the index search above
In [15]: ds = ArgoFloat(wmo, host='s3').open_dataset('prof', lazy=True, ak=ak)
In [16]: ds.encoding['source'] == 'reference://'
Out[16]: True