argopy.stores.ArgoKerchunker

argopy.stores.ArgoKerchunker#

class ArgoKerchunker(store: Literal['memory', 'local'] | AbstractFileSystem = 'memory', root: Path | str = '.', preload: bool = True, inline_threshold: int = 0, max_chunk_size: int = 0, storage_options: Dict | None = None)[source]#

Argo netcdf file kerchunk helper

This class is for expert users who wish to test lazy access to remote netcdf files. It is designed to be used through one of the argopy stores inheriting from ArgoStoreProto.

The kerchunk library is required only if you need to extract zarr data from a netcdf file, i.e. execute ArgoKerchunker.translate().

Notes

According to AWS, typical sizes for byte-range requests are 8 MB or 16 MB.

If you intend to compute kerchunk zarr data on-demand, we don’t recommend to use this method on mono or multi profile files that are only a few MB in size, because (ker)-chunking creates a significant performance overhead.

Warning

We noticed that kerchunk zarr data for Rtraj files can be insanely larger than the netcdf file itself. This could go from 10Mb to 228Mb !

Examples

Listing 32 ArgoKerchunker API#
# Use default memory store to manage kerchunk zarr data:
ak = ArgoKerchunker(store='memory')

# Use a local file store to keep track of zarr kerchunk data (for later
# re-use or sharing):
ak = ArgoKerchunker(store='local', root='kerchunk_data_folder')

# Use a remote file store to keep track of zarr kerchunk data (for later
# re-use or sharing):
fs = fsspec.filesystem('dir',
                       path='s3://.../kerchunk_data_folder/',
                       target_protocol='s3')
ak = ArgoKerchunker(store=fs)

# Methods:
ak.supported(ncfile)
ak.translate(ncfiles)
ak.to_reference(ncfile)
ak.pprint(ncfile)
Listing 33 Loading one file lazily#
# Let's consider a remote Argo netcdf file from a s3 server supporting lazy access
# (i.e. support byte range requests):
ncfile = "argo-gdac-sandbox/pub/dac/coriolis/6903090/6903090_prof.nc"

# Simply open the netcdf file lazily:
from argopy.stores import s3store
ds = s3store().open_dataset(ncfile, lazy=True)

# You can also do it with the GDAC fs:
from argopy.stores import gdacfs
ds = gdacfs('s3').open_dataset("dac/coriolis/6903090/6903090_prof.nc", lazy=True)
Listing 34 Translate and save references for a batch of netcdf files#
# Create an instance that will save netcdf to zarr references on a local
# folder at "~/kerchunk_data_folder":
ak = ArgoKerchunker(store='local', root='~/kerchunk_data_folder')

# Get a dummy list of netcdf files:
from argopy import ArgoIndex
idx = ArgoIndex(host='s3').search_lat_lon_tim([-70, -55, 30, 45,
                                               '2025-01-01', '2025-02-01'])
ncfiles = [af.ls_dataset()['prof'] for af in idx.iterfloats()]

# Translate and save references for this batch of netcdf files:
# (done in parallel, possibly using a Dask client if available)
ak.translate(ncfiles, fs=idx.fs['src'], chunker='auto')
__init__(store: Literal['memory', 'local'] | AbstractFileSystem = 'memory', root: Path | str = '.', preload: bool = True, inline_threshold: int = 0, max_chunk_size: int = 0, storage_options: Dict | None = None)[source]#
Parameters:
  • store (str, default='memory') – Kerchunk data store, i.e. the file system used to load from and/or save to kerchunk json files

  • root (Path, str, default='.') – Use to specify a local folder to base the store

  • preload (bool, default=True) – Indicate if kerchunk references already on the store should be preloaded or not.

  • inline_threshold (int, default=0) –

    Byte size below which an array will be embedded in the output. Use 0 to disable inlining.

    This argument is passed to kerchunk.netCDF3.NetCDF3ToZarr or kerchunk.hdf.SingleHdf5ToZarr

  • max_chunk_size (int, default=0) –

    How big a chunk can be before triggering subchunking. If 0, there is no subchunking, and there is never subchunking for coordinate/dimension arrays. E.g., if an array contains 10,000bytes, and this value is 6000, there will be two output chunks, split on the biggest available dimension.

    This argument is passed to kerchunk.netCDF3.NetCDF3ToZarr only.

  • storage_options (dict, default=None) – This argument is passed to kerchunk.netCDF3.NetCDF3ToZarr or kerchunk.hdf.SingleHdf5ToZarr during translation. These in turn, will pass options to fsspec when opening netcdf file.

Methods

__init__([store, root, preload, ...])

nc2reference(ncfile[, fs, chunker])

Compute reference data for a netcdf file (kerchunk json data)

pprint(ncfile[, params])

Pretty print kerchunk json data for a netcdf file

supported(ncfile[, fs])

Check if a netcdf file can be accessed through byte ranges

to_reference(ncfile[, fs, overwrite])

Return zarr reference data for a given netcdf file

translate(ncfiles[, fs, chunker])

Translate netcdf file(s) into kerchunk reference data

update_kerchunk_references_from_store()

Load kerchunk data already on store

Attributes

store_path

Path to the reference store, including protocol

inline_threshold

int Byte size below which an array will be embedded in the output.

max_chunk_size

int How big a chunk can be before triggering subchunking.