argopy.stores.ArgoKerchunker

argopy.stores.ArgoKerchunker#

class ArgoKerchunker(store: Literal['memory', 'local'] | AbstractFileSystem = 'memory', root: Path | str = '.', preload: bool = True, inline_threshold: int = 0, max_chunk_size: int = 0, storage_options: Dict = None)[source]#

Argo netcdf file kerchunk helper

This class is for expert users who wish to test lazy access to remote netcdf files. It is designed to be used through one of the argopy stores inheriting from ArgoStoreProto.

The kerchunk library is required only if you need to extract zarr data from a netcdf file, i.e. execute ArgoKerchunker.translate().

Notes

According to AWS, typical sizes for byte-range requests are 8 MB or 16 MB.

If you intend to compute kerchunk zarr data on-demand, we don’t recommend to use this method on mono or multi profile files that are only a few MB in size, because (ker)-chunking creates a significant performance overhead.

Warning

We noticed that kerchunk zarr data for Rtraj files can be insanely larger than the netcdf file itself. This could go from 10Mb to 228Mb !

Examples

Listing 68 ArgoKerchunker API#

# Use default memory store to manage kerchunk zarr data:
ak = ArgoKerchunker(store='memory')

# Use a local file store to keep track of zarr kerchunk data (for later
# reuse or sharing):
ak = ArgoKerchunker(store='local', root='kerchunk_data_folder')

# Use a remote file store to keep track of zarr kerchunk data (for later
# reuse or sharing):
fs = fsspec.filesystem('dir',
                       path='s3://.../kerchunk_data_folder/',
                       target_protocol='s3')
ak = ArgoKerchunker(store=fs)

# Methods:
ak.supported(ncfile)
ak.translate(ncfiles)
ak.to_reference(ncfile)
ak.pprint(ncfile)

Listing 69 Loading one file lazily#

# Let's consider a remote Argo netcdf file from a s3 server supporting lazy access
# (i.e. support byte range requests):
ncfile = "argo-gdac-sandbox/pub/dac/coriolis/6903090/6903090_prof.nc"

# Simply open the netcdf file lazily:
from argopy.stores import s3store
ds = s3store().open_dataset(ncfile, lazy=True)

# You can also do it with the GDAC fs:
from argopy.stores import gdacfs
ds = gdacfs('s3').open_dataset("dac/coriolis/6903090/6903090_prof.nc", lazy=True)

Listing 70 Translate and save references for a batch of netcdf files#

# Create an instance that will save netcdf to zarr references on a local
# folder at "~/kerchunk_data_folder":
ak = ArgoKerchunker(store='local', root='~/kerchunk_data_folder')

# Get a dummy list of netcdf files:
from argopy import ArgoIndex
idx = ArgoIndex(host='s3').search_lat_lon_tim([-70, -55, 30, 45,
                                               '2025-01-01', '2025-02-01'])
ncfiles = [af.ls_dataset()['prof'] for af in idx.iterfloats()]

# Translate and save references for this batch of netcdf files:
# (done in parallel, possibly using a Dask client if available)
ak.translate(ncfiles, fs=idx.fs['src'], chunker='auto')

__init__(store: Literal['memory', 'local'] | AbstractFileSystem = 'memory', root: Path | str = '.', preload: bool = True, inline_threshold: int = 0, max_chunk_size: int = 0, storage_options: Dict = None)[source]#

Parameters:

store (str, default='memory') – Kerchunk data store, i.e. the file system used to load from and/or save to kerchunk json files
root (Path, str, default='.') – Use to specify a local folder to base the store
preload (bool, default=True) – Indicate if kerchunk references already on the store should be preloaded or not.
inline_threshold (int, default=0) –
Byte size below which an array will be embedded in the output. Use 0 to disable inlining.

This argument is passed to kerchunk.netCDF3.NetCDF3ToZarr or kerchunk.hdf.SingleHdf5ToZarr
max_chunk_size (int, default=0) –
How big a chunk can be before triggering subchunking. If 0, there is no subchunking, and there is never subchunking for coordinate/dimension arrays. E.g., if an array contains 10,000bytes, and this value is 6000, there will be two output chunks, split on the biggest available dimension.

This argument is passed to kerchunk.netCDF3.NetCDF3ToZarr only.
storage_options (dict, default=None) – This argument is passed to kerchunk.netCDF3.NetCDF3ToZarr or kerchunk.hdf.SingleHdf5ToZarr during translation. These in turn, will pass options to fsspec when opening netcdf file.

Methods

`__init__`([store, root, preload, ...])
`nc2reference`(ncfile[, fs, chunker])	Compute reference data for a netcdf file (kerchunk json data)
`pprint`(ncfile[, params, fs])	Pretty print kerchunk json data for a netcdf file
`supported`(ncfile[, fs])	Check if a netcdf file can be accessed through byte ranges
`to_reference`(ncfile[, fs, overwrite])	Return zarr reference data for a given netcdf file path
`translate`(ncfiles[, fs, chunker])	Translate netcdf file(s) into kerchunk reference data
`update_kerchunk_references_from_store`()	Load kerchunk data already on store

Attributes

`store_path`	Absolute path to the reference store, including protocol
`inline_threshold`	int Byte size below which an array will be embedded in the output.
`max_chunk_size`	int How big a chunk can be before triggering subchunking.

argopy.stores.ArgoKerchunker

Contents

argopy.stores.ArgoKerchunker#