argopy.stores.ArgoKerchunker#
- class ArgoKerchunker(store: Literal['memory', 'local'] | AbstractFileSystem = 'memory', root: Path | str = '.', preload: bool = True, inline_threshold: int = 0, max_chunk_size: int = 0, storage_options: Dict | None = None)[source]#
Argo netcdf file kerchunk helper
This class is for expert users who wish to test lazy access to remote netcdf files. It is designed to be used through one of the argopy stores inheriting from
ArgoStoreProto
.The kerchunk library is required only if you need to extract zarr data from a netcdf file, i.e. execute
ArgoKerchunker.translate()
.Notes
According to AWS, typical sizes for byte-range requests are 8 MB or 16 MB.
If you intend to compute kerchunk zarr data on-demand, we don’t recommend to use this method on mono or multi profile files that are only a few MB in size, because (ker)-chunking creates a significant performance overhead.
Warning
We noticed that kerchunk zarr data for Rtraj files can be insanely larger than the netcdf file itself. This could go from 10Mb to 228Mb !
Examples
# Use default memory store to manage kerchunk zarr data: ak = ArgoKerchunker(store='memory') # Use a local file store to keep track of zarr kerchunk data (for later # re-use or sharing): ak = ArgoKerchunker(store='local', root='kerchunk_data_folder') # Use a remote file store to keep track of zarr kerchunk data (for later # re-use or sharing): fs = fsspec.filesystem('dir', path='s3://.../kerchunk_data_folder/', target_protocol='s3') ak = ArgoKerchunker(store=fs) # Methods: ak.supported(ncfile) ak.translate(ncfiles) ak.to_reference(ncfile) ak.pprint(ncfile)
# Let's consider a remote Argo netcdf file from a s3 server supporting lazy access # (i.e. support byte range requests): ncfile = "argo-gdac-sandbox/pub/dac/coriolis/6903090/6903090_prof.nc" # Simply open the netcdf file lazily: from argopy.stores import s3store ds = s3store().open_dataset(ncfile, lazy=True) # You can also do it with the GDAC fs: from argopy.stores import gdacfs ds = gdacfs('s3').open_dataset("dac/coriolis/6903090/6903090_prof.nc", lazy=True)
# Create an instance that will save netcdf to zarr references on a local # folder at "~/kerchunk_data_folder": ak = ArgoKerchunker(store='local', root='~/kerchunk_data_folder') # Get a dummy list of netcdf files: from argopy import ArgoIndex idx = ArgoIndex(host='s3').search_lat_lon_tim([-70, -55, 30, 45, '2025-01-01', '2025-02-01']) ncfiles = [af.ls_dataset()['prof'] for af in idx.iterfloats()] # Translate and save references for this batch of netcdf files: # (done in parallel, possibly using a Dask client if available) ak.translate(ncfiles, fs=idx.fs['src'], chunker='auto')
- __init__(store: Literal['memory', 'local'] | AbstractFileSystem = 'memory', root: Path | str = '.', preload: bool = True, inline_threshold: int = 0, max_chunk_size: int = 0, storage_options: Dict | None = None)[source]#
- Parameters:
store (str, default='memory') – Kerchunk data store, i.e. the file system used to load from and/or save to kerchunk json files
root (Path, str, default='.') – Use to specify a local folder to base the store
preload (bool, default=True) – Indicate if kerchunk references already on the store should be preloaded or not.
inline_threshold (int, default=0) –
Byte size below which an array will be embedded in the output. Use 0 to disable inlining.
This argument is passed to
kerchunk.netCDF3.NetCDF3ToZarr
orkerchunk.hdf.SingleHdf5ToZarr
max_chunk_size (int, default=0) –
How big a chunk can be before triggering subchunking. If 0, there is no subchunking, and there is never subchunking for coordinate/dimension arrays. E.g., if an array contains 10,000bytes, and this value is 6000, there will be two output chunks, split on the biggest available dimension.
This argument is passed to
kerchunk.netCDF3.NetCDF3ToZarr
only.storage_options (dict, default=None) – This argument is passed to
kerchunk.netCDF3.NetCDF3ToZarr
orkerchunk.hdf.SingleHdf5ToZarr
during translation. These in turn, will pass options to fsspec when opening netcdf file.
Methods
__init__
([store, root, preload, ...])nc2reference
(ncfile[, fs, chunker])Compute reference data for a netcdf file (kerchunk json data)
pprint
(ncfile[, params])Pretty print kerchunk json data for a netcdf file
supported
(ncfile[, fs])Check if a netcdf file can be accessed through byte ranges
to_reference
(ncfile[, fs, overwrite])Return zarr reference data for a given netcdf file
translate
(ncfiles[, fs, chunker])Translate netcdf file(s) into kerchunk reference data
update_kerchunk_references_from_store
()Load kerchunk data already on store
Attributes
store_path
Path to the reference store, including protocol
inline_threshold
int Byte size below which an array will be embedded in the output.
max_chunk_size
int How big a chunk can be before triggering subchunking.