A common scenario in geospatial data workflows is multiple raster files. Let’s consider 23 snapshots of OISST sea surface temperature spread across 2025. One NetCDF file per date, each raster is the same grid, four variables each. Let’s see what R does with that. Here we assume files is a vector of paths to the NetCDF files.
(Fully reproducible setup, downloading directly from NCEI is in the appendix.)
str(basename(files))
# chr [1:23] "oisst-avhrr-v02r01.20250101.nc" "oisst-avhrr-v02r01.20250117.nc" "oisst-avhrr-v02r01.20250202.nc" ...
library(terra)
rast(files)
# class : SpatRaster
# size : 720, 1440, 92 (nrow, ncol, nlyr)
# resolution : 0.25, 0.25 (x, y)
# extent : 0, 360, -90, 90
# coord. ref. : lon/lat WGS 84 (CRS84)
# sources : oisst-avhrr-v02r01.20250101.nc (4 layers)
# oisst-avhrr-v02r01.20250117.nc (4 layers)
# ... and 21 more sources
# varnames : anom, err, ice, sst, anom, err, ...
# names : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0, ...
# time (days) : 2025-01-01 to 2025-12-19 (23 steps)The result is pretty good. terra::rast() is no slouch: it reads all subdatasets across all files, concatenates band-wise, preserves the time axis, carries units, and records depth metadata. For a lot of workflows this is exactly what is needed, and it’s lazy — no pixel values are read until they are needed.
But look at the names: anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0 — repeated 23 times, 92 layers total. Four variables and 23 timesteps are there, flattened into band order. The n-D structure — this is a (variable × time × lat × lon) array — is implicit in the interleaving, not encoded anywhere. There’s an implicit contract about how this was unrolled, there are conventions for band ordering, workarounds for the _zlev=0 naming. Nothing is wrong, but the band-stack model forces all of that n-D structure into a single axis, and recovering it requires a lot of information that is not usual in the multi-band model.
We can sidestep the variable interleaving with the vrt:// URI syntax:
vrt(
sprintf("vrt://%s?sd_name=sst", files),
"/tmp/sst_stack.vrt",
options = "-separate",
overwrite = TRUE
)
# class : SpatRaster
# size : 720, 1440, 23 (nrow, ncol, nlyr)
# source : sst_stack.vrt
# names : r_1, r_2, r_3, ..., r_23One variable, 23 layers. The vrt:// GDAL protocol is worth knowing: it lets us name a subdataset in a URI query string, vectorizing cleanly with sprintf and using the underlying raster-stack capability of GDAL VRT (-separate). But there’s a trade-off: no time axis, layer names are r_1 through r_23 rather than dates. We fixed the variable interleaving problem and lost the coordinate metadata.
rast(files, "sst") also selects a single subdataset without the vrt:// syntax, and produces the same 23-layer result — but it goes through the ncdf4 backend rather than GDAL, so it doesn’t produce a VRT file and doesn’t carry the same interoperability guarantees. The two approaches look similar but are doing different things underneath, and that distinction matters for what comes next.
stars approaches this differently. In default proxy (lazy) mode it uses GDAL classic raster and keeps variables as separate attributes, file-per-timestep structure explicit, doing this by harvesting metadata from the underlying files as carried by GDAL. stars::read_mdim() goes further — it uses the GDAL multidimensional API and gives named attributes, real dimension coordinates with udunits, proper n-D structure. The representation is right, but it too has limits; details are in the appendix.
None of these approaches is the best — each embeds real expertise and genuine design choices. But with excellent tools like terra and stars, you still can’t hand someone a file that says “this is a year of global SST.” You can hand them code, or you can hand them gigabytes. The code only works if they have the same files in the same place. The gigabytes require a format decision and a copy operation.
There’s no serializable recipe that is light, language-agnostic, and handles a broad variety of cases.
Fifteen minutes with xarray
A key moment for me was while learning Python and xarray, I was impressed that xarray could open these files — a huge list of them (16,337 days of OISST at time of writing) as local files or remote URLs. xarray checks them all for consistency, detects that what is really there is four variables in longitude, latitude, and time, and that time is an implicit concatenation from a single time step in each of those thousands of files. It’s remarkable: one object in memory in Python, ready to lazy-read any part of the four 16337×720×1440 arrays. But that took about 15 minutes to run, let alone the software and computing infrastructure needed to get there. And then: how do you share that progress with colleagues?
(I thought: “hmmm, what’s the xarray equivalent of a VRT?” Spoiler: there isn’t one - and although VirtualiZarr does serve that role, it’s missing an intermediate state where we only record what files were opened and what structure was learned.)
What the VRT already does
Back to our 23 files — a sample of the 16,337+ in the full OISST set as of writing.
Open /tmp/sst_stack.vrt in a text editor (browseURL("/tmp/sst_stack.vrt") works too). It’s a few kilobytes of XML that fully describes a virtual 23-layer SST dataset — source files, spatial extent, band mapping. It’s a tiny file: you could email it, put it in a repository, or patch the source paths to /vsicurl/https://....nc URLs so it works without any local files at all. Anyone with GDAL can open it with rast("sst_stack.vrt") and get exactly what you built, without copying a byte of the underlying data.
That’s a recipe, we’ve stacked our files into as simple chunk of text that GDAL can open very quickly.
VRT (Virtual Format) is a plain XML file that describes a virtual raster dataset. It can mosaic, warp, reproject, reband, and combine source files without touching the underlying data. For the 2D raster world it is exactly the recipe format that’s missing from the terra/stars in-memory story. The 2D VRT is one of GDAL’s most underappreciated features — already powering a lot of geospatial infrastructure, especially where band-stacking or tile-mosaicking is sufficient as the solution.
So what about the full n-D case: multiple variables, time coordinates, arbitrary dimensions?
What Zarr figured out
The Python/xarray community has been working on exactly this problem, approaching it from the chunk level rather than the file level.
Zarr is a chunked array format. An array is split into regular chunks; each chunk is compressed independently and stored as a separate object — a file, an S3 object, a range of bytes in a larger store. The chunks themselves are opaque: raw numeric arrays in compressed bytes, not self-describing. What makes Zarr work is the metadata sidecar: a small JSON file that records the shape, dtype, chunk layout, compression codec, and dimension coordinates — everything xarray needs to provide a rich, self-describing interface over the data. This is really good for cloud access: fetch only the chunks you need, in parallel, without a format-specific library. It’s a bit like taking a pile of NetCDF files and turning them inside out — all the metadata (tiny in comparison to the data) is placed upfront in JSON, and all the compressed opaque chunks are stored as individually addressable objects.
The deeper insight comes when you realize that HDF5, netCDF4, GeoTIFF, GRIB, and others already store their data in chunks. The chunks are buried inside files that require format-specific libraries to navigate, but the byte ranges are all there. We just need an index.
kerchunk and its successor VirtualiZarr do exactly that: open your existing files, find where each chunk lives (file path, byte offset, byte length), and write that index as a JSON or Parquet reference store. The bulk data never moves. What you get is a Zarr-shaped recipe that points back into your original files. It’s worth noting that this idea is not new — NASA and Unidata’s DMR++ format did the same thing for OPeNDAP. Cloud-Optimized GeoTIFF is the same idea applied to TIFF: the IFD tile index is moved to the front of the file so a client can read the chunk map in one range request and fetch only what it needs thereafter. The difference is that Zarr’s client-side framing made the pattern composable and ecosystem-wide rather than server-specific.
from virtualizarr import open_virtual_dataset
import xarray as xr
# index the chunks — no data copied
combined = xr.concat(
[open_virtual_dataset(f) for f in files],
dim="time"
)
combined.virtualize.to_kerchunk("sst_virtual.parquet", format="parquet")
# open it — chunk-aware, lazy, S3-ready, fully structured
ds = xr.open_dataset("sst_virtual.parquet", engine="kerchunk")
# <xarray.Dataset>
# Dimensions: (time: 23, zlev: 1, lat: 720, lon: 1440)
# Coordinates:
# * time (time) datetime64[ns] 2025-01-01 ... 2025-12-19
# * lat (lat) float32 -90.0 ... 89.75
# * lon (lon) float32 0.0 ... 359.75
# Data variables: anom, err, ice, sstThe store is kilobytes of metadata. The data stays where it is. The n-D structure is fully encoded. Compare this to the terra output above:
sst,anom,err,iceas named variables, not interleaved bandstimeas a real coordinate with date values, not layer indicesr_1..r_23lat,lon,zlevas labelled dimensions with coordinate values- the whole thing serialized as a file anyone with Zarr support can open
This is what stars::read_mdim() is reaching toward in R — the representation is structurally similar. The difference is that the xarray object came from a 12KB Parquet file, not from reconstructing structure at runtime over 23 original netCDF files. That difference is the recipe.
GDAL is the bridge
GDAL already spans both worlds.
GDAL has a multidimensional API (the “mdim” layer) that represents data as named arrays with labelled dimensions — the same model as xarray, not the band-stacked model of classic GDAL rasters. GDAL’s Zarr driver can open VirtualiZarr/kerchunk Parquet reference stores through this API. A store created in Python is readable via GDAL — which means it’s reachable from R today.
More importantly, GDAL has gdal mdim mosaic — a tool that takes a collection of raster files and produces a multidim VRT describing them as a unified n-D array dataset. This VRT encodes the array structure, dimension names, coordinate values, and source file mapping. It is the GDAL-native recipe for an n-D dataset — and as shown in the appendix, stars::read_mdim() opens it faithfully, giving you the full multi-file dataset with the right structure, right coordinates, and a serializable description that travels independently of the original files.
Converting a GDAL multidim VRT to a Zarr reference store is a small transformation on kilobytes of XML, not a data operation. That’s the near-term path to R-native VirtualiZarr creation.
The gap: creation, not reading
R can read virtual stores — Zarr and multidim VRT and VirtualiZarr Parquet reference via stars::read_mdim(), gdalraster::mdim_translate or zaro or real Zarr with pizzarr or zarr. What R can’t yet do is create the chunk-level byte-reference stores that VirtualiZarr generates for HDF5 and netCDF.
That’s a frontier, and the pieces are converging:
gdal mdim mosaic creates multidim VRTs from file collections today. The XML fully describes what a VirtualiZarr store needs — shape, chunks, coordinate values, source file paths and byte offsets. This is the hard structural work; emitting the Zarr reference store format from it is a straightforward conversion.
rhdf5 can now open remote HDF5 files at URLs or on S3 (dev version on GitHub). The next step is extracting chunk byte-reference indices directly — the core operation VirtualiZarr performs in Python. Step 7 in the appendix shows what this already looks like.
zaro (pure-R Zarr reader via Arrow) already reads VirtualiZarr Parquet stores. The reading side is there, and needs comparison with pizzarr, and zarr and other zarr-enabled packages.
vrtstack generates multidim VRTs from 2D raster collections in R — the gdal mdim mosaic equivalent without leaving R.
The path: vrtstack or gdal mdim mosaic for file collections → emit Zarr reference store → readable by zaro, xarray, or any Zarr-aware tool. In this way R can become a first-class creator of portable dataset descriptions, not just a consumer of what Python produces.
Why this matters beyond R
A Zarr reference store created in R from OISST netCDF files is immediately openable in Python, Julia, a browser via Zarr.js, any tool that speaks Zarr. No conversion. No data copy. No format negotiation.
STAC is complementary but different: STAC describes where files are and their spatiotemporal footprint. A virtual store describes what the dataset structurally is — how chunks map to array coordinates, what the dimensions mean, how to reassemble a slice. A STAC item can point to a virtual store the way it points to a COG. They compose; they don’t overlap.
The R community has genuine leverage here. GDAL is the common substrate between R and the xarray/Zarr ecosystem. We are not catching up to Python — we are building on the same foundation, and there are things we can do with it that aren’t on their roadmap yet.
What would help
If you work with time series of raster files and you’ve felt this friction:
- Talk about your use case — what files, what structure, what would a portable dataset description let you do that you can’t do today?
- Try
terra::vrt()withvrt://URI syntax for netCDF subdataset selection — it’s underused and solves real problems - Try
stars::read_mdim()on agdal mdim mosaicVRT — the full tour in the appendix shows how these fit together - Try
remotes::install_github("hypertidy/tidync@multi-input")if you have remote netCDF files and want to skip the download entirely — Step 8 in the appendix shows what this already looks like - Look at
vrtstackandzaroon the hypertidy GitHub to poke at the frontier - File issues against GDAL’s multidim API when you find gaps — R users exercising it in real workflows makes it better for everyone
Appendix: a progressive tour on 23 OISST files
Setup: download 23 OISST files from NCEI
All code below uses only this setup — no internal data infrastructure required.
library(glue)
date <- seq(as.Date("2025-01-01"), by = "16 days", length.out = 23)
base0 <- "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr"
ym <- format(date, "%Y%m")
ymd <- format(date, "%Y%m%d")
urls <- glue("{base0}/{ym}/oisst-avhrr-v02r01.{ymd}.nc")
dir <- tempdir()
files <- file.path(dir, basename(urls))
download.file(urls, files, method = "libcurl", mode = "wb")The URL structure itself is the catalogue: a simple path template over date components, no database required. 23 files, ~3.5MB each, ~80MB total.
Step 1 — terra: rast(), band-stacked, all variables, time preserved
rast(files)
# class : SpatRaster
# size : 720, 1440, 92 (nrow, ncol, nlyr)
# sources : oisst-avhrr-v02r01.20250101.nc (4 layers)
# ... and 22 more sources
# varnames : anom, err, ice, sst, anom, err, ice, sst, ...
# names : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0, ...
# unit : Celsius, Celsius, %, Celsius, ...
# time (days) : 2025-01-01 to 2025-12-19 (23 steps)terra uses the ncdf4 R package for netCDF reads — a design inherited from the raster package — rather than routing through GDAL. This is why its behaviour differs subtly from stars proxy mode. The time axis survives via ncdf4’s own metadata handling. The n-D structure is flattened into 92 bands; recovering it means knowing the 4-variable interleaving convention.
Step 2 — terra: vrt() with vrt://, single variable, time lost
vrt(
sprintf("vrt://%s?sd_name=sst", files),
"/tmp/sst_stack.vrt",
options = "-separate",
overwrite = TRUE
)
# class : SpatRaster
# size : 720, 1440, 23 (nrow, ncol, nlyr)
# source : sst_stack.vrt
# names : r_1, r_2, r_3, ..., r_23vrt:// URI syntax selects a subdataset inline; ?sd_name=sst is the query param. Vectorises cleanly with sprintf. The result is a real file — a 2D recipe for this one variable. Trade-off: time coordinate is gone, layer names are positional. The 2D VRT has no slot for dimension coordinates.
Step 3 — stars: read_stars() proxy, GDAL classic raster, variables separated
read_stars(files)
# stars_proxy object with 4 attributes in 92 file(s):
# $sst
# [1] "[...]/oisst-avhrr-v02r01.20250101.nc:sst"
# [2] "[...]/oisst-avhrr-v02r01.20250117.nc:sst"
# ...
# $anom, $err, $ice (similarly, 23 files each)
#
# dimension(s):
# from to offset delta refsys x/y
# x 1 1440 0 0.25 NA [x]
# y 1 720 90 -0.25 NA [y]
# zlev 1 1 0 [m] NA udunits
# time 1 23 NA NA NAstars proxy mode uses GDAL classic raster and separates variables as attributes rather than interleaving as bands — a meaningful structural difference from terra. Time is present as a dimension but without coordinate values. Fully lazy.
Step 4 — stars: read_mdim() on a single file, right model, wrong scope
stars::read_mdim(files[1L])
# stars object with 4 dimensions and 4 attributes
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
# sst [°C] -1.80 -1.80 -1.78 -1.70779 -1.64 -1.03 89635
# anom [°C] -0.72 -0.08 -0.02 -0.03531 0.00 0.49 89635
# err [°C] 0.30 0.30 0.30 0.30446 0.30 0.35 89635
# ice [%] 0.19 0.81 0.95 0.87747 0.98 1.00 91877
# dimension(s):
# from to offset delta refsys x/y
# lon 1 1440 0 0.25 WGS 84 (CRS84) [x]
# lat 1 720 -90 0.25 WGS 84 (CRS84) [y]
# zlev 1 1 0 [m] NA udunits
# time 1 1 17532 [days since 1978-01-01 12:00:00] NA udunitsNow we’re talking. read_mdim() goes through GDAL’s multidimensional API: named attributes, labelled dimensions, udunits-encoded time coordinate, proper spatial reference. This is the right representation of the dataset.
But pass it all 23 files and it silently opens only the first. It’s not multi-file aware. You get one timestep, and no error to tell you so.
This is the pivot point.
Step 5 — gdal mdim mosaic + read_mdim(), the full tour
gdal mdim mosaic is the missing piece. It takes a collection of files and produces a multidim VRT that describes them as a unified n-D array — dimension coordinates and all:
gdal mdim mosaic \
$(printf 'NETCDF:%s:sst ' $(cat filelist.txt)) \
-o /tmp/sst_mdim.vrtOr from R via vrtstack or a direct system() call. Now hand that VRT to read_mdim():
stars::read_mdim("/tmp/sst_mdim.vrt")
# stars object with 3 dimensions and 1 attribute
# attribute(s):
# Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
# sst [°C] -1.80 -1.80 -1.55 -1.198 -1.03 29.0 357256
# dimension(s):
# from to offset delta refsys x/y
# lon 1 1440 0 0.25 WGS 84 (CRS84) [x]
# lat 1 720 -90 0.25 WGS 84 (CRS84) [y]
# time 1 23 2025-01-01T00:00:00+00:00 16 days POSIXct23 timesteps. Named attribute. Real time coordinates as POSIXct with a 16-day delta. Proper spatial reference. And /tmp/sst_mdim.vrt is a file — a few kilobytes of XML — that fully describes this dataset spanning the year. Hand it to a colleague and they get the same thing. Put it in a repository. Point tidync or vapour at it. Open it in Python via GDAL. It’s the recipe.
This is what gdal mdim mosaic unlocks: read_mdim() was always the right model. It just needed the VRT to make the multi-file case work.
Step 6 — skip the download: URL access, who can and can’t
The files we downloaded are publicly accessible over plain HTTPS. Why download at all? Let’s see who can open them directly.
GDAL — just works. GDAL’s /vsicurl/ virtual filesystem handles HTTP range requests transparently, and both terra::rast() (via GDAL for the spatial read) and stars::read_mdim() can open the URLs directly once you prefix them appropriately. No download needed.
# GDAL-backed read direct from URL
stars::read_mdim(paste0("/vsicurl/", urls[1L]))
# stars object with 4 dimensions and 4 attributes
# ... same output as Step 4, no local file requiredncdf4 — the picture is more complicated. ncdf4 routes through the netCDF-C library, which expects either a local file path or an OPeNDAP/Thredds dodsC URL. A plain HTTPS URL fails, and the error is instructive:
ncdf4::nc_open(urls[1])
# syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or
# SCAN_DATASET or SCAN_ERROR
# context: <!DOCTYPE^ HTML PUBLIC ...><h1>Not Found</h1>...
# Error in ncdf4::nc_open: NetCDF: file not foundncdf4 tried to parse the 404 HTML response as netCDF syntax. But there’s a workaround — the #mode=bytes suffix instructs the HDF5 layer to use HTTP byte-range requests rather than treating the URL as a file path:
ncdf4::nc_open(sprintf("%s#mode=bytes", urls[1]))
# File https://...oisst-avhrr-v02r01.20250101.nc#mode=bytes
# (NC_FORMAT_NETCDF4):
# 4 variables (excluding dimension variables):
# sst, anom, err, ice ...It works — but you have to know the magic suffix. And because terra’s netCDF reading is backed by ncdf4, it inherits this same limitation: plain HTTPS URLs don’t work without #mode=bytes, and OPeNDAP dodsC URLs are the usual workaround for terra users working with remote data.
This is the library-identity problem in miniature: the same URL, three different outcomes depending on which library is doing the opening. GDAL’s cloud-native HTTP access is a first-class citizen; ncdf4’s is an afterthought requiring a non-obvious suffix. The recipe format inherits these differences — a VRT that references HTTPS URLs is GDAL-native and just works; a workflow built on ncdf4 needs extra plumbing.
Step 7 — rhdf5-dev: opening the door to chunk indexing (dev, not yet on CRAN)
The final step in this tour is a preview. The development version of rhdf5 on GitHub can open remote HDF5 files over HTTPS and S3 using the ROS3 virtual file driver — no download, no #mode=bytes workaround, just direct anonymous access:
# remotes::install_github("grimbough/rhdf5")
library(rhdf5)
fapl <- H5Pcreate("H5P_FILE_ACCESS")
H5Pset_fapl_ros3(fapl) # no credentials → anonymous mode
fid <- H5Fopen(urls[1], flags = "H5F_ACC_RDONLY", fapl = fapl)
fid
# HDF5 FILE
# name /
# filename
#
# name otype dclass dim
# 0 anom H5I_DATASET INTEGER 1440 x 720 x 1 x 1
# 1 err H5I_DATASET INTEGER 1440 x 720 x 1 x 1
# 2 ice H5I_DATASET INTEGER 1440 x 720 x 1 x 1
# 3 lat H5I_DATASET FLOAT 720
# 4 lon H5I_DATASET FLOAT 1440
# 5 sst H5I_DATASET INTEGER 1440 x 720 x 1 x 1
# 6 time H5I_DATASET FLOAT 1
# 7 zlev H5I_DATASET FLOAT 1What you’re seeing here is the raw HDF5 structure of the file — variables as datasets, dimensions as separate float arrays, all visible without reading a single data value. This is exactly the level at which VirtualiZarr operates: it walks this structure, reads the chunk B-tree for each dataset, and records the byte offset and length of every chunk.
Once rhdf5 can expose those chunk byte-references — which is the next step in its development — R will have a native path to VirtualiZarr store creation without Python, without shelling out, and working directly on files at HTTPS or S3 URLs. The H5Pset_fapl_ros3() call with no credentials is the anonymous S3/HTTPS access pattern: it works for any publicly accessible endpoint that supports range requests, including NCEI, NASA Earthdata public buckets, and most open data archives.
The reading side (zaro) is already there. The indexing side (rhdf5-dev) is almost there. The assembly side (gdal mdim mosaic → Zarr reference store) is a small conversion on top of infrastructure that already exists. The pieces are in place.
Step 8 — tidync dev: multi-file remote concat in one call (unmerged dev branch)
And then this happened.
# remotes::install_github("hypertidy/tidync@multi-input")
tidync::tidync(
sprintf("%s#mode=bytes", urls),
fast = TRUE,
concat_dim = list(name = "time", values = date)
)
# not a file:
# ' https://.../oisst-avhrr-v02r01.20250101.nc#mode=bytes '
#
# ... attempting remote connection
#
# Connection succeeded.
# Data Source (23): oisst-avhrr-v02r01.20250101.nc#mode=bytes,
# oisst-avhrr-v02r01.20250117.nc#mode=bytes ...
# Concatenated along 'time' (23/23 steps, 23 sources, fast mode)
#
# Grids (5) <dimension family> : <associated variables>
#
# [1] D3,D2,D1,D0 : sst, anom, err, ice **ACTIVE GRID**
# [2] D0 : time
# [3] D1 : zlev
# [4] D2 : lat
# [5] D3 : lon
#
# Dimensions 4 (all active):
# dim name length min max unlim coord_dim
# D0 time 23 20089 20441 TRUE TRUE
# D1 zlev 1 0 0 FALSE TRUE
# D2 lat 720 -89.875 89.875 FALSE TRUE
# D3 lon 1440 0.125 359.88 FALSE TRUENo download. No VRT. No pre-built reference store. Twenty-three remote netCDF files concatenated into one lazy n-D dataset in a single call — #mode=bytes vectorized over the URL vector, a synthetic time axis injected via concat_dim, all four variables on the active grid, unlimited time dimension, proper coordinates throughout.
The “not a file / attempting remote connection / Connection succeeded” in the output is tidync doing its own detective work: it knows this isn’t a local path, tries the remote connection, and succeeds. The #mode=bytes suffix — the ncdf4 workaround from Step 6, quietly rediscovered after a few years — turns out to be the key that unlocks the whole thing.
What this gives you is the R equivalent of the xarray VirtualiZarr open: the same structure, the same laziness, the same multi-file concatenation. The recipe here is the code rather than a Parquet store — but it’s remarkably clean code, and the tidync object it produces is fully navigable with hyper_filter() and hyper_array() without touching the data.
The serializable recipe (the Parquet store, the multidim VRT) is still the goal for sharing and interoperability. But as a working representation for analysis, this is already there.
The xarray counterpart — for comparison
ds = xr.open_dataset("sst_virtual.parquet", engine="kerchunk")
# <xarray.Dataset>
# Dimensions: (time: 23, zlev: 1, lat: 720, lon: 1440)
# Coordinates:
# * time (time) datetime64[ns] 2025-01-01 ... 2025-12-19
# * lat (lat) float32 -90.0 ... 89.75
# * lon (lon) float32 0.0 ... 359.75
# Data variables: sst, anom, err, iceOpened from a ~15KB Parquet reference store. Fully lazy — no data read yet. The store is the recipe; the original netCDF files are the data. Structurally this is what Steps 5–7 are reaching toward: the multidim VRT is GDAL’s version of the Parquet store, and converting between them is the near-term target. The VirtualiZarr store goes one level deeper — chunk-level byte references rather than file-level references — which is what makes it S3-native and format-library-free. That’s the remaining gap, and it’s closing.
Michael Sumner is a geospatial researcher at the Australian Antarctic Division. Packages mentioned — tidync, vapour, terra, stars, vrtstack, zaro — are part of or adjacent to the hypertidy ecosystem.