Skip to content

Virtualization

Read archival files as one Zarr dataset. No conversion, no copies. Point at NetCDF4, HDF5, GRIB, GeoTIFF.

What it does

VitualiZarr scans existing archival files and builds a virtual Zarr manifest that records where every chunk lives inside those files. Icechunk stores and version-controls the manifest with transactional commits. xarray then reads across the whole archive as if it were a single Zarr dataset, with no copying or rewriting of the original data.

Why it matters

  • No conversion costs: terabyte-scale archives can be exposed as Zarr datasets without rewriting them
  • Atomic updates: Icechunk commits are transactional, so adding or amending data in the manifest is safe under concurrent writes
  • Format-agnostic: the same manifest can point at NetCDF4, HDF5, GRIB, GeoTIFF, FITS, and more

Virtualization flow

How to use it

Run with uv run site/examples/virtualization.py. The script lists a slice of NASA NEX-GDDP-CMIP6 from public S3 with obspec-utils, builds the virtual dataset with VirtualiZarr's v2 API, persists it in Icechunk, and reads it back.

python
# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "virtualizarr==2.4.0",
#   "zarr==3.1.6",
#   "obstore",
#   "obspec-utils>=0.9.0",
#   "icechunk>=2.0",
#   "xarray",
#   "h5py",
# ]
# ///
"""Virtualization: read archival NetCDF files as a single Zarr dataset.

Lists a slice of NASA NEX-GDDP-CMIP6 daily climate projections from public S3,
builds a virtual dataset with VirtualiZarr's v2 API, persists the manifest in
Icechunk, and reads it back as a normal Zarr dataset.
"""

import tempfile
from pathlib import Path

import virtualizarr as vz
import xarray as xr
from obspec_utils.glob import glob
from obspec_utils.registry import ObjectStoreRegistry
from obstore.store import from_url

import icechunk

BUCKET = "s3://nex-gddp-cmip6/"
# A small, deterministic subset: ACCESS-CM2 historical surface air temperature, 5 years.
PATTERN = "NEX-GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/tas/*201[0-4]*.nc"

# 1. Build an obstore-backed registry for the public bucket (anonymous reads).
store = from_url(BUCKET, region="us-west-2", skip_signature=True)
registry = ObjectStoreRegistry({BUCKET: store})

# 2. List the input files via obspec-utils glob — no per-file HEAD requests.
urls = [f"{BUCKET}{path}" for path in glob(store, PATTERN)]
print(f"virtualizing {len(urls)} files")

# 3. Build one virtual dataset spanning all of them (v2 API).
parser = vz.parsers.HDFParser()
vds = vz.open_virtual_mfdataset(
    urls, registry=registry, parser=parser, combine="by_coords"
)

config = icechunk.ObjectStoreConfig.S3(
    icechunk.S3Options(region="us-west-2", anonymous=True)
)

# 4. Tell Icechunk how to read virtual chunks from the public S3 bucket.
container = icechunk.VirtualChunkContainer(
    BUCKET,
    config,
)
config = icechunk.RepositoryConfig(virtual_chunk_containers={BUCKET: container})

# 5. Persist the manifest in an Icechunk repo (transactional commits).
workdir = Path(tempfile.mkdtemp(prefix="virtualization-demo-"))
repo = icechunk.Repository.create(
    icechunk.local_filesystem_storage(str(workdir / "nex-gddp-icechunk")),
    config=config,
    authorize_virtual_chunk_access={BUCKET: None},
)
session = repo.writable_session("main")
vds.virtualize.to_icechunk(session.store)
session.commit("Initial NEX-GDDP-CMIP6 virtual dataset")

# 6. Read it back as a normal Zarr dataset.
ds = xr.open_zarr(repo.readonly_session("main").store, consolidated=False)
print(ds)

Learn more

EGU 2026 · ESSI2.2 · EGU26-15196