GDAL multidim and cloud-ready ZARR

Author

Michael Sumner

The GDAL API

I have been working on a better understanding of the GDAL multidimensional model and to do that really needs a closer look at the GDAL API itself.

This post demonstrates loading the GDAL API via its Python bindings into R. We connect to a modern “cloud-ready” ZARR dataset, find out some details about its contents and then convert from its native multidimensional form to a more classic 2D raster model, then use that to create a map from Sentinel 2 imagery.

We don’t go very deep into any part, but just want to show a quick tour of some parts of GDAL that don’t get as much attention as deserved (IMO).

ZARR and the European Space Agency (ESA)

The ESA is moving Copernicus to ZARR, launching its Earth Observation Processing Framework (EOPF) data format (Zarr). ZARR is “a community project to develop specifications and software for storage of large N-dimensional typed arrays”.

A ZARR is a dataset consisting of trees of array chunks stored (usually) in object storage and indexed by fairly simple JSON metadata that describes how those chunks align together in one potentially very large array. The idea is that all the metadata lives upfront in instantly readable JSON, and changes made to the dataset (extending it each day as new data arrives) affects only the relevant chunks and the small parts of the JSON. This is different to a long list of NetCDFs that grows every day, where the metadata is self-contained for each file and there is no overarching abstraction for the entire file set.

ZARR is usually in object storage, and loaded by datacube software such as xarray. It’s not intended to be zipped into a huge file and downloaded or read, the real power lies in the entire dataset being lazy, and understood by software that needs just one data set desription (url, or S3 path, etc).

ESA sample Zarr datasets

The ESA provide a set of example ZARRs that are available in zip files:

https://eopf-public.s3.sbg.perf.cloud.ovh.net/product.html

We choose one that is described by this URL:

url <- "https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip"

GDAL urls and zip files

GDAL doesn’t force us to download data, but we need some syntax to leverage its remote capabilities in the simplest way. The Virtual File System (VSI) allows us to declare special sources like zip files /vsizip/ and urls /vsicurl/, which we can chain together. With ZARR we also need careful quoting of the description, and we declare the ZARR driver upfront.

(dsn <- sprintf('ZARR:"/vsizip//vsicurl/%s"', url))
[1] "ZARR:\"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip\""

GDAL and multidimensional datasets

GDAL has a multidimensional mode for data sources that aren’t “imagery” in the traditional sense. (If we open one of these datasets in “classic” mode we end up with a lot of bands on a 2D raster.)

Multidimensional mode is avaible in the API via OpenEx() and declaring type OF_MULTIDIM_RASTER.

To actually load this python library we use {reticulate} py_require() which drives the awesome Python uv package manager.

(For some reason the pypi name of the package is “gdal”, but the actual model is obtained with “osgeo.gdal”).

reticulate::py_require("gdal")
gdal <- reticulate::import("osgeo.gdal")
gdal$UseExceptions()

sample(names(gdal), 40)  ## see that we have a huge coverage of the underlying API, 544 elements at time of writing
 [1] "ReprojectImage"                                    
 [2] "CXT_Literal"                                       
 [3] "wrapper_GDALWarpDestDS"                            
 [4] "OF_RASTER"                                         
 [5] "DataTypeUnion"                                     
 [6] "OF_READONLY"                                       
 [7] "GCI_LWIRBand"                                      
 [8] "Sync"                                              
 [9] "DCAP_CURVE_GEOMETRIES"                             
[10] "GRA_Bilinear"                                      
[11] "osr"                                               
[12] "DCAP_OPEN"                                         
[13] "GetCacheMax"                                       
[14] "DecToDMS"                                          
[15] "CPLE_NotSupported"                                 
[16] "GCI_YCbCr_CrBand"                                  
[17] "CreateRasterAttributeTableFromMDArrays"            
[18] "CE_Fatal"                                          
[19] "TileIndexInternalNames"                            
[20] "DCAP_GNM"                                          
[21] "GetSubdatasetInfo"                                 
[22] "GARIO_PENDING"                                     
[23] "MkdirRecursive"                                    
[24] "GCI_SAR_P_Band"                                    
[25] "DMD_NUMERIC_FIELD_WIDTH_INCLUDES_DECIMAL_SEPARATOR"
[26] "AutoCreateWarpedVRT"                               
[27] "GMF_ALPHA"                                         
[28] "GDAL_GCP_GCPZ_get"                                 
[29] "GDT_CInt32"                                        
[30] "VectorInfoInternal"                                
[31] "GCI_YCbCr_YBand"                                   
[32] "GRA_Max"                                           
[33] "Driver"                                            
[34] "GDALInfoOptions"                                   
[35] "TranslateInternal"                                 
[36] "GARIO_COMPLETE"                                    
[37] "GFU_PixelCount"                                    
[38] "GetDataTypeByName"                                 
[39] "SieveFilter"                                       
[40] "GDAL_DMD_RELATIONSHIP_FLAGS"                       

The API elements chain in the usual way that works in python with ‘object.element.thing.etc’ syntax uses R’s $ accessor.

gdal$Dimension$GetIndexingVariable
<function Dimension.GetIndexingVariable at 0x7f0d83c71760>
 signature: (self, *args) -> 'GDALMDArrayHS *'

Open the data

It’s not very exciting yet.

ds <- gdal$OpenEx(dsn, gdal$OF_MULTIDIM_RASTER)
ds
<osgeo.gdal.Dataset; proxy of <Swig Object of type 'GDALDatasetShadow *' at 0x7f0d8be973f0> >

To actually find out what’s in there we have to traverse a tree of potentially nested “groups” that organize actual datasets in a hierarchy.

Get the root group and dive in, it’s very tediuous but shows some of what is there. There might be MDArrays in a group, or there might just be more groups.

rg <- ds$GetRootGroup()
rg$GetMDArrayNames()
list()
rg$GetGroupNames()
[1] "conditions"   "measurements" "quality"     
g1 <- rg$OpenGroup("quality")
g1$GetMDArrayNames()
list()
g1$GetGroupNames()
[1] "l1c_quicklook" "mask"         
g2 <- g1$OpenGroup("l1c_quicklook")
g2$GetMDArrayNames()
list()
g2$GetGroupNames()
[1] "r10m"
g3 <- g2$OpenGroup("r10m")
g3$GetMDArrayNames()
[1] "band" "x"    "y"    "tci" 
g3$GetGroupNames()
list()

Finally we got to actual data, we recognize ‘tci’ as being the quicklook RGB of Sentinel 2.

To avoid tedium write a quick recursive function to find all the MDArray, at each level use GetFullName() which provides the cumulative path to where we are in the tree.

get_all_mdnames <- function(rootgroup) {
  groups <- rootgroup$GetGroupNames()
  groupname <- rootgroup$GetFullName()
  amd <- rootgroup$GetMDArrayNames()
  md <- sprintf("%s/%s", groupname, amd)
  md <- c(md, unlist(lapply(groups, \(.g) get_all_mdnames(rootgroup$OpenGroup(.g)))))
  md
}

get_all_mdnames(ds$GetRootGroup())
  [1] "/conditions/geometry/angle"                        
  [2] "/conditions/geometry/band"                         
  [3] "/conditions/geometry/detector"                     
  [4] "/conditions/geometry/x"                            
  [5] "/conditions/geometry/y"                            
  [6] "/conditions/geometry/mean_sun_angles"              
  [7] "/conditions/geometry/mean_viewing_incidence_angles"
  [8] "/conditions/geometry/sun_angles"                   
  [9] "/conditions/geometry/viewing_incidence_angles"     
 [10] "/conditions/mask/detector_footprint/r10m/x"        
 [11] "/conditions/mask/detector_footprint/r10m/y"        
 [12] "/conditions/mask/detector_footprint/r10m/b02"      
 [13] "/conditions/mask/detector_footprint/r10m/b03"      
 [14] "/conditions/mask/detector_footprint/r10m/b04"      
 [15] "/conditions/mask/detector_footprint/r10m/b08"      
 [16] "/conditions/mask/detector_footprint/r20m/x"        
 [17] "/conditions/mask/detector_footprint/r20m/y"        
 [18] "/conditions/mask/detector_footprint/r20m/b05"      
 [19] "/conditions/mask/detector_footprint/r20m/b06"      
 [20] "/conditions/mask/detector_footprint/r20m/b07"      
 [21] "/conditions/mask/detector_footprint/r20m/b11"      
 [22] "/conditions/mask/detector_footprint/r20m/b12"      
 [23] "/conditions/mask/detector_footprint/r20m/b8a"      
 [24] "/conditions/mask/detector_footprint/r60m/x"        
 [25] "/conditions/mask/detector_footprint/r60m/y"        
 [26] "/conditions/mask/detector_footprint/r60m/b01"      
 [27] "/conditions/mask/detector_footprint/r60m/b09"      
 [28] "/conditions/mask/detector_footprint/r60m/b10"      
 [29] "/conditions/mask/l1c_classification/r60m/x"        
 [30] "/conditions/mask/l1c_classification/r60m/y"        
 [31] "/conditions/mask/l1c_classification/r60m/b00"      
 [32] "/conditions/meteorology/cams/latitude"             
 [33] "/conditions/meteorology/cams/longitude"            
 [34] "/conditions/meteorology/cams/aod1240"              
 [35] "/conditions/meteorology/cams/aod469"               
 [36] "/conditions/meteorology/cams/aod550"               
 [37] "/conditions/meteorology/cams/aod670"               
 [38] "/conditions/meteorology/cams/aod865"               
 [39] "/conditions/meteorology/cams/bcaod550"             
 [40] "/conditions/meteorology/cams/duaod550"             
 [41] "/conditions/meteorology/cams/isobaricInhPa"        
 [42] "/conditions/meteorology/cams/number"               
 [43] "/conditions/meteorology/cams/omaod550"             
 [44] "/conditions/meteorology/cams/ssaod550"             
 [45] "/conditions/meteorology/cams/step"                 
 [46] "/conditions/meteorology/cams/suaod550"             
 [47] "/conditions/meteorology/cams/surface"              
 [48] "/conditions/meteorology/cams/time"                 
 [49] "/conditions/meteorology/cams/valid_time"           
 [50] "/conditions/meteorology/cams/z"                    
 [51] "/conditions/meteorology/ecmwf/latitude"            
 [52] "/conditions/meteorology/ecmwf/longitude"           
 [53] "/conditions/meteorology/ecmwf/isobaricInhPa"       
 [54] "/conditions/meteorology/ecmwf/msl"                 
 [55] "/conditions/meteorology/ecmwf/number"              
 [56] "/conditions/meteorology/ecmwf/r"                   
 [57] "/conditions/meteorology/ecmwf/step"                
 [58] "/conditions/meteorology/ecmwf/surface"             
 [59] "/conditions/meteorology/ecmwf/tco3"                
 [60] "/conditions/meteorology/ecmwf/tcwv"                
 [61] "/conditions/meteorology/ecmwf/time"                
 [62] "/conditions/meteorology/ecmwf/u10"                 
 [63] "/conditions/meteorology/ecmwf/v10"                 
 [64] "/conditions/meteorology/ecmwf/valid_time"          
 [65] "/measurements/reflectance/r10m/x"                  
 [66] "/measurements/reflectance/r10m/y"                  
 [67] "/measurements/reflectance/r10m/b02"                
 [68] "/measurements/reflectance/r10m/b03"                
 [69] "/measurements/reflectance/r10m/b04"                
 [70] "/measurements/reflectance/r10m/b08"                
 [71] "/measurements/reflectance/r20m/x"                  
 [72] "/measurements/reflectance/r20m/y"                  
 [73] "/measurements/reflectance/r20m/b05"                
 [74] "/measurements/reflectance/r20m/b06"                
 [75] "/measurements/reflectance/r20m/b07"                
 [76] "/measurements/reflectance/r20m/b11"                
 [77] "/measurements/reflectance/r20m/b12"                
 [78] "/measurements/reflectance/r20m/b8a"                
 [79] "/measurements/reflectance/r60m/x"                  
 [80] "/measurements/reflectance/r60m/y"                  
 [81] "/measurements/reflectance/r60m/b01"                
 [82] "/measurements/reflectance/r60m/b09"                
 [83] "/measurements/reflectance/r60m/b10"                
 [84] "/quality/l1c_quicklook/r10m/band"                  
 [85] "/quality/l1c_quicklook/r10m/x"                     
 [86] "/quality/l1c_quicklook/r10m/y"                     
 [87] "/quality/l1c_quicklook/r10m/tci"                   
 [88] "/quality/mask/r10m/x"                              
 [89] "/quality/mask/r10m/y"                              
 [90] "/quality/mask/r10m/b02"                            
 [91] "/quality/mask/r10m/b03"                            
 [92] "/quality/mask/r10m/b04"                            
 [93] "/quality/mask/r10m/b08"                            
 [94] "/quality/mask/r20m/x"                              
 [95] "/quality/mask/r20m/y"                              
 [96] "/quality/mask/r20m/b05"                            
 [97] "/quality/mask/r20m/b06"                            
 [98] "/quality/mask/r20m/b07"                            
 [99] "/quality/mask/r20m/b11"                            
[100] "/quality/mask/r20m/b12"                            
[101] "/quality/mask/r20m/b8a"                            
[102] "/quality/mask/r60m/x"                              
[103] "/quality/mask/r60m/y"                              
[104] "/quality/mask/r60m/b01"                            
[105] "/quality/mask/r60m/b09"                            
[106] "/quality/mask/r60m/b10"                            

Happily, we see our target MDArray name in there /quality/l1c_quicklook/r10m/tci.

Actual data

Finally let’s get some data out. We can obtain the MDArray now by full name and find out some properties.

reticulate::py_require("gdal")
gdal <- reticulate::import("osgeo.gdal")
gdal$UseExceptions()

dsn <- "ZARR:\"/vsizip//vsicurl/https://eopf-public.s3.sbg.perf.cloud.ovh.net/eoproducts/S02MSIL1C_20230629T063559_0000_A064_T3A5.zarr.zip\""

ds <- gdal$OpenEx(dsn, gdal$OF_MULTIDIM_RASTER)
rg <- ds$GetRootGroup()
mdname <- "/quality/l1c_quicklook/r10m/tci"
md <- rg$OpenMDArrayFromFullname(mdname)
md$GetDimensionCount()
[1] 3

Traverse the dimensions to get their sizes and names.

vapply(md$GetDimensions(), \(.d) .d$GetSize(), 0L)
[1]     3 10980 10980
vapply(md$GetDimensions(), \(.d) .d$GetName(), "")
[1] "band" "y"    "x"   

Explore some metadata attributes.

mnames <- vapply(md$GetAttributes(), \(.a) .a$GetName(), "")

(meta <- setNames(lapply(md$GetAttributes(), \(.a) unlist(.a$Read())), mnames))
$long_name
[1] "TCI: True Color Image"

$`proj:bbox`
[1]  300000 4490220  409800 4600020

$`proj:epsg`
[1] 32636

$`proj:shape`
[1] 10980 10980

$`proj:transform`
[1]      10       0  300000       0     -10 4600020       0       0       1

$`proj:wkt2`
[1] "PROJCS[\"WGS 84 / UTM zone 36N\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",33],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],AUTHORITY[\"EPSG\",\"32636\"]]"

We see that the CRS information is in there, but sadly while the geotransform of the data is maintained when we convert to classic raster the CRS does not make the journey (the geotransform is just the image bbox intermingled with its resolution in a mathematical abstraction used in matrix manipulation).

A multidim raster can be converted to classic form, either in whole or after applying a $GetView() operation that acts like numpy ‘[:]’ array subsetting.

When converting to classice we specify the x, then y dimensions of the dataset, which from multdim convention are in reverse order.

cc <- md$AsClassicDataset(2L, 1L)
tr <- unlist(cc$GetGeoTransform())
dm <- c(cc$RasterXSize, cc$RasterYSize)
cc$GetSpatialRef() ## empty, so we retrieve from the attributes
crs <- cc$GetMetadata()[["proj:wkt2"]]

Now, I used reproj_extent to convert the dataset’s bbox to one in longlat, but I won’t share that here, we’ll just use the result so our map is a little more familiar. The source data xy bbox is xmin: 300000 ymin: 4490220 xmax: 409800 ymax:4600020, and we take small section of that which is this in longitude latitude:

# xmin,xmax,ymin,ymax converted below to bbox 
ex <- c(31.689, 31.774, 40.665, 40.724)

To warp the imagery to this region we first need to set the CRS properly on the dataset, so translate to a temporary VRT and use that dataset for the next step.

dsvrt <- gdal$Translate(tempfile(fileext = ".vrt", tmpdir = "/vsimem"), cc, options = c("-a_srs", crs))
tf <- tf <- "/vsimem/result.tif"
ww <- gdal$Warp(tf, dsvrt,  dstSRS = "EPSG:4326", outputBounds = ex[c(1, 3, 2, 4)])
ww$Close()
[1] 0

Finally we can read the image, plot it, obtain some contextual data and be assured that our map is correct.

library(gdalraster)
GDAL 3.11.0dev-dd6009a0eb, released 2018/99/99, GEOS 3.13.0, PROJ 9.6.0
data <- read_ds(new(GDALRaster, tf))
library(osmdata)
Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
x <- opq(ex[c(1, 3, 2, 4)]) |> add_osm_feature(key = "highway") |> osmdata_sf()

library(sf)
Linking to GEOS 3.13.0, GDAL 3.11.0dev-dd6009a0eb, PROJ 9.6.0; sf_use_s2() is
TRUE
WARNING: different compile-time and runtime versions for GEOS found:
Linked against: 3.13.0-CAPI-1.19.0 compiled against: 3.8.0-CAPI-1.13.1
It is probably a good idea to reinstall sf (and maybe lwgeom too)
plot_raster(data)
plot(x$osm_lines[0], add = TRUE, col = "hotpink")

Summary