Skip to content
Sindbad.DataLoaders Module
julia
DataLoaders

The DataLoaders module provides tools for ingesting and preprocessing SINDBAD input data. It supports reading, cleaning, masking, and managing forcing/observation data with an emphasis on spatial and temporal dimensions.

Purpose

Streamline the ingestion and preprocessing of input data for SINDBAD experiments.

Dependencies

Related (SINDBAD ecosystem)

  • OmniTools: Utility functions for handling NamedTuples, printing, and shared helpers.

External (third-party)

  • AxisKeys: Labeled multidimensional arrays (KeyedArray).

  • DimensionalData: Dimension-aware indexing/slicing.

  • FillArrays: Efficient filled-array representations.

  • NCDatasets: NetCDF reader/writer.

  • YAXArrays, YAXArrayBase: Multidimensional array/cube abstractions used for IO and spatial data.

  • Zarr: Chunked/compressed array storage for large datasets.

Internal (within Sindbad)

  • Sindbad.Setup

  • Sindbad.Types

  • SindbadTEM

Included Files

  • utilsDataLoaders.jl: Utility functions for data preprocessing (cleaning, masking, bounds checks).

  • spatialSubset.jl: Spatial operations (extracting subsets based on spatial dimensions).

  • getForcing.jl: Extracting and processing forcing data (environmental drivers).

  • getObservation.jl: Reading and processing observational data for evaluation/validation.

Notes

  • The module uses NCDatasets, YAXArrays, and Zarr directly; it does not re-export them.

  • Designed to handle large datasets efficiently, leveraging chunked and compressed data formats like NetCDF and Zarr.

  • Ensures compatibility with SINDBAD's experimental framework by integrating spatial and temporal data management tools.

Functions

getForcing

Sindbad.DataLoaders.getForcing Function
julia
getForcing(info::NamedTuple)

Reads forcing data from the data_path specified in the experiment configuration and returns a NamedTuple with the forcing data.

Arguments:

  • info: A SINDBAD NamedTuple containing all information needed for setup and execution of an experiment.

Returns:

  • A NamedTuple forcing containing:
    • data: The processed input cubes.

    • dims: The dimensions of the forcing data.

    • variables: The names of the forcing variables.

    • f_types: The types of the forcing data (e.g., ForcingWithTime or ForcingWithoutTime).

    • helpers: Helper information for the forcing data.

Examples

julia
julia> using Sindbad

julia> # Load forcing data from experiment configuration
julia> # forcing = getForcing(info)

Notes:

  • Reads forcing data from the specified data path and processes it using the SINDBAD framework.

  • Handles spatiotemporal and spatial-only forcing data.

  • Applies masks and subsets to the forcing data if specified in the configuration.

Code
julia
function getForcing(info::NamedTuple)
    nc_default = nothing
    forcing_data_settings = info.experiment.data_settings.forcing
    # forcing_data_settings = info.experiment.data_settings.forcing
    data_path = forcing_data_settings.default_forcing.data_path
    if !isnothing(data_path)
        data_path = getAbsDataPath(info, data_path)
        print_info(getForcing, @__FILE__, @__LINE__, "default_data_path: `$(data_path)`")
        nc_default = loadDataFile(data_path)
    end
    data_backend = getfield(Types, to_uppercase_first(info.helpers.run.input_data_backend, "Backend"))()

    forcing_mask = nothing
    if :sel_mask keys(forcing_data_settings)
        if !isnothing(forcing_data_settings.forcing_mask.data_path)
            mask_path = getAbsDataPath(info, forcing_data_settings.forcing_mask.data_path)
            _, forcing_mask = getYaxFromSource(nothing, mask_path, nothing, forcing_data_settings.forcing_mask.source_variable, info, data_backend)
            forcing_mask = positive_mask(forcing_mask)
        end
    end

    default_info = info.experiment.data_settings.forcing.default_forcing
    forcing_vars = keys(forcing_data_settings.variables)
    tar_dims = getTargetDimensionOrder(info)
    print_info(getForcing, @__FILE__, @__LINE__, "getting forcing variables. Units given in forcing settings are not strictly enforced but shown for reference. Bounds are applied after unit conversion...", n_m=1)
    vinfo = nothing
    f_sizes = nothing
    f_dimension = nothing
    num_type = Val{info.helpers.numbers.num_type}()
    incubes = map(forcing_vars) do k
        nc = nc_default
        vinfo = merge_namedtuple_prefer_nonempty(default_info, forcing_data_settings.variables[k])
        data_path_v = getAbsDataPath(info, getfield(vinfo, :data_path))
        nc, yax = getYaxFromSource(nc, data_path, data_path_v, vinfo.source_variable, info, data_backend)
        incube = subsetAndProcessYax(yax, forcing_mask, tar_dims, vinfo, info, num_type)
        v_op = vinfo.additive_unit_conversion ? " + " : " * "
        v_op = v_op * "$(vinfo.source_to_sindbad_unit)"
        v_string = "`$(k)` ($(vinfo.sindbad_unit), $(vinfo.bounds)) = <$(vinfo.space_time_type)> `$(vinfo.source_variable)` ($(vinfo.source_unit)) $(v_op)"
        print_info(nothing, @__FILE__, @__LINE__, v_string, n_m=4)
        if vinfo.space_time_type == "spatiotemporal" && isnothing(f_sizes)
            f_sizes = collectForcingSizes(info, incube)
            f_dimension = getSindbadDims(incube)
        end
        incube
    end
    return createForcingNamedTuple(incubes, f_sizes, f_dimension, info)
end

getNumberOfTimeSteps

Sindbad.DataLoaders.getNumberOfTimeSteps Function
julia
getNumberOfTimeSteps(incubes, time_name)

Returns the number of time steps in the input data cubes.

Arguments

  • incubes: Input data cubes containing temporal information

  • time_name: Name of the time dimension/variable

Returns

Integer representing the total number of time steps in the data

Code
julia
function getNumberOfTimeSteps(incubes, time_name)
    i1 = findfirst(c -> YAXArrays.Axes.findAxis(time_name, c) !== nothing, incubes)
    return length(getAxis(time_name, incubes[i1]).values)
end

getObservation

Sindbad.DataLoaders.getObservation Function
julia
getObservation(info::NamedTuple, forcing_helpers::NamedTuple)

Processes observation data and returns a NamedTuple containing the observation data, dimensions, and variables.

Arguments:

  • info: A SINDBAD NamedTuple containing all information needed for setup and execution of an experiment.

  • forcing_helpers: A SINDBAD NamedTuple containing helper information for forcing data.

Returns:

  • A NamedTuple with the following fields:
    • data: The processed observation data as an input array.

    • dims: The dimensions of the observation data.

    • variables: A tuple of variable names for the observation data.

Notes:

  • Reads observation data from the path specified in the experiment configuration.

Examples

julia
julia> using Sindbad

julia> # Load observation data from experiment configuration
julia> # observations = getObservation(info, forcing_helpers)
  • Handles quality flags, uncertainty, spatial weights, and selection masks for each observation variable.

  • Subsets and harmonizes the observation data based on the target dimensions and masks.

Code
julia
function getObservation(info::NamedTuple, forcing_helpers::NamedTuple)
    observation_data_settings = info.experiment.data_settings.optimization
    forcing_data_settings = info.experiment.data_settings.forcing
    exe_rules_settings = info.experiment.exe_rules
    data_path = observation_data_settings.observations.default_observation.data_path
    data_backend = getfield(Types, to_uppercase_first(exe_rules_settings.input_data_backend, "Backend"))()
    default_info = observation_data_settings.observations.default_observation
    tar_dims = getTargetDimensionOrder(info)

    nc_default = nothing

    if !isnothing(data_path)
        data_path = getAbsDataPath(info, data_path)
        print_info(getObservation, @__FILE__, @__LINE__, "default_observation_data_path: `$(data_path)`")
        nc_default = loadDataFile(data_path)
    end

    varnames = Symbol.(observation_data_settings.observational_constraints)

    yax_mask = nothing
    if :one_sel_mask keys(observation_data_settings)
        if !isnothing(observation_data_settings.one_sel_mask)
            mask_path = getAbsDataPath(info, observation_data_settings.one_sel_mask)
            _, yax_mask = getYaxFromSource(nothing, mask_path, nothing, "mask", info, data_backend)
            yax_mask = positive_mask(yax_mask)
        end
    end
    obscubes = []
    num_type = Val{info.helpers.numbers.num_type}()
    num_type_bool = Val{Bool}()

    print_info(getObservation, @__FILE__, @__LINE__, "getting observation variables. Units given in optimization settings are not strictly enforced but shown for reference. Bounds are applied after unit conversion...")
    map(varnames) do k

        vinfo = getproperty(observation_data_settings.observations.variables, k)
        print_info(nothing, @__FILE__, @__LINE__, "constraint: `$k`", n_m=4)

        src_var = vinfo.data.source_variable
        nc = nc_default
        nc, yax, vinfo_data, bounds_data = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :data, info)

        # get quality flags data and use it later to mask observations. Set to value of 1 when :qflag field is not given for a data stream or all are turned off by setting optimizatio.optimization.observations.use_quality_flag to false
        nc_qc, yax_qc, vinfo_qc, bounds_qc = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :qflag, info; yax=yax, use_data_sub=observation_data_settings.observations.use_quality_flag)

        # get uncertainty data and add to observations. Set to value of 1 when :unc field is not given for a data stream or all are turned off by setting observation_data_settings.use_uncertainty to false
        nc_unc, yax_unc, vinfo_unc, bounds_unc = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :unc, info; yax=yax, use_data_sub=observation_data_settings.observations.use_uncertainty)

        nc_wgt, yax_wgt, vinfo_wgt, bounds_wgt = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :weight, info; yax=yax, use_data_sub=observation_data_settings.observations.use_spatial_weight)

        _, yax_mask_v, vinfo_mask, bounds_mask = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :sel_mask, info; yax=yax)
        yax_mask_v = positive_mask(yax_mask_v)
        if !isnothing(yax_mask)
            yax_mask_v .= yax_mask .* yax_mask_v
        end
        print_info(nothing, @__FILE__, @__LINE__, "harmonize/subset...", n_m=6)
        @debug "      qflag"
        cyax_qc = subsetAndProcessYax(yax_qc, yax_mask_v, tar_dims, vinfo_qc, info, num_type; clean_data=false)
        @debug "      data"
        cyax = subsetAndProcessYax(yax, yax_mask, tar_dims, vinfo_data, info, num_type; fill_nan=true, yax_qc=cyax_qc, bounds_qc=bounds_qc)
        @debug "      unc"
        cyax_unc = subsetAndProcessYax(yax_unc, yax_mask, tar_dims, vinfo_unc, info, num_type; fill_nan=true, yax_qc=cyax_qc, bounds_qc=bounds_qc)
        @debug "      weight"
        cyax_wgt = subsetAndProcessYax(yax_wgt, yax_mask, tar_dims, vinfo_wgt, info, num_type; fill_nan=true, yax_qc=cyax_qc, bounds_qc=bounds_qc)
        @debug "      mask"
        yax_mask_v = subsetAndProcessYax(yax_mask_v, yax_mask_v, tar_dims, vinfo_mask, info, num_type_bool; clean_data=false)

        push!(obscubes, cyax)
        push!(obscubes, cyax_unc)
        push!(obscubes, yax_mask_v)
        push!(obscubes, cyax_wgt)
    end
    print_info(getObservation, @__FILE__, @__LINE__, "getting observation helpers...", n_m=2)
    @debug "getObservation: getting observation dimensions..."
    indims = getDataDims.(obscubes, Ref(forcing_data_settings.data_dimension.space))
    @debug "getObservation: getting number of time steps..."
    nts = forcing_helpers.sizes
    @debug "getObservation: getting variable name..."
    varnames_all = []
    for v  varnames
        push!(varnames_all, v)
        push!(varnames_all, Symbol(string(v) * "_σ"))
        push!(varnames_all, Symbol(string(v) * "_mask"))
        push!(varnames_all, Symbol(string(v) * "_weight"))
    end
    input_array_type = getfield(Types, to_uppercase_first(exe_rules_settings.input_array_type, "Input"))()
    print_info_separator()

    return (; data=getInputArrayOfType(obscubes, input_array_type), dims=indims, variables=Tuple(varnames_all))
end

getSpatialSubset

Sindbad.DataLoaders.getSpatialSubset Function
julia
getSpatialSubset(ss, v)

Extracts a spatial subset of data based on specified spatial subsetting type/strategy.

Arguments

  • ss: Spatial subset parameters or geometry defining the region of interest

  • v: Data to be spatially subset

Returns

Spatially subset data according to the specified parameters

Note

The function assumes input data and spatial parameters are in compatible formats

Examples

julia
julia> using Sindbad

julia> # Get spatial subset from configuration
julia> # subset_data = getSpatialSubset(spatial_subset_config, data_cube)
Code
julia
function getSpatialSubset(ss, v)
    if isa(ss, Dict)
        ss = dict_to_namedtuple(ss)
    end
    if !isnothing(ss)
        ssname = propertynames(ss)
        for ssn  ssname
            ss_r = getproperty(ss, ssn)
            if !isnothing(ss_r)
                ss_range = collect(ss_r)
                ss_typeName = Symbol("Space" * string(ssn))
                v = spatialSubset(v, ss_range, getfield(Types, ss_typeName)())
            end
        end
    end
    return v
end

mapCleanData

Sindbad.DataLoaders.mapCleanData Function

Maps and cleans data based on quality control parameters and fills missing values.

Arguments

  • _data: Raw input data to be cleaned

  • _data_qc: Quality control data corresponding to input data

  • _data_fill: Fill values for replacing invalid/missing data

  • bounds_qc: Quality control bounds/thresholds

  • _data_info: Additional information about the data

  • ::Val{T}: Value type parameter for dispatch

Returns

Cleaned and mapped data with invalid values replaced according to QC criteria

Note

This function performs quality control checks and data cleaning based on the provided bounds and fill values. The exact behavior depends on the value type T.

Code
julia
function mapCleanData(_data, _data_qc, _data_fill, bounds_qc, _data_info, ::Val{T}) where {T}
    if !isnothing(bounds_qc) && !isnothing(_data_qc)
        _data = map((da, dq) -> applyQCBound(da, dq, bounds_qc, _data_fill), _data, _data_qc)
    end
    vT = Val{T}()
    _data = map(data_point -> cleanData(data_point, _data_fill, _data_info, vT), _data)
    return _data
end

subsetAndProcessYax

Sindbad.DataLoaders.subsetAndProcessYax Function
julia
subsetAndProcessYax(yax, forcing_mask, tar_dims, _data_info, info, ::Val{num_type}; clean_data=true, fill_nan=false, yax_qc=nothing, bounds_qc=nothing) where {num_type}

Subset and process YAX data according to specified parameters and quality control criteria.

Arguments

  • yax: YAX data to be processed

  • forcing_mask: Mask to apply to the data

  • tar_dims: Target dimensions

  • _data_info: Data information

  • info: a SINDBAD NT that includes all information needed for setup and execution of an experiment

  • ::Val{num_type}: Value type parameter for numerical type specification

  • clean_data=true: Boolean flag to enable/disable data cleaning

  • fill_nan=false: Boolean flag to control NaN filling

  • yax_qc=nothing: Optional quality control parameters for YAX data

  • bounds_qc=nothing: Optional boundary quality control parameters

Returns

Processed and subset YAX data according to specified parameters and quality controls.

Type Parameters

  • num_type: Numerical type specification for the processed data
Code
julia
function subsetAndProcessYax(yax, forcing_mask, tar_dims, _data_info, info, ::Val{num_type}; clean_data=true, fill_nan=false, yax_qc=nothing, bounds_qc=nothing) where {num_type}
    forcing_data_settings = info.experiment.data_settings.forcing
    if !isnothing(forcing_mask)
        yax = yax #todo: mask the forcing variables here depending on the mask of 1 and 0
    end

    if !isnothing(tar_dims)
        permutes = getDimPermutation(YAXArrayBase.dimnames(yax), tar_dims)
        @debug "         -> permuting dimensions to $(tar_dims)..."
        yax = permutedims(yax, permutes)
    end
    if hasproperty(yax, Symbol(forcing_data_settings.data_dimension.time))
        init_date = DateTime(info.helpers.dates.date_begin)
        last_date = DateTime(info.helpers.dates.date_end)
        yax = yax[time=(init_date .. last_date)]
    end

    if hasproperty(forcing_data_settings, :subset)
        yax = getSpatialSubset(forcing_data_settings.subset, yax)
    end

    #todo mean of the data instead of zero or nan
    vfill = 0.0
    if fill_nan
        vfill = NaN
    end
    vNT = Val{num_type}()
    if clean_data
        yax = mapCleanData(yax, yax_qc, vfill, bounds_qc, _data_info, vNT)
    else
        yax = map(yax_point -> replace_invalid_number(yax_point, vfill), yax)
        # yax = num_type.(yax)
    end
    return yax
end

toDimStackArray

Sindbad.DataLoaders.toDimStackArray Function

Convert a stacked array into a DimensionalArray with specified dimensions and metadata.

Arguments

  • stackArr: The input stacked array to be converted

  • time_interval: Time interval information for temporal dimension

  • p_names: Names of pools/variables

  • name: Optional keyword argument to specify the name of the dimension (default: :pools)

Returns

A DimensionalArray with proper dimensions and labels.

This function is useful for converting raw stacked arrays into properly dimensioned arrays with metadata, particularly for time series data with multiple pools or variables.

Code
julia
function toDimStackArray(stackArr, time_interval, p_names; name=:pools)
    return DimArray(stackArr,  (p_names=p_names, Ti=time_interval,); name=name,)
end

yaxCubeToKeyedArray

Sindbad.DataLoaders.yaxCubeToKeyedArray Function
julia
yaxCubeToKeyedArray(c)

Convert a YAXArray cube to a KeyedArray.

Arguments

  • c: YAXArray input cube to be converted

Returns

KeyedArray representation of the input YAXArray cube

Description

Transforms a YAXArray data cube into a KeyedArray format, preserving the dimensional structure and associated metadata of the original cube.

Code
julia
function yaxCubeToKeyedArray(c)
    t_dims = getSindbadDims(c);
    return KeyedArray(Array(c.data); t_dims...)
end

Types

AllNaN

Sindbad.DataLoaders.AllNaN Type
julia
AllNaN <: YAXArrays.DAT.ProcFilter

Specialized filter for YAXArrays to skip pixels with all NaN or missing values.

Description

This struct is used as a specialized filter in data processing pipelines to identify or handle cases where all values in a data segment are NaN (Not a Number).