Sindbad.DataLoaders Module
DataLoadersThe DataLoaders module provides tools for ingesting and preprocessing SINDBAD input data. It supports reading, cleaning, masking, and managing forcing/observation data with an emphasis on spatial and temporal dimensions.
Purpose
Streamline the ingestion and preprocessing of input data for SINDBAD experiments.
Dependencies
Related (SINDBAD ecosystem)
OmniTools: Utility functions for handling NamedTuples, printing, and shared helpers.
External (third-party)
AxisKeys: Labeled multidimensional arrays (KeyedArray).DimensionalData: Dimension-aware indexing/slicing.FillArrays: Efficient filled-array representations.NCDatasets: NetCDF reader/writer.YAXArrays,YAXArrayBase: Multidimensional array/cube abstractions used for IO and spatial data.Zarr: Chunked/compressed array storage for large datasets.
Internal (within Sindbad)
Sindbad.SetupSindbad.TypesSindbadTEM
Included Files
utilsDataLoaders.jl: Utility functions for data preprocessing (cleaning, masking, bounds checks).spatialSubset.jl: Spatial operations (extracting subsets based on spatial dimensions).getForcing.jl: Extracting and processing forcing data (environmental drivers).getObservation.jl: Reading and processing observational data for evaluation/validation.
Notes
The module uses
NCDatasets,YAXArrays, andZarrdirectly; it does not re-export them.Designed to handle large datasets efficiently, leveraging chunked and compressed data formats like NetCDF and Zarr.
Ensures compatibility with SINDBAD's experimental framework by integrating spatial and temporal data management tools.
Functions
getForcing
Sindbad.DataLoaders.getForcing Function
getForcing(info::NamedTuple)Reads forcing data from the data_path specified in the experiment configuration and returns a NamedTuple with the forcing data.
Arguments:
info: A SINDBAD NamedTuple containing all information needed for setup and execution of an experiment.
Returns:
- A NamedTuple
forcingcontaining:data: The processed input cubes.dims: The dimensions of the forcing data.variables: The names of the forcing variables.f_types: The types of the forcing data (e.g.,ForcingWithTimeorForcingWithoutTime).helpers: Helper information for the forcing data.
Examples
julia> using Sindbad
julia> # Load forcing data from experiment configuration
julia> # forcing = getForcing(info)Notes:
Reads forcing data from the specified data path and processes it using the SINDBAD framework.
Handles spatiotemporal and spatial-only forcing data.
Applies masks and subsets to the forcing data if specified in the configuration.
Code
function getForcing(info::NamedTuple)
nc_default = nothing
forcing_data_settings = info.experiment.data_settings.forcing
# forcing_data_settings = info.experiment.data_settings.forcing
data_path = forcing_data_settings.default_forcing.data_path
if !isnothing(data_path)
data_path = getAbsDataPath(info, data_path)
print_info(getForcing, @__FILE__, @__LINE__, "default_data_path: `$(data_path)`")
nc_default = loadDataFile(data_path)
end
data_backend = getfield(Types, to_uppercase_first(info.helpers.run.input_data_backend, "Backend"))()
forcing_mask = nothing
if :sel_mask ∈ keys(forcing_data_settings)
if !isnothing(forcing_data_settings.forcing_mask.data_path)
mask_path = getAbsDataPath(info, forcing_data_settings.forcing_mask.data_path)
_, forcing_mask = getYaxFromSource(nothing, mask_path, nothing, forcing_data_settings.forcing_mask.source_variable, info, data_backend)
forcing_mask = positive_mask(forcing_mask)
end
end
default_info = info.experiment.data_settings.forcing.default_forcing
forcing_vars = keys(forcing_data_settings.variables)
tar_dims = getTargetDimensionOrder(info)
print_info(getForcing, @__FILE__, @__LINE__, "getting forcing variables. Units given in forcing settings are not strictly enforced but shown for reference. Bounds are applied after unit conversion...", n_m=1)
vinfo = nothing
f_sizes = nothing
f_dimension = nothing
num_type = Val{info.helpers.numbers.num_type}()
incubes = map(forcing_vars) do k
nc = nc_default
vinfo = merge_namedtuple_prefer_nonempty(default_info, forcing_data_settings.variables[k])
data_path_v = getAbsDataPath(info, getfield(vinfo, :data_path))
nc, yax = getYaxFromSource(nc, data_path, data_path_v, vinfo.source_variable, info, data_backend)
incube = subsetAndProcessYax(yax, forcing_mask, tar_dims, vinfo, info, num_type)
v_op = vinfo.additive_unit_conversion ? " + " : " * "
v_op = v_op * "$(vinfo.source_to_sindbad_unit)"
v_string = "`$(k)` ($(vinfo.sindbad_unit), $(vinfo.bounds)) = <$(vinfo.space_time_type)> `$(vinfo.source_variable)` ($(vinfo.source_unit)) $(v_op)"
print_info(nothing, @__FILE__, @__LINE__, v_string, n_m=4)
if vinfo.space_time_type == "spatiotemporal" && isnothing(f_sizes)
f_sizes = collectForcingSizes(info, incube)
f_dimension = getSindbadDims(incube)
end
incube
end
return createForcingNamedTuple(incubes, f_sizes, f_dimension, info)
endgetNumberOfTimeSteps
Sindbad.DataLoaders.getNumberOfTimeSteps Function
getNumberOfTimeSteps(incubes, time_name)Returns the number of time steps in the input data cubes.
Arguments
incubes: Input data cubes containing temporal informationtime_name: Name of the time dimension/variable
Returns
Integer representing the total number of time steps in the data
Code
function getNumberOfTimeSteps(incubes, time_name)
i1 = findfirst(c -> YAXArrays.Axes.findAxis(time_name, c) !== nothing, incubes)
return length(getAxis(time_name, incubes[i1]).values)
endgetObservation
Sindbad.DataLoaders.getObservation Function
getObservation(info::NamedTuple, forcing_helpers::NamedTuple)Processes observation data and returns a NamedTuple containing the observation data, dimensions, and variables.
Arguments:
info: A SINDBAD NamedTuple containing all information needed for setup and execution of an experiment.forcing_helpers: A SINDBAD NamedTuple containing helper information for forcing data.
Returns:
- A NamedTuple with the following fields:
data: The processed observation data as an input array.dims: The dimensions of the observation data.variables: A tuple of variable names for the observation data.
Notes:
- Reads observation data from the path specified in the experiment configuration.
Examples
julia> using Sindbad
julia> # Load observation data from experiment configuration
julia> # observations = getObservation(info, forcing_helpers)Handles quality flags, uncertainty, spatial weights, and selection masks for each observation variable.
Subsets and harmonizes the observation data based on the target dimensions and masks.
Code
function getObservation(info::NamedTuple, forcing_helpers::NamedTuple)
observation_data_settings = info.experiment.data_settings.optimization
forcing_data_settings = info.experiment.data_settings.forcing
exe_rules_settings = info.experiment.exe_rules
data_path = observation_data_settings.observations.default_observation.data_path
data_backend = getfield(Types, to_uppercase_first(exe_rules_settings.input_data_backend, "Backend"))()
default_info = observation_data_settings.observations.default_observation
tar_dims = getTargetDimensionOrder(info)
nc_default = nothing
if !isnothing(data_path)
data_path = getAbsDataPath(info, data_path)
print_info(getObservation, @__FILE__, @__LINE__, "default_observation_data_path: `$(data_path)`")
nc_default = loadDataFile(data_path)
end
varnames = Symbol.(observation_data_settings.observational_constraints)
yax_mask = nothing
if :one_sel_mask ∈ keys(observation_data_settings)
if !isnothing(observation_data_settings.one_sel_mask)
mask_path = getAbsDataPath(info, observation_data_settings.one_sel_mask)
_, yax_mask = getYaxFromSource(nothing, mask_path, nothing, "mask", info, data_backend)
yax_mask = positive_mask(yax_mask)
end
end
obscubes = []
num_type = Val{info.helpers.numbers.num_type}()
num_type_bool = Val{Bool}()
print_info(getObservation, @__FILE__, @__LINE__, "getting observation variables. Units given in optimization settings are not strictly enforced but shown for reference. Bounds are applied after unit conversion...")
map(varnames) do k
vinfo = getproperty(observation_data_settings.observations.variables, k)
print_info(nothing, @__FILE__, @__LINE__, "constraint: `$k`", n_m=4)
src_var = vinfo.data.source_variable
nc = nc_default
nc, yax, vinfo_data, bounds_data = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :data, info)
# get quality flags data and use it later to mask observations. Set to value of 1 when :qflag field is not given for a data stream or all are turned off by setting optimizatio.optimization.observations.use_quality_flag to false
nc_qc, yax_qc, vinfo_qc, bounds_qc = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :qflag, info; yax=yax, use_data_sub=observation_data_settings.observations.use_quality_flag)
# get uncertainty data and add to observations. Set to value of 1 when :unc field is not given for a data stream or all are turned off by setting observation_data_settings.use_uncertainty to false
nc_unc, yax_unc, vinfo_unc, bounds_unc = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :unc, info; yax=yax, use_data_sub=observation_data_settings.observations.use_uncertainty)
nc_wgt, yax_wgt, vinfo_wgt, bounds_wgt = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :weight, info; yax=yax, use_data_sub=observation_data_settings.observations.use_spatial_weight)
_, yax_mask_v, vinfo_mask, bounds_mask = getAllConstraintData(nc, data_backend, data_path, default_info, vinfo, :sel_mask, info; yax=yax)
yax_mask_v = positive_mask(yax_mask_v)
if !isnothing(yax_mask)
yax_mask_v .= yax_mask .* yax_mask_v
end
print_info(nothing, @__FILE__, @__LINE__, "harmonize/subset...", n_m=6)
@debug " qflag"
cyax_qc = subsetAndProcessYax(yax_qc, yax_mask_v, tar_dims, vinfo_qc, info, num_type; clean_data=false)
@debug " data"
cyax = subsetAndProcessYax(yax, yax_mask, tar_dims, vinfo_data, info, num_type; fill_nan=true, yax_qc=cyax_qc, bounds_qc=bounds_qc)
@debug " unc"
cyax_unc = subsetAndProcessYax(yax_unc, yax_mask, tar_dims, vinfo_unc, info, num_type; fill_nan=true, yax_qc=cyax_qc, bounds_qc=bounds_qc)
@debug " weight"
cyax_wgt = subsetAndProcessYax(yax_wgt, yax_mask, tar_dims, vinfo_wgt, info, num_type; fill_nan=true, yax_qc=cyax_qc, bounds_qc=bounds_qc)
@debug " mask"
yax_mask_v = subsetAndProcessYax(yax_mask_v, yax_mask_v, tar_dims, vinfo_mask, info, num_type_bool; clean_data=false)
push!(obscubes, cyax)
push!(obscubes, cyax_unc)
push!(obscubes, yax_mask_v)
push!(obscubes, cyax_wgt)
end
print_info(getObservation, @__FILE__, @__LINE__, "getting observation helpers...", n_m=2)
@debug "getObservation: getting observation dimensions..."
indims = getDataDims.(obscubes, Ref(forcing_data_settings.data_dimension.space))
@debug "getObservation: getting number of time steps..."
nts = forcing_helpers.sizes
@debug "getObservation: getting variable name..."
varnames_all = []
for v ∈ varnames
push!(varnames_all, v)
push!(varnames_all, Symbol(string(v) * "_σ"))
push!(varnames_all, Symbol(string(v) * "_mask"))
push!(varnames_all, Symbol(string(v) * "_weight"))
end
input_array_type = getfield(Types, to_uppercase_first(exe_rules_settings.input_array_type, "Input"))()
print_info_separator()
return (; data=getInputArrayOfType(obscubes, input_array_type), dims=indims, variables=Tuple(varnames_all))
endgetSpatialSubset
Sindbad.DataLoaders.getSpatialSubset Function
getSpatialSubset(ss, v)Extracts a spatial subset of data based on specified spatial subsetting type/strategy.
Arguments
ss: Spatial subset parameters or geometry defining the region of interestv: Data to be spatially subset
Returns
Spatially subset data according to the specified parameters
Note
The function assumes input data and spatial parameters are in compatible formats
Examples
julia> using Sindbad
julia> # Get spatial subset from configuration
julia> # subset_data = getSpatialSubset(spatial_subset_config, data_cube)Code
function getSpatialSubset(ss, v)
if isa(ss, Dict)
ss = dict_to_namedtuple(ss)
end
if !isnothing(ss)
ssname = propertynames(ss)
for ssn ∈ ssname
ss_r = getproperty(ss, ssn)
if !isnothing(ss_r)
ss_range = collect(ss_r)
ss_typeName = Symbol("Space" * string(ssn))
v = spatialSubset(v, ss_range, getfield(Types, ss_typeName)())
end
end
end
return v
endmapCleanData
Sindbad.DataLoaders.mapCleanData Function
Maps and cleans data based on quality control parameters and fills missing values.
Arguments
_data: Raw input data to be cleaned_data_qc: Quality control data corresponding to input data_data_fill: Fill values for replacing invalid/missing databounds_qc: Quality control bounds/thresholds_data_info: Additional information about the data::Val{T}: Value type parameter for dispatch
Returns
Cleaned and mapped data with invalid values replaced according to QC criteria
Note
This function performs quality control checks and data cleaning based on the provided bounds and fill values. The exact behavior depends on the value type T.
Code
function mapCleanData(_data, _data_qc, _data_fill, bounds_qc, _data_info, ::Val{T}) where {T}
if !isnothing(bounds_qc) && !isnothing(_data_qc)
_data = map((da, dq) -> applyQCBound(da, dq, bounds_qc, _data_fill), _data, _data_qc)
end
vT = Val{T}()
_data = map(data_point -> cleanData(data_point, _data_fill, _data_info, vT), _data)
return _data
endsubsetAndProcessYax
Sindbad.DataLoaders.subsetAndProcessYax Function
subsetAndProcessYax(yax, forcing_mask, tar_dims, _data_info, info, ::Val{num_type}; clean_data=true, fill_nan=false, yax_qc=nothing, bounds_qc=nothing) where {num_type}Subset and process YAX data according to specified parameters and quality control criteria.
Arguments
yax: YAX data to be processedforcing_mask: Mask to apply to the datatar_dims: Target dimensions_data_info: Data informationinfo: a SINDBAD NT that includes all information needed for setup and execution of an experiment::Val{num_type}: Value type parameter for numerical type specificationclean_data=true: Boolean flag to enable/disable data cleaningfill_nan=false: Boolean flag to control NaN fillingyax_qc=nothing: Optional quality control parameters for YAX databounds_qc=nothing: Optional boundary quality control parameters
Returns
Processed and subset YAX data according to specified parameters and quality controls.
Type Parameters
num_type: Numerical type specification for the processed data
Code
function subsetAndProcessYax(yax, forcing_mask, tar_dims, _data_info, info, ::Val{num_type}; clean_data=true, fill_nan=false, yax_qc=nothing, bounds_qc=nothing) where {num_type}
forcing_data_settings = info.experiment.data_settings.forcing
if !isnothing(forcing_mask)
yax = yax #todo: mask the forcing variables here depending on the mask of 1 and 0
end
if !isnothing(tar_dims)
permutes = getDimPermutation(YAXArrayBase.dimnames(yax), tar_dims)
@debug " -> permuting dimensions to $(tar_dims)..."
yax = permutedims(yax, permutes)
end
if hasproperty(yax, Symbol(forcing_data_settings.data_dimension.time))
init_date = DateTime(info.helpers.dates.date_begin)
last_date = DateTime(info.helpers.dates.date_end)
yax = yax[time=(init_date .. last_date)]
end
if hasproperty(forcing_data_settings, :subset)
yax = getSpatialSubset(forcing_data_settings.subset, yax)
end
#todo mean of the data instead of zero or nan
vfill = 0.0
if fill_nan
vfill = NaN
end
vNT = Val{num_type}()
if clean_data
yax = mapCleanData(yax, yax_qc, vfill, bounds_qc, _data_info, vNT)
else
yax = map(yax_point -> replace_invalid_number(yax_point, vfill), yax)
# yax = num_type.(yax)
end
return yax
endtoDimStackArray
Sindbad.DataLoaders.toDimStackArray Function
Convert a stacked array into a DimensionalArray with specified dimensions and metadata.
Arguments
stackArr: The input stacked array to be convertedtime_interval: Time interval information for temporal dimensionp_names: Names of pools/variablesname: Optional keyword argument to specify the name of the dimension (default: :pools)
Returns
A DimensionalArray with proper dimensions and labels.
This function is useful for converting raw stacked arrays into properly dimensioned arrays with metadata, particularly for time series data with multiple pools or variables.
Code
function toDimStackArray(stackArr, time_interval, p_names; name=:pools)
return DimArray(stackArr, (p_names=p_names, Ti=time_interval,); name=name,)
endyaxCubeToKeyedArray
Sindbad.DataLoaders.yaxCubeToKeyedArray Function
yaxCubeToKeyedArray(c)Convert a YAXArray cube to a KeyedArray.
Arguments
c: YAXArray input cube to be converted
Returns
KeyedArray representation of the input YAXArray cube
Description
Transforms a YAXArray data cube into a KeyedArray format, preserving the dimensional structure and associated metadata of the original cube.
Code
function yaxCubeToKeyedArray(c)
t_dims = getSindbadDims(c);
return KeyedArray(Array(c.data); t_dims...)
endTypes
AllNaN
Sindbad.DataLoaders.AllNaN Type
AllNaN <: YAXArrays.DAT.ProcFilterSpecialized filter for YAXArrays to skip pixels with all NaN or missing values.
Description
This struct is used as a specialized filter in data processing pipelines to identify or handle cases where all values in a data segment are NaN (Not a Number).