Application Programming Interface#

The core of Arcana’s framework is located under the arcana.core sub-package, which contains all the domain-independent logic. Domain-specific extensions for alternative data stores, dimensions and formats should be placed in arcana.data.stores, arcana.data.spaces and arcana.data.formats respectively.

Warning

Under construction

Data Model#

Core#

class arcana.core.data.store.DataStore[source]#

class arcana.core.data.set.Dataset(id, store: DataStore, hierarchy: List[DataSpace], space: Optional[DataSpace] = None, id_inference=NOTHING, include=NOTHING, exclude=NOTHING, name: str = 'default', columns=NOTHING, pipelines=NOTHING)[source]#

A representation of a “dataset”, the complete collection of data (file-sets and fields) to be used in an analysis.

Parameters

id (str) – The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)
store (Repository) – The store the dataset is stored into. Can be the local file system by providing a FileSystem repo.
hierarchy (Sequence[str]) –
The data frequencies that are explicitly present in the data tree. For example, if a FileSystem dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be

[‘subject’, ‘timepoint’]

Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be

[‘subject’, ‘session’]

In such cases, if there are multiple timepoints, the timepoint ID of the session will need to be extracted using the id_inference argument.

Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):

[‘timepoint’, ‘member’, ‘group’]

Note that the combination of layers in the hierarchy must span the space defined in the DataSpace enum, i.e. the “bitwise or” of the layer values of the hierarchy must be 1 across all bits (e.g. ‘session’: 0b111).
space (DataSpace) – The space of the dataset. See https://arcana.readthedocs.io/en/latest/data_model.html#spaces) for a description
id_inference (list[tuple[DataSpace, str]]) –
Not all IDs will appear explicitly within the hierarchy of the data tree, and some will need to be inferred by extracting components of more specific labels.

For example, given a set of subject IDs that combination of the ID of the group that they belong to and the member ID within that group (i.e. matched test & control would have same member ID)

CONTROL01, CONTROL02, CONTROL03, … and TEST01, TEST02, TEST03

the group ID can be extracted by providing the a list of tuples containing ID to source the inferred IDs from coupled with a regular expression with named groups

id_inference=[(‘subject’,
r’(?P<group>[A-Z]+)(?P<member>[0-9]+)’)}
include (list[tuple[DataSpace, str or list[str]]]) – The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used
exclude (list[tuple[DataSpace, str or list[str]]]) – The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used
name (str) – The name of the dataset as saved in the store under
columns (list[tuple[str, DataSource or DataSink]) – The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).
workflows (Dict[str, pydra.Workflow]) – Workflows that have been applied to the dataset to generate sink
access_args (ty.Dict[str, Any]) – Repository specific args used to control the way the dataset is accessed

add_sink(name, format, path=None, row_frequency=None, overwrite=False, **kwargs)[source]#

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters

name (str) – The name used to reference the dataset “column” for the sink
format (type) – The file-format (for file-groups) or format (for fields) that the sink will be stored in within the dataset
path (str, default name) – The location of the sink within the dataset
row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the sink within the dataset
overwrite (bool) – Whether to overwrite an existing sink

add_source(name, format, path=None, row_frequency=None, overwrite=False, **kwargs)[source]#

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters

name (str) – The name used to reference the dataset “column” for the source
format (type) – The file-format (for file-groups) or format (for fields) that the source will be stored in within the dataset
path (str, default name) – The location of the source within the dataset
row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the source within the dataset
overwrite (bool) – Whether to overwrite existing columns
**kwargs (ty.Dict[str, Any]) – Additional kwargs to pass to DataSource.__init__

class arcana.core.data.space.DataSpace(value)[source]#

Base class for all “data space” enums. DataSpace enums specify the relationships between rows of a dataset.

For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D grid along the group, member, and timepoint dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.

All dimensions should be included as members of a DataSpace subclass enum with orthogonal binary vector values, e.g.

member = 0b001 group = 0b010 timepoint = 0b100

In this space, an imaging session row is uniquely defined by its member, group and timepoint ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).

In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the grid. The “row_frequency” of these rows can be specified using the “basis” members (i.e. member, group, timepoint) in contrast to the session row_frequency, which is the combination of all three

session = 0b111

Additionally, some data is stored in aggregated rows that across a plane of the grid. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.

subject = 0b011 - uniquely identified subject within in the dataset. batch = 0b110 - separate group+timepoint combinations matchedpoint = 0b101 - matched members and time-points aggregated across groups

Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:

dataset = 0b000

class arcana.core.data.row.DataRow(ids: Dict[DataSpace, str], frequency: DataSpace, dataset: Dataset, children: DefaultDict[DataSpace, Dict[Union[str, Tuple[str]], str]] = NOTHING, unresolved=None)[source]#

A “row” in a dataset “frame” where file-groups and fields can be placed, e.g. a session or subject.

Parameters

ids (Dict[DataSpace, str]) – The ids for the frequency of the row and all “parent” frequencies within the tree
frequency (DataSpace) – The frequency of the row
dataset (Dataset) – A reference to the root of the data tree

class arcana.core.data.column.DataSource(name: str, path: str, format, row_frequency: DataSpace, dataset=None, quality_threshold=None, order=None, header_vals: Optional[Dict[str, Any]] = None, is_regex=False)[source]#

Specifies the criteria by which an item is selected from a data row to be a data source.

Parameters

path (str) – A regex name_path to match the file_group names with. Must match one and only one file_group per <row_frequency>. If None, the name is used instead.
format (type) – File format that data will be
row_frequency (DataSpace) – The row_frequency of the file-group within the dataset tree, e.g. per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’
quality_threshold (DataQuality) – The acceptable quality (or above) that should be considered. Data items will be considered missing
order (int | None) – To be used to distinguish multiple file_groups that match the name_path in the same session. The order of the file_group within the session (0-indexed). Based on the scan ID but is more robust to small changes to the IDs within the session if for example there are two scans of the same type taken before and after a task.
header_vals (Dict[str, str]) – To be used to distinguish multiple items that match the the other criteria. The provided dictionary contains header values that must match the stored header_vals exactly.
is_regex (bool) – Flags whether the name_path is a regular expression or not

class arcana.core.data.column.DataSink(name: str, path: str, format, row_frequency: DataSpace, dataset=None, salience=ColumnSalience.supplementary, pipeline_name: Optional[str] = None)[source]#

A specification for a file group within a analysis to be derived from a processing pipeline.

Parameters

path (str) – The path to the relative location the corresponding data items will be stored within the rows of the data tree.
format (type) – The file format or data type used to store the corresponding items in the store dataset.
row_frequency (DataSpace) – The row_frequency of the file-group within the dataset tree, e.g. per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’
salience (Salience) – The salience of the specified file-group, i.e. whether it would be typically of interest for publication outputs or whether it is just a temporary file in a workflow, and stages in between
pipeline_name (str) – The name of the workflow applied to the dataset to generates the data for the sink

class arcana.core.data.format.DataItem(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None)[source]#

A representation of a file_group within the dataset.

Parameters

name_path (str) – The name_path to the relative location of the file group, i.e. excluding information about which row in the data tree it belongs to
order (int | None) – The order in which the file-group appears in the row it belongs to (starting at 0). Typically corresponds to the acquisition order for scans within an imaging session. Can be used to distinguish between scans with the same series description (e.g. multiple BOLD or T1w scans) in the same imaging sessions.
quality (str) – The quality label assigned to the file_group (e.g. as is saved on XNAT)
row (DataRow) – The data row within a dataset that the file-group belongs to
exists (bool) – Whether the file_group exists or is just a placeholder for a sink
provenance (Provenance | None) – The provenance for the pipeline that generated the file-group, if applicable

abstract get(assume_exists=False)[source]#

Pulls data from the store (if remote) and caches locally

Parameters: assume_exists (bool) – If set, checks to see whether the item exists are skipped (used to pull data after a successful workflow run)

abstract put(value)[source]#

Updates the value of the item in the store to the provided value, pushing remotely if necessary.

Parameters: value (ty.Any) – The value to update

class arcana.core.data.format.FileGroup(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None)[source]#

A representation of a file_group within the dataset.

Parameters

name_path (str) – The name_path to the relative location of the file group, i.e. excluding information about which row in the data tree it belongs to
order (int | None) – The order in which the file-group appears in the row it belongs to (starting at 0). Typically corresponds to the acquisition order for scans within an imaging session. Can be used to distinguish between scans with the same series description (e.g. multiple BOLD or T1w scans) in the same imaging sessions.
quality (str) – The quality label assigned to the file_group (e.g. as is saved on XNAT)
row (DataRow) – The data row within a dataset that the file-group belongs to
exists (bool) – Whether the file_group exists or is just a placeholder for a sink
provenance (Provenance | None) – The provenance for the pipeline that generated the file-group, if applicable
fs_path (str | None) – Path to the primary file or directory on the local file system
side_cars (ty.Dict[str, str] | None) – Additional files in the file_group. Keys should match corresponding side_cars dictionary in format.
checksums (ty.Dict[str, str]) – A checksums of all files within the file_group in a dictionary sorted bys relative file name_paths

class arcana.core.data.format.Field(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, value=None)[source]#

A representation of a value field in the dataset.

Parameters

name_path (str) – The name_path to the relative location of the field, i.e. excluding information about which row in the data tree it belongs to
derived (bool) – Whether or not the value belongs in the derived session or not
row (DataRow) – The data row that the field belongs to
exists (bool) – Whether the field exists or is just a placeholder for a sink
provenance (Provenance | None) – The provenance for the pipeline that generated the field, if applicable

class arcana.core.data.format.BaseFile(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None)[source]#

class arcana.core.data.format.BaseDirectory(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None)[source]#

class arcana.core.data.format.WithSideCars(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None, side_cars=NOTHING)[source]#: Base class for file-groups with a primary file and several header or side car files

Stores#

class arcana.data.stores.common.FileSystem[source]#

A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.

Parameters: base_dir (str) – Path to the base directory of the “store”, i.e. datasets are arranged by name as sub-directories of the base dir.

class arcana.data.stores.medimage.Xnat(server: str, cache_dir, user: Optional[str] = None, password: Optional[str] = None, check_md5: bool = True, race_condition_delay: int = 30)[source]#

Access class for XNAT data repositories

Parameters

server (str (URI)) – URI of XNAT server to connect to
project_id (str) – The ID of the project in the XNAT repository
cache_dir (str (name_path)) – Path to local directory to cache remote data in
user (str) – Username with which to connect to XNAT with
password (str) – Password to connect to the XNAT repository with
check_md5 (bool) – Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached
race_condition_delay (int) – The amount of time to wait before checking that the required file_group has been downloaded to cache by another process has completed if they are attempting to download the same file_group

class arcana.data.stores.medimage.XnatViaCS(check_md5: bool = True, race_condition_delay: int = 30, row_frequency: DataSpace = Clinical.session, row_id: Optional[str] = None, input_mount=PosixPath('/input'), output_mount=PosixPath('/output'), server: str = NOTHING, user: str = NOTHING, password: str = NOTHING, cache_dir=PosixPath('/cache'))[source]#

Access class for XNAT repositories via the XNAT container service plugin. The container service allows the exposure of the underlying file system where imaging data can be accessed directly (for performance), and outputs

Parameters

server (str (URI)) – URI of XNAT server to connect to
project_id (str) – The ID of the project in the XNAT repository
cache_dir (str (name_path)) – Path to local directory to cache remote data in
user (str) – Username with which to connect to XNAT with
password (str) – Password to connect to the XNAT repository with
check_md5 (bool) – Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached
race_cond_delay (int) – The amount of time to wait before checking that the required file_group has been downloaded to cache by another process has completed if they are attempting to download the same file_group

Processing#

class arcana.core.pipeline.Pipeline(name: str, row_frequency: DataSpace, workflow: Workflow, inputs, outputs, converter_args=NOTHING, dataset: Optional[Dataset] = None)[source]#

A thin wrapper around a Pydra workflow to link it to sources and sinks within a dataset

Parameters

row_frequency (DataSpace, optional) – The row_frequency of the pipeline, i.e. the row_frequency of the derivatvies within the dataset, e.g. per-session, per-subject, etc, by default None
workflow (Workflow) – The pydra workflow that performs the actual analysis
inputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of column names (i.e. either data sources or sinks) to be connected to the inputs of the pipeline. If the pipelines requires the input to be in a format to the source, then it can be specified in a tuple (NAME, FORMAT)
outputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of sink names to be connected to the outputs of the pipeline If the input to be in a specific format, then it can be provided in a tuple (NAME, FORMAT)
converter_args (dict[str, dict]) – keyword arguments passed on to the converter to control how the conversion is performed.
dataset (Dataset) – the dataset the pipeline has been applied to

Enums#

class arcana.core.enum.ColumnSalience(value)[source]#

An enum that holds the salience levels options that can be used when specifying data columns. Salience is used to indicate whether it would be best to store the data in the data store or whether it can be just stored in the local file-system and discarded after it has been used. This choice is ultimately specified by the user by defining a salience threshold for a store.

The salience is also used when providing information on what sinks are available to avoid cluttering help menus

primary = (100, 'Primary input data, typically reconstructed by the instrument that collects them')#

raw = (90, "Raw data from the scanner that haven't been reconstructed and are only typically used in advanced analyses")#

publication = (80, 'Results that would typically be used as main outputs in publications')#

supplementary = (60, 'Derivatives that would typically only be provided in supplementary material')#

qa = (40, 'Derivatives that would typically be only kept for quality assurance of analysis workflows')#

debug = (20, 'Derivatives that would typically only need to be checked when debugging analysis workflows')#

temp = (0, 'Data only temporarily stored to pass between pipelines, e.g. that operate on different row frequencies')#

classmethod default()[source]#

class arcana.core.enum.ParameterSalience(value)[source]#

An enum that holds the salience levels options that can be used when specifying class parameters. Salience is used to indicate whether the parameter should show up by default when listing the available parameters of an Analysis class in a menu.

debug = (0, 'typically only needed to be altered for debugging')#

recommended = (20, 'recommended to keep defaults')#

dependent = (40, 'best value can be dependent on the context of the analysis, but the default should work for most cases')#

check = (60, 'default value should be checked for validity for particular use case')#

arbitrary = (80, 'a default is provided, but it is not clear which value is best')#

required = (100, 'No sensible default value, should be provided')#

classmethod default()[source]#

class arcana.core.enum.DataQuality(value)[source]#

The quality of a data item. Can be manually specified or set by automatic quality control methods

usable = 100#

noisy = 75#

questionable = 50#

artefactual = 25#

unusable = 0#

classmethod default()[source]#