Application Programming Interface#

The core of Arcana’s framework is located under the arcana.core sub-package, which contains all the domain-independent logic. Domain-specific extensions for alternative data stores, dimensions and formats should be placed in arcana.data.stores, arcana.data.spaces and arcana.data.formats respectively.

Warning

Under construction

Data Model#

Core#

class arcana.core.data.store.DataStore[source]#
class arcana.core.data.set.Dataset(id, store: DataStore, hierarchy: List[DataSpace], space: Optional[DataSpace] = None, id_inference=NOTHING, include=NOTHING, exclude=NOTHING, name: str = 'default', columns=NOTHING, pipelines=NOTHING)[source]#

A representation of a “dataset”, the complete collection of data (file-sets and fields) to be used in an analysis.

Parameters
  • id (str) – The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)

  • store (Repository) – The store the dataset is stored into. Can be the local file system by providing a FileSystem repo.

  • hierarchy (Sequence[str]) –

    The data frequencies that are explicitly present in the data tree. For example, if a FileSystem dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be

    [‘subject’, ‘timepoint’]

    Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be

    [‘subject’, ‘session’]

    In such cases, if there are multiple timepoints, the timepoint ID of the session will need to be extracted using the id_inference argument.

    Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):

    [‘timepoint’, ‘member’, ‘group’]

    Note that the combination of layers in the hierarchy must span the space defined in the DataSpace enum, i.e. the “bitwise or” of the layer values of the hierarchy must be 1 across all bits (e.g. ‘session’: 0b111).

  • space (DataSpace) – The space of the dataset. See https://arcana.readthedocs.io/en/latest/data_model.html#spaces) for a description

  • id_inference (list[tuple[DataSpace, str]]) –

    Not all IDs will appear explicitly within the hierarchy of the data tree, and some will need to be inferred by extracting components of more specific labels.

    For example, given a set of subject IDs that combination of the ID of the group that they belong to and the member ID within that group (i.e. matched test & control would have same member ID)

    CONTROL01, CONTROL02, CONTROL03, … and TEST01, TEST02, TEST03

    the group ID can be extracted by providing the a list of tuples containing ID to source the inferred IDs from coupled with a regular expression with named groups

    id_inference=[(‘subject’,

    r’(?P<group>[A-Z]+)(?P<member>[0-9]+)’)}

  • include (list[tuple[DataSpace, str or list[str]]]) – The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used

  • exclude (list[tuple[DataSpace, str or list[str]]]) – The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used

  • name (str) – The name of the dataset as saved in the store under

  • columns (list[tuple[str, DataSource or DataSink]) – The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).

  • workflows (Dict[str, pydra.Workflow]) – Workflows that have been applied to the dataset to generate sink

  • access_args (ty.Dict[str, Any]) – Repository specific args used to control the way the dataset is accessed

add_sink(name, format, path=None, row_frequency=None, overwrite=False, **kwargs)[source]#

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters
  • name (str) – The name used to reference the dataset “column” for the sink

  • format (type) – The file-format (for file-groups) or format (for fields) that the sink will be stored in within the dataset

  • path (str, default name) – The location of the sink within the dataset

  • row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the sink within the dataset

  • overwrite (bool) – Whether to overwrite an existing sink

add_source(name, format, path=None, row_frequency=None, overwrite=False, **kwargs)[source]#

Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.

Parameters
  • name (str) – The name used to reference the dataset “column” for the source

  • format (type) – The file-format (for file-groups) or format (for fields) that the source will be stored in within the dataset

  • path (str, default name) – The location of the source within the dataset

  • row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the source within the dataset

  • overwrite (bool) – Whether to overwrite existing columns

  • **kwargs (ty.Dict[str, Any]) – Additional kwargs to pass to DataSource.__init__

class arcana.core.data.space.DataSpace(value)[source]#

Base class for all “data space” enums. DataSpace enums specify the relationships between rows of a dataset.

For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D grid along the group, member, and timepoint dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.

All dimensions should be included as members of a DataSpace subclass enum with orthogonal binary vector values, e.g.

member = 0b001 group = 0b010 timepoint = 0b100

In this space, an imaging session row is uniquely defined by its member, group and timepoint ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).

In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the grid. The “row_frequency” of these rows can be specified using the “basis” members (i.e. member, group, timepoint) in contrast to the session row_frequency, which is the combination of all three

session = 0b111

Additionally, some data is stored in aggregated rows that across a plane of the grid. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.

subject = 0b011 - uniquely identified subject within in the dataset. batch = 0b110 - separate group+timepoint combinations matchedpoint = 0b101 - matched members and time-points aggregated across groups

Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:

dataset = 0b000

class arcana.core.data.row.DataRow(ids: Dict[DataSpace, str], frequency: DataSpace, dataset: Dataset, children: DefaultDict[DataSpace, Dict[Union[str, Tuple[str]], str]] = NOTHING, unresolved=None)[source]#

A “row” in a dataset “frame” where file-groups and fields can be placed, e.g. a session or subject.

Parameters
  • ids (Dict[DataSpace, str]) – The ids for the frequency of the row and all “parent” frequencies within the tree

  • frequency (DataSpace) – The frequency of the row

  • dataset (Dataset) – A reference to the root of the data tree

class arcana.core.data.column.DataSource(name: str, path: str, format, row_frequency: DataSpace, dataset=None, quality_threshold=None, order=None, header_vals: Optional[Dict[str, Any]] = None, is_regex=False)[source]#

Specifies the criteria by which an item is selected from a data row to be a data source.

Parameters
  • path (str) – A regex name_path to match the file_group names with. Must match one and only one file_group per <row_frequency>. If None, the name is used instead.

  • format (type) – File format that data will be

  • row_frequency (DataSpace) – The row_frequency of the file-group within the dataset tree, e.g. per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’

  • quality_threshold (DataQuality) – The acceptable quality (or above) that should be considered. Data items will be considered missing

  • order (int | None) – To be used to distinguish multiple file_groups that match the name_path in the same session. The order of the file_group within the session (0-indexed). Based on the scan ID but is more robust to small changes to the IDs within the session if for example there are two scans of the same type taken before and after a task.

  • header_vals (Dict[str, str]) – To be used to distinguish multiple items that match the the other criteria. The provided dictionary contains header values that must match the stored header_vals exactly.

  • is_regex (bool) – Flags whether the name_path is a regular expression or not

class arcana.core.data.column.DataSink(name: str, path: str, format, row_frequency: DataSpace, dataset=None, salience=ColumnSalience.supplementary, pipeline_name: Optional[str] = None)[source]#

A specification for a file group within a analysis to be derived from a processing pipeline.

Parameters
  • path (str) – The path to the relative location the corresponding data items will be stored within the rows of the data tree.

  • format (type) – The file format or data type used to store the corresponding items in the store dataset.

  • row_frequency (DataSpace) – The row_frequency of the file-group within the dataset tree, e.g. per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’

  • salience (Salience) – The salience of the specified file-group, i.e. whether it would be typically of interest for publication outputs or whether it is just a temporary file in a workflow, and stages in between

  • pipeline_name (str) – The name of the workflow applied to the dataset to generates the data for the sink

class arcana.core.data.format.DataItem(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None)[source]#

A representation of a file_group within the dataset.

Parameters
  • name_path (str) – The name_path to the relative location of the file group, i.e. excluding information about which row in the data tree it belongs to

  • order (int | None) – The order in which the file-group appears in the row it belongs to (starting at 0). Typically corresponds to the acquisition order for scans within an imaging session. Can be used to distinguish between scans with the same series description (e.g. multiple BOLD or T1w scans) in the same imaging sessions.

  • quality (str) – The quality label assigned to the file_group (e.g. as is saved on XNAT)

  • row (DataRow) – The data row within a dataset that the file-group belongs to

  • exists (bool) – Whether the file_group exists or is just a placeholder for a sink

  • provenance (Provenance | None) – The provenance for the pipeline that generated the file-group, if applicable

abstract get(assume_exists=False)[source]#

Pulls data from the store (if remote) and caches locally

Parameters

assume_exists (bool) – If set, checks to see whether the item exists are skipped (used to pull data after a successful workflow run)

abstract put(value)[source]#

Updates the value of the item in the store to the provided value, pushing remotely if necessary.

Parameters

value (ty.Any) – The value to update

class arcana.core.data.format.FileGroup(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None)[source]#

A representation of a file_group within the dataset.

Parameters
  • name_path (str) – The name_path to the relative location of the file group, i.e. excluding information about which row in the data tree it belongs to

  • order (int | None) – The order in which the file-group appears in the row it belongs to (starting at 0). Typically corresponds to the acquisition order for scans within an imaging session. Can be used to distinguish between scans with the same series description (e.g. multiple BOLD or T1w scans) in the same imaging sessions.

  • quality (str) – The quality label assigned to the file_group (e.g. as is saved on XNAT)

  • row (DataRow) – The data row within a dataset that the file-group belongs to

  • exists (bool) – Whether the file_group exists or is just a placeholder for a sink

  • provenance (Provenance | None) – The provenance for the pipeline that generated the file-group, if applicable

  • fs_path (str | None) – Path to the primary file or directory on the local file system

  • side_cars (ty.Dict[str, str] | None) – Additional files in the file_group. Keys should match corresponding side_cars dictionary in format.

  • checksums (ty.Dict[str, str]) – A checksums of all files within the file_group in a dictionary sorted bys relative file name_paths

class arcana.core.data.format.Field(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, value=None)[source]#

A representation of a value field in the dataset.

Parameters
  • name_path (str) – The name_path to the relative location of the field, i.e. excluding information about which row in the data tree it belongs to

  • derived (bool) – Whether or not the value belongs in the derived session or not

  • row (DataRow) – The data row that the field belongs to

  • exists (bool) – Whether the field exists or is just a placeholder for a sink

  • provenance (Provenance | None) – The provenance for the pipeline that generated the field, if applicable

class arcana.core.data.format.BaseFile(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None)[source]#
class arcana.core.data.format.BaseDirectory(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None)[source]#
class arcana.core.data.format.WithSideCars(path: str, uri: Optional[str] = None, order: Optional[int] = None, quality: DataQuality = DataQuality.usable, exists: bool = True, provenance: Optional[Dict[str, Any]] = None, row=None, fs_path=None, side_cars=NOTHING)[source]#

Base class for file-groups with a primary file and several header or side car files

Stores#

class arcana.data.stores.common.FileSystem[source]#

A Repository class for data stored hierarchically within sub-directories of a file-system directory. The depth and which layer in the data tree the sub-directories correspond to is defined by the hierarchy argument.

Parameters

base_dir (str) – Path to the base directory of the “store”, i.e. datasets are arranged by name as sub-directories of the base dir.

class arcana.data.stores.medimage.Xnat(server: str, cache_dir, user: Optional[str] = None, password: Optional[str] = None, check_md5: bool = True, race_condition_delay: int = 30)[source]#

Access class for XNAT data repositories

Parameters
  • server (str (URI)) – URI of XNAT server to connect to

  • project_id (str) – The ID of the project in the XNAT repository

  • cache_dir (str (name_path)) – Path to local directory to cache remote data in

  • user (str) – Username with which to connect to XNAT with

  • password (str) – Password to connect to the XNAT repository with

  • check_md5 (bool) – Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached

  • race_condition_delay (int) – The amount of time to wait before checking that the required file_group has been downloaded to cache by another process has completed if they are attempting to download the same file_group

class arcana.data.stores.medimage.XnatViaCS(check_md5: bool = True, race_condition_delay: int = 30, row_frequency: DataSpace = Clinical.session, row_id: Optional[str] = None, input_mount=PosixPath('/input'), output_mount=PosixPath('/output'), server: str = NOTHING, user: str = NOTHING, password: str = NOTHING, cache_dir=PosixPath('/cache'))[source]#

Access class for XNAT repositories via the XNAT container service plugin. The container service allows the exposure of the underlying file system where imaging data can be accessed directly (for performance), and outputs

Parameters
  • server (str (URI)) – URI of XNAT server to connect to

  • project_id (str) – The ID of the project in the XNAT repository

  • cache_dir (str (name_path)) – Path to local directory to cache remote data in

  • user (str) – Username with which to connect to XNAT with

  • password (str) – Password to connect to the XNAT repository with

  • check_md5 (bool) – Whether to check the MD5 digest of cached files before using. This checks for updates on the server since the file was cached

  • race_cond_delay (int) – The amount of time to wait before checking that the required file_group has been downloaded to cache by another process has completed if they are attempting to download the same file_group

Processing#

class arcana.core.pipeline.Pipeline(name: str, row_frequency: DataSpace, workflow: Workflow, inputs, outputs, converter_args=NOTHING, dataset: Optional[Dataset] = None)[source]#

A thin wrapper around a Pydra workflow to link it to sources and sinks within a dataset

Parameters
  • row_frequency (DataSpace, optional) – The row_frequency of the pipeline, i.e. the row_frequency of the derivatvies within the dataset, e.g. per-session, per-subject, etc, by default None

  • workflow (Workflow) – The pydra workflow that performs the actual analysis

  • inputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of column names (i.e. either data sources or sinks) to be connected to the inputs of the pipeline. If the pipelines requires the input to be in a format to the source, then it can be specified in a tuple (NAME, FORMAT)

  • outputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of sink names to be connected to the outputs of the pipeline If the input to be in a specific format, then it can be provided in a tuple (NAME, FORMAT)

  • converter_args (dict[str, dict]) – keyword arguments passed on to the converter to control how the conversion is performed.

  • dataset (Dataset) – the dataset the pipeline has been applied to

Enums#

class arcana.core.enum.ColumnSalience(value)[source]#

An enum that holds the salience levels options that can be used when specifying data columns. Salience is used to indicate whether it would be best to store the data in the data store or whether it can be just stored in the local file-system and discarded after it has been used. This choice is ultimately specified by the user by defining a salience threshold for a store.

The salience is also used when providing information on what sinks are available to avoid cluttering help menus

primary = (100, 'Primary input data, typically reconstructed by the instrument that collects them')#
raw = (90, "Raw data from the scanner that haven't been reconstructed and are only typically used in advanced analyses")#
publication = (80, 'Results that would typically be used as main outputs in publications')#
supplementary = (60, 'Derivatives that would typically only be provided in supplementary material')#
qa = (40, 'Derivatives that would typically be only kept for quality assurance of analysis workflows')#
debug = (20, 'Derivatives that would typically only need to be checked when debugging analysis workflows')#
temp = (0, 'Data only temporarily stored to pass between pipelines, e.g. that operate on different row frequencies')#
classmethod default()[source]#
class arcana.core.enum.ParameterSalience(value)[source]#

An enum that holds the salience levels options that can be used when specifying class parameters. Salience is used to indicate whether the parameter should show up by default when listing the available parameters of an Analysis class in a menu.

debug = (0, 'typically only needed to be altered for debugging')#
recommended = (20, 'recommended to keep defaults')#
dependent = (40, 'best value can be dependent on the context of the analysis, but the default should work for most cases')#
check = (60, 'default value should be checked for validity for particular use case')#
arbitrary = (80, 'a default is provided, but it is not clear which value is best')#
required = (100, 'No sensible default value, should be provided')#
classmethod default()[source]#
class arcana.core.enum.DataQuality(value)[source]#

The quality of a data item. Can be manually specified or set by automatic quality control methods

usable = 100#
noisy = 75#
questionable = 50#
artefactual = 25#
unusable = 0#
classmethod default()[source]#