Application Programming Interface#
The core of Arcana’s framework is located under the arcana.core
sub-package,
which contains all the domain-independent logic. Domain-specific extensions
for alternative data stores, dimensions and formats should be placed in
arcana.data.stores
, arcana.data.spaces
and arcana.data.types
respectively.
Warning
Under construction
Data Model#
Core#
- class arcana.core.data.set.Dataset(id, store: DataStore, hierarchy: List[DataSpace], space: Optional[DataSpace] = None, id_inference=_Nothing.NOTHING, include=_Nothing.NOTHING, exclude=_Nothing.NOTHING, name: str = 'default', columns=_Nothing.NOTHING, pipelines=_Nothing.NOTHING)[source]#
A representation of a “dataset”, the complete collection of data (file-sets and fields) to be used in an analysis.
- Parameters
id (str) – The dataset id/path that uniquely identifies the dataset within the store it is stored (e.g. FS directory path or project ID)
store (Repository) – The store the dataset is stored into. Can be the local file system by providing a SimpleStore repo.
hierarchy (Sequence[str]) –
The data frequencies that are explicitly present in the data tree. For example, if a SimpleStore dataset (i.e. directory) has two layer hierarchy of sub-directories, the first layer of sub-directories labelled by unique subject ID, and the second directory layer labelled by study time-point then the hierarchy would be
[‘subject’, ‘timepoint’]
Alternatively, in some stores (e.g. XNAT) the second layer in the hierarchy may be named with session ID that is unique across the project, in which case the layer dimensions would instead be
[‘subject’, ‘session’]
In such cases, if there are multiple timepoints, the timepoint ID of the session will need to be extracted using the id_inference argument.
Alternatively, the hierarchy could be organised such that the tree first splits on longitudinal time-points, then a second directory layer labelled by member ID, with the final layer containing sessions of matched members labelled by their groups (e.g. test & control):
[‘timepoint’, ‘member’, ‘group’]
Note that the combination of layers in the hierarchy must span the space defined in the DataSpace enum, i.e. the “bitwise or” of the layer values of the hierarchy must be 1 across all bits (e.g. ‘session’: 0b111).
space (DataSpace) – The space of the dataset. See https://arcana.readthedocs.io/en/latest/data_model.html#spaces) for a description
id_inference (list[tuple[DataSpace, str]]) –
Not all IDs will appear explicitly within the hierarchy of the data tree, and some will need to be inferred by extracting components of more specific labels.
For example, given a set of subject IDs that combination of the ID of the group that they belong to and the member ID within that group (i.e. matched test & control would have same member ID)
CONTROL01, CONTROL02, CONTROL03, … and TEST01, TEST02, TEST03
the group ID can be extracted by providing the a list of tuples containing ID to source the inferred IDs from coupled with a regular expression with named groups
- id_inference=[(‘subject’,
r’(?P<group>[A-Z]+)(?P<member>[0-9]+)’)}
include (list[tuple[DataSpace, str or list[str]]]) – The IDs to be included in the dataset per row_frequency. E.g. can be used to limit the subject IDs in a project to the sub-set that passed QC. If a row_frequency is omitted or its value is None, then all available will be used
exclude (list[tuple[DataSpace, str or list[str]]]) – The IDs to be excluded in the dataset per row_frequency. E.g. can be used to exclude specific subjects that failed QC. If a row_frequency is omitted or its value is None, then all available will be used
name (str) – The name of the dataset as saved in the store under
columns (list[tuple[str, DataSource or DataSink]) – The sources and sinks to be initially added to the dataset (columns are explicitly added when workflows are applied to the dataset).
workflows (Dict[str, pydra.Workflow]) – Workflows that have been applied to the dataset to generate sink
access_args (ty.Dict[str, Any]) – Repository specific args used to control the way the dataset is accessed
- add_sink(name, datatype, path=None, row_frequency=None, overwrite=False, **kwargs)[source]#
Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.
- Parameters
name (str) – The name used to reference the dataset “column” for the sink
datatype (type) – The file-format (for file-groups) or datatype (for fields) that the sink will be stored in within the dataset
path (str, default name) – The location of the sink within the dataset
row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the sink within the dataset
overwrite (bool) – Whether to overwrite an existing sink
- add_source(name, datatype, path=None, row_frequency=None, overwrite=False, **kwargs)[source]#
Specify a data source in the dataset, which can then be referenced when connecting workflow inputs.
- Parameters
name (str) – The name used to reference the dataset “column” for the source
datatype (type) – The file-format (for file-groups) or datatype (for fields) that the source will be stored in within the dataset
path (str, default name) – The location of the source within the dataset
row_frequency (DataSpace, default self.leaf_freq) – The row_frequency of the source within the dataset
overwrite (bool) – Whether to overwrite existing columns
**kwargs (ty.Dict[str, Any]) – Additional kwargs to pass to DataSource.__init__
- class arcana.core.data.space.DataSpace(value)[source]#
Base class for all “data space” enums. DataSpace enums specify the relationships between rows of a dataset.
For example in imaging studies, scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subjects) and time-points (for longitudinal studies). We can visualise the rows arranged in a 3-D grid along the group, member, and timepoint dimensions. Note that datasets that only contain one group or time-point can still be represented in the same space, and just be of depth=1 along those dimensions.
All dimensions should be included as members of a DataSpace subclass enum with orthogonal binary vector values, e.g.
member = 0b001 group = 0b010 timepoint = 0b100
In this space, an imaging session row is uniquely defined by its member, group and timepoint ID. The most commonly present dimension should be given the least frequent bit (e.g. imaging datasets will not always have different groups or time-points but will always have different members (equivalent to subjects when there is one group).
In addition to the data items stored in the data rows for each session, some items only vary along a particular dimension of the grid. The “row_frequency” of these rows can be specified using the “basis” members (i.e. member, group, timepoint) in contrast to the session row_frequency, which is the combination of all three
session = 0b111
Additionally, some data is stored in aggregated rows that across a plane of the grid. These frequencies should also be added to the enum (all combinations of the basis frequencies must be included) and given intuitive names if possible, e.g.
subject = 0b011 - uniquely identified subject within in the dataset. batch = 0b110 - separate group+timepoint combinations matchedpoint = 0b101 - matched members and time-points aggregated across groups
Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:
dataset = 0b000
- class arcana.core.data.row.DataRow(ids: ty.Dict[DataSpace, str], frequency: DataSpace, dataset: arcana.core.data.set.Dataset, children: ty.DefaultDict[DataSpace, ty.Dict[ty.Union[str, ty.Tuple[str]], str]] = _Nothing.NOTHING, unresolved=None)[source]#
A “row” in a dataset “frame” where file-groups and fields can be placed, e.g. a session or subject.
- class arcana.core.data.column.DataSource(name: str, path: str, datatype, row_frequency: DataSpace, dataset=None, quality_threshold=None, order=None, header_vals: Optional[Dict[str, Any]] = None, is_regex=False)[source]#
Specifies the criteria by which an item is selected from a data row to be a data source.
- Parameters
path (str) – A regex name_path to match the file_group names with. Must match one and only one file_group per <row_frequency>. If None, the name is used instead.
datatype (type) – File format that data will be
row_frequency (DataSpace) – The row_frequency of the file-group within the dataset tree, e.g. per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’
quality_threshold (DataQuality) – The acceptable quality (or above) that should be considered. Data items will be considered missing
order (int | None) – To be used to distinguish multiple file_groups that match the name_path in the same session. The order of the file_group within the session (0-indexed). Based on the scan ID but is more robust to small changes to the IDs within the session if for example there are two scans of the same type taken before and after a task.
header_vals (Dict[str, str]) – To be used to distinguish multiple items that match the the other criteria. The provided dictionary contains header values that must match the stored header_vals exactly.
is_regex (bool) – Flags whether the name_path is a regular expression or not
- class arcana.core.data.column.DataSink(name: str, path: str, datatype, row_frequency: DataSpace, dataset=None, salience=ColumnSalience.supplementary, pipeline_name: Optional[str] = None)[source]#
A specification for a file group within a analysis to be derived from a processing pipeline.
- Parameters
path (str) – The path to the relative location the corresponding data items will be stored within the rows of the data tree.
datatype (type) – The file datatype or data type used to store the corresponding items in the store dataset.
row_frequency (DataSpace) – The row_frequency of the file-group within the dataset tree, e.g. per ‘session’, ‘subject’, ‘timepoint’, ‘group’, ‘dataset’
salience (Salience) – The salience of the specified file-group, i.e. whether it would be typically of interest for publication outputs or whether it is just a temporary file in a workflow, and stages in between
pipeline_name (str) – The name of the workflow applied to the dataset to generates the data for the sink
Stores#
- class arcana.bids.data.Bids(name: str = 'file', json_edits: list = _Nothing.NOTHING)[source]#
Repository for working with data stored on the file-system in BIDS format
- Parameters
json_edits (list[tuple[str, str]], optional) – Specifications to edit JSON files as they are written to the store to enable manual modification of fields to correct metadata. List of tuples of the form: FILE_PATH - path expression to select the files, EDIT_STR - jq filter used to modify the JSON document.
Processing#
- class arcana.core.analysis.pipeline.Pipeline(name: str, row_frequency: DataSpace, workflow: Workflow, inputs, outputs, converter_args=_Nothing.NOTHING, dataset: Optional[Dataset] = None)[source]#
A thin wrapper around a Pydra workflow to link it to sources and sinks within a dataset
- Parameters
name (str) – the name of the pipeline, used to differentiate it from others
row_frequency (DataSpace, optional) – The row_frequency of the pipeline, i.e. the row_frequency of the derivatvies within the dataset, e.g. per-session, per-subject, etc, by default None
workflow (Workflow) – The pydra workflow that performs the actual analysis
inputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of column names (i.e. either data sources or sinks) to be connected to the inputs of the pipeline. If the pipelines requires the input to be in a datatype to the source, then it can be specified in a tuple (NAME, FORMAT)
outputs (Sequence[ty.Union[str, ty.Tuple[str, type]]]) – List of sink names to be connected to the outputs of the pipeline If the input to be in a specific datatype, then it can be provided in a tuple (NAME, FORMAT)
converter_args (dict[str, dict]) – keyword arguments passed on to the converter to control how the conversion is performed.
dataset (Dataset) – the dataset the pipeline has been applied to