New formats and spaces#
Arcana was initially developed for medical-imaging analysis. Therefore, with
the notable exceptions of the generic data spaces and file-formats defined in
arcana.dirtree.data
, the
majority of file-formats and data spaces are specific to medical imaging.
However, new formats and data spaces used in other fields can be implemented as
required with just a few lines of code.
File formats#
File formats are defined by subclasses of the FileGroup
base class.
“File group” is a catch-all term that encompasses three sub-types, each with
their own FileGroup
subclass:
File
- single filesWithSideCars
- files + side car files (e.g. separate headers/JSON files)Directory
- single directories containing specific
New datatype classes should extend one of these classes or an existing file
datatype class (or both) as they include methods to interact with the data
store. Note that File
is a base class of WithSideCars
so multiple inheritance is possible where a datatype with side cars inherits from
the same datatype without side-cars (e.g. Nifti -> NiftiX), but in this case
ensure that WithSideCars
appears before the other class to be
extended in the bases list, e.g. NiftiX(WithSideCars, Nifti)
.
File
subclasses typically only need to set an ext
attribute
to the extension string used to identify the type of file, e.g.
from fileformats.common import File
class Json(File):
ext = 'json'
Note
If the file datatype doesn’t have an identifiable extension it is possible to
override File.set_fs_paths()
to peak inside the contents of the
file to determine its type, but this shouldn’t be necessary in most cases.
WithSideCars
subclasses can set the ext
and side_car_types
attributes. The side_car_types
attribute is a tuple of the side cars
formats that are expected alongside the “primary file”
from arcana.core.data.type.file import WithSideCars
class AnalyzeHeader(File):
ext = 'hdr'
class Analyze(WithSideCars):
ext = 'img'
side_car_types = (AnalyzeHeader,)
Note
When using a file + side-cars datatype in a workflow, the side car files can
be assumed to have the same name-stem, just different extensions
(e.g. /path/to/data/myfile.nii.gz
and /path/to/data/myfile.json
).
Also when setting paths, if side-car paths are not explicitly provided they
will be assumed to have the same name-stem.
Directory
subclasses can define the content_types
attribute,
a tuple of the file formats, that are expected within the directory. The list is not
exclusive, so additional files inside the directory will not effect its
identification.
from fileformats.common import Directory, File
class DicomFile(File):
ext = 'dcm'
class Dicom(Directory):
content_types = (DicomFile,)
It is a good idea to make use of class inheritance when defining related formats to capture the relationship between them. For example, adding a datatype to handle the Siemens-variant DICOM format which has ‘.IMA’ extensions.
class SiemensDicomFile(DicomFile):
ext = 'IMA'
class SiemensDicom(Dicom):
content_types = (SiemensDicomFile,)
Defining hierarchical relationships between file formats is most useful when
defining implicit converters between file formats. This is done by adding
classmethods to the file format class decorated by arcana.core.mark.converter()
.
The decorator specifies the format the converter method can specify the
the conversion from into the current class. The converter method adds Pydra
nodes to a pipeline argument to perform
The first argument for converter methods should be the fs_path followed by
any side cars as keyword arguments. Converter methods should return the Pydra
that performs the conversion followed by a lazy field that points to the
fs_path
of the converted file-group. If the datatype to convert to has side
cars, then the method should return the task followed by a tuple consisting of
lazy fields that point to the fs_path
and then side-car files in the
converted file group in the order they appear in side_car_exts
.
from pydra.engine.core import Workflow, LazyField
from pydra.tasks.dcm2niix import Dcm2niix
from pydra.tasks.mrtrix3.utils import MRConvert
from arcana.core.mark import converter
from fileformats.common import File
class Nifti(File):
ext = 'nii'
@classmethod
@converter(Dicom)
def dcm2niix(cls, fs_path: LazyField):
node = Dcm2niix(
name=node_name,
in_file=dicom,
compress='n')
return node, node.lzout.out_file
@classmethod
@converter(Analyze)
def mrconvert(cls, fs_path: LazyField, hdr: LazyField):
node = MRConvert(
name=node_name,
in_file=analyze,
out_file='out.' + cls.ext)
return node, node.lzout.out_file
If the class to convert to is a WithSideCars
subclass then the return value
should be a tuple consisting the primary path followed by side-car paths in the
same order they are defined in the class. To remove a converter in a specialised
subclass (which the converter isn’t able to convert to) simply override the
converter method with an arbitrary value.
class NiftiX(WithSideCars, Nifti):
ext = 'nii'
side_car_types = (Json,)
@classmethod
@converter(Dicom)
def dcm2niix(cls, fs_path: LazyField):
node, out_file = super().dcm2niix(fs_path)
return node, (out_file, node.lzout.out_json)
mrconvert = None # Only dcm2niix produces the required JSON files for NiftiX
Use dummy base classes in order to avoid circular reference issues when defining two-way conversions between formats
class ExampleFormat2Base(File):
pass
class ExampleFormat1(File):
ext = 'exm1'
@classmethod
@converter(ExampleFormat2Base)
def from_example1(cls, fs_path: LazyField):
node = Converter2to1(
in_file=example1)
return node, node.lzout.out_file
class ExampleFormat2(ExampleFormat2Base):
ext = 'exm2'
@classmethod
@converter(ExampleFormat1)
def from_example1(cls, pipeline: Pipeline, node_name: str, example1: LazyField):
node = Converter1to2(
in_file=example1)
return node, node.lzout.out_file
While not necessary, it can be convenient to add methods for accessing file-group data within Python. This makes it possible to write generic methods to generate publication outputs. Some suggested methods are
data
- access data array, particularly relevant for imaging datametadata
- access a dictionary containing metadata extracted from a header or side-car
Data spaces#
New data spaces (see Spaces) are defined by extending the
DataSpace
abstract base class. DataSpace
subclasses are be
enums with binary string
values of consistent length (i.e. all of length 2 or all of length 3, etc…).
The length of the binary string defines the rank of the data space,
i.e. the maximum depth of a data tree within the space. The enum must contain
members for each permutation of the bit string (e.g. for 2 dimensions, there
must be members corresponding to the values 0b00, 0b01, 0b10, 0b11).
For example, in imaging studies scannings sessions are typically organised by analysis group (e.g. test & control), membership within the group (i.e matched subject ID) and time-points for longitudinal studies. In this case, we can visualise the imaging sessions arranged in a 3-D grid along the group, member, and timepoint axes. Note that datasets that only contain one group or time-point can still be represented in this space, and just be singleton along the corresponding axis.
All axes should be included as members of a DataSpace subclass enum with orthogonal binary vector values, e.g.:
member = 0b001
group = 0b010
timepoint = 0b100
The axis that is most often non-singleton should be given the smallest bit as this will be assumed to be the default when there is only one layer in the data tree, e.g. imaging datasets will not always have different groups or time-points but will always have different members (which are equivalent to subjects when there is only one group).
The “leaf rows” of a data tree, imaging sessions in this example, will be the bitwise-and of the dimension vectors, i.e. an imaging session is uniquely defined by its member, group and timepoint ID.:
session = 0b111
In addition to the data items stored in leaf rows, some data, particularly derivatives, may be stored in the dataset along a particular dimension, at a lower “row_frequency” than ‘per session’. For example, brain templates are sometimes calculated ‘per group’. Additionally, data can also be stored in aggregated rows that across a plane of the grid. These frequencies should also be added to the enum, i.e. all permutations of the base dimensions must be included and given intuitive names if possible:
subject = 0b011 - uniquely identified subject within in the dataset.
batch = 0b110 - separate group + timepoint combinations
matchedpoint = 0b101 - matched members and time-points aggregated across groups
Finally, for items that are singular across the whole dataset there should also be a dataset-wide member with value=0:
dataset = 0b000
For example, if you wanted to analyse daily recordings from various weather stations you could define a 2-dimensional “Weather” data space with axes for the date and weather station of the recordings, with the following code
from arcana.core.data.space import DataSpace
class Weather(DataSpace):
# Define the axes of the dataspace
timepoint = 0b01
station = 0b10
# Name the leaf and root frequencies of the data space
recording = 0b11
dataset = 0b00
Note
All permutations of N-D binary strings need to be named within the enum.