Contribution Guidelines
Preparing the environment
1. First, set up a cache storage. Create the file ~/.config/amid/.bev.yml
with the following content:
where /path/to/storage
and /path/to/cache
are some paths in your filesystem.
2. Run
The full command could look something like this:
mkdir -p ~/.config/amid
cat >~/.config/amid/.bev.yml <<EOL
main:
storage: /mount/data/storage
cache: /mount/data/cache
EOL
amid init
Adding a new dataset
We will be using LiTS as an example.
1. Download the raw data to a separate folder in your filesystem
2. (Optionally) create a new branch for the dataset:
3. Create a class that loads the raw data. LiTSBase is a good example. Note how each field is implemented as a separate function.
There are no strict rules regarding the dataset fields, but try to keep the output "as raw as possible", i.e., do not apply heavy processing that modifies the data irreversibly.
Rule of thumb:
The dataset should be written in such a way, that making a submission to a contest would work out of the box.
Note
In case of DICOM files, make sure to transpose the first 2 image axes. This way, the image axes will be consistent with the potential contour coordinates.
Tip
If some value is missing for a given id, it is preferable to return None
instead of raising an exception.
Tip
The dataset must have a docstring which describes it and provides a link to the original data.
Tip
If the raw data contains a table with metadata, it is preferable to split the metadata columns into separate fields.
4. Register the dataset like so:
where the first 3 arguments are
- the raw dataset
- the final dataset name
- a short name for the dataset (mostly used for various files generation)
and ...
stands for the following arguments:
modality
— the images' modality/modalities, e.g., CT, MRIbody_region
— the anatomical regions present in the dataset, e.g., Head, Thorax, Abdomenlicense
— the dataset's license, if anylink
— the link to the original dataraw_data_size
— the total size, required for the raw data, e.g., 10G, 500Mtask
— the dataset's downstream task if any. E.g., Supervised Learning, Domain Adaptation, Self-supervised Learning, Tumor Segmentation, etc.
5. Make sure all the methods are working as expected:
from amid.lits import LiTS
dataset = LiTS(root="/datasets/LiTS")
print(len(dataset.ids))
id_ = dataset.ids[0]
print(dataset.image(id_).shape)
6. Populate the dataset:
Tip
Use the option --n-jobs
to speed up the process.
Tip
Use the option --help
for a more detailed information on this command.
7. If there is no error, the file amid/data/lits.hash
will appear (the name depends on short_name
given to normalize
).
8. Check the codestyle using the lint.sh
script in the repository's root and make changes if flake8 is not happy:
9. Commit all the files you added, including the *.hash
one.