Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement version tracking system for CanProCo data #86

Open
jcohenadad opened this issue Apr 9, 2024 · 6 comments
Open

Implement version tracking system for CanProCo data #86

jcohenadad opened this issue Apr 9, 2024 · 6 comments

Comments

@jcohenadad
Copy link
Member

jcohenadad commented Apr 9, 2024

Context

Note

Conversation started via email, but redirected here for transparency and easy cross-referencing. Everyone please feel free to contribute to the discussion!

There have been multiple conversations about issues with the dataset (eg: the exact same images being labeled as M0 and M12: #39), that are being fixed locally, without knowing if the error is also being fixed at the source, and at institutions using the data for analysis.

In addition, some of the issues are being reported to us from another site than the source site, example: #13, so we end up fixing things on our internal server, without knowing that the exact same corrections are also being made at the source site.

Problem: Given that the dataset is not being synced across the multiple user sites, we end up with multiple versions of the dataset that are not being tracked, potentially leading to errors and lack of reproducibility.

Solutions

We should look at ways to version track the data and its usage across all the user sites. The earlier the better (as time passes, errors are being accumulated, making it increasingly difficult to reconstruct the history).

Track source dataset with git-annex

git-annex technology is a popular reference for version-tracking dataset, based on git. It is notably used by Datalad, a reference tool in the neuroimaging community for sharing data and performing reproducible science. An excellent solution would be to convert the source repos as a git-annex repos, and make modifications with regular git commit/push that are trackable.

Several levels of permissions are possible:

  1. The most restrictive one, would be that the source site would be the only one with R/W permission, and no other site would have either R or W permission (due to network access limitation at the source site, for security reasons). The source site would manage the git-annex repos, and distribute (eg: via secured SFTP) specific versions of the repos (with a specific commit SHA)
  2. Less restrictive: source site has R/W access, and some sites have R access, to be able to git-annex checkout/pull a specific version of the repos-- pros: less manual work to distribute the data ; cons: possible security issues from source IT management team
  3. even less restrictive: source site has R/W access, and some sites have R/W access, to be able to fetch data, and to push contributions (eg: manual segmentations, see below). Pros: less manual work, less prone to human error when copying from collaborative site to source site ; cons: security issue (likely not going to work).

I think that option 1 is the most realistic/reasonable given the IT context.

Create manual checksum

If git-annex repos is not possible, or while it is being implemented, a "quick and dirty" solution is for the source site to create a Checksum of all files in the dataset (recursively), which could be done with:

find * -type f -exec shasum -b -a 256 {} \; > CANPROCO_vX.Y

And then the sums can be verified by collaborative sites with:

shasum -c -a 256 CANPROCO_vX.Y

Additional usage

We should also consider that other sites might contribute to the dataset, eg, with manual labels of segmentations. Using git-annex would be a means to push the segmentations to the source repository, so that it could also serve other sites for analysis.

Resource

Examples of multi-site data managed with git-annex

@aman-s
Copy link

aman-s commented Apr 22, 2024

Thanks Julien - in the last couple of days we have changed our BIDS data structure to be centralized and contain both M0 and M12 timepoints, as opposed to previously having separate M0 and M12 data storage locations (and having to send separate zipfiles for each timepoint).

This was done to allow one data directory which can be git-annex tracked as per your suggestion. In the next coming days, I will send this M0-M12 data in the new structure over UBC onedrive, and add you to it's git-annex repository on github, so any future changes can be tracked.

Please let us know how this sounds.

@jcohenadad
Copy link
Member Author

Fantastic! Thank you @aman-s

In the next coming days, I will send this M0-M12 data in the new structure over UBC onedrive, and add you to it's git-annex repository on github, so any future changes can be tracked.

If the repos is tracked on git-annex and you can add us to the repos, why sending the data via UBC's onedrive? It defeats the purpose of getting the data directly via git-annex, no?

@aman-s
Copy link

aman-s commented Apr 22, 2024

If the repos is tracked on git-annex and you can add us to the repos, why sending the data via UBC's onedrive? It defeats the purpose of getting the data directly via git-annex, no?

@jcohenadad UBC doesn't allow us to put the actual datafiles on github, therefore, we were planning to have just the filepointer and hashes, for tracking purposes, of zipped-datafiles on github
The github-repo can still be downloaded on your end and locally combined with the zipped data we send over UBC-Onedrive.
This does add one extra step for you - before you can 'git-annex pull' files (locally). We also plan on commenting on each git-annex push, which data-zipfile each push corresponds to.

We are also open to any other suggestions which may be helpful for your team in streamlining!

@jcohenadad
Copy link
Member Author

jcohenadad commented Apr 22, 2024

@jcohenadad UBC doesn't allow us to put the actual datafiles on github, therefore, we were planning to have just the filepointer and hashes, for tracking purposes, of zipped-datafiles on github. The github-repo can still be downloaded on your end and locally combined with the zipped data we send over UBC-Onedrive. This does add one extra step for you - before you can 'git-annex pull' files (locally). We also plan on commenting on each git-annex push, which data-zipfile each push corresponds to.

I think there is a misunderstanding. I did not suggest that I would pull the binary files from GH, but that I would clone/checkout/pull the repository that includes the pointers to the actual data (which is the main difference between git and git-annex, ie, the repos is pointing to where the data are). Now, the repos could be public (eg: on GH), while the data require a token to be able to fetch them. See e.g. https://docs.cneuromod.ca/en/latest/ACCESS.html#versioning

Alternatively, the git-annex repos could be hosted on your servers, and SSH permissions could be given to external collaborators to fetch the data via git-annex commands (which is compatible with SSH protocol)

@aman-s
Copy link

aman-s commented Apr 23, 2024

From my understanding, the original purpose of having the GH git-annex of CanProCo spinal data was so we can track any file changes/ past errors that have been fixed retrospectively. UBC fixing errors retrospectively were prone to being lost in communication without version control, which we are hoping the git-annex will help with, by allowing you to do checksums on received data.

Second, having a git-annex repository that would allow your team to pull specific versions of the datasets could be another benefit of the git-annex. However, setting up a token to fetch data directly over ssh or hosting a git server, requires more discussion with our lab PIs and IT. Currently, the approved method of data transfer, is via the UBC Onedrive, and the UBC team sending periodic updated data - but we can think about direct git-annex pull requests in the future.

Perhaps I can share the new GH git-annex (for version tracking of the files) and the new-combined-M0-data-structure so you get a better sense of how it can offer you checksums for past files? And we can continue our discussion about adding remotes to git-annex in the future. Let me know how that sounds!

@jcohenadad
Copy link
Member Author

Another useful use case is for researchers to contribute to the dataset, eg, with manual segmentations. Being able to push those segmentations would benefit other researchers (like @leelisae). The modus operandi could be:

  • external researcher create a branch on the git-annex repository,
  • external researcher makes modifications (eg: push segmentations)
  • UBC internal team reviews the modifications, and upon validation merges them onto the main branch (which is protected, and read-only for external researchers)

By experience, this will save a lot of trouble (ie: minimize human error, and systematizing procedures).

One interesting avenue would be to host the repos on OneDrive. Some people have done that. Few relevant links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants