Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider moving our storage from AWS to Digital Alliance #162

Open
jcohenadad opened this issue Apr 3, 2024 · 20 comments
Open

Consider moving our storage from AWS to Digital Alliance #162

jcohenadad opened this issue Apr 3, 2024 · 20 comments

Comments

@jcohenadad
Copy link
Member

The spine-generic dataset is being increasingly downloaded, which comes at a cost. For example, the cost for 2023 was 478$, which is not negligible (cost per month below):

>>> 34+33+26+24+37+37+40+90+32+24+45+56
478

I'm wondering how feasible/difficult it would be to move the git-annex server to a Digital Alliance cloud?

@namgo
Copy link

namgo commented Apr 3, 2024

I can't remember off the top of my head what Digital Alliance's policy is for this but I know I'd talked to Nick about it before. My memory is that they're happy to supply the bandwidth but I could be mistaken.

I'll submit a ticket about it now.

@mguaypaq
Copy link
Member

mguaypaq commented Apr 3, 2024

It should be possible, but I'll just point out a technical difference:

  • For our internal dataset server, and for spineimage.ca, we're running NeuroGitea, which has both the repository metadata (file names, directories, commits, etc.) and the large data files (the actual .nii.gz files).
  • For spine-generic, we still want to host the repository metadata on Github. It's just the large data files that we want to host on the Digital Alliance cloud, instead of on Amazon S3.

So, it's not exactly the same interface/setup. But, I'm sure there exists open source software which can present the same interface as S3. Or, we can also look at the other types of special remotes that git-annex supports.

@jcohenadad
Copy link
Member Author

Thank you for the clarification @mguaypaq. Would it make sense to move spine-generic to a NeuroGitea server?

@mguaypaq
Copy link
Member

mguaypaq commented Apr 3, 2024

I think Github still gives us a lot of value:

  • All the existing links keep working.
  • All the issues/pull requests are there.
  • We don't have to manage user accounts for every collaborator.

Probably we just want to change the storage backend for the large data files, that should be a much smaller change from the point of view of our users. It should be very doable.

@mguaypaq
Copy link
Member

mguaypaq commented Apr 3, 2024

In particular, it looks like the Digital Alliance already has a service that's compatible with S3, so maybe this can be easy:
https://docs.alliancecan.ca/wiki/Arbutus_object_storage

@namgo
Copy link

namgo commented Apr 4, 2024

I heard back from Digital Alliance on Arbutus (way quicker than I expected... 30 minute turnaround time!).

There's no specific policy about using arbutus for public datasets but the bandwidth they can provide is pretty limited - unless it's primarily our own users who are downloading the dataset? They have fast uplinks to Canadian university networks, but relatively slow uplinks to everyone else from the sounds of things?

Do we know how many external people are using the dataset?

@jcohenadad
Copy link
Member Author

Do we know how many external people are using the dataset?

i'd say ~10 ppl/month? but uplinks doesn't matter too much. What matters is downlinks. And as long as it is not insanely slow (which is not), we should be fine

@namgo
Copy link

namgo commented Apr 4, 2024

I mean uplink in this case would mean their upload speed to external networks which would affect download speed but:

Well the CANARIE research backbone that connects all of Canada together has some pretty good interconnections with Internet2 and GEANT at the very least, so that should cover the USA and most of Europe.

I think that fits our needs, very cool!

@mguaypaq do you have a vision for how you might switch backends in git-annex? I imagine we can just import the dataset from s3 to arbutus without much trouble but I'm not sure about how this works with git-annex.

@mguaypaq
Copy link
Member

mguaypaq commented Apr 4, 2024

Just like git, git-annex supports having multiple remotes. So, I imagine we would:

  1. Figure out the right configuration and permissions for Arbutus object storage. (Probably you @namgo? Although maybe I can help since I have admin access to the existing Amazon S3 stuff.)
  2. Configure Arbutus as a new special remote for git-annex, alongside the existing Amazon special remote. (Probably me, since I'm most familiar with git-annex.)
  3. Copy over the files through git-annex, which will double as a test that new files can be uploaded by people with write access.
  4. Test that a new clone can get the files from the Arbutus special remote.
  5. Deconfigure the Amazon special remote from git-annex (but keep the files in place for a little while).
  6. Once everything works, delete the Amazon buckets.

So, a nice gradual transition, with plenty of opportunity to roll back if there are problems.

@namgo
Copy link

namgo commented Apr 4, 2024

@nullnik-0 is becoming our resident expert on ComputeCanada already! I'd be down to work alongside her since I have admin perms on our CC projects. I'll loop all three of us into a slack convo and we could talk about permissions.

@mguaypaq
Copy link
Member

mguaypaq commented Apr 4, 2024

Preliminary tests seem to work! We should be able to migrate to Arbutus object storage fairly quickly and reduce our Amazon bandwidth costs.

Steps:

  • Following the Alliance docs for Arbutus object storage, I downloaded and sourced my OpenStack RC from the API access dashboard. Then I ran

    openstack ec2 credentials create

    and saved the resulting access key and secret key in my password manager.

  • In a fresh clone of spine-generic/data-single-subject, I created two special remotes:

    • arbutus-read: this one is world-readable and auto-enabled.

    • arbutus-write: this one has to be enabled manually, and can be used by people with ec2 credentials (like me from the previous point) to upload image files. Git-annex knows that it refers to the same bucket as arbutus-read.

    read -r AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
    # copy-paste the access key and secret key, separated by a space, then press enter
    export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
    env | grep AWS_
    
    git annex initremote arbutus-read type=S3 \
      autoenable=true \
      bucket=def-jcohen-data-single-subject \
      datacenter=CA \
      encryption=none \
      host=object-arbutus.cloud.computecanada.ca \
      port=443 \
      protocol=https \
      public=yes \
      publicurl=https://object-arbutus.cloud.computecanada.ca/def-jcohen-data-single-subject/ \
      requeststyle=path
    
    git annex initremote --sameas=arbutus-read arbutus-write type=S3 \
      bucket=def-jcohen-data-single-subject \
      datacenter=CA \
      host=object-arbutus.cloud.computecanada.ca \
      port=443 \
      protocol=https \
      public=no \
      requeststyle=path
  • Get some data files and push them to Arbutus:

    git annex get sub-douglas
    git annex copy --to arbutus-write sub-douglas
  • In a clone-of-my-clone, try to get the files, without having ec2 credentials:

    unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
    git annex get --from arbutus-read sub-douglas

I'm out of time for this week, but next week I'll try to migrate both data-single-subject and data-multi-subject to Arbutus.

@bpinsard
Copy link

Stumbling on that convo...
You do not necessarily need to decommission the AWS storage as you can manage which special-remote gets chosen by default (while keeping others as a fallback) by setting a cost value to the special remote when initializing it (or adding it with enableremote afterward ).

Also, I don't think you need to set 2 special remotes for read/write respectively, the first one should work for both (write with credentials only).

@jcohenadad
Copy link
Member Author

thank you for your insights @bpinsard!

a fallback is definitely useful, but don't we already have a backup of spine-generic @namgo ?

@mguaypaq
Copy link
Member

@bpinsard have you gotten the single read/write remote to work in the past? I remember trying (in the past year) to use a single special remote for both, but couldn't get it to work. Possibly something to do with this interaction between the config settings.

@bpinsard
Copy link

We do not use that setup in production (only authenticated access), but I just tested it with a minio s3 server (so not the same as digital alliance).
There is one caveat though (which I think occurs with the split setup too), is if someone has permanently setup AWS keys in their environment for another server/usage.
In that case, it overrides the anonymous access and can cause 403 errors because git-annex sends the credentials (it might depend on how each server deals with policies and anonymous access).

I think a good way to avoid that, is to set the S3 remote for the read/write data management only, not autoenabled, and then add a httpalso sameas remote, crafting the https url depending on the server, bucket-name and requeststyle.

git annex initremote https_download --sameas=s3_remote_name autoenable=true type=httpalso url=https://s3.unf-montreal.ca/test.publicurl/ cost=50 

This can save a lot of user-support headaches.

@mguaypaq
Copy link
Member

Oh! I didn't know about the httpalso remote type, that makes a lot of sense. It's still two remotes with a sameas, but probably with fewer corner cases.

@namgo
Copy link

namgo commented May 13, 2024

a fallback is definitely useful, but don't we already have a backup of spine-generic @namgo ?

We have a backup of what's on gitea (and what was on gitolite) however those I understand to be git-annex archives rather than the datasets themselves.

@namgo
Copy link

namgo commented May 13, 2024

Whoops! I misunderstood - we don't have spine-generic backed up on restic - mathieu helped me remember that that one's on github.

@jcohenadad
Copy link
Member Author

then we should probably create a backup, no?

@namgo
Copy link

namgo commented May 14, 2024

Good point. I made a ticket for getting this put in restic (with some questions for Mathieu) - should be pretty straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants