Consider moving our storage from AWS to Digital Alliance #162

jcohenadad · 2024-04-03T16:17:35Z

The spine-generic dataset is being increasingly downloaded, which comes at a cost. For example, the cost for 2023 was 478$, which is not negligible (cost per month below):

>>> 34+33+26+24+37+37+40+90+32+24+45+56
478

I'm wondering how feasible/difficult it would be to move the git-annex server to a Digital Alliance cloud?

The text was updated successfully, but these errors were encountered:

namgo · 2024-04-03T16:24:08Z

I can't remember off the top of my head what Digital Alliance's policy is for this but I know I'd talked to Nick about it before. My memory is that they're happy to supply the bandwidth but I could be mistaken.

I'll submit a ticket about it now.

mguaypaq · 2024-04-03T18:25:01Z

It should be possible, but I'll just point out a technical difference:

For our internal dataset server, and for spineimage.ca, we're running NeuroGitea, which has both the repository metadata (file names, directories, commits, etc.) and the large data files (the actual .nii.gz files).
For spine-generic, we still want to host the repository metadata on Github. It's just the large data files that we want to host on the Digital Alliance cloud, instead of on Amazon S3.

So, it's not exactly the same interface/setup. But, I'm sure there exists open source software which can present the same interface as S3. Or, we can also look at the other types of special remotes that git-annex supports.

jcohenadad · 2024-04-03T18:28:01Z

Thank you for the clarification @mguaypaq. Would it make sense to move spine-generic to a NeuroGitea server?

mguaypaq · 2024-04-03T19:55:35Z

I think Github still gives us a lot of value:

All the existing links keep working.
All the issues/pull requests are there.
We don't have to manage user accounts for every collaborator.

Probably we just want to change the storage backend for the large data files, that should be a much smaller change from the point of view of our users. It should be very doable.

mguaypaq · 2024-04-03T20:04:49Z

In particular, it looks like the Digital Alliance already has a service that's compatible with S3, so maybe this can be easy:
https://docs.alliancecan.ca/wiki/Arbutus_object_storage

namgo · 2024-04-04T01:33:07Z

I heard back from Digital Alliance on Arbutus (way quicker than I expected... 30 minute turnaround time!).

There's no specific policy about using arbutus for public datasets but the bandwidth they can provide is pretty limited - unless it's primarily our own users who are downloading the dataset? They have fast uplinks to Canadian university networks, but relatively slow uplinks to everyone else from the sounds of things?

Do we know how many external people are using the dataset?

jcohenadad · 2024-04-04T02:19:48Z

Do we know how many external people are using the dataset?

i'd say ~10 ppl/month? but uplinks doesn't matter too much. What matters is downlinks. And as long as it is not insanely slow (which is not), we should be fine

namgo · 2024-04-04T16:58:37Z

I mean uplink in this case would mean their upload speed to external networks which would affect download speed but:

Well the CANARIE research backbone that connects all of Canada together has some pretty good interconnections with Internet2 and GEANT at the very least, so that should cover the USA and most of Europe.

I think that fits our needs, very cool!

@mguaypaq do you have a vision for how you might switch backends in git-annex? I imagine we can just import the dataset from s3 to arbutus without much trouble but I'm not sure about how this works with git-annex.

mguaypaq · 2024-04-04T17:25:13Z

Just like git, git-annex supports having multiple remotes. So, I imagine we would:

Figure out the right configuration and permissions for Arbutus object storage. (Probably you @namgo? Although maybe I can help since I have admin access to the existing Amazon S3 stuff.)
Configure Arbutus as a new special remote for git-annex, alongside the existing Amazon special remote. (Probably me, since I'm most familiar with git-annex.)
Copy over the files through git-annex, which will double as a test that new files can be uploaded by people with write access.
Test that a new clone can get the files from the Arbutus special remote.
Deconfigure the Amazon special remote from git-annex (but keep the files in place for a little while).
Once everything works, delete the Amazon buckets.

So, a nice gradual transition, with plenty of opportunity to roll back if there are problems.

namgo · 2024-04-04T19:46:55Z

@nullnik-0 is becoming our resident expert on ComputeCanada already! I'd be down to work alongside her since I have admin perms on our CC projects. I'll loop all three of us into a slack convo and we could talk about permissions.

mguaypaq · 2024-04-04T22:33:27Z

Preliminary tests seem to work! We should be able to migrate to Arbutus object storage fairly quickly and reduce our Amazon bandwidth costs.

Steps:

Following the Alliance docs for Arbutus object storage, I downloaded and sourced my OpenStack RC from the API access dashboard. Then I ran
```
openstack ec2 credentials create
```
and saved the resulting access key and secret key in my password manager.

In a fresh clone of spine-generic/data-single-subject, I created two special remotes:

arbutus-read: this one is world-readable and auto-enabled.
arbutus-write: this one has to be enabled manually, and can be used by people with ec2 credentials (like me from the previous point) to upload image files. Git-annex knows that it refers to the same bucket as arbutus-read.

read -r AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
# copy-paste the access key and secret key, separated by a space, then press enter
export AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
env | grep AWS_

git annex initremote arbutus-read type=S3 \
  autoenable=true \
  bucket=def-jcohen-data-single-subject \
  datacenter=CA \
  encryption=none \
  host=object-arbutus.cloud.computecanada.ca \
  port=443 \
  protocol=https \
  public=yes \
  publicurl=https://object-arbutus.cloud.computecanada.ca/def-jcohen-data-single-subject/ \
  requeststyle=path

git annex initremote --sameas=arbutus-read arbutus-write type=S3 \
  bucket=def-jcohen-data-single-subject \
  datacenter=CA \
  host=object-arbutus.cloud.computecanada.ca \
  port=443 \
  protocol=https \
  public=no \
  requeststyle=path

Get some data files and push them to Arbutus:

git annex get sub-douglas
git annex copy --to arbutus-write sub-douglas

In a clone-of-my-clone, try to get the files, without having ec2 credentials:

unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY
git annex get --from arbutus-read sub-douglas

I'm out of time for this week, but next week I'll try to migrate both data-single-subject and data-multi-subject to Arbutus.

bpinsard · 2024-05-10T14:09:16Z

Stumbling on that convo...
You do not necessarily need to decommission the AWS storage as you can manage which special-remote gets chosen by default (while keeping others as a fallback) by setting a cost value to the special remote when initializing it (or adding it with enableremote afterward ).

Also, I don't think you need to set 2 special remotes for read/write respectively, the first one should work for both (write with credentials only).

jcohenadad · 2024-05-11T15:18:11Z

thank you for your insights @bpinsard!

a fallback is definitely useful, but don't we already have a backup of spine-generic @namgo ?

mguaypaq · 2024-05-13T15:20:50Z

@bpinsard have you gotten the single read/write remote to work in the past? I remember trying (in the past year) to use a single special remote for both, but couldn't get it to work. Possibly something to do with this interaction between the config settings.

bpinsard · 2024-05-13T16:02:07Z

We do not use that setup in production (only authenticated access), but I just tested it with a minio s3 server (so not the same as digital alliance).
There is one caveat though (which I think occurs with the split setup too), is if someone has permanently setup AWS keys in their environment for another server/usage.
In that case, it overrides the anonymous access and can cause 403 errors because git-annex sends the credentials (it might depend on how each server deals with policies and anonymous access).

I think a good way to avoid that, is to set the S3 remote for the read/write data management only, not autoenabled, and then add a httpalso sameas remote, crafting the https url depending on the server, bucket-name and requeststyle.

git annex initremote https_download --sameas=s3_remote_name autoenable=true type=httpalso url=https://s3.unf-montreal.ca/test.publicurl/ cost=50

This can save a lot of user-support headaches.

mguaypaq · 2024-05-13T17:01:17Z

Oh! I didn't know about the httpalso remote type, that makes a lot of sense. It's still two remotes with a sameas, but probably with fewer corner cases.

namgo · 2024-05-13T20:20:47Z

a fallback is definitely useful, but don't we already have a backup of spine-generic @namgo ?

We have a backup of what's on gitea (and what was on gitolite) however those I understand to be git-annex archives rather than the datasets themselves.

namgo · 2024-05-13T20:30:58Z

Whoops! I misunderstood - we don't have spine-generic backed up on restic - mathieu helped me remember that that one's on github.

jcohenadad · 2024-05-14T13:52:36Z

then we should probably create a backup, no?

namgo · 2024-05-14T14:41:35Z

Good point. I made a ticket for getting this put in restic (with some questions for Mathieu) - should be pretty straightforward.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider moving our storage from AWS to Digital Alliance #162

Consider moving our storage from AWS to Digital Alliance #162

jcohenadad commented Apr 3, 2024

namgo commented Apr 3, 2024

mguaypaq commented Apr 3, 2024

jcohenadad commented Apr 3, 2024

mguaypaq commented Apr 3, 2024

mguaypaq commented Apr 3, 2024

namgo commented Apr 4, 2024

jcohenadad commented Apr 4, 2024

namgo commented Apr 4, 2024

mguaypaq commented Apr 4, 2024

namgo commented Apr 4, 2024

mguaypaq commented Apr 4, 2024

bpinsard commented May 10, 2024

jcohenadad commented May 11, 2024

mguaypaq commented May 13, 2024

bpinsard commented May 13, 2024

mguaypaq commented May 13, 2024

namgo commented May 13, 2024

namgo commented May 13, 2024

jcohenadad commented May 14, 2024

namgo commented May 14, 2024

Consider moving our storage from AWS to Digital Alliance #162

Consider moving our storage from AWS to Digital Alliance #162

Comments

jcohenadad commented Apr 3, 2024

namgo commented Apr 3, 2024

mguaypaq commented Apr 3, 2024

jcohenadad commented Apr 3, 2024

mguaypaq commented Apr 3, 2024

mguaypaq commented Apr 3, 2024

namgo commented Apr 4, 2024

jcohenadad commented Apr 4, 2024

namgo commented Apr 4, 2024

mguaypaq commented Apr 4, 2024

namgo commented Apr 4, 2024

mguaypaq commented Apr 4, 2024

bpinsard commented May 10, 2024

jcohenadad commented May 11, 2024

mguaypaq commented May 13, 2024

bpinsard commented May 13, 2024

mguaypaq commented May 13, 2024

namgo commented May 13, 2024

namgo commented May 13, 2024

jcohenadad commented May 14, 2024

namgo commented May 14, 2024