Training checklist #1

justheuristic · 2021-12-21T15:18:12Z

Start some peers

@razaidy starts initial CPU peer and shares her Peer ID (/ip4/something/tcp/something)
@justheuristic starts another CPU peer for fault-tolerance and shares his Peer ID
Start a few GPU workers

Technical features

Scale sequence length dynamically during training [@justheuristic]
- add 32 every 1000 global steps up to 512
training with TPU

Volunteer starter kit

draft notebook
create an invite link for volunteers
text explanation for running in colab
upload notebook to CALM/notebooks?
discord chatroom
organization page
instructions for training locally, with kaggle, with sagemaker
email to volunteers (early ~wednesday )

After we start

contributors dashboard [@SaulLu ]
running evaluation every few hours
set up monitoring & support shifts

Sanity checks

make sure training data looks right (@razaidy @JAWHARAH123 @pr-Mais )
make sure loss is in a reasonable range
verify that volunteer starter kits work
make sure training with TPU does not leak memory during host<->device transfer (@justheuristic )
look at the data again just in case

Milestones

pass loss 9 by ~1000 steps (without stagnating)
do not blow up at peak learning rate (steps 2500-3500)
downstream should be better than random after step 4000
reach full sequence length (10000)

The text was updated successfully, but these errors were encountered:

pr-Mais · 2021-12-24T13:00:48Z

I see sagemaker, could GCE w or w/o colab be used as well?

justheuristic · 2021-12-24T15:21:20Z

Anything can as long as it has a free tier :)
Sagemaker had a free studiolab: https://studiolab.sagemaker.aws/

SaulLu · 2021-12-29T10:38:41Z

Some news about the dashboard 😃

Here is what I did for the moment:

Create a streamlit space named Dashboard in the CALM organization
Copy the dashboard repo of training-transformers-together into the new github repo for the CALM dashboard
Change the target HF repository in the github workflow (here)
Create a machine-user for the organization and add it to the organization with a WRITE role (we need its token) - how should I share with you its password?

To finish setting up this dashboard, we should :
6. Add a secret named HF_TOKEN in the dashboard repository on github corresponding to a write access token of the machine user (I don't have the necessary rights on github)
7. Add two secrets WANDB_REPO_INDIVIDUAL_METRICS and WANDB_RUN_URL_MAIN_METRICS in the dashboard repository on the HUB corresponding to the links to the WANDBs storing the data.

justheuristic · 2021-12-30T06:36:16Z

Awesome work!

For step #6
@razaidy , you mentioned there is someone on your side who can admin the organization, right?

@SaulLu can they use this instruction for adding a secret?

For step #7, I took the liberty to add these two secrets, but I'm not entirely sure I got the format right. Since they are public knowledge, I'll copy them here as well:

WANDB_REPO_INDIVIDUAL_METRICS=https://wandb.ai/calm/CALM-hivemind-trainers
WANDB_RUN_URL_MAIN_METRICS=https://wandb.ai/calm/CALM

p.s. this seems way more elegant than what we did with Neuropark (i.e. hardcode access tokens), thanks!

JAWHARAH123 · 2021-12-30T14:21:05Z

Thank you SaulLu for your effort and contribution. Can you please provide me with the write access token of the machine user? so I can add the secret to the repository.

…

On Wed, 29 Dec 2021 at 13:38, SaulLu ***@***.***> wrote: Some news about the dashboard 😃 Here is what I did for the moment: 1. Create a streamlit space named Dashboard <https://huggingface.co/spaces/CALM/Dashboard> in the CALM organization 2. Copy the dashboard repo of training-transformers-together <https:/training-transformers-together/dashboard> into the new github repo for the CALM dashboard <https:/NCAI-Research/dashboard> 3. Change the target HF repository in the github workflow (here <https:/NCAI-Research/dashboard/blob/main/.github/workflows/sync_to_hub.yaml#L20> ) 4. Create a machine-user <https://huggingface.co/ncai-calm-machine-user> for the organization and add it to the organization with a WRITE role (we need its token) - how should I share with you its password? To finish setting up this dashboard, we should : 6. Add a secret named HF_TOKEN in the dashboard repository on github <https:/NCAI-Research/dashboard> corresponding to a write access token of the machine user <https://huggingface.co/ncai-calm-machine-user> (I don't have the necessary rights on github) 7. Add two secrets WANDB_REPO_INDIVIDUAL_METRICS and WANDB_RUN_URL_MAIN_METRICS in the dashboard repository on the HUB <https://huggingface.co/spaces/CALM/Dashboard> corresponding to the links to the WANDBs storing the data. — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https:/notifications/unsubscribe-auth/AN7FTRYFZMWVBBHTK5IK66TUTLQLZANCNFSM5KQNMYGQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

SaulLu · 2022-01-01T15:11:11Z

@JAWHARAH123 , with pleasure to share with you this secret information (I'll share the password to this machine user and the user access token)! I think that I can send you a private message on github, How do you want me to reach you? (email? discord?) 😄

The procedure to add the secret to the repository is indeed exactly the one you shared @justheuristic !

Thanks for the information regarding the step 7 @justheuristic 🤗

JAWHARAH123 · 2022-01-01T15:50:45Z

Whatever you like ,you can find me in the CALM discord channel, or you can send me an email ***@***.***) في سبت، ١ يناير، ٢٠٢٢ في ٦:١١ م، كتب SaulLu ***@***.***>:

…

@JAWHARAH123 <https:/JAWHARAH123> , with pleasure to share with you this secret information (I'll share the password to this machine user and the user access token)! I think that I can send you a private message on github, How do you want me to reach you? (email? discord?) 😄 The procedure to add the secret to the repository is indeed exactly the one you shared @justheuristic <https:/justheuristic> ! Thanks for the information regarding the step 7 @justheuristic <https:/justheuristic> 🤗 — Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https:/notifications/unsubscribe-auth/AN7FTR66QR7QSHFY7RN7NNDUT4KRXANCNFSM5KQNMYGQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

SaulLu · 2022-01-01T23:26:19Z

Thank you very much for your answer, unfortunately I can't see your email address in your last message and there are more users with the same nickname as you on discord.

If it ever helps, on discord I am user SaulLu #0201.

SaulLu · 2022-01-02T10:49:50Z

Thank you all! The dashboard is live at this address: https://hf.co/spaces/CALM/Dashboard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training checklist #1

Training checklist #1

justheuristic commented Dec 21, 2021 •

edited

Loading

pr-Mais commented Dec 24, 2021

justheuristic commented Dec 24, 2021

SaulLu commented Dec 29, 2021

justheuristic commented Dec 30, 2021 •

edited

Loading

JAWHARAH123 commented Dec 30, 2021 via email •

edited

Loading

SaulLu commented Jan 1, 2022

JAWHARAH123 commented Jan 1, 2022 via email

SaulLu commented Jan 1, 2022 •

edited

Loading

SaulLu commented Jan 2, 2022

Training checklist #1

Training checklist #1

Comments

justheuristic commented Dec 21, 2021 • edited Loading

Start some peers

Technical features

Volunteer starter kit

After we start

Sanity checks

Milestones

pr-Mais commented Dec 24, 2021

justheuristic commented Dec 24, 2021

SaulLu commented Dec 29, 2021

justheuristic commented Dec 30, 2021 • edited Loading

JAWHARAH123 commented Dec 30, 2021 via email • edited Loading

SaulLu commented Jan 1, 2022

JAWHARAH123 commented Jan 1, 2022 via email

SaulLu commented Jan 1, 2022 • edited Loading

SaulLu commented Jan 2, 2022

justheuristic commented Dec 21, 2021 •

edited

Loading

justheuristic commented Dec 30, 2021 •

edited

Loading

JAWHARAH123 commented Dec 30, 2021 via email •

edited

Loading

SaulLu commented Jan 1, 2022 •

edited

Loading