Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training checklist #1

Open
20 of 25 tasks
justheuristic opened this issue Dec 21, 2021 · 9 comments
Open
20 of 25 tasks

Training checklist #1

justheuristic opened this issue Dec 21, 2021 · 9 comments

Comments

@justheuristic
Copy link
Collaborator

justheuristic commented Dec 21, 2021

Start some peers

  • @razaidy starts initial CPU peer and shares her Peer ID (/ip4/something/tcp/something)
  • @justheuristic starts another CPU peer for fault-tolerance and shares his Peer ID
  • Start a few GPU workers

Technical features

  • Scale sequence length dynamically during training [@justheuristic]
    • add 32 every 1000 global steps up to 512
  • training with TPU

Volunteer starter kit

  • draft notebook
  • create an invite link for volunteers
  • text explanation for running in colab
  • upload notebook to CALM/notebooks?
  • discord chatroom
  • organization page
  • instructions for training locally, with kaggle, with sagemaker
  • email to volunteers (early ~wednesday )

After we start

  • contributors dashboard [@SaulLu ]
  • running evaluation every few hours
  • set up monitoring & support shifts

Sanity checks

  • make sure training data looks right (@razaidy @JAWHARAH123 @pr-Mais )
  • make sure loss is in a reasonable range
  • verify that volunteer starter kits work
  • make sure training with TPU does not leak memory during host<->device transfer (@justheuristic )
  • look at the data again just in case

Milestones

  • pass loss 9 by ~1000 steps (without stagnating)
    image
  • do not blow up at peak learning rate (steps 2500-3500)
  • downstream should be better than random after step 4000
  • reach full sequence length (10000)
@pr-Mais
Copy link
Collaborator

pr-Mais commented Dec 24, 2021

I see sagemaker, could GCE w or w/o colab be used as well?

@justheuristic
Copy link
Collaborator Author

Anything can as long as it has a free tier :)
Sagemaker had a free studiolab: https://studiolab.sagemaker.aws/

@SaulLu
Copy link
Collaborator

SaulLu commented Dec 29, 2021

Some news about the dashboard 😃

Here is what I did for the moment:

  1. Create a streamlit space named Dashboard in the CALM organization
  2. Copy the dashboard repo of training-transformers-together into the new github repo for the CALM dashboard
  3. Change the target HF repository in the github workflow (here)
  4. Create a machine-user for the organization and add it to the organization with a WRITE role (we need its token) - how should I share with you its password?

To finish setting up this dashboard, we should :
6. Add a secret named HF_TOKEN in the dashboard repository on github corresponding to a write access token of the machine user (I don't have the necessary rights on github)
7. Add two secrets WANDB_REPO_INDIVIDUAL_METRICS and WANDB_RUN_URL_MAIN_METRICS in the dashboard repository on the HUB corresponding to the links to the WANDBs storing the data.

@justheuristic
Copy link
Collaborator Author

justheuristic commented Dec 30, 2021

Awesome work!

For step #6
@razaidy , you mentioned there is someone on your side who can admin the organization, right?

@SaulLu can they use this instruction for adding a secret?

For step #7, I took the liberty to add these two secrets, but I'm not entirely sure I got the format right. Since they are public knowledge, I'll copy them here as well:

  • WANDB_REPO_INDIVIDUAL_METRICS=https://wandb.ai/calm/CALM-hivemind-trainers
  • WANDB_RUN_URL_MAIN_METRICS=https://wandb.ai/calm/CALM

p.s. this seems way more elegant than what we did with Neuropark (i.e. hardcode access tokens), thanks!

@JAWHARAH123
Copy link
Collaborator

JAWHARAH123 commented Dec 30, 2021 via email

@SaulLu
Copy link
Collaborator

SaulLu commented Jan 1, 2022

@JAWHARAH123 , with pleasure to share with you this secret information (I'll share the password to this machine user and the user access token)! I think that I can send you a private message on github, How do you want me to reach you? (email? discord?) 😄

The procedure to add the secret to the repository is indeed exactly the one you shared @justheuristic !

Thanks for the information regarding the step 7 @justheuristic 🤗

@JAWHARAH123
Copy link
Collaborator

JAWHARAH123 commented Jan 1, 2022 via email

@SaulLu
Copy link
Collaborator

SaulLu commented Jan 1, 2022

Thank you very much for your answer, unfortunately I can't see your email address in your last message and there are more users with the same nickname as you on discord.
image

If it ever helps, on discord I am user SaulLu #0201.

@SaulLu
Copy link
Collaborator

SaulLu commented Jan 2, 2022

Thank you all! The dashboard is live at this address: https://hf.co/spaces/CALM/Dashboard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants