Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CometLogger for metrics logging via comet.ml #1221

Merged
merged 30 commits into from
Aug 12, 2024

Conversation

dzheng256
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

My company uses torchtune for LLM finetuning but we use comet.ml for experiment tracking, not tensorboard or weights and biases. Adding a CometLogger class to make it easier for other Comet users to use torchtune with Comet.

Changelog

  • Introduce CometLogger class
  • Update docs to include CometLogger
  • Add TestCometLogger test

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
    • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Copy link

pytorch-bot bot commented Jul 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1221

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 27c90e3 with merge base 0057fe7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

Hi @dzheng256!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@dzheng256 dzheng256 changed the title Add comet logging Add CometLogger for metrics logging via comet.ml Jul 25, 2024
@dzheng256 dzheng256 marked this pull request as ready for review July 29, 2024 22:03
@joecummings
Copy link
Contributor

@dzheng256 Would you mind signing the CLA agreement?

@dzheng256
Copy link
Contributor Author

Hey @joecummings I actually already had an email exchange with the meta CLA folks about this and I think I'm on the agreement already but it might be my corporate email ([email protected]) rather than the account I'm making this PR from ([email protected]). I've asked them to add my personal email to the corporate CLA as well but haven't heard back.

@dzheng256
Copy link
Contributor Author

It seems like the tests ran and passed, @joecummings does that mean the CLA issue is resolved?

@RdoubleA
Copy link
Contributor

@dzheng256 It seems the CLA check is still failing. Have you confirmed that your personal email was added? Are you still able to sign a second time with your personal email?

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 30, 2024
@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@dzheng256
Copy link
Contributor Author

dzheng256 commented Jul 30, 2024

@joecummings sorry about that, I signed again with my personal email and it seems like CLA issues are resolved now.

@SalmanMohammadi
Copy link
Collaborator

Thank you so much for adding this @dzheng256. It's really cool to hear that you're using torchtune in your company.

Could you possibly attach an example comet logging output to the PR to confirm everything looks GOOD?

@SalmanMohammadi
Copy link
Collaborator

Could you also please shout out integration in the README.md here, please?

@RdoubleA
Copy link
Contributor

RdoubleA commented Aug 3, 2024

Thanks for adding this. There was also an earlier PR to add CometLogger that including a quick tutorial on how to use it. Would you be able to add that in this PR as well? #910

@dzheng256
Copy link
Contributor Author

dzheng256 commented Aug 3, 2024

Oh sorry, I hadn't realized there was an open PR implementing pretty much the same thing as this one. Should we stick with #910 then, since it came first? It also seems to be written by the a Comet dev so it might utilize the Comet API better. cc @Lothiraldan

@RdoubleA RdoubleA mentioned this pull request Aug 5, 2024
3 tasks
@RdoubleA
Copy link
Contributor

RdoubleA commented Aug 5, 2024

@dzheng256 I've pinged the other PR to see if he's still willing to work on it. If not, maybe you can incorporate the differences here?

@dzheng256
Copy link
Contributor Author

Yes sounds good, I will pull in the tutorial from the other PR if they no longer have the bandwidth to work on it.

Copy link
Contributor

@Lothiraldan Lothiraldan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, I will give it a try in the next days to generate an example public project.

@dzheng256 Feel free to pull the tutorial from my PR https:/pytorch/torchtune/pull/910/files#diff-5c3c14a3a87e5106e63f046499c113cd5b5c0771ac548bcb5b097fc40342bd44, I don't think I have the permission to push to your fork. Once I have a new example project, I will share the links to update the tutorial

There is one open question, whether to use our new way of initializing Comet Experiments. This is a new way but it's about to be more globally released, we have it stable for a while now and we already updated several of our integrations to use it, last one was Transformers: huggingface/transformers#31366

pyproject.toml Outdated Show resolved Hide resolved
Comment on lines 371 to 376
self.experiment = Experiment(
project_name=project_name,
workspace=workspace,
log_code=log_code,
**kwargs,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving towards a new way of initializing Comet Experiment with a new top-level function: comet_ml.start. This has several benefits:

  • It support online and offline logging
  • It support logging a new experiment or appending to an existing Experiment
  • You can set name and tags directly at initialization time

I'm working right now on finalizing the documentation but I can share the reference doc for that new function already: https://www.comet.com/docs/v2/api-and-sdk/python-sdk/reference/Experiment-Creation/

If there is a chance that users might want to append data to an existing experiment or create log data to an offline Experiment (to upload later), I would recommend using that instead. Please let me know, happy to provide the code corresponding code change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Lothiraldan, I swapped it over to the new comet_ml.start() and verified the latest code works for me. Could you use it to make an example project which I can then link in the deep dive tutorial?

Copy link
Contributor Author

@dzheng256 dzheng256 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lothiraldan I also added you as a collaborator to my fork, feel free to directly push changes if you would like!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've sent an update and I'm working on running a new example once I get my hands on a EC2 GPU instance

torchtune/utils/metric_logging.py Outdated Show resolved Hide resolved
Co-authored-by:  Boris Feld <[email protected]>
api_key (Optional[str]): Comet API key. It's recommended to configure the API Key from the environment.
workspace (Optional[str]): Comet workspace name. If not provided, uses the default workspace.
project (Optional[str]): Comet project name. Defaults to Uncategorized.
experiment_name (Optional[str]): The name for comet experiment to be used for logging.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
experiment_name (Optional[str]): The name for comet experiment to be used for logging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still unhappy about this @Lothiraldan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm working on understanding why is it complaining. It only is complaining with the latest version of pydocstring.

Comment on lines 334 to 339
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally
in offline experiment. Default is `True`.
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to
an existing one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally
in offline experiment. Default is `True`.
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to
an existing one.
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful
for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to
an existing one.
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally
in offline experiment. Default is `True`.

nit: ordering

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I was wondering why the linting was failing initially.

@Lothiraldan
Copy link
Contributor

@SalmanMohammadi I got back access to the EC2 instance and I've got the example running: https://www.comet.com/examples/comet-example-torchtune-mistral/fffb3036880e41b5af2df932db4d3578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step. I took a screenshot while it's running, it will probably looks better after it finishes but we can merge with the current screenshot.

I made some small changes and added back logging the configuration as a file, this way it's pretty easy to grab it from Comet and reproduce a previous run. I made several small commits, sorry about that. Let me know if you want me to squash them if you are not gonna squash the PR.

@SalmanMohammadi
Copy link
Collaborator

@SalmanMohammadi I got back access to the EC2 instance and I've got the example running: https://www.comet.com/examples/comet-example-torchtune-mistral/fffb3036880e41b5af2df932db4d3578?compareXAxis=step&experiment-tab=panels&showOutliers=true&smoothing=0&xAxis=step. I took a screenshot while it's running, it will probably looks better after it finishes but we can merge with the current screenshot.

I made some small changes and added back logging the configuration as a file, this way it's pretty easy to grab it from Comet and reproduce a previous run. I made several small commits, sorry about that. Let me know if you want me to squash them if you are not gonna squash the PR.

Amazing!! Thank you.

Feel free to send me an updated screenshot on Discord and I'm happy to add.

experiment_name (Optional[str]): Name of the experiment. If not provided, Comet will auto-generate a name.
tags (Optional[List[str]]): Tags to associate with the experiment.
log_code (bool): Whether to log the source code. Defaults to True.
**kwargs: additional arguments to pass to Comet.start. See
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need to type this (annoyingly)

Suggested change
**kwargs: additional arguments to pass to Comet.start. See
**kwargs: additional arguments to pass to Comet.start. See

experiment_name: Optional[str] = None,
tags: Optional[List[str]] = None,
log_code: bool = True,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. I have a lot of nit comments around RST docs rendering stuff, hope you don't mind :)

* How to use the :class:`~torchtune.utils.metric_logging.CometLogger`
* How to log configs, metrics, and model checkpoints to Comet

Torchtune supports logging your training runs to `Comet <https://www.comet.com/site/?utm_source=torchtune&utm_medium=docs&utm_content=docs>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we lowercase torchtune everywhere, even at the start of a sentence :)

docs/source/deep_dives/comet_logging.rst Show resolved Hide resolved

# enable logging to the built-in CometLogger
metric_logger:
_component_: torchtune.utils.metric_logging.CometLogger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍


.. note::

Click on this sample `Comet project to see how it will looks like after fine-tuning <https://www.comet.com/examples/comet-example-torchtune-mistral/>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Click on this sample `Comet project to see how it will looks like after fine-tuning <https://www.comet.com/examples/comet-example-torchtune-mistral/>`_.
Click on this sample `Comet project to see the logged metrics after fine-tuning <https://www.comet.com/examples/comet-example-torchtune-mistral/>`_.

torchtune/utils/metric_logging.py Show resolved Hide resolved
experiment_key (Optional[str]): The key for comet experiment to be used for logging. Must be an alphanumeric
string whose length is between 32 and 50 characters.
mode (Optional[str]): Control how the Comet experiment is started. "get": Continue logging to an existing
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
experiment identified by the `experiment_key` value. "create": Always creates of a new experiment, useful
experiment identified by the ``experiment_key`` value. "create": Always creates of a new experiment, useful

for HPO sweeps. "get_or_create" (default): Starts a fresh experiment if required, or persists logging to
an existing one.
online (Optional[bool]): If True, the data will be logged to Comet server, otherwise it will be stored locally
in offline experiment. Default is `True`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in offline experiment. Default is `True`.
in an offline experiment. Default is ``True``.

experiment_name (Optional[str]): Name of the experiment. If not provided, Comet will auto-generate a name.
tags (Optional[List[str]]): Tags to associate with the experiment.
log_code (bool): Whether to log the source code. Defaults to True.
**kwargs: additional arguments to pass to Comet.start. See
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**kwargs: additional arguments to pass to Comet.start. See
**kwargs: additional arguments to pass to ``Comet.start``. See

>>> logger.close()

Raises:
ImportError: If `comet_ml` package is not installed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ImportError: If `comet_ml` package is not installed.
ImportError: If ``comet_ml`` package is not installed.

You can install it with ``pip install comet_ml``.
You need to set up your Comet.ml API key before using this logger.
You can do this by setting the COMET_API_KEY environment variable
or by calling ``comet_ml.login()`` with your API key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I find comet login the easiest to do and is more comfortable for folks who don't know much code. What are your thoughts on just always recommending to setup via comet login so folks do not get confused with the different options?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comet login works great in interactive situations but won't help in automated setups. I'm happy to highlight the comet login option and links to other options in the Comet docs if that works for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in #1221 (comment), if we expect users to be setting up a config and doing tune run ... how are they supposed to use comet.login()? We would have to call that inside the logger, and incorporate some API key via a secrets manager solution which would be different for every user.

As far as I can tell torchtune seems to heavily favor the CLI tune run style usage, rather than importing it and calling tune run programmatically. If this is indeed the case, I think we should actually remove the mention to comet.login() and have people export the api key, what do you guys think?

Copy link
Contributor

@RdoubleA RdoubleA Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was under the impression you can run comet login from CLI, is this not the case?

Either way I agree with @dzheng256, since CLI is the primary entry point for most users via tune run I would lean towards suggesting whatever fits that workflow best whether that's setting environment variables or a CLI command. I suppose for automated setups environment variable is better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RdoubleA @dzheng256 Sorry about the confusion, there is bot a comet login command-line script and comet_ml.login that one can call in their Python script. I will clean up and double-check everything to point people to use the CLI comet login.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 22.41379% with 45 lines in your changes missing coverage. Please review.

Project coverage is 27.13%. Comparing base (7eb89e2) to head (089adad).
Report is 37 commits behind head on main.

Files Patch % Lines
torchtune/utils/metric_logging.py 25.71% 26 Missing ⚠️
tests/torchtune/utils/test_metric_logging.py 17.39% 19 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (7eb89e2) and HEAD (089adad). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (7eb89e2) HEAD (089adad)
5 2
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1221       +/-   ##
===========================================
- Coverage   67.81%   27.13%   -40.68%     
===========================================
  Files         219      255       +36     
  Lines        9908    11799     +1891     
===========================================
- Hits         6719     3202     -3517     
- Misses       3189     8597     +5408     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

class TestCometLogger:
def test_log(self) -> None:
with patch("comet_ml.Experiment") as mock_experiment:
logger = CometLogger(project_name="test_project")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger = CometLogger(project_name="test_project")
logger = CometLogger(project="test_project")


def test_log_dict(self) -> None:
with patch("comet_ml.Experiment") as mock_experiment:
logger = CometLogger(project_name="test_project")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger = CometLogger(project_name="test_project")
logger = CometLogger(project="test_project")


def test_log_config(self) -> None:
with patch("comet_ml.Experiment") as mock_experiment:
logger = CometLogger(project_name="test_project")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger = CometLogger(project_name="test_project")
logger = CometLogger(project="test_project")

@Lothiraldan
Copy link
Contributor

Hi @SalmanMohammadi, I've pushed fixes for the unit tests, updated the links and screenshots, tried to clarify the docstring about experiment_key argument and found the correct format for the mode argument to render as a nested bullet point.

I think that we are all good now but please let me know if I missed anything

README.md Outdated
@@ -30,6 +30,7 @@ torchtune focuses on integrating with popular tools and libraries from the ecosy
- [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) for distributed training
- [torchao](https:/pytorch-labs/ao) for lower precision dtypes and [post-training quantization](recipes/quantize.py) techniques
- [Weights & Biases](https://wandb.ai/site) for [logging](https://pytorch.org/torchtune/main/deep_dives/wandb_logging.html) metrics and checkpoints, and tracking training progress
- [CometML](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/stable/api_ref_utilities.html#metric-logging)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [CometML](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/stable/api_ref_utilities.html#metric-logging)
- [CometML](https://www.comet.com/site/) as another option for [logging](https://pytorch.org/torchtune/main/deep_dives/comet_logging.html)

one last minor nit, I promise

@SalmanMohammadi
Copy link
Collaborator

Thank you so much for your patience in addressing all our comments, and for your contributions @dzheng256, @Lothiraldan. This looks fantastic to me.

One last tiny nit and feel free to merge once CI is green.

@Lothiraldan
Copy link
Contributor

@SalmanMohammadi Good catch, thanks you and @RdoubleA for all of your time and reviews. I hope that my millions of small commits were not spamming your inbox too much 😅

@SalmanMohammadi SalmanMohammadi merged commit 6be89c0 into pytorch:main Aug 12, 2024
20 checks passed
@dzheng256 dzheng256 deleted the add-comet-logging branch August 12, 2024 22:05
@dzheng256
Copy link
Contributor Author

Thanks everyone! Looking forward to deleting my internal CometLogger and switching to torchtune.utils.metric_logging.CometLogger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants