Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When profile_memory, also export and save the snapshot.pickle for lora_finetune_single_device.py #1382

Merged
merged 2 commits into from
Aug 21, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions recipes/lora_finetune_single_device.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,11 @@ def _setup_profiler(

log.info(f" Profiler config after instantiation: {profiler_cfg}")

self.profiler_wait_steps = profiler_cfg["wait_steps"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be throwing an error in our recipe tests, can you just quickly inspect profiler_cfg after running e.g.

pytest tests/recipes/test_lora_finetune_single_device.py -m integration_test -k 'test_loss'

I thought these fields would be defined based on this, but seems like something weird is happening here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's because profiling isn't enabled for tests, so it returns here

return DummyProfiler(), DictConfig({"enabled": False})

self.profiler_warmup_steps = profiler_cfg["warmup_steps"]
self.profiler_active_steps = profiler_cfg["active_steps"]
self.profiler_profile_memory = profiler_cfg["profile_memory"]

return profiler

def _setup_model(
Expand Down Expand Up @@ -579,6 +584,9 @@ def train(self) -> None:
):
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Memory profiling
if curr_epoch == 0 and self.profiler_profile_memory and idx == (self.profiler_wait_steps + self.profiler_warmup_steps):
torch.cuda.memory._record_memory_history()

if curr_epoch == 0 and self.profiler_profile_memory and idx == self.profiler_wait_steps + self.profiler_warmup_steps:
torch.cuda.memory._record_memory_history()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is if in the future we support other backend, like intel. It makes me think that this should be a function, instead of being hardcoded as "torch.cuda"

cc: @ebsmothers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is if in the future we support other backend, like intel. It makes me think that this should be a function, instead of being hardcoded as "torch.cuda"

cc: @ebsmothers

I think it's OK, idk that memory snapshot would be supported on XPU anyways. We can do a check on device + profile_memory combo just to be safe though


batch = {k: v.to(self._device) for k, v in batch.items()}
num_tokens += batch["tokens"].numel()

Expand Down Expand Up @@ -626,6 +634,9 @@ def train(self) -> None:
num_tokens = 0
t0 = time.perf_counter()

if curr_epoch == 0 and self.profiler_profile_memory and idx == self.profiler_wait_steps + self.profiler_warmup_steps + self.profiler_active_steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if curr_epoch == 0 and self.profiler_profile_memory and idx == self.profiler_wait_steps + self.profiler_warmup_steps + self.profiler_active_steps:
# Stop memory profiling
if curr_epoch == 0 and self.profiler_profile_memory and idx == self.profiler_wait_steps + self.profiler_warmup_steps + self.profiler_active_steps:

torch.cuda.memory._record_memory_history(enabled=None)

# Step the profiler
# Note we are stepping each batch, which might not include optimizer step in the trace
# if the schedule cycle doesn't align with gradient accumulation.
Expand Down
4 changes: 3 additions & 1 deletion torchtune/utils/_profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def trace_handler(
The following artifacts are exported:
- chrome / tensorboard trace - viewable through tensorboard or perfetto.dev / chrome::/tracing
- trace event table
- memory timeline if ``profile_memory``
- memory timeline and snapshot.pickle if ``profile_memory``
- stacks if ``with_stack`` (note that ``profile_memory`` requires ``with_stack`` to be ``True``),
viewable as a flamegraph see (https://pytorch.org/docs/stable/profiler.html#torch.profiler._KinetoProfile.export_stacks).

Expand Down Expand Up @@ -115,6 +115,8 @@ def trace_handler(
except Exception as e:
log.warn(f" Failed to export memory timeline: {e}")

torch.cuda.memory._dump_snapshot(f"{curr_trace_dir}/rank{rank}_memory_snapshot.pickle")

# Dump stack traces
if prof.with_stack:
prof.export_stacks(f"{curr_trace_dir}/rank{rank}_stacks.txt", metric=metric)
Expand Down
Loading