Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ASFormer #2692

Open
wants to merge 18 commits into
base: dev-1.x
Choose a base branch
from
12 changes: 12 additions & 0 deletions configs/_base_/models/asformer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# model settings
model = dict(
type='ASFormer',
num_layers=10,
num_f_maps=64,
input_dim=2048,
num_decoders=3,
num_classes=11,
channel_masking_rate=0.5,
sample_rate=1,
r1=2,
r2=2)
127 changes: 127 additions & 0 deletions configs/segmentation/asformer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# ASFormer

[ASFormer: Transformer for Action Segmentation](https://arxiv.org/pdf/2110.08568.pdf)

<!-- [ALGORITHM] -->

## Abstract

<!-- [ABSTRACT] -->

Algorithms for the action segmentation task typically use temporal models to predict
what action is occurring at each frame for a minute-long daily activity. Recent studies have shown the potential of Transformer in modeling the relations among elements
in sequential data. However, there are several major concerns when directly applying
the Transformer to the action segmentation task, such as the lack of inductive biases
with small training sets, the deficit in processing long input sequence, and the limitation of the decoder architecture to utilize temporal relations among multiple action segments to refine the initial predictions. To address these concerns, we design an efficient
Transformer-based model for the action segmentation task, named ASFormer, with three
distinctive characteristics: (i) We explicitly bring in the local connectivity inductive priors because of the high locality of features. It constrains the hypothesis space within a
reliable scope, and is beneficial for the action segmentation task to learn a proper target
function with small training sets. (ii) We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences. (iii) We carefully design the
decoder to refine the initial predictions from the encoder. Extensive experiments on
three public datasets demonstrate the effectiveness of our methods.

<!-- [IMAGE] -->

<div align=center>
<img src="https:/open-mmlab/mmaction2/assets/35267818/ea2af27e-0cd9-489d-9c81-02b8a7f29ef1" width="800"/>
</div>

## Results

### GTEA

| split | gpus | pretrain | ACC | EDIT | F1@10 | F1@25 | F1@50 | gpu_mem(M) | config | ckpt | log |
| :----: | :--: | :------: | :---: | :---: | :---: | :---: | :---: | :--------: | :------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------: |
| split2 | 1 | None | 80.34 | 81.58 | 89.30 | 87.83 | 75.28 | 1500 | [config](/configs/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature_20231011-b5aaf789.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.log) |
| split1 | 1 | None | 76.54 | 80.36 | 84.80 | 83.39 | 77.74 | 1500 | - | - | - |
| split3 | 1 | None | 82.41 | 90.03 | 92.13 | 92.37 | 86.26 | 1500 | - | - | - |
| split4 | 1 | None | 79.77 | 91.70 | 92.88 | 92.39 | 81.65 | 1500 | - | - | - |

### 50Salads

| split | gpus | pretrain | ACC | EDIT | F1@10 | F1@25 | F1@50 | gpu_mem(M) | config | ckpt | log |
| :----: | :--: | :------: | :---: | :---: | :---: | :---: | :---: | :--------: | :------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------: |
| split2 | 1 | None | 87.55 | 79.10 | 85.17 | 83.73 | 77.99 | 7200 | [config](/configs/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature_20231011-25dc57d5.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature.log) |
| split1 | 1 | None | 81.44 | 73.25 | 82.04 | 80.27 | 71.84 | 7200 | - | - | - |
| split3 | 1 | None | 85.51 | 82.23 | 85.71 | 84.29 | 78.57 | 7200 | - | - | - |
| split4 | 1 | None | 87.27 | 80.46 | 85.99 | 83.14 | 78.86 | 7200 | - | - | - |
| split5 | 1 | None | 87.96 | 75.29 | 84.60 | 83.13 | 76.28 | 7200 | - | - | - |

### Breakfast

| split | gpus | pretrain | ACC | EDIT | F1@10 | F1@25 | F1@50 | gpu_mem(M) | config | ckpt | log |
| :----: | :--: | :------: | :---: | :---: | :---: | :---: | :---: | :--------: | :------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------: |
| split2 | 1 | None | 74.12 | 76.53 | 77.74 | 72.62 | 60.43 | 8800 | [config](/configs/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature_20231011-10e557f3.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature.log) |
| split1 | 1 | None | 75.52 | 76.87 | 77.06 | 73.05 | 61.77 | 8800 | - | - | - |
| split3 | 1 | None | 74.86 | 74.33 | 76.17 | 70.85 | 58.07 | 8800 | - | - | - |
| split4 | 1 | None | 70.39 | 71.54 | 73.42 | 66.61 | 52.76 | 8800 | - | - | - |

1. The **gpus** indicates the number of gpu we used to get the checkpoint.

2. We report results trained on every split, but only provide checkpoints of one split. For experiments with other splits, simply change the paths to the training and testing datasets in the configs file, i.e., modifying `ann_file_train`, `ann_file_val` and `ann_file_test`.

For more details on data preparation, you can refer to [Preparing Datasets for Action Segmentation](/tools/data/action_seg/README.md).

## Train

Train ASFormer model on features dataset for action segmentation.

```shell
bash tools/dist_train.sh configs/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.py 1
```

For more details, you can refer to the **Training** part in the [Training and Test Tutorial](/docs/en/user_guides/train_test.md).

## Test

Test ASFormer on features dataset for action segmentation.

```shell
python3 tools/test.py configs/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.py CHECKPOINT.PTH
```

For more details, you can refer to the **Testing** part in the [Training and Test Tutorial](/docs/en/user_guides/train_test.md).

## Citation

```BibTeX
@inproceedings{chinayi_ASformer,
author={Fangqiu Yi and Hongyu Wen and Tingting Jiang},
booktitle={The British Machine Vision Conference (BMVC)},
title={ASFormer: Transformer for Action Segmentation},
year={2021},
}
```

<!-- [DATASET] -->

```BibTeX
@inproceedings{fathi2011learning,
title={Learning to recognize objects in egocentric activities},
author={Fathi, Alireza and Ren, Xiaofeng and Rehg, James M},
booktitle={CVPR 2011},
pages={3281--3288},
year={2011},
organization={IEEE}
}
```

```BibTeX
@inproceedings{stein2013combining,
title={Combining embedded accelerometers with computer vision for recognizing food preparation activities},
author={Stein, Sebastian and McKenna, Stephen J},
booktitle={Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing},
pages={729--738},
year={2013}
}
```

```BibTeX
@inproceedings{kuehne2014language,
title={The language of actions: Recovering the syntax and semantics of goal-directed human activities},
author={Kuehne, Hilde and Arslan, Ali and Serre, Thomas},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={780--787},
year={2014}
}
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
_base_ = [
'../../_base_/models/asformer.py', '../../_base_/default_runtime.py'
] # dataset settings
dataset_type = 'ActionSegmentDataset'
data_root = 'data/action_seg/50salads/'
data_root_val = 'data/action_seg/50salads/'
ann_file_train = 'data/action_seg/50salads/splits/train.split2.bundle'
ann_file_val = 'data/action_seg/50salads/splits/test.split2.bundle'
ann_file_test = 'data/action_seg/50salads/splits/test.split2.bundle'

model = dict(
type='ASFormer',
num_layers=10,
num_f_maps=64,
input_dim=2048,
num_decoders=3,
num_classes=19,
channel_masking_rate=0.3,
sample_rate=2,
r1=2,
r2=2)

train_pipeline = [
dict(type='LoadSegmentationFeature'),
dict(
type='PackSegmentationInputs',
keys=('classes', ),
meta_keys=(
'num_classes',
'actions_dict',
'index2label',
'ground_truth',
'classes',
))
]

val_pipeline = [
dict(type='LoadSegmentationFeature'),
dict(
type='PackSegmentationInputs',
keys=('classes', ),
meta_keys=('num_classes', 'actions_dict', 'index2label',
'ground_truth', 'classes'))
]

test_pipeline = [
dict(type='LoadSegmentationFeature'),
dict(
type='PackSegmentationInputs',
keys=('classes', ),
meta_keys=('num_classes', 'actions_dict', 'index2label',
'ground_truth', 'classes'))
]

train_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
drop_last=True,
dataset=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=dict(video=data_root),
pipeline=train_pipeline))

val_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=dict(video=data_root_val),
pipeline=val_pipeline,
test_mode=True))

test_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=dict(video=data_root_val),
pipeline=test_pipeline,
test_mode=True))

max_epochs = 120
train_cfg = dict(
type='EpochBasedTrainLoop',
max_epochs=max_epochs,
val_begin=0,
val_interval=10)

val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005, weight_decay=1e-5))
param_scheduler = [
dict(
type='MultiStepLR',
begin=0,
end=max_epochs,
by_epoch=True,
milestones=[
80,
100,
],
gamma=0.5)
]
work_dir = './work_dirs/50salads2/'
test_evaluator = dict(
type='SegmentMetric',
metric_type='ALL',
dump_config=dict(out=f'{work_dir}/results.json', output_format='json'))
val_evaluator = test_evaluator
default_hooks = dict(checkpoint=dict(interval=10, max_keep_ckpts=3))
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
_base_ = [
'../../_base_/models/asformer.py', '../../_base_/default_runtime.py'
] # dataset settings
dataset_type = 'ActionSegmentDataset'
data_root = 'data/action_seg/breakfast/'
data_root_val = 'data/action_seg/breakfast/'
ann_file_train = 'data/action_seg/breakfast/splits/train.split2.bundle'
ann_file_val = 'data/action_seg/breakfast/splits/test.split2.bundle'
ann_file_test = 'data/action_seg/breakfast/splits/test.split2.bundle'

model = dict(
type='ASFormer',
channel_masking_rate=0.3,
input_dim=2048,
num_classes=48,
num_decoders=3,
num_f_maps=64,
num_layers=10,
r1=2,
r2=2,
sample_rate=1)

train_pipeline = [
dict(type='LoadSegmentationFeature'),
dict(
type='PackSegmentationInputs',
keys=('classes', ),
meta_keys=(
'num_classes',
'actions_dict',
'index2label',
'ground_truth',
'classes',
))
]

val_pipeline = [
dict(type='LoadSegmentationFeature'),
dict(
type='PackSegmentationInputs',
keys=('classes', ),
meta_keys=('num_classes', 'actions_dict', 'index2label',
'ground_truth', 'classes'))
]

test_pipeline = [
dict(type='LoadSegmentationFeature'),
dict(
type='PackSegmentationInputs',
keys=('classes', ),
meta_keys=('num_classes', 'actions_dict', 'index2label',
'ground_truth', 'classes'))
]

train_dataloader = dict(
batch_size=1,
num_workers=1,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
drop_last=True,
dataset=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=dict(video=data_root),
pipeline=train_pipeline))

val_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=dict(video=data_root_val),
pipeline=val_pipeline,
test_mode=True))

test_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=dict(video=data_root_val),
pipeline=test_pipeline,
test_mode=True))

max_epochs = 120
train_cfg = dict(
type='EpochBasedTrainLoop',
max_epochs=max_epochs,
val_begin=0,
val_interval=5)

val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005, weight_decay=1e-5))
param_scheduler = [
dict(
type='MultiStepLR',
begin=0,
end=max_epochs,
by_epoch=True,
milestones=[
80,
100,
],
gamma=0.5)
]

work_dir = './work_dirs/breakfast2/'
test_evaluator = dict(
type='SegmentMetric',
metric_type='ALL',
dump_config=dict(out=f'{work_dir}/results.json', output_format='json'))
val_evaluator = test_evaluator
default_hooks = dict(checkpoint=dict(interval=5, max_keep_ckpts=3))
Loading
Loading