open-mmlab · jts250 · Sep 20, 2023 · Sep 21, 2023 · Sep 21, 2023 · Oct 9, 2023
diff --git a/configs/_base_/models/asformer.py b/configs/_base_/models/asformer.py
@@ -0,0 +1,12 @@
+# model settings
+model = dict(
+ type='ASFormer',
+ num_layers=10,
+ num_f_maps=64,
+ input_dim=2048,
+ num_decoders=3,
+ num_classes=11,
+ channel_masking_rate=0.5,
+ sample_rate=1,
+ r1=2,
+ r2=2)
diff --git a/configs/segmentation/asformer/README.md b/configs/segmentation/asformer/README.md
@@ -0,0 +1,127 @@
+# ASFormer
+
+[ASFormer: Transformer for Action Segmentation](https://arxiv.org/pdf/2110.08568.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+Algorithms for the action segmentation task typically use temporal models to predict
+what action is occurring at each frame for a minute-long daily activity. Recent studies have shown the potential of Transformer in modeling the relations among elements
+in sequential data. However, there are several major concerns when directly applying
+the Transformer to the action segmentation task, such as the lack of inductive biases
+with small training sets, the deficit in processing long input sequence, and the limitation of the decoder architecture to utilize temporal relations among multiple action segments to refine the initial predictions. To address these concerns, we design an efficient
+Transformer-based model for the action segmentation task, named ASFormer, with three
+distinctive characteristics: (i) We explicitly bring in the local connectivity inductive priors because of the high locality of features. It constrains the hypothesis space within a
+reliable scope, and is beneficial for the action segmentation task to learn a proper target
+function with small training sets. (ii) We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences. (iii) We carefully design the
+decoder to refine the initial predictions from the encoder. Extensive experiments on
+three public datasets demonstrate the effectiveness of our methods.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https:/open-mmlab/mmaction2/assets/35267818/ea2af27e-0cd9-489d-9c81-02b8a7f29ef1" width="800"/>
+</div>
+
+## Results
+
+### GTEA
+
+| split | gpus | pretrain | ACC | EDIT | F1@10 | F1@25 | F1@50 | gpu_mem(M) | config | ckpt | log |
+| :----: | :--: | :------: | :---: | :---: | :---: | :---: | :---: | :--------: | :------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------: |
+| split2 | 1 | None | 80.34 | 81.58 | 89.30 | 87.83 | 75.28 | 1500 | [config](/configs/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature_20231011-b5aaf789.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.log) |
+| split1 | 1 | None | 76.54 | 80.36 | 84.80 | 83.39 | 77.74 | 1500 | - | - | - |
+| split3 | 1 | None | 82.41 | 90.03 | 92.13 | 92.37 | 86.26 | 1500 | - | - | - |
+| split4 | 1 | None | 79.77 | 91.70 | 92.88 | 92.39 | 81.65 | 1500 | - | - | - |
+
+### 50Salads
+
+| split | gpus | pretrain | ACC | EDIT | F1@10 | F1@25 | F1@50 | gpu_mem(M) | config | ckpt | log |
+| :----: | :--: | :------: | :---: | :---: | :---: | :---: | :---: | :--------: | :------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------: |
+| split2 | 1 | None | 87.55 | 79.10 | 85.17 | 83.73 | 77.99 | 7200 | [config](/configs/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature_20231011-25dc57d5.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature.log) |
+| split1 | 1 | None | 81.44 | 73.25 | 82.04 | 80.27 | 71.84 | 7200 | - | - | - |
+| split3 | 1 | None | 85.51 | 82.23 | 85.71 | 84.29 | 78.57 | 7200 | - | - | - |
+| split4 | 1 | None | 87.27 | 80.46 | 85.99 | 83.14 | 78.86 | 7200 | - | - | - |
+| split5 | 1 | None | 87.96 | 75.29 | 84.60 | 83.13 | 76.28 | 7200 | - | - | - |
+
+### Breakfast
+
+| split | gpus | pretrain | ACC | EDIT | F1@10 | F1@25 | F1@50 | gpu_mem(M) | config | ckpt | log |
+| :----: | :--: | :------: | :---: | :---: | :---: | :---: | :---: | :--------: | :------------------------------------------------: | :-----------------------------------------------: | :----------------------------------------------: |
+| split2 | 1 | None | 74.12 | 76.53 | 77.74 | 72.62 | 60.43 | 8800 | [config](/configs/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature_20231011-10e557f3.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature.log) |
+| split1 | 1 | None | 75.52 | 76.87 | 77.06 | 73.05 | 61.77 | 8800 | - | - | - |
+| split3 | 1 | None | 74.86 | 74.33 | 76.17 | 70.85 | 58.07 | 8800 | - | - | - |
+| split4 | 1 | None | 70.39 | 71.54 | 73.42 | 66.61 | 52.76 | 8800 | - | - | - |
+
+1. The **gpus** indicates the number of gpu we used to get the checkpoint.
+
+2. We report results trained on every split, but only provide checkpoints of one split. For experiments with other splits, simply change the paths to the training and testing datasets in the configs file, i.e., modifying `ann_file_train`, `ann_file_val` and `ann_file_test`.
+
+For more details on data preparation, you can refer to [Preparing Datasets for Action Segmentation](/tools/data/action_seg/README.md).
+
+## Train
+
+Train ASFormer model on features dataset for action segmentation.
+
+```shell
+bash tools/dist_train.sh configs/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.py 1
+```
+
+For more details, you can refer to the **Training** part in the [Training and Test Tutorial](/docs/en/user_guides/train_test.md).
+
+## Test
+
+Test ASFormer on features dataset for action segmentation.
+
+```shell
+python3 tools/test.py configs/segmentation/asformer/asformer_1xb1-120e_gtea-split2-i3d-feature.py CHECKPOINT.PTH
+```
+
+For more details, you can refer to the **Testing** part in the [Training and Test Tutorial](/docs/en/user_guides/train_test.md).
+
+## Citation
+
+```BibTeX
+@inproceedings{chinayi_ASformer,
+ author={Fangqiu Yi and Hongyu Wen and Tingting Jiang},
+ booktitle={The British Machine Vision Conference (BMVC)},
+ title={ASFormer: Transformer for Action Segmentation},
+ year={2021},
+}
+```
+
+<!-- [DATASET] -->
+
+```BibTeX
+@inproceedings{fathi2011learning,
+ title={Learning to recognize objects in egocentric activities},
+ author={Fathi, Alireza and Ren, Xiaofeng and Rehg, James M},
+ booktitle={CVPR 2011},
+ pages={3281--3288},
+ year={2011},
+ organization={IEEE}
+}
+```
+
+```BibTeX
+@inproceedings{stein2013combining,
+ title={Combining embedded accelerometers with computer vision for recognizing food preparation activities},
+ author={Stein, Sebastian and McKenna, Stephen J},
+ booktitle={Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing},
+ pages={729--738},
+ year={2013}
+}
+```
+
+```BibTeX
+@inproceedings{kuehne2014language,
+ title={The language of actions: Recovering the syntax and semantics of goal-directed human activities},
+ author={Kuehne, Hilde and Arslan, Ali and Serre, Thomas},
+ booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
+ pages={780--787},
+ year={2014}
+}
+```
diff --git a/configs/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature.py b/configs/segmentation/asformer/asformer_1xb1-120e_50salads-split2-i3d-feature.py
@@ -0,0 +1,120 @@
+_base_ = [
+ '../../_base_/models/asformer.py', '../../_base_/default_runtime.py'
+] # dataset settings
+dataset_type = 'ActionSegmentDataset'
+data_root = 'data/action_seg/50salads/'
+data_root_val = 'data/action_seg/50salads/'
+ann_file_train = 'data/action_seg/50salads/splits/train.split2.bundle'
+ann_file_val = 'data/action_seg/50salads/splits/test.split2.bundle'
+ann_file_test = 'data/action_seg/50salads/splits/test.split2.bundle'
+
+model = dict(
+ type='ASFormer',
+ num_layers=10,
+ num_f_maps=64,
+ input_dim=2048,
+ num_decoders=3,
+ num_classes=19,
+ channel_masking_rate=0.3,
+ sample_rate=2,
+ r1=2,
+ r2=2)
+
+train_pipeline = [
+ dict(type='LoadSegmentationFeature'),
+ dict(
+ type='PackSegmentationInputs',
+ keys=('classes', ),
+ meta_keys=(
+ 'num_classes',
+ 'actions_dict',
+ 'index2label',
+ 'ground_truth',
+ 'classes',
+ ))
+]
+
+val_pipeline = [
+ dict(type='LoadSegmentationFeature'),
+ dict(
+ type='PackSegmentationInputs',
+ keys=('classes', ),
+ meta_keys=('num_classes', 'actions_dict', 'index2label',
+ 'ground_truth', 'classes'))
+]
+
+test_pipeline = [
+ dict(type='LoadSegmentationFeature'),
+ dict(
+ type='PackSegmentationInputs',
+ keys=('classes', ),
+ meta_keys=('num_classes', 'actions_dict', 'index2label',
+ 'ground_truth', 'classes'))
+]
+
+train_dataloader = dict(
+ batch_size=1,
+ num_workers=1,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ drop_last=True,
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_train,
+ data_prefix=dict(video=data_root),
+ pipeline=train_pipeline))
+
+val_dataloader = dict(
+ batch_size=1,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_val,
+ data_prefix=dict(video=data_root_val),
+ pipeline=val_pipeline,
+ test_mode=True))
+
+test_dataloader = dict(
+ batch_size=1,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_test,
+ data_prefix=dict(video=data_root_val),
+ pipeline=test_pipeline,
+ test_mode=True))
+
+max_epochs = 120
+train_cfg = dict(
+ type='EpochBasedTrainLoop',
+ max_epochs=max_epochs,
+ val_begin=0,
+ val_interval=10)
+
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005, weight_decay=1e-5))
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[
+ 80,
+ 100,
+ ],
+ gamma=0.5)
+]
+work_dir = './work_dirs/50salads2/'
+test_evaluator = dict(
+ type='SegmentMetric',
+ metric_type='ALL',
+ dump_config=dict(out=f'{work_dir}/results.json', output_format='json'))
+val_evaluator = test_evaluator
+default_hooks = dict(checkpoint=dict(interval=10, max_keep_ckpts=3))
diff --git a/configs/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature.py b/configs/segmentation/asformer/asformer_1xb1-120e_breakfast-split2-i3d-feature.py
@@ -0,0 +1,121 @@
+_base_ = [
+ '../../_base_/models/asformer.py', '../../_base_/default_runtime.py'
+] # dataset settings
+dataset_type = 'ActionSegmentDataset'
+data_root = 'data/action_seg/breakfast/'
+data_root_val = 'data/action_seg/breakfast/'
+ann_file_train = 'data/action_seg/breakfast/splits/train.split2.bundle'
+ann_file_val = 'data/action_seg/breakfast/splits/test.split2.bundle'
+ann_file_test = 'data/action_seg/breakfast/splits/test.split2.bundle'
+
+model = dict(
+ type='ASFormer',
+ channel_masking_rate=0.3,
+ input_dim=2048,
+ num_classes=48,
+ num_decoders=3,
+ num_f_maps=64,
+ num_layers=10,
+ r1=2,
+ r2=2,
+ sample_rate=1)
+
+train_pipeline = [
+ dict(type='LoadSegmentationFeature'),
+ dict(
+ type='PackSegmentationInputs',
+ keys=('classes', ),
+ meta_keys=(
+ 'num_classes',
+ 'actions_dict',
+ 'index2label',
+ 'ground_truth',
+ 'classes',
+ ))
+]
+
+val_pipeline = [
+ dict(type='LoadSegmentationFeature'),
+ dict(
+ type='PackSegmentationInputs',
+ keys=('classes', ),
+ meta_keys=('num_classes', 'actions_dict', 'index2label',
+ 'ground_truth', 'classes'))
+]
+
+test_pipeline = [
+ dict(type='LoadSegmentationFeature'),
+ dict(
+ type='PackSegmentationInputs',
+ keys=('classes', ),
+ meta_keys=('num_classes', 'actions_dict', 'index2label',
+ 'ground_truth', 'classes'))
+]
+
+train_dataloader = dict(
+ batch_size=1,
+ num_workers=1,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=True),
+ drop_last=True,
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_train,
+ data_prefix=dict(video=data_root),
+ pipeline=train_pipeline))
+
+val_dataloader = dict(
+ batch_size=1,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_val,
+ data_prefix=dict(video=data_root_val),
+ pipeline=val_pipeline,
+ test_mode=True))
+
+test_dataloader = dict(
+ batch_size=1,
+ num_workers=8,
+ persistent_workers=True,
+ sampler=dict(type='DefaultSampler', shuffle=False),
+ dataset=dict(
+ type=dataset_type,
+ ann_file=ann_file_test,
+ data_prefix=dict(video=data_root_val),
+ pipeline=test_pipeline,
+ test_mode=True))
+
+max_epochs = 120
+train_cfg = dict(
+ type='EpochBasedTrainLoop',
+ max_epochs=max_epochs,
+ val_begin=0,
+ val_interval=5)
+
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+optim_wrapper = dict(optimizer=dict(type='Adam', lr=0.0005, weight_decay=1e-5))
+param_scheduler = [
+ dict(
+ type='MultiStepLR',
+ begin=0,
+ end=max_epochs,
+ by_epoch=True,
+ milestones=[
+ 80,
+ 100,
+ ],
+ gamma=0.5)
+]
+
+work_dir = './work_dirs/breakfast2/'
+test_evaluator = dict(
+ type='SegmentMetric',
+ metric_type='ALL',
+ dump_config=dict(out=f'{work_dir}/results.json', output_format='json'))
+val_evaluator = test_evaluator
+default_hooks = dict(checkpoint=dict(interval=5, max_keep_ckpts=3))