[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug? #3136

goalinshi · 2024-10-13T02:12:22Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https:/open-mmlab/mmpose).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.20 (default, Oct 3 2024, 15:24:27) [GCC 11.2.0]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', '/usr/local/cuda-11.8'), ('NVCC', 'Cuda compilation tools, release 11.8, V11.8.89'), ('GCC', 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0'), ('PyTorch', '2.0.1+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.2+cu117'), ('OpenCV', '4.10.0'), ('MMEngine', '0.10.5'), ('MMPose', '1.1.0+')])

Reproduces the problem - code sample

0.210927 loss_kpt: 0.210927 acc_pose: 0.470607
10/13 10:00:56 - mmengine - INFO - Epoch(train) [105][4200/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:06 time: 0.126481 data_time: 0.023456 memory: 3826 loss: 0.207607 loss_kpt: 0.207607 acc_pose: 0.455742
10/13 10:01:03 - mmengine - INFO - Epoch(train) [105][4250/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:00 time: 0.122026 data_time: 0.019312 memory: 3826 loss: 0.207197 loss_kpt: 0.207197 acc_pose: 0.522144
10/13 10:01:07 - mmengine - INFO - Exp name: rtmpose-l_8xb256-420e_coco-256x192_20241012_164653
10/13 10:01:09 - mmengine - INFO - Epoch(train) [105][4300/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:54 time: 0.121489 data_time: 0.018492 memory: 3826 loss: 0.210692 loss_kpt: 0.210692 acc_pose: 0.520829
10/13 10:01:15 - mmengine - INFO - Epoch(train) [105][4350/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:49 time: 0.130391 data_time: 0.027746 memory: 3826 loss: 0.207093 loss_kpt: 0.207093 acc_pose: 0.510383
10/13 10:01:21 - mmengine - INFO - Epoch(train) [105][4400/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:43 time: 0.121266 data_time: 0.018557 memory: 3826 loss: 0.208687 loss_kpt: 0.208687 acc_pose: 0.571073
10/13 10:01:27 - mmengine - INFO - Epoch(train) [105][4450/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:30:37 time: 0.120966 data_time: 0.018265 memory: 3826 loss: 0.207345 loss_kpt: 0.207345 acc_pose: 0.523733

Reproduces the problem - command or script

python train.py config configs/body_2d_keypoint/rtmpose/coco/rtmpose-l_8xb256-420e_coco-256x192.py
--resume work_dirs/cspnext-l_udp-aic-coco_210e-256x192-273b7631_20230130.pth

Reproduces the problem - error message

[4250/4853] base_lr: 4.000000e-03 lr: 4.000000e-03 eta: 2 days, 3:31:00 time: 0.122026 data_time: 0.019312 memory: 3826 loss: 0.207197 loss_kpt: 0.207197 acc_pose: 0.522144

Additional information

1.The dataset is based on the original COCO dataset with 2000 additional images.
2.I think the performance after adding data is close to the original given model；
3.I can't think of where the problem is. The data has been verified and there is no problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug? #3136

[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug? #3136

goalinshi commented Oct 13, 2024

[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug? #3136

[Bug] I used the coco dataset to reproduce rtmpose, and the acc_pose value has been hovering between 0.4 and 0.5, and has been trained for 120 epochs，How should I solve this bug? #3136

Comments

goalinshi commented Oct 13, 2024

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information