incorrect pipeline status when restarted the server #130

chinyeungli · 2021-04-06T09:47:27Z

I've created couple projects and launch a docker pipeline on each of them, but then the server is dead. Therefore, I restarted the server, and when I open the project page again, the "status" next to the pipelines is still showing "Running" which is incorrect.

…eue for execution #130 Signed-off-by: Thomas Druez <[email protected]>

…gic #130 Signed-off-by: Thomas Druez <[email protected]>

Signed-off-by: Thomas Druez <[email protected]>

…and False #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez · 2021-04-29T06:50:39Z

@AvishrantsSh I'd like to get your input on this one. Any ideas on how we could improve this?

To reproduce:

Start the webserver: $make run
Start any pipeline execution
Kill the webserver process
Restart the webserver and look at your pipeline state, still displayed as "running"

The pipeline will show as "running" forever while not running anymore. That status display is based on the database state of the Pipeline run, it was marked as "started" at the start of the pipeline execution, but since the worker process was "killed", it never got the chance to be marked as "ended/failed".

AvishrantsSh · 2021-04-29T10:23:35Z

@tdruez, I've looked into this issue and I believe there are a few workarounds:

We can implement a custom signal for server shutdown and perform database cleansing at that time.
When the server is started, we can look for entries in Run model that have no valid task_exitcode and perform cleansing accordingly. It can be implemented in wsgi.py using a filter like task_exitcode__isnull = True.

Personally, I feel that the second solution makes more sense as it is guaranteed that no celery job will be running at that time that might interfere with the cleansing. Besides, it is also simpler to implement than signals.

AvishrantsSh · 2021-04-29T10:24:55Z

I tried to implement the second solution and it seems to work perfectly.

tdruez · 2021-04-29T12:51:25Z

@AvishrantsSh thanks for your input. This approach works alright in local "runserver" mode.

But in the full stack mode (redis + celery worker + webserver + ALWAYS_EAGER = False) the "When the server is started" would not work since each service is run separately. The celery worker could still be working on a task while you restart the webserver for example.

AvishrantsSh · 2021-04-29T13:11:25Z

@tdruez As you rightly said, the approach that I mentioned should work in "offline" mode.
Now as for the complete solution, I have some doubts. First, can Celery update the db when django server is shut down? If it can then I don't think there will be any issue with this approach as we can set the cleansing to work only with local development.
However, if Celery can't update the record, then it is a different story.

tdruez · 2021-04-29T13:18:54Z

Celery update the db when django server is shut down?

Yes, the celery workers access the database independently from the django app.

Also, we can have multiple celery workers running, so doing the cleanup on starting a worker would not be good enough.

AvishrantsSh · 2021-05-01T04:12:15Z

@tdruez, can you provide me with instructions on how to test the full implementation. I tried doing it with a dockerized container of redis and using the command make worker, but for some reason, it always shows the result as a failure.

tdruez · 2021-05-04T10:13:10Z

@AvishrantsSh there's 2 ways to run the full ScanCode.io stack:

1. Running each service individually (dev mode, Linux and macOS only)

Edit the .env file to disable the eager mode: CELERY_TASK_ALWAYS_EAGER=False
Run a Redis server (https://redis.io/download): $ redis-server
Run the Django app: $ make run
Run the Celery worker:$ make worker

2. Docker (production mode)

https://scancodeio.readthedocs.io/en/latest/docker-image.html#docker-image

AvishrantsSh · 2021-05-04T11:59:35Z

@tdruez, I've opened a pull request #171 for this issue. Please have a look and comment on any changes you'd like. I've used app.control.inspect() to check if any celery worker is online/working before the cleansing operation.

tdruez · 2021-06-14T14:30:41Z

@AvishrantsSh after looking deeper into this issue, the app.control.inspect() is not enough for a complete solution.

The celery inspect tool works at the workers level and knows only about the executing tasks and schedule ones (prefetched tasks). It does not provide an insight in the full queue of tasks. Those are waiting to be picked, by a worker, on the broker (redis) side.

The queued tasks can be gather using the redis api:

redis_connection = redis.StrictRedis(host="redis", port="6379", db=0)
tasks = redis_connection.lrange("redis://redis:6379/0", start=0, end=-1)

# or 
queue_name = "redis://redis:6379/0"
with celery_app.pool.acquire(block=True) as conn:
    tasks = conn.default_channel.client.lrange(queue_name, 0, -1)

I'm working on a solution that will combine those to recover all the tasks:

Running: celery_app.control.inspect()
Scheduled: celery_app.control.reserved() celery_app.control.scheduled()
Queued: redis.lrange()

ddmesh · 2021-09-06T14:56:42Z

Hi, I can confirm that the version from today (September 6), still has problems when a scan is running and I just
do "docker-compose stop" and "docker-compose start".

The status for the pipeline stays "running". When trying to "start_pipeline" I'm told that this pipeline is already running.
but scancode.io does not restart the scan in real.

What I like to have is that after restarting scancode.io :

an aborted scan continues at scan position where it was interrupted
or at least restart the scan as queued

At moment I can also not determine if such a dead running pipeline is working or not, so I can't delete it and try again.
But this would be very inconvenient when implementing scancode.io into some CI/CD.
Thanks

Signed-off-by: Thomas Druez <[email protected]>

with default policy of fsync every second Signed-off-by: Thomas Druez <[email protected]>

Signed-off-by: Thomas Druez <[email protected]>

Synchronise the `self` Run instance with its related RQ Job Signed-off-by: Thomas Druez <[email protected]>

…eady #130 Signed-off-by: Thomas Druez <[email protected]>

Signed-off-by: Thomas Druez <[email protected]>

…C queue #130 Signed-off-by: Thomas Druez <[email protected]>

Signed-off-by: Thomas Druez <[email protected]>

…357) * Flag stale runs on app ready in SYNC mode #130 Signed-off-by: Thomas Druez <[email protected]> * Enable redis data persistence using AOF #130 with default policy of fsync every second Signed-off-by: Thomas Druez <[email protected]> * Make sure the job is found before calling delete in Run.delete_task #130 Signed-off-by: Thomas Druez <[email protected]> * Add a sync_with_job method on the run model #130 Synchronise the `self` Run instance with its related RQ Job Signed-off-by: Thomas Druez <[email protected]> * Synchronizes QUEUED and RUNNING Runs with their related Jobs on app ready #130 Signed-off-by: Thomas Druez <[email protected]> * Add unit test for sync_with_job method #130 Signed-off-by: Thomas Druez <[email protected]> * Move the synchronization process in a custom Worker class #130 Signed-off-by: Thomas Druez <[email protected]> * Simplify the synchronization logic #130 Signed-off-by: Thomas Druez <[email protected]> * Reduce the "cleaning lock" ttl from 899 seconds to 60 seconds in ASYNC queue #130 Signed-off-by: Thomas Druez <[email protected]> * Add unit tests for better coverage #130 Signed-off-by: Thomas Druez <[email protected]> * Add CHANGELOG entry #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez · 2021-11-22T04:11:30Z

Extract from the v30.1.0 release notes https:/nexB/scancode.io/releases/tag/v30.1.0:

Synchronize QUEUED and RUNNING pipeline runs with their related worker jobs during
worker maintenance tasks scheduled every 10 minutes.
If a container was taken down while a pipeline was running, or if pipeline process
was killed unexpectedly, that pipeline run status will be updated to a FAILED state
during the next maintenance tasks.
QUEUED pipeline will be restored in the queue as the worker redis cache backend data
is now persistent and reloaded on starting the image.
Note that internally, a running job emits a "heartbeat" every 60 seconds to let all the
workers know that it is properly running.
After 90 seconds without any heartbeats, a worker will determine that the job is not
active anymore and that job will be moved to the failed registry during the worker
maintenance tasks. The pipeline run will be updated as well to reflect this failure
in the Web UI, the REST API, and the command line interface.
Enable redis data persistence using the "Append Only File" with the default policy of
fsync every second in the docker-compose.

tdruez added a commit that referenced this issue Apr 26, 2021

Add a new QUEUED status to differentiate a pipeline that is in the qu…

e52facc

…eue for execution #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Apr 26, 2021

Add a Run.status property for consistency across status-based code lo…

6db954a

…gic #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Apr 27, 2021

Add a unit test for the new Run.status property #130

b567c4b

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Apr 28, 2021

Refine the way the task_id is initialized to support TASK_EAGER True …

a3b5bdd

…and False #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez added this to the 2021-05 milestone Apr 28, 2021

AvishrantsSh mentioned this issue May 4, 2021

Fix pipeline status when restarting the server #130 #171

Closed

tdruez modified the milestones: 2021-05, 2021-08 Jun 14, 2021

cco3 mentioned this issue Aug 10, 2021

Queued scans do not run after deleting a running scan #248

Closed

tdruez added the high priority label Aug 11, 2021

tdruez added a commit that referenced this issue Sep 17, 2021

Flag the "staled" runs on app ready #130

e13e953

Signed-off-by: Thomas Druez <[email protected]>

tdruez modified the milestones: 2021-08, Task queue improvments Sep 17, 2021

tdruez added a commit that referenced this issue Nov 9, 2021

Flag stale runs on app ready in SYNC mode #130

431e170

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 10, 2021

Enable redis data persistence using AOF #130

f8e38bb

with default policy of fsync every second Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 10, 2021

Make sure the job is found before calling delete in Run.delete_task #130

8498930

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 10, 2021

Add a sync_with_job method on the run model #130

af64726

Synchronise the `self` Run instance with its related RQ Job Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 10, 2021

Synchronizes QUEUED and RUNNING Runs with their related Jobs on app r…

61b0526

…eady #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 11, 2021

Add unit test for sync_with_job method #130

46e9e96

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 11, 2021

Move the synchronization process in a custom Worker class #130

c58638f

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 11, 2021

Simplify the synchronization logic #130

1edbfda

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 17, 2021

Reduce the "cleaning lock" ttl from 899 seconds to 60 seconds in ASYN…

f95d7af

…C queue #130 Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 19, 2021

Add unit tests for better coverage #130

aad3d41

Signed-off-by: Thomas Druez <[email protected]>

tdruez added a commit that referenced this issue Nov 22, 2021

Add CHANGELOG entry #130

bff6e05

Signed-off-by: Thomas Druez <[email protected]>

tdruez closed this as completed Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect pipeline status when restarted the server #130

incorrect pipeline status when restarted the server #130

chinyeungli commented Apr 6, 2021

tdruez commented Apr 29, 2021

AvishrantsSh commented Apr 29, 2021

AvishrantsSh commented Apr 29, 2021

tdruez commented Apr 29, 2021

AvishrantsSh commented Apr 29, 2021

tdruez commented Apr 29, 2021 •

edited

Loading

AvishrantsSh commented May 1, 2021

tdruez commented May 4, 2021

AvishrantsSh commented May 4, 2021

tdruez commented Jun 14, 2021

ddmesh commented Sep 6, 2021 •

edited

Loading

tdruez commented Nov 22, 2021

incorrect pipeline status when restarted the server #130

incorrect pipeline status when restarted the server #130

Comments

chinyeungli commented Apr 6, 2021

tdruez commented Apr 29, 2021

AvishrantsSh commented Apr 29, 2021

AvishrantsSh commented Apr 29, 2021

tdruez commented Apr 29, 2021

AvishrantsSh commented Apr 29, 2021

tdruez commented Apr 29, 2021 • edited Loading

AvishrantsSh commented May 1, 2021

tdruez commented May 4, 2021

1. Running each service individually (dev mode, Linux and macOS only)

2. Docker (production mode)

AvishrantsSh commented May 4, 2021

tdruez commented Jun 14, 2021

ddmesh commented Sep 6, 2021 • edited Loading

tdruez commented Nov 22, 2021

tdruez commented Apr 29, 2021 •

edited

Loading

ddmesh commented Sep 6, 2021 •

edited

Loading