Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect pipeline status when restarted the server #130

Closed
chinyeungli opened this issue Apr 6, 2021 · 12 comments
Closed

incorrect pipeline status when restarted the server #130

chinyeungli opened this issue Apr 6, 2021 · 12 comments

Comments

@chinyeungli
Copy link
Contributor

I've created couple projects and launch a docker pipeline on each of them, but then the server is dead. Therefore, I restarted the server, and when I open the project page again, the "status" next to the pipelines is still showing "Running" which is incorrect.

tdruez added a commit that referenced this issue Apr 26, 2021
tdruez added a commit that referenced this issue Apr 26, 2021
tdruez added a commit that referenced this issue Apr 27, 2021
tdruez added a commit that referenced this issue Apr 28, 2021
@tdruez tdruez added this to the 2021-05 milestone Apr 28, 2021
@tdruez
Copy link
Contributor

tdruez commented Apr 29, 2021

@AvishrantsSh I'd like to get your input on this one. Any ideas on how we could improve this?

To reproduce:

  • Start the webserver: $make run
  • Start any pipeline execution
  • Kill the webserver process
  • Restart the webserver and look at your pipeline state, still displayed as "running"

The pipeline will show as "running" forever while not running anymore. That status display is based on the database state of the Pipeline run, it was marked as "started" at the start of the pipeline execution, but since the worker process was "killed", it never got the chance to be marked as "ended/failed".

@AvishrantsSh
Copy link
Collaborator

@tdruez, I've looked into this issue and I believe there are a few workarounds:

  • We can implement a custom signal for server shutdown and perform database cleansing at that time.
  • When the server is started, we can look for entries in Run model that have no valid task_exitcode and perform cleansing accordingly. It can be implemented in wsgi.py using a filter like task_exitcode__isnull = True.

Personally, I feel that the second solution makes more sense as it is guaranteed that no celery job will be running at that time that might interfere with the cleansing. Besides, it is also simpler to implement than signals.

@AvishrantsSh
Copy link
Collaborator

I tried to implement the second solution and it seems to work perfectly.

@tdruez
Copy link
Contributor

tdruez commented Apr 29, 2021

@AvishrantsSh thanks for your input. This approach works alright in local "runserver" mode.

But in the full stack mode (redis + celery worker + webserver + ALWAYS_EAGER = False) the "When the server is started" would not work since each service is run separately. The celery worker could still be working on a task while you restart the webserver for example.

@AvishrantsSh
Copy link
Collaborator

@tdruez As you rightly said, the approach that I mentioned should work in "offline" mode.
Now as for the complete solution, I have some doubts. First, can Celery update the db when django server is shut down? If it can then I don't think there will be any issue with this approach as we can set the cleansing to work only with local development.
However, if Celery can't update the record, then it is a different story.

@tdruez
Copy link
Contributor

tdruez commented Apr 29, 2021

Celery update the db when django server is shut down?

Yes, the celery workers access the database independently from the django app.

Also, we can have multiple celery workers running, so doing the cleanup on starting a worker would not be good enough.

@AvishrantsSh
Copy link
Collaborator

@tdruez, can you provide me with instructions on how to test the full implementation. I tried doing it with a dockerized container of redis and using the command make worker, but for some reason, it always shows the result as a failure.

@tdruez
Copy link
Contributor

tdruez commented May 4, 2021

@AvishrantsSh there's 2 ways to run the full ScanCode.io stack:

1. Running each service individually (dev mode, Linux and macOS only)

  • Edit the .env file to disable the eager mode: CELERY_TASK_ALWAYS_EAGER=False
  • Run a Redis server (https://redis.io/download): $ redis-server
  • Run the Django app: $ make run
  • Run the Celery worker:$ make worker

2. Docker (production mode)

@AvishrantsSh
Copy link
Collaborator

@tdruez, I've opened a pull request #171 for this issue. Please have a look and comment on any changes you'd like. I've used app.control.inspect() to check if any celery worker is online/working before the cleansing operation.

@tdruez
Copy link
Contributor

tdruez commented Jun 14, 2021

@AvishrantsSh after looking deeper into this issue, the app.control.inspect() is not enough for a complete solution.

The celery inspect tool works at the workers level and knows only about the executing tasks and schedule ones (prefetched tasks). It does not provide an insight in the full queue of tasks. Those are waiting to be picked, by a worker, on the broker (redis) side.

The queued tasks can be gather using the redis api:

redis_connection = redis.StrictRedis(host="redis", port="6379", db=0)
tasks = redis_connection.lrange("redis://redis:6379/0", start=0, end=-1)

# or 
queue_name = "redis://redis:6379/0"
with celery_app.pool.acquire(block=True) as conn:
    tasks = conn.default_channel.client.lrange(queue_name, 0, -1)

I'm working on a solution that will combine those to recover all the tasks:

  • Running: celery_app.control.inspect()
  • Scheduled: celery_app.control.reserved() celery_app.control.scheduled()
  • Queued: redis.lrange()

@ddmesh
Copy link

ddmesh commented Sep 6, 2021

Hi, I can confirm that the version from today (September 6), still has problems when a scan is running and I just
do "docker-compose stop" and "docker-compose start".

The status for the pipeline stays "running". When trying to "start_pipeline" I'm told that this pipeline is already running.
but scancode.io does not restart the scan in real.

What I like to have is that after restarting scancode.io :

  • an aborted scan continues at scan position where it was interrupted
  • or at least restart the scan as queued

At moment I can also not determine if such a dead running pipeline is working or not, so I can't delete it and try again.
But this would be very inconvenient when implementing scancode.io into some CI/CD.
Thanks

tdruez added a commit that referenced this issue Sep 17, 2021
tdruez added a commit that referenced this issue Nov 9, 2021
tdruez added a commit that referenced this issue Nov 10, 2021
with default policy of fsync every second

Signed-off-by: Thomas Druez <[email protected]>
tdruez added a commit that referenced this issue Nov 10, 2021
tdruez added a commit that referenced this issue Nov 10, 2021
Synchronise the `self` Run instance with its related RQ Job

Signed-off-by: Thomas Druez <[email protected]>
tdruez added a commit that referenced this issue Nov 10, 2021
tdruez added a commit that referenced this issue Nov 11, 2021
tdruez added a commit that referenced this issue Nov 11, 2021
tdruez added a commit that referenced this issue Nov 11, 2021
tdruez added a commit that referenced this issue Nov 17, 2021
tdruez added a commit that referenced this issue Nov 19, 2021
tdruez added a commit that referenced this issue Nov 22, 2021
Signed-off-by: Thomas Druez <[email protected]>
tdruez added a commit that referenced this issue Nov 22, 2021
…357)

* Flag stale runs on app ready in SYNC mode #130

Signed-off-by: Thomas Druez <[email protected]>

* Enable redis data persistence using AOF #130

with default policy of fsync every second

Signed-off-by: Thomas Druez <[email protected]>

* Make sure the job is found before calling delete in Run.delete_task #130

Signed-off-by: Thomas Druez <[email protected]>

* Add a sync_with_job method on the run model #130

Synchronise the `self` Run instance with its related RQ Job

Signed-off-by: Thomas Druez <[email protected]>

* Synchronizes QUEUED and RUNNING Runs with their related Jobs on app ready #130

Signed-off-by: Thomas Druez <[email protected]>

* Add unit test for sync_with_job method #130

Signed-off-by: Thomas Druez <[email protected]>

* Move the synchronization process in a custom Worker class #130

Signed-off-by: Thomas Druez <[email protected]>

* Simplify the synchronization logic #130

Signed-off-by: Thomas Druez <[email protected]>

* Reduce the "cleaning lock" ttl from 899 seconds to 60 seconds in ASYNC queue #130

Signed-off-by: Thomas Druez <[email protected]>

* Add unit tests for better coverage #130

Signed-off-by: Thomas Druez <[email protected]>

* Add CHANGELOG entry #130

Signed-off-by: Thomas Druez <[email protected]>
@tdruez
Copy link
Contributor

tdruez commented Nov 22, 2021

Extract from the v30.1.0 release notes https:/nexB/scancode.io/releases/tag/v30.1.0:

  • Synchronize QUEUED and RUNNING pipeline runs with their related worker jobs during
    worker maintenance tasks scheduled every 10 minutes.
    If a container was taken down while a pipeline was running, or if pipeline process
    was killed unexpectedly, that pipeline run status will be updated to a FAILED state
    during the next maintenance tasks.
    QUEUED pipeline will be restored in the queue as the worker redis cache backend data
    is now persistent and reloaded on starting the image.
    Note that internally, a running job emits a "heartbeat" every 60 seconds to let all the
    workers know that it is properly running.
    After 90 seconds without any heartbeats, a worker will determine that the job is not
    active anymore and that job will be moved to the failed registry during the worker
    maintenance tasks. The pipeline run will be updated as well to reflect this failure
    in the Web UI, the REST API, and the command line interface.

  • Enable redis data persistence using the "Append Only File" with the default policy of
    fsync every second in the docker-compose.

@tdruez tdruez closed this as completed Nov 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants