Add query graceful shutdown for rolling updates #10106

illusional · 2021-02-25T01:23:14Z

Add terminationGracePeriodSeconds to query kubernetes deployment
Implement app.on_shutdown signal handler to wait for all asyncio tasks to complete before returning.
Upgrade aiohttp == 0.7.3 to address tasks being cancelled before the on_shutdown method is called: on_cleanup / on_shutdown are called after active tasks on the event loop are canceled aio-libs/aiohttp#3593

Testing

Adding a "wait n seconds" method that slept for n seconds, and returned the value of an environment variable. This environment variable meant I could track which version of the deployment my script ran against.
Taking the deploy.yaml from the deploy query step of the dev deploy, adding the TEST_VALUE environment variable with some value and saving it as new-deploy.yaml
Issuing the first wait request (for 50 seconds) (https://internal.hail.populationgenomics.org.au/$NAMESPACE/query/api/v1alpha/wait?duration=50)

Issuing the new deploy with:

kubectl -n $NAMESPACE apply -f new-deploy.yaml
kubectl -n $NAMESPACE rollout status --timeout=10m deployment query

When the new pod is created (seen with kubectl --namespace $NAMESPACE get pod), issue the second request to the wait method.
If all goes well, you should have:
- termination logs like those below,
- the first request successfully fulfilled with the response of env value being None (filled by the first pod)
- The second request successfully filled, but has the value of the environment value, the one you set in the deploy.yaml (it got scheduled to the new node)

Termination logs:

{"severity": "INFO", "levelname": "INFO", "asctime": "2021-02-24 23:22:40,472", "filename": "query.py", "funcNameAndLine": "on_shutdown:253", "message": "On shutdown request received, with 2 tasks left", "hail_log": 1}
++ term
++ kill -TERM 7
+ true
+ '[' no == yes ']'
+ trap - SIGTERM SIGINT
+ wait 7
{"severity": "INFO", "levelname": "INFO", "asctime": "2021-02-24 23:23:26,004", "filename": "hail_logging.py", "funcNameAndLine": "log:40", "message": "https GET /michaelfranklin/query/api/v1alpha/wait done in 50.029999999998836s: 200", "remote_address": "10.28.127.3", "request_start_time": "[24/Feb/2021:23:22:35 +0000]", "request_duration": 50.029999999998836, "response_status": 200, "x_real_ip": "124.170.20.28", "hail_log": 1}
{"severity": "INFO", "levelname": "INFO", "asctime": "2021-02-24 23:23:26,005", "filename": "query.py", "funcNameAndLine": "on_shutdown:255", "message": "Tasks have all completed.", "hail_log": 1}

Test duration endpoint

@routes.get('/api/v1alpha/wait')
async def wait_seconds(request):
    """
    Wait query.duration seconds before returning the request.
    """
    duration = request.query.get('duration')
    try:
        duration = int(duration)
    except Exception as e:
        return web.json_response({
            'error': f'Invalid parameter duration "{duration}": {e}',
        }, status=422)

    await asyncio.sleep(int(duration))
    e = os.getenv("TEST_VALUE", "None")
    return web.json_response({"d": f"You waited '{duration}' seconds!!", "env": e})

…tdown Add query graceful shutdown

daniel-goldstein · 2021-02-26T22:11:11Z

Sorry to step on your toes here a little bit @illusional! I updated aiohttp to 0.7.4 so you don't have to bump it here.

illusional · 2021-02-26T22:51:53Z

Too easy, thanks @daniel-goldstein! I've resolved the conflict ☺️

illusional · 2021-03-09T10:07:51Z

Hey @danking, just wondered if you could assign this PR to you :)

danking · 2021-03-09T14:19:48Z

Thanks for the ping!

danking

An elegant, if zany solution! Looks right to me.

daniel-goldstein · 2021-03-18T02:05:47Z

check-services error: on-shutdown needs a # pylint: disable=unused-argument

illusional · 2021-03-18T02:55:39Z

Thanks @daniel-goldstein, I've fixed my linting in my local hail now to catch this (and the other PR's) issues.

@daniel-goldstein

* [batch] Worker cleanup (hail-is#10155) * [batch] Worker cleanup * more changes * wip * delint * additions? * fix * [query] Add `source_file_field` to `import_table` (hail-is#10164) * [query] Add `source_file_field` to `import_table` CHANGELOG: Add `source_file_field` parameter to `hl.import_table` to allow lines to be associated with their original source file. * ugh * [ci] add authorize sha and action items table to user page (hail-is#10142) * [ci] add authorize sha and action items table to user page * [ci] track review requested in addition to assigned for PR reviews * [ci] add CI dropdown with link to user page (hail-is#10163) * [batch] add more logs and do not wait for asyncgens (hail-is#10136) * [batch] add more logs and do not wait for asyncgens I think there is some unresolved issue with asyncgen shutdown that is keeping workers alive. This is not an issue in worker because worker calls sys.exit which forcibly stops execution. cc: @daniel-goldstein @jigold. * fix lint * [query-service] maybe fix event loop not initialized (hail-is#10153) * [query-service] maybe fix event loop not initialized The event loop is supposed to be initialized in the main thread. Sometimes our tests get placed in the non-main thread (always a thread named Dummy-1). Hopefully the session-scoped fixture is run in the main thread. * fix * [prometheus] add prometheus to track SLIs (hail-is#10165) * [prometheus] add prometheus to track SLIs * add wraps * [query] apply nest-asyncio as early as possible (hail-is#10158) * [query] apply nest-asyncio as early as possible * fix * [grafana] set pod fsGroup to grafana user (hail-is#10162) * fix linting errors (hail-is#10171) * [query] Remove verbose print (hail-is#10167) Looks like this got added in some dndarray work * [ci] update assignees and reviewers on PR github update (hail-is#10168) * [query-service] fix receive logic (hail-is#10159) * [query-service] fix receive logic Only one coro waits on receive now. We still error if a message is sent before we make our first response. * fix * fix * CHANGELOG: Fixed incorrect error message when incorrect type specified with hl.loop (hail-is#10174) * [linting] add curlylint check for any service that renders jinja2 (hail-is#10172) * [linting] add curlylint check for any service that renders jinja2 templates * [linting] spaces not tabs * [website] fix website (hail-is#10173) * [website] fix website I build old versions of the docs and use them in new websites. This does not work for versions of the docs before I introduced the new system. In particular versions 0.2.63 and before generate old-style docs. * tutorials are templated * [ci] change mention for deploy failure (hail-is#10178) * [gateway] move ukbb routing into gateway (hail-is#10179) * [query] Fix filter intervals (keep=False) memory leak (hail-is#10182) * [query-service] remove service backend tests (hail-is#10180) They are too flaky currently due to the version issue. * [website] pass response body as kwarg (hail-is#10176) * Release 0.2.64 (hail-is#10183) * Bump version number * Updated changelog * [nginx] ensure nginx configs dont overwrite each other in build.yaml (hail-is#10181) * [query-service] teach query service to read MTs and Ts created by Spark (hail-is#10184) * [query-service] teach query service to read MTs and Ts created by Spark Hail-on-Spark uses HadoopFS which emulates directories by creating size-zero files with the name `gs://bucket/dirname/`. Note: the object name literally ends in a slash. Such files should not be included in `listStatus` (they should always be empty anyway). Unfortunately, my fix in hail-is#9914 was wrong because `GoogleStorageFileStatus` removes the trailing slash. This prevented the path from matching `path`, which always ends in a `/`. * fix * [website] dont jinja render any of the batch docs (hail-is#10190) * [googlestoragefs] ignore the directory check entirely (hail-is#10185) * [googlestoragefs] ignore the directory check entirely If a file exists with the *same name as the directory we are listing*, then it must be a directory marker. It does not matter if that file is a directory or not. * Update GoogleStorageFS.scala * [ci] fix focus on slash and search job page for PRs (hail-is#10194) * [query] Improve file compatibility error (hail-is#10191) * Call init_service from init based on HAIL_QUERY_BACKEND value. (hail-is#10189) * [query] NDArray Sum (hail-is#10187) * Attempt implementing the sum rule in Emit * Connected the python code, but not working yet * NDArrayExpression.sum is working now * Add default arg when no axis is provided * More comprehensive test * Unused imports * Use sum appropriately in linear_regression_rows_nd * Deleted extra blank line * Don't use typeToTypeInfo, make NumericPrimitives the source of these decisions * Better assertions, with tests * Got the summation index correct * Add documentation * [website] fix resource path for non-html files in the docs (hail-is#10196) * [query] Remove tcode from primitive orderings (hail-is#10193) * [query] BlockMatrix map (hail-is#10195) * Add map, but protect users of the spark backend from writing arbitrary maps * If densify would have been a no-op, that should work * Densify and Sparsify are no-ops for now * Rename map to map_dense and map_sparse. Give better implementations for add, multiply, divide, subtract of a scalar * Make the maps underscore methods * [query] Remove all uses of .tcode[Boolean] (hail-is#10198) * [ci] make test hello speak https (hail-is#10192) * [tls] make hello use tls * change pylint ignore message * [query] blanczos_pca dont do extra loading work (hail-is#10201) * Use the checkpointed table from mt_to_table_of_ndarray to avoid recomputing mt * Keep extra row fields from being included * Add query graceful shutdown for rolling updates (hail-is#10106) * Merge pull request #35 from populationgenomics/add-query-graceful-shutdown Add query graceful shutdown * Remove unused argument from query:on_shutdown * [auth] add more options for obtaining session id for dev credentials (hail-is#10203) * [auth] add more options for obtaining session id for dev credentials * [auth] extract userinfo query for use in both userinfo and verify_dev_credentials * remove unused import * [query] Default to Spark 3 (hail-is#10054) * Change hail to use spark3 and scala 2.12 by default, change build_hail_spar3 to instead test spark2 for backwards support * Update Makefile * Update dataproc image version * Scale down the dataproc version, since latest dataproc is using Spark release candidate * Update pyspark version in requirements.txt * Bump scala/spark patch versions * We want to use the newer py4j jar when using spark 3 * Upgrade json4s * I now want Spark 3.1.1, since it's been released * Upgrade to 3.1.1 in the Makefile, fix a deprecateed IOUtils method * Update pyspark as well * Don't update json4s * Try upgrading version * Fixed issue for constructing bufferspecs * Should at least be using newest one * Remove abstracts from type hints * Revert "Remove abstracts from type hints" This reverts commit 1e0d194. * Things don't go well if I don't use the same json4s version as Spark * Mixed a typeHintFieldName * See if this fixes my BlockMatrixSparsity issue * json4s can't handle a curried apply method * This works so long as the jar file is included in the libs directory * Makefile changes to support pulling elasticsearch * Use dataproc image for Spark 3.1.1 * Update patch version of dataproc image, no longer uses Spark RC * Fixed up Makefile, now correctly depends on copying the jar * Now we just check that the specified version is 7, as that's all we support * Delete build_hail_spark2, we can't support spark2 * Version checks for Scala and Spark * Updated installation docs * Spark versions warning * Update some old pysparks * [batch] Add more info to UI pages (hail-is#10070) * [batch] Add more info to UI pages * fixes * addr comment * addr comments * Bump jinja2 from 2.10.1 to 2.11.3 in /docker (hail-is#10209) Bumps [jinja2](https:/pallets/jinja) from 2.10.1 to 2.11.3. - [Release notes](https:/pallets/jinja/releases) - [Changelog](https:/pallets/jinja/blob/master/CHANGES.rst) - [Commits](pallets/jinja@2.10.1...2.11.3) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [docker][hail] update to latest pytest (hail-is#10177) * [docker][hail] update to latest pytest Issues like this https://ci.hail.is/batches/221291/jobs/112 do not appear locally for me, I suspect this is due to my using a much newer pytest. * fix many tests incorrectly using pytest * another one * remove unnecessary pip installs in service test dockerfiles * fix * [gateway] Cut out router and router-resolver from gateway internal routing (hail-is#10207) * [gateway] cut out router-resolver from internal auth flow * [gateway] cut out router from internal * [datasets] add pan-ukb datasets (hail-is#10186) * add available pan-ukb datasets * add rst files for schemas * reference associated variant indices HT in the block matrix descriptions * [query] Add json warn context to `parse_json` (hail-is#10160) We don't test the logs, but I did test this manually, it works as expected. * [query] fix tmp_dir default in init(), which doesn't work for the service backend (hail-is#10199) * Fix tmp_dir default, which doesn't work for the service backend. * Fix type for tmp_dir. * [gitignore]ignore website and doc files (hail-is#10214) * Remove duplicate on_shutdown in query service Co-authored-by: jigold <[email protected]> Co-authored-by: Tim Poterba <[email protected]> Co-authored-by: Daniel Goldstein <[email protected]> Co-authored-by: Dan King <[email protected]> Co-authored-by: John Compitello <[email protected]> Co-authored-by: Christopher Vittal <[email protected]> Co-authored-by: Michael Franklin <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Patrick Cummings <[email protected]> Co-authored-by: Carolin Diaz <[email protected]>

@daniel-goldstein

* [batch] Worker cleanup (hail-is#10155) * [batch] Worker cleanup * more changes * wip * delint * additions? * fix * [query] Add `source_file_field` to `import_table` (hail-is#10164) * [query] Add `source_file_field` to `import_table` CHANGELOG: Add `source_file_field` parameter to `hl.import_table` to allow lines to be associated with their original source file. * ugh * [ci] add authorize sha and action items table to user page (hail-is#10142) * [ci] add authorize sha and action items table to user page * [ci] track review requested in addition to assigned for PR reviews * [ci] add CI dropdown with link to user page (hail-is#10163) * [batch] add more logs and do not wait for asyncgens (hail-is#10136) * [batch] add more logs and do not wait for asyncgens I think there is some unresolved issue with asyncgen shutdown that is keeping workers alive. This is not an issue in worker because worker calls sys.exit which forcibly stops execution. cc: @daniel-goldstein @jigold. * fix lint * [query-service] maybe fix event loop not initialized (hail-is#10153) * [query-service] maybe fix event loop not initialized The event loop is supposed to be initialized in the main thread. Sometimes our tests get placed in the non-main thread (always a thread named Dummy-1). Hopefully the session-scoped fixture is run in the main thread. * fix * [prometheus] add prometheus to track SLIs (hail-is#10165) * [prometheus] add prometheus to track SLIs * add wraps * [query] apply nest-asyncio as early as possible (hail-is#10158) * [query] apply nest-asyncio as early as possible * fix * [grafana] set pod fsGroup to grafana user (hail-is#10162) * fix linting errors (hail-is#10171) * [query] Remove verbose print (hail-is#10167) Looks like this got added in some dndarray work * [ci] update assignees and reviewers on PR github update (hail-is#10168) * [query-service] fix receive logic (hail-is#10159) * [query-service] fix receive logic Only one coro waits on receive now. We still error if a message is sent before we make our first response. * fix * fix * CHANGELOG: Fixed incorrect error message when incorrect type specified with hl.loop (hail-is#10174) * [linting] add curlylint check for any service that renders jinja2 (hail-is#10172) * [linting] add curlylint check for any service that renders jinja2 templates * [linting] spaces not tabs * [website] fix website (hail-is#10173) * [website] fix website I build old versions of the docs and use them in new websites. This does not work for versions of the docs before I introduced the new system. In particular versions 0.2.63 and before generate old-style docs. * tutorials are templated * [ci] change mention for deploy failure (hail-is#10178) * [gateway] move ukbb routing into gateway (hail-is#10179) * [query] Fix filter intervals (keep=False) memory leak (hail-is#10182) * [query-service] remove service backend tests (hail-is#10180) They are too flaky currently due to the version issue. * [website] pass response body as kwarg (hail-is#10176) * Release 0.2.64 (hail-is#10183) * Bump version number * Updated changelog * [nginx] ensure nginx configs dont overwrite each other in build.yaml (hail-is#10181) * [query-service] teach query service to read MTs and Ts created by Spark (hail-is#10184) * [query-service] teach query service to read MTs and Ts created by Spark Hail-on-Spark uses HadoopFS which emulates directories by creating size-zero files with the name `gs://bucket/dirname/`. Note: the object name literally ends in a slash. Such files should not be included in `listStatus` (they should always be empty anyway). Unfortunately, my fix in hail-is#9914 was wrong because `GoogleStorageFileStatus` removes the trailing slash. This prevented the path from matching `path`, which always ends in a `/`. * fix * [website] dont jinja render any of the batch docs (hail-is#10190) * [googlestoragefs] ignore the directory check entirely (hail-is#10185) * [googlestoragefs] ignore the directory check entirely If a file exists with the *same name as the directory we are listing*, then it must be a directory marker. It does not matter if that file is a directory or not. * Update GoogleStorageFS.scala * [ci] fix focus on slash and search job page for PRs (hail-is#10194) * [query] Improve file compatibility error (hail-is#10191) * Call init_service from init based on HAIL_QUERY_BACKEND value. (hail-is#10189) * [query] NDArray Sum (hail-is#10187) * Attempt implementing the sum rule in Emit * Connected the python code, but not working yet * NDArrayExpression.sum is working now * Add default arg when no axis is provided * More comprehensive test * Unused imports * Use sum appropriately in linear_regression_rows_nd * Deleted extra blank line * Don't use typeToTypeInfo, make NumericPrimitives the source of these decisions * Better assertions, with tests * Got the summation index correct * Add documentation * [website] fix resource path for non-html files in the docs (hail-is#10196) * [query] Remove tcode from primitive orderings (hail-is#10193) * [query] BlockMatrix map (hail-is#10195) * Add map, but protect users of the spark backend from writing arbitrary maps * If densify would have been a no-op, that should work * Densify and Sparsify are no-ops for now * Rename map to map_dense and map_sparse. Give better implementations for add, multiply, divide, subtract of a scalar * Make the maps underscore methods * [query] Remove all uses of .tcode[Boolean] (hail-is#10198) * [ci] make test hello speak https (hail-is#10192) * [tls] make hello use tls * change pylint ignore message * [query] blanczos_pca dont do extra loading work (hail-is#10201) * Use the checkpointed table from mt_to_table_of_ndarray to avoid recomputing mt * Keep extra row fields from being included * Add query graceful shutdown for rolling updates (hail-is#10106) * Merge pull request #35 from populationgenomics/add-query-graceful-shutdown Add query graceful shutdown * Remove unused argument from query:on_shutdown * [auth] add more options for obtaining session id for dev credentials (hail-is#10203) * [auth] add more options for obtaining session id for dev credentials * [auth] extract userinfo query for use in both userinfo and verify_dev_credentials * remove unused import * [query] Default to Spark 3 (hail-is#10054) * Change hail to use spark3 and scala 2.12 by default, change build_hail_spar3 to instead test spark2 for backwards support * Update Makefile * Update dataproc image version * Scale down the dataproc version, since latest dataproc is using Spark release candidate * Update pyspark version in requirements.txt * Bump scala/spark patch versions * We want to use the newer py4j jar when using spark 3 * Upgrade json4s * I now want Spark 3.1.1, since it's been released * Upgrade to 3.1.1 in the Makefile, fix a deprecateed IOUtils method * Update pyspark as well * Don't update json4s * Try upgrading version * Fixed issue for constructing bufferspecs * Should at least be using newest one * Remove abstracts from type hints * Revert "Remove abstracts from type hints" This reverts commit 1e0d194. * Things don't go well if I don't use the same json4s version as Spark * Mixed a typeHintFieldName * See if this fixes my BlockMatrixSparsity issue * json4s can't handle a curried apply method * This works so long as the jar file is included in the libs directory * Makefile changes to support pulling elasticsearch * Use dataproc image for Spark 3.1.1 * Update patch version of dataproc image, no longer uses Spark RC * Fixed up Makefile, now correctly depends on copying the jar * Now we just check that the specified version is 7, as that's all we support * Delete build_hail_spark2, we can't support spark2 * Version checks for Scala and Spark * Updated installation docs * Spark versions warning * Update some old pysparks * [batch] Add more info to UI pages (hail-is#10070) * [batch] Add more info to UI pages * fixes * addr comment * addr comments * Bump jinja2 from 2.10.1 to 2.11.3 in /docker (hail-is#10209) Bumps [jinja2](https:/pallets/jinja) from 2.10.1 to 2.11.3. - [Release notes](https:/pallets/jinja/releases) - [Changelog](https:/pallets/jinja/blob/master/CHANGES.rst) - [Commits](pallets/jinja@2.10.1...2.11.3) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [docker][hail] update to latest pytest (hail-is#10177) * [docker][hail] update to latest pytest Issues like this https://ci.hail.is/batches/221291/jobs/112 do not appear locally for me, I suspect this is due to my using a much newer pytest. * fix many tests incorrectly using pytest * another one * remove unnecessary pip installs in service test dockerfiles * fix * [gateway] Cut out router and router-resolver from gateway internal routing (hail-is#10207) * [gateway] cut out router-resolver from internal auth flow * [gateway] cut out router from internal * [datasets] add pan-ukb datasets (hail-is#10186) * add available pan-ukb datasets * add rst files for schemas * reference associated variant indices HT in the block matrix descriptions * [query] Add json warn context to `parse_json` (hail-is#10160) We don't test the logs, but I did test this manually, it works as expected. * [query] fix tmp_dir default in init(), which doesn't work for the service backend (hail-is#10199) * Fix tmp_dir default, which doesn't work for the service backend. * Fix type for tmp_dir. * [gitignore]ignore website and doc files (hail-is#10214) * Remove duplicate on_shutdown in query service Co-authored-by: jigold <[email protected]> Co-authored-by: Tim Poterba <[email protected]> Co-authored-by: Daniel Goldstein <[email protected]> Co-authored-by: Dan King <[email protected]> Co-authored-by: John Compitello <[email protected]> Co-authored-by: Christopher Vittal <[email protected]> Co-authored-by: Michael Franklin <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Patrick Cummings <[email protected]> Co-authored-by: Carolin Diaz <[email protected]>

Merge pull request #35 from populationgenomics/add-query-graceful-shu…

374aa27

…tdown Add query graceful shutdown

illusional changed the title ~~Merge pull request #35 from populationgenomics/add-query-graceful-shu…~~ Add query graceful shutdown for rolling updates Feb 25, 2021

Merge branch 'main' into add-query-graceful-shutdown-upstream

5f9d6ff

Merge branch 'main' into add-query-graceful-shutdown-upstream

98c9cb3

danking self-assigned this Mar 9, 2021

danking approved these changes Mar 9, 2021

View reviewed changes

Remove unused argument from query:on_shutdown

6d249d5

danking merged commit 8e1a9f1 into hail-is:main Mar 18, 2021

illusional deleted the add-query-graceful-shutdown-upstream branch June 14, 2023 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add query graceful shutdown for rolling updates #10106

Add query graceful shutdown for rolling updates #10106

illusional commented Feb 25, 2021

daniel-goldstein commented Feb 26, 2021

illusional commented Feb 26, 2021

illusional commented Mar 9, 2021

danking commented Mar 9, 2021

danking left a comment

daniel-goldstein commented Mar 18, 2021

illusional commented Mar 18, 2021

Add query graceful shutdown for rolling updates #10106

Add query graceful shutdown for rolling updates #10106

Conversation

illusional commented Feb 25, 2021

Testing

daniel-goldstein commented Feb 26, 2021

illusional commented Feb 26, 2021

illusional commented Mar 9, 2021

danking commented Mar 9, 2021

danking left a comment

Choose a reason for hiding this comment

daniel-goldstein commented Mar 18, 2021

illusional commented Mar 18, 2021