Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add query graceful shutdown for rolling updates #10106

Merged

Conversation

illusional
Copy link
Contributor

Testing

  • Adding a "wait n seconds" method that slept for n seconds, and returned the value of an environment variable. This environment variable meant I could track which version of the deployment my script ran against.
  • Taking the deploy.yaml from the deploy query step of the dev deploy, adding the TEST_VALUE environment variable with some value and saving it as new-deploy.yaml
  • Issuing the first wait request (for 50 seconds) (https://internal.hail.populationgenomics.org.au/$NAMESPACE/query/api/v1alpha/wait?duration=50)
  • Issuing the new deploy with:
    kubectl -n $NAMESPACE apply -f new-deploy.yaml
    kubectl -n $NAMESPACE rollout status --timeout=10m deployment query
  • When the new pod is created (seen with kubectl --namespace $NAMESPACE get pod), issue the second request to the wait method.
  • If all goes well, you should have:
    • termination logs like those below,
    • the first request successfully fulfilled with the response of env value being None (filled by the first pod)
    • The second request successfully filled, but has the value of the environment value, the one you set in the deploy.yaml (it got scheduled to the new node)

Termination logs:

{"severity": "INFO", "levelname": "INFO", "asctime": "2021-02-24 23:22:40,472", "filename": "query.py", "funcNameAndLine": "on_shutdown:253", "message": "On shutdown request received, with 2 tasks left", "hail_log": 1}
++ term
++ kill -TERM 7
+ true
+ '[' no == yes ']'
+ trap - SIGTERM SIGINT
+ wait 7
{"severity": "INFO", "levelname": "INFO", "asctime": "2021-02-24 23:23:26,004", "filename": "hail_logging.py", "funcNameAndLine": "log:40", "message": "https GET /michaelfranklin/query/api/v1alpha/wait done in 50.029999999998836s: 200", "remote_address": "10.28.127.3", "request_start_time": "[24/Feb/2021:23:22:35 +0000]", "request_duration": 50.029999999998836, "response_status": 200, "x_real_ip": "124.170.20.28", "hail_log": 1}
{"severity": "INFO", "levelname": "INFO", "asctime": "2021-02-24 23:23:26,005", "filename": "query.py", "funcNameAndLine": "on_shutdown:255", "message": "Tasks have all completed.", "hail_log": 1}

Test duration endpoint

@routes.get('/api/v1alpha/wait')
async def wait_seconds(request):
    """
    Wait query.duration seconds before returning the request.
    """
    duration = request.query.get('duration')
    try:
        duration = int(duration)
    except Exception as e:
        return web.json_response({
            'error': f'Invalid parameter duration "{duration}": {e}',
        }, status=422)

    await asyncio.sleep(int(duration))
    e = os.getenv("TEST_VALUE", "None")
    return web.json_response({"d": f"You waited '{duration}' seconds!!", "env": e})

@illusional illusional changed the title Merge pull request #35 from populationgenomics/add-query-graceful-shu… Add query graceful shutdown for rolling updates Feb 25, 2021
@daniel-goldstein
Copy link
Contributor

Sorry to step on your toes here a little bit @illusional! I updated aiohttp to 0.7.4 so you don't have to bump it here.

@illusional
Copy link
Contributor Author

Too easy, thanks @daniel-goldstein! I've resolved the conflict ☺️

@illusional
Copy link
Contributor Author

Hey @danking, just wondered if you could assign this PR to you :)

@danking danking self-assigned this Mar 9, 2021
@danking
Copy link
Contributor

danking commented Mar 9, 2021

Thanks for the ping!

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An elegant, if zany solution! Looks right to me.

@daniel-goldstein
Copy link
Contributor

check-services error: on-shutdown needs a # pylint: disable=unused-argument

@illusional
Copy link
Contributor Author

Thanks @daniel-goldstein, I've fixed my linting in my local hail now to catch this (and the other PR's) issues.

@danking danking merged commit 8e1a9f1 into hail-is:main Mar 18, 2021
illusional added a commit to populationgenomics/hail that referenced this pull request Mar 23, 2021
* [batch] Worker cleanup (hail-is#10155)

* [batch] Worker cleanup

* more changes

* wip

* delint

* additions?

* fix

* [query] Add `source_file_field` to `import_table` (hail-is#10164)

* [query] Add `source_file_field` to `import_table`

CHANGELOG: Add `source_file_field` parameter to `hl.import_table` to allow lines to be associated with their original source file.

* ugh

* [ci] add authorize sha and action items table to user page (hail-is#10142)

* [ci] add authorize sha and action items table to user page

* [ci] track review requested in addition to assigned for PR reviews

* [ci] add CI dropdown with link to user page (hail-is#10163)

* [batch] add more logs and do not wait for asyncgens (hail-is#10136)

* [batch] add more logs and do not wait for asyncgens

I think there is some unresolved issue with asyncgen shutdown that is keeping
workers alive. This is not an issue in worker because worker calls sys.exit
which forcibly stops execution. cc: @daniel-goldstein @jigold.

* fix lint

* [query-service] maybe fix event loop not initialized (hail-is#10153)

* [query-service] maybe fix event loop not initialized

The event loop is supposed to be initialized in the main thread. Sometimes
our tests get placed in the non-main thread (always a thread named Dummy-1).
Hopefully the session-scoped fixture is run in the main thread.

* fix

* [prometheus] add prometheus to track SLIs (hail-is#10165)

* [prometheus] add prometheus to track SLIs

* add wraps

* [query] apply nest-asyncio as early as possible (hail-is#10158)

* [query] apply nest-asyncio as early as possible

* fix

* [grafana] set pod fsGroup to grafana user (hail-is#10162)

* fix linting errors (hail-is#10171)

* [query] Remove verbose print (hail-is#10167)

Looks like this got added in some dndarray work

* [ci] update assignees and reviewers on PR github update (hail-is#10168)

* [query-service] fix receive logic (hail-is#10159)

* [query-service] fix receive logic

Only one coro waits on receive now. We still error if a message is sent before
we make our first response.

* fix

* fix

* CHANGELOG: Fixed incorrect error message when incorrect type specified with hl.loop (hail-is#10174)

* [linting] add curlylint check for any service that renders jinja2 (hail-is#10172)

* [linting] add curlylint check for any service that renders jinja2 templates

* [linting] spaces not tabs

* [website] fix website (hail-is#10173)

* [website] fix website

I build old versions of the docs and use them in new websites. This does not
work for versions of the docs before I introduced the new system. In particular
versions 0.2.63 and before generate old-style docs.

* tutorials are templated

* [ci] change mention for deploy failure (hail-is#10178)

* [gateway] move ukbb routing into gateway (hail-is#10179)

* [query] Fix filter intervals (keep=False) memory leak (hail-is#10182)

* [query-service] remove service backend tests (hail-is#10180)

They are too flaky currently due to the version issue.

* [website] pass response body as kwarg (hail-is#10176)

* Release 0.2.64 (hail-is#10183)

* Bump version number

* Updated changelog

* [nginx] ensure nginx configs dont overwrite each other in build.yaml (hail-is#10181)

* [query-service] teach query service to read MTs and Ts created by Spark (hail-is#10184)

* [query-service] teach query service to read MTs and Ts created by Spark

Hail-on-Spark uses HadoopFS which emulates directories by creating size-zero files with
the name `gs://bucket/dirname/`. Note: the object name literally ends in a slash. Such files
should not be included in `listStatus` (they should always be empty anyway). Unfortunately,
my fix in hail-is#9914 was wrong because `GoogleStorageFileStatus` removes
the trailing slash. This prevented the path from matching `path`, which always ends in a `/`.

* fix

* [website] dont jinja render any of the batch docs (hail-is#10190)

* [googlestoragefs] ignore the directory check entirely (hail-is#10185)

* [googlestoragefs] ignore the directory check entirely

If a file exists with the *same name as the directory we are listing*,
then it must be a directory marker. It does not matter if that file is
a directory or not.

* Update GoogleStorageFS.scala

* [ci] fix focus on slash and search job page for PRs (hail-is#10194)

* [query] Improve file compatibility error (hail-is#10191)

* Call init_service from init based on HAIL_QUERY_BACKEND value. (hail-is#10189)

* [query] NDArray Sum (hail-is#10187)

* Attempt implementing the sum rule in Emit

* Connected the python code, but not working yet

* NDArrayExpression.sum is working now

* Add default arg when no axis is provided

* More comprehensive test

* Unused imports

* Use sum appropriately in linear_regression_rows_nd

* Deleted extra blank line

* Don't use typeToTypeInfo, make NumericPrimitives the source of these decisions

* Better assertions, with tests

* Got the summation index correct

* Add documentation

* [website] fix resource path for non-html files in the docs (hail-is#10196)

* [query] Remove tcode from primitive orderings (hail-is#10193)

* [query] BlockMatrix map (hail-is#10195)

* Add map, but protect users of the spark backend from writing arbitrary maps

* If densify would have been a no-op, that should work

* Densify and Sparsify are no-ops for now

* Rename map to map_dense and map_sparse. Give better implementations for add, multiply, divide, subtract of a scalar

* Make the maps underscore methods

* [query] Remove all uses of .tcode[Boolean] (hail-is#10198)

* [ci] make test hello speak https (hail-is#10192)

* [tls] make hello use tls

* change pylint ignore message

* [query] blanczos_pca dont do extra loading work (hail-is#10201)

* Use the checkpointed table from mt_to_table_of_ndarray to avoid recomputing mt

* Keep extra row fields from being included

* Add query graceful shutdown for rolling updates (hail-is#10106)

* Merge pull request #35 from populationgenomics/add-query-graceful-shutdown

Add query graceful shutdown

* Remove unused argument from query:on_shutdown

* [auth] add more options for obtaining session id for dev credentials (hail-is#10203)

* [auth] add more options for obtaining session id for dev credentials

* [auth] extract userinfo query for use in both userinfo and verify_dev_credentials

* remove unused import

* [query] Default to Spark 3 (hail-is#10054)

* Change hail to use spark3 and scala 2.12 by default, change build_hail_spar3 to instead test spark2 for backwards support

* Update Makefile

* Update dataproc image version

* Scale down the dataproc version, since latest dataproc is using Spark release candidate

* Update pyspark version in requirements.txt

* Bump scala/spark patch versions

* We want to use the newer py4j jar when using spark 3

* Upgrade json4s

* I now want Spark 3.1.1, since it's been released

* Upgrade to 3.1.1 in the Makefile, fix a deprecateed IOUtils method

* Update pyspark as well

* Don't update json4s

* Try upgrading version

* Fixed issue for constructing bufferspecs

* Should at least be using newest one

* Remove abstracts from type hints

* Revert "Remove abstracts from type hints"

This reverts commit 1e0d194.

* Things don't go well if I don't use the same json4s version as Spark

* Mixed a typeHintFieldName

* See if this fixes my BlockMatrixSparsity issue

* json4s can't handle a curried apply method

* This works so long as the jar file is included in the libs directory

* Makefile changes to support pulling elasticsearch

* Use dataproc image for Spark 3.1.1

* Update patch version of dataproc image, no longer uses Spark RC

* Fixed up Makefile, now correctly depends on copying the jar

* Now we just check that the specified version is 7, as that's all we support

* Delete build_hail_spark2, we can't support spark2

* Version checks for Scala and Spark

* Updated installation docs

* Spark versions warning

* Update some old pysparks

* [batch] Add more info to UI pages (hail-is#10070)

* [batch] Add more info to UI pages

* fixes

* addr comment

* addr comments

* Bump jinja2 from 2.10.1 to 2.11.3 in /docker (hail-is#10209)

Bumps [jinja2](https:/pallets/jinja) from 2.10.1 to 2.11.3.
- [Release notes](https:/pallets/jinja/releases)
- [Changelog](https:/pallets/jinja/blob/master/CHANGES.rst)
- [Commits](pallets/jinja@2.10.1...2.11.3)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [docker][hail] update to latest pytest (hail-is#10177)

* [docker][hail] update to latest pytest

Issues like this https://ci.hail.is/batches/221291/jobs/112 do not appear locally for me,
I suspect this is due to my using a much newer pytest.

* fix many tests incorrectly using pytest

* another one

* remove unnecessary pip installs in service test dockerfiles

* fix

* [gateway] Cut out router and router-resolver from gateway internal routing (hail-is#10207)

* [gateway] cut out router-resolver from internal auth flow

* [gateway] cut out router from internal

* [datasets] add pan-ukb datasets (hail-is#10186)

* add available pan-ukb datasets

* add rst files for schemas

* reference associated variant indices HT in the block matrix descriptions

* [query] Add json warn context to `parse_json` (hail-is#10160)

We don't test the logs, but I did test this manually, it works as
expected.

* [query] fix tmp_dir default in init(), which doesn't work for the service backend (hail-is#10199)

* Fix tmp_dir default, which doesn't work for the service backend.

* Fix type for tmp_dir.

* [gitignore]ignore website and doc files (hail-is#10214)

* Remove duplicate on_shutdown in query service

Co-authored-by: jigold <[email protected]>
Co-authored-by: Tim Poterba <[email protected]>
Co-authored-by: Daniel Goldstein <[email protected]>
Co-authored-by: Dan King <[email protected]>
Co-authored-by: John Compitello <[email protected]>
Co-authored-by: Christopher Vittal <[email protected]>
Co-authored-by: Michael Franklin <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Patrick Cummings <[email protected]>
Co-authored-by: Carolin Diaz <[email protected]>
vladsavelyev pushed a commit to populationgenomics/hail that referenced this pull request Mar 26, 2021
* [batch] Worker cleanup (hail-is#10155)

* [batch] Worker cleanup

* more changes

* wip

* delint

* additions?

* fix

* [query] Add `source_file_field` to `import_table` (hail-is#10164)

* [query] Add `source_file_field` to `import_table`

CHANGELOG: Add `source_file_field` parameter to `hl.import_table` to allow lines to be associated with their original source file.

* ugh

* [ci] add authorize sha and action items table to user page (hail-is#10142)

* [ci] add authorize sha and action items table to user page

* [ci] track review requested in addition to assigned for PR reviews

* [ci] add CI dropdown with link to user page (hail-is#10163)

* [batch] add more logs and do not wait for asyncgens (hail-is#10136)

* [batch] add more logs and do not wait for asyncgens

I think there is some unresolved issue with asyncgen shutdown that is keeping
workers alive. This is not an issue in worker because worker calls sys.exit
which forcibly stops execution. cc: @daniel-goldstein @jigold.

* fix lint

* [query-service] maybe fix event loop not initialized (hail-is#10153)

* [query-service] maybe fix event loop not initialized

The event loop is supposed to be initialized in the main thread. Sometimes
our tests get placed in the non-main thread (always a thread named Dummy-1).
Hopefully the session-scoped fixture is run in the main thread.

* fix

* [prometheus] add prometheus to track SLIs (hail-is#10165)

* [prometheus] add prometheus to track SLIs

* add wraps

* [query] apply nest-asyncio as early as possible (hail-is#10158)

* [query] apply nest-asyncio as early as possible

* fix

* [grafana] set pod fsGroup to grafana user (hail-is#10162)

* fix linting errors (hail-is#10171)

* [query] Remove verbose print (hail-is#10167)

Looks like this got added in some dndarray work

* [ci] update assignees and reviewers on PR github update (hail-is#10168)

* [query-service] fix receive logic (hail-is#10159)

* [query-service] fix receive logic

Only one coro waits on receive now. We still error if a message is sent before
we make our first response.

* fix

* fix

* CHANGELOG: Fixed incorrect error message when incorrect type specified with hl.loop (hail-is#10174)

* [linting] add curlylint check for any service that renders jinja2 (hail-is#10172)

* [linting] add curlylint check for any service that renders jinja2 templates

* [linting] spaces not tabs

* [website] fix website (hail-is#10173)

* [website] fix website

I build old versions of the docs and use them in new websites. This does not
work for versions of the docs before I introduced the new system. In particular
versions 0.2.63 and before generate old-style docs.

* tutorials are templated

* [ci] change mention for deploy failure (hail-is#10178)

* [gateway] move ukbb routing into gateway (hail-is#10179)

* [query] Fix filter intervals (keep=False) memory leak (hail-is#10182)

* [query-service] remove service backend tests (hail-is#10180)

They are too flaky currently due to the version issue.

* [website] pass response body as kwarg (hail-is#10176)

* Release 0.2.64 (hail-is#10183)

* Bump version number

* Updated changelog

* [nginx] ensure nginx configs dont overwrite each other in build.yaml (hail-is#10181)

* [query-service] teach query service to read MTs and Ts created by Spark (hail-is#10184)

* [query-service] teach query service to read MTs and Ts created by Spark

Hail-on-Spark uses HadoopFS which emulates directories by creating size-zero files with
the name `gs://bucket/dirname/`. Note: the object name literally ends in a slash. Such files
should not be included in `listStatus` (they should always be empty anyway). Unfortunately,
my fix in hail-is#9914 was wrong because `GoogleStorageFileStatus` removes
the trailing slash. This prevented the path from matching `path`, which always ends in a `/`.

* fix

* [website] dont jinja render any of the batch docs (hail-is#10190)

* [googlestoragefs] ignore the directory check entirely (hail-is#10185)

* [googlestoragefs] ignore the directory check entirely

If a file exists with the *same name as the directory we are listing*,
then it must be a directory marker. It does not matter if that file is
a directory or not.

* Update GoogleStorageFS.scala

* [ci] fix focus on slash and search job page for PRs (hail-is#10194)

* [query] Improve file compatibility error (hail-is#10191)

* Call init_service from init based on HAIL_QUERY_BACKEND value. (hail-is#10189)

* [query] NDArray Sum (hail-is#10187)

* Attempt implementing the sum rule in Emit

* Connected the python code, but not working yet

* NDArrayExpression.sum is working now

* Add default arg when no axis is provided

* More comprehensive test

* Unused imports

* Use sum appropriately in linear_regression_rows_nd

* Deleted extra blank line

* Don't use typeToTypeInfo, make NumericPrimitives the source of these decisions

* Better assertions, with tests

* Got the summation index correct

* Add documentation

* [website] fix resource path for non-html files in the docs (hail-is#10196)

* [query] Remove tcode from primitive orderings (hail-is#10193)

* [query] BlockMatrix map (hail-is#10195)

* Add map, but protect users of the spark backend from writing arbitrary maps

* If densify would have been a no-op, that should work

* Densify and Sparsify are no-ops for now

* Rename map to map_dense and map_sparse. Give better implementations for add, multiply, divide, subtract of a scalar

* Make the maps underscore methods

* [query] Remove all uses of .tcode[Boolean] (hail-is#10198)

* [ci] make test hello speak https (hail-is#10192)

* [tls] make hello use tls

* change pylint ignore message

* [query] blanczos_pca dont do extra loading work (hail-is#10201)

* Use the checkpointed table from mt_to_table_of_ndarray to avoid recomputing mt

* Keep extra row fields from being included

* Add query graceful shutdown for rolling updates (hail-is#10106)

* Merge pull request #35 from populationgenomics/add-query-graceful-shutdown

Add query graceful shutdown

* Remove unused argument from query:on_shutdown

* [auth] add more options for obtaining session id for dev credentials (hail-is#10203)

* [auth] add more options for obtaining session id for dev credentials

* [auth] extract userinfo query for use in both userinfo and verify_dev_credentials

* remove unused import

* [query] Default to Spark 3 (hail-is#10054)

* Change hail to use spark3 and scala 2.12 by default, change build_hail_spar3 to instead test spark2 for backwards support

* Update Makefile

* Update dataproc image version

* Scale down the dataproc version, since latest dataproc is using Spark release candidate

* Update pyspark version in requirements.txt

* Bump scala/spark patch versions

* We want to use the newer py4j jar when using spark 3

* Upgrade json4s

* I now want Spark 3.1.1, since it's been released

* Upgrade to 3.1.1 in the Makefile, fix a deprecateed IOUtils method

* Update pyspark as well

* Don't update json4s

* Try upgrading version

* Fixed issue for constructing bufferspecs

* Should at least be using newest one

* Remove abstracts from type hints

* Revert "Remove abstracts from type hints"

This reverts commit 1e0d194.

* Things don't go well if I don't use the same json4s version as Spark

* Mixed a typeHintFieldName

* See if this fixes my BlockMatrixSparsity issue

* json4s can't handle a curried apply method

* This works so long as the jar file is included in the libs directory

* Makefile changes to support pulling elasticsearch

* Use dataproc image for Spark 3.1.1

* Update patch version of dataproc image, no longer uses Spark RC

* Fixed up Makefile, now correctly depends on copying the jar

* Now we just check that the specified version is 7, as that's all we support

* Delete build_hail_spark2, we can't support spark2

* Version checks for Scala and Spark

* Updated installation docs

* Spark versions warning

* Update some old pysparks

* [batch] Add more info to UI pages (hail-is#10070)

* [batch] Add more info to UI pages

* fixes

* addr comment

* addr comments

* Bump jinja2 from 2.10.1 to 2.11.3 in /docker (hail-is#10209)

Bumps [jinja2](https:/pallets/jinja) from 2.10.1 to 2.11.3.
- [Release notes](https:/pallets/jinja/releases)
- [Changelog](https:/pallets/jinja/blob/master/CHANGES.rst)
- [Commits](pallets/jinja@2.10.1...2.11.3)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [docker][hail] update to latest pytest (hail-is#10177)

* [docker][hail] update to latest pytest

Issues like this https://ci.hail.is/batches/221291/jobs/112 do not appear locally for me,
I suspect this is due to my using a much newer pytest.

* fix many tests incorrectly using pytest

* another one

* remove unnecessary pip installs in service test dockerfiles

* fix

* [gateway] Cut out router and router-resolver from gateway internal routing (hail-is#10207)

* [gateway] cut out router-resolver from internal auth flow

* [gateway] cut out router from internal

* [datasets] add pan-ukb datasets (hail-is#10186)

* add available pan-ukb datasets

* add rst files for schemas

* reference associated variant indices HT in the block matrix descriptions

* [query] Add json warn context to `parse_json` (hail-is#10160)

We don't test the logs, but I did test this manually, it works as
expected.

* [query] fix tmp_dir default in init(), which doesn't work for the service backend (hail-is#10199)

* Fix tmp_dir default, which doesn't work for the service backend.

* Fix type for tmp_dir.

* [gitignore]ignore website and doc files (hail-is#10214)

* Remove duplicate on_shutdown in query service

Co-authored-by: jigold <[email protected]>
Co-authored-by: Tim Poterba <[email protected]>
Co-authored-by: Daniel Goldstein <[email protected]>
Co-authored-by: Dan King <[email protected]>
Co-authored-by: John Compitello <[email protected]>
Co-authored-by: Christopher Vittal <[email protected]>
Co-authored-by: Michael Franklin <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Patrick Cummings <[email protected]>
Co-authored-by: Carolin Diaz <[email protected]>
@illusional illusional deleted the add-query-graceful-shutdown-upstream branch June 14, 2023 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants