Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Test] Add full cluster restart test for Rollup #31533

Merged
merged 2 commits into from
Jun 26, 2018

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jun 22, 2018

This pull request adds a full cluster restart test for a Rollup job. The test creates and starts a Rollup job on the cluster and checks that the job already exists and is correctly started on the upgraded cluster.

This test allows to test that the persistent task state is correctly parsed from the cluster state after the upgrade, as the status field has been renamed to state in #31031.

The test undercovers a ClassCastException that can be thrown in the RollupIndexer when the timestamp as a very low value that fits into an integer. When it's the case, the value is parsed back as an Integer instead of Long object and (long) position.get(rollupFieldName) fails.

@bleskes This is the follow up test you asked for in #31031 (comment)

@tlrx tlrx added review v7.0.0 :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.4.0 labels Jun 22, 2018
@tlrx tlrx requested review from bleskes and polyfractal June 22, 2018 13:43
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@tlrx tlrx added the >test Issues or PRs that are addressing/adding tests label Jun 22, 2018
assertThat(createRollupJobResponse.get("acknowledged"), equalTo(Boolean.TRUE));

// start the rollup job
final Request startRollupJobRequest = new Request("POST", "_xpack/rollup/job/rollup-job-test/_start");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, you don't have to start the job if you just want to make sure the task is running. We start the persistent task when the job is created in the CreateRollupJob API.

Although I wonder if there might be a potentially rare timing issue here? The CreateRollupJob API returns when the persistent task framework acknowledges that task was created. But I believe there might be a lag between that and when the allocated task is actually created on the target node?

So it might be possible for the StartJob API (or assertRollupJob() if starting is skipped) to fail because the allocated task hasn't started yet? Maybe we need an assertBusy checking for the job, like below?

I may be wrong about that though, you're much more familiar with how the persistent tasks work :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, you don't have to start the job if you just want to make sure the task is running. We start the persistent task when the job is created in the CreateRollupJob API.

Ok, I've been fooled by the doc.

Although I wonder if there might be a potentially rare timing issue here?

Good point. I'll take a look at this, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, the docs are a little confusing with regards to how it works internally. The persistent/allocated Task is always running, start and stop just toggles internal state. If the task doesn't exist, the job doesn't exist basically.

Copy link
Member Author

@tlrx tlrx Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The rollup job is started so that we can ensure that it is correctly restarted after upgrade: it means that the state was correctly persisted on the old cluster and correctly deserialized back on the upgraded cluster, which is what we want to test.

@@ -415,7 +415,11 @@ private QueryBuilder createBoundaryQuery(Map<String, Object> position) {
DateHistoGroupConfig dateHisto = job.getConfig().getGroupConfig().getDateHisto();
String fieldName = dateHisto.getField();
String rollupFieldName = fieldName + "." + DateHistogramAggregationBuilder.NAME;
long lowerBound = position != null ? (long) position.get(rollupFieldName) : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oi, I would not have thought of this in a million years. Thanks for fixing :)

@polyfractal
Copy link
Contributor

Left a question about timing in the test, but otherwise this LGTM! Thanks @tlrx :)

@bleskes
Copy link
Contributor

bleskes commented Jun 22, 2018

Thanks @tlrx for picking this up. It's a very important safety net. @droberts195 can you maybe confirm similar tests exist for ML?

@tlrx
Copy link
Member Author

tlrx commented Jun 25, 2018

@polyfractal I updated the code according to your comments. Can you have another look, please?

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM :)

@tlrx tlrx merged commit be9292c into elastic:master Jun 26, 2018
@tlrx tlrx deleted the add-rollup-full-restart-test branch June 26, 2018 08:07
tlrx added a commit that referenced this pull request Jun 26, 2018
This pull request adds a full cluster restart test for a Rollup job.
The test creates and starts a Rollup job on the cluster and checks
that the job already exists and is correctly started on the upgraded
cluster.

This test allows to test that the persistent task state is correctly
parsed from the cluster state after the upgrade, as the status field
has been renamed to state in #31031.

The test undercovers a ClassCastException that can be thrown in
the RollupIndexer when the timestamp as a very low value that fits
into an integer. When it's the case, the value is parsed back as an
Integer instead of Long object and (long) position.get(rollupFieldName)
fails.
tlrx added a commit that referenced this pull request Jun 26, 2018
This pull request adds a full cluster restart test for a Rollup job.
The test creates and starts a Rollup job on the cluster and checks
that the job already exists and is correctly started on the upgraded
cluster.

This test allows to test that the persistent task state is correctly
parsed from the cluster state after the upgrade, as the status field
has been renamed to state in #31031.

The test undercovers a ClassCastException that can be thrown in
the RollupIndexer when the timestamp as a very low value that fits
into an integer. When it's the case, the value is parsed back as an
Integer instead of Long object and (long) position.get(rollupFieldName)
fails.
@tlrx
Copy link
Member Author

tlrx commented Jun 26, 2018

Thanks @polyfractal ! I also pushed the test in the 6.3 branch (1adf4f1) as I found it useful for 6.3 too.

@tlrx tlrx added the v6.3.1 label Jun 26, 2018
@droberts195
Copy link
Contributor

can you maybe confirm similar tests exist for ML?

Not at the moment - I'll add one.

@bleskes
Copy link
Contributor

bleskes commented Jun 26, 2018

Thx @droberts195 !

dnhatn added a commit that referenced this pull request Jun 26, 2018
* 6.x:
  Fix broken backport of #31578 by adjusting constructor (#31587)
  ingest: Add ignore_missing property to foreach filter (#22147) (#31578)
  Add package pre-install check for java binary (#31343)
  Docs: Clarify sensitive fields watcher encryption (#31551)
  Watcher: Remove never executed code (#31135)
  Improve test times for tests using `RandomObjects::addFields` (#31556)
  Revert "Remove RestGetAllAliasesAction (#31308)"
  REST high-level client: add simulate pipeline API (#31158)
  Get Mapping API to honour allow_no_indices and ignore_unavailable (#31507)
  Fix Mockito trying to mock IOException that isn't thrown by method (#31433) (#31527)
  [Test] Add full cluster restart test for Rollup (#31533)
  Enhance thread context uniqueness assertion
  fix writeIndex evaluation for aliases (#31562)
  Add x-opaque-id to search slow logs (#31539)
  Watcher: Fix put watch action (#31524)
  [DOCS] Significantly improve SQL docs
  turn GetFieldMappingsResponse to ToXContentObject (#31544)
  TEST: Unmute testHistoryUUIDIsGenerated
  Ingest Attachment: Upgrade Tika to 1.18 (#31252)
  TEST: Correct the assertion arguments order (#31540)
dnhatn added a commit that referenced this pull request Jun 26, 2018
* master:
  ingest: Add ignore_missing property to foreach filter (#22147) (#31578)
  Fix a formatting issue in the docvalue_fields documentation. (#31563)
  reduce log level at gradle configuration time
  [TEST] Close additional clients created while running yaml tests (#31575)
  Docs: Clarify sensitive fields watcher encryption (#31551)
  Watcher: Remove never executed code (#31135)
  Add support for switching distribution for all integration tests (#30874)
  Improve robustness of geo shape parser for malformed shapes (#31449)
  QA: Create xpack yaml features (#31403)
  Improve test times for tests using `RandomObjects::addFields` (#31556)
  [Test] Add full cluster restart test for Rollup (#31533)
  Enhance thread context uniqueness assertion
  [DOCS] Fix heading format errors (#31483)
  fix writeIndex evaluation for aliases (#31562)
  Add x-opaque-id to search slow logs (#31539)
  Watcher: Fix put watch action (#31524)
  Add package pre-install check for java binary (#31343)
  Reduce number of raw types warnings (#31523)
  Migrate scripted metric aggregation scripts to ScriptContext design (#30111)
  turn GetFieldMappingsResponse to ToXContentObject (#31544)
  Close xcontent parsers (partial) (#31513)
  Ingest Attachment: Upgrade Tika to 1.18 (#31252)
  TEST: Correct the assertion arguments order (#31540)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data >test Issues or PRs that are addressing/adding tests v6.3.1 v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants