Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the amount of request in the stress test, using multiple threads #90

Merged
merged 36 commits into from
Nov 24, 2021

Conversation

ahandan
Copy link
Collaborator

@ahandan ahandan commented Oct 21, 2021

…uest

Overview

Enhancement for the stress_test.
Increase the amount of request by using multiple thread to increase the chance of making request in parallel as much as possible.
The increase amount of request also increases the chance to force the cache to run a refresh cycle by filling it up and trying to make a request that is overlapping the cache refreshing cycle to generate a known issue (NoTransaction, DetachedInstance).

Changes

  • MutliThreading execution of the tests

Related Issue / Discussion

This enhancement of the stress makes a reference to :
Magpie issue (More caching/session re-connect and logging #473, Ouranosinc/Magpie#473)

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@ahandan ahandan changed the title Making the actual test a multitreaded one, increase the number of req… Increase the amount of request in the stress test, using multiple threads Oct 22, 2021
@tlvu
Copy link
Contributor

tlvu commented Oct 25, 2021

@ahandan if this PR is ready for a review from me, please explicitly set me as the reviewer. Otherwise I won't be touching it.

That said, I was curious and opened up in ReviewNB (https://app.reviewnb.com/Ouranosinc/PAVICS-e2e-workflow-tests/pull/90/) and I ended up with this error "Invalid notebook file. Please make sure both before and after version of the file contains valid notebook JSON. ".

@ahandan ahandan requested review from matprov and tlvu October 25, 2021 19:22
Copy link
Contributor

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review, I'll have to come back to it again. I haven't wrap my head around the full changes yet. Thanks for the fix.

notebooks/stress-tests.ipynb Outdated Show resolved Hide resolved
notebooks/stress-tests.ipynb Outdated Show resolved Hide resolved
notebooks/stress-tests.ipynb Outdated Show resolved Hide resolved
notebooks/stress-tests.ipynb Outdated Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
Copy link
Contributor

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@ahandan
Copy link
Collaborator Author

ahandan commented Oct 29, 2021

Changes ;

  • Reverted to "pavics.ouranos.ca"

  • Progression is now a class rather than a dictionary.

  • The timestamps are registered at each request rather than and the end of multiple requests (Result)

  • Combined similarity of a test in one function "Stress_test" with a test_type specifying which test to run ("wps" or "thredds")

  • 2 new Methods :
    - Stress_test : Test making multiple requests on the server
    - run_threads: Function running a target function in a multithreaded environment

notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
Copy link
Contributor

@fmigneault fmigneault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality-wise, all looks good.
I only have a few cosmetic comments regarding the output format, not blocking themselves.

Move the print(f'\n Execution # {i+1}') statement within run_threads with input parameter that provides the "run_index".

Remove extra spaces before:

  • " Timestamps" (this is hackish, not how it is intended to be used)
    The field should be w=24 (4 spaces before + 20 from the timestamp length itself) for right align.
    Or, w=14 for left align (4 spaces before + len("Timestamp") == 10).
  • Execution #<index> (to align on the left of output)
  • Thread(T2) Stress Test ... (to align on the left of output)

@ahandan
Copy link
Collaborator Author

ahandan commented Nov 2, 2021

@fmigneault, I agree for the "Timestamps I don’t know exactly how to make it look good. I tried to align it on the left, but then it's hard to keep track of the line. For me it's better that each column is closer to each other but separated enough to distinguish them.
So for now, I've set w = 16 and I added 5 spaces before the timestamps so that it is separated from Times .
Let me know what you think about it, is it clear or should we just align everything on the right?

@fmigneault
Copy link
Contributor

fmigneault commented Nov 3, 2021

@ahandan

You have to look at the fields this way.
Since you put w=16, and that the format uses :>, fields are right-aligned by the width.

Because Codes, Delta and Times are all containing values that are shorter than 16, the full width is 16.
Spacing between those "headers" are all 11 spaces because 16 - len("Codes") = 11.
Since Timestamp is 10 char, (5 more than "Codes", "Delta", "Times"), you need to add 5 (in the header variable, for last field) to keep the same distance.

In the case of the timestamps values themselves, they are 20 char. This is >16 so the field overflows to the left.
You could simply use a pseudo left-align for the timestamp values, by right aligning them with 26 (or 20 + any other extra spacing you add to the Timestamp header).

image

notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
notebooks/stress-tests.ipynb Show resolved Hide resolved
@fmigneault
Copy link
Contributor

I'm fine with the way results are reported.
Maybe just check the output format. I'm not sure if its on my side only or real output. It is somewhat hard to read:

19:14:24  �[0;31m---------------------------------------------------------------------------�[0m
19:14:24  �[0;31mAssertionError�[0m                            Traceback (most recent call last)
19:14:24  �[0;32m/tmp/ipykernel_1032/1504455324.py�[0m in �[0;36m<module>�[0;34m�[0m
19:14:24  �[1;32m      7�[0m             �[0murl�[0m �[0;34m=�[0m �[0mthredds_url�[0m�[0;34m,�[0m�[0;34m�[0m�[0;34m�[0m�[0m
19:14:24  �[1;32m      8�[0m             �[0mn_threads�[0m �[0;34m=�[0m �[0mTEST_N_THREADS_THREDDS�[0m�[0;34m,�[0m�[0;34m�[0m�[0;34m�[0m�[0m
19:14:24  �[0;32m----> 9�[0;31m             runs_per_threads = TEST_RUNS_THREDDS)
19:14:24  �[0m
19:14:24  �[0;32m/tmp/ipykernel_1032/4243812779.py�[0m in �[0;36mrun_threads�[0;34m(target_test, url, execution_id, n_threads, runs_per_threads)�[0m
19:14:24  �[1;32m     29�[0m         �[0mthread�[0m�[0;34m.�[0m�[0mjoin�[0m�[0;34m(�[0m�[0;34m)�[0m�[0;34m�[0m�[0;34m�[0m�[0m
19:14:24  �[1;32m     30�[0m �[0;34m�[0m�[0m
19:14:24  �[0;32m---> 31�[0;31m     �[0;32massert�[0m �[0mstressTestProgression�[0m�[0;34m.�[0m�[0mfailed_count�[0m �[0;34m==�[0m �[0;36m0�[0m�[0;34m,�[0m �[0mstressTestProgression�[0m�[0;34m.�[0m�[0mfail_report�[0m�[0;34m(�[0m�[0;34m)�[0m�[0;34m�[0m�[0;34m�[0m�[0m
19:14:24  �[0m�[1;32m     32�[0m     �[0mprint�[0m�[0;34m(�[0m�[0;34mf"All threads passed {target_test.__name__} function."�[0m�[0;34m)�[0m�[0;34m�[0m�[0;34m�[0m�[0m
19:14:24  �[1;32m     33�[0m �[0;34m�[0m�[0m
19:14:24  
19:14:24  �[0;31mAssertionError�[0m: This test failed due to bad code received from the requests.
19:14:24   Number of failed thread(s) : 3/ 5
19:14:24   Number of failed requests : 5/2500
19:14:24   First failed request timestamp : 2021-11-17 00:13:29

@ahandan
Copy link
Collaborator Author

ahandan commented Nov 18, 2021

@fmigneault as for the weird outputs, these (�[0;31m) are the colors of the outputs.

Consol output

vs

Plain text

Looks like it's a Jenkins problem. Issue

@ahandan ahandan requested a review from tlvu November 18, 2021 20:51
Copy link
Contributor

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to actually show the error, like previously in comment #90 (comment)

The current error displayed in Jenkins do not have enough information https://daccs-jenkins.crim.ca/job/PAVICS-e2e-workflow-tests/job/multi_threading_stress_test/46/console:

18:57:14  _____________________ notebooks/stress-tests.ipynb::Cell 4 _____________________
18:57:14  Notebook cell execution failed
18:57:14  Cell 4: Cell execution caused an exception
18:57:14  
18:57:14  Input:
18:57:14  # NBVAL_IGNORE_OUTPUT
18:57:14  
18:57:14  # Executing THREDDS test on multiple threads
18:57:14  thredds_url = f"{TWITCHER_URL}/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc"
18:57:14  
18:57:14  run_threads(target_test = stress_test,
18:57:14              url = thredds_url,
18:57:14              n_threads = TEST_N_THREADS_THREDDS,
18:57:14              runs_per_threads = TEST_RUNS_THREDDS)
18:57:14  
18:57:14  Traceback:
18:57:14  
18:57:14  ---------------------------------------------------------------------------
18:57:14  AssertionError                            Traceback (most recent call last)
18:57:14  /tmp/ipykernel_1129/1504455324.py in <module>
18:57:14        7             url = thredds_url,
18:57:14        8             n_threads = TEST_N_THREADS_THREDDS,
18:57:14  ----> 9             runs_per_threads = TEST_RUNS_THREDDS)
18:57:14  
18:57:14  /tmp/ipykernel_1129/4243812779.py in run_threads(target_test, url, execution_id, n_threads, runs_per_threads)
18:57:14       29         thread.join()
18:57:14       30 
18:57:14  ---> 31     assert stressTestProgression.failed_count == 0, stressTestProgression.fail_report()
18:57:14       32     print(f"All threads passed {target_test.__name__} function.")
18:57:14       33 
18:57:14  
18:57:14  AssertionError: This test failed due to bad code received from the requests.
18:57:14   Number of failed thread(s) : 2/ 5
18:57:14   Number of failed requests : 3/1000
18:57:14   First failed request timestamp : 2021-11-17 18:56:51

We do not know which of the WPS services or Thredds is/are failing, what was the bad http code returned, how long it took.

You spent a lot of time formatting the output so why not simply show it, only for the run with errors of course, not all the runs.

@tlvu
Copy link
Contributor

tlvu commented Nov 18, 2021

In case it's not clear, stress-test.ipynb has 2 purposes:

  • run manually to quickly reproduce the error
  • run by Jenkins as part of an automated test suite to catch regressions

In the Jenkins purpose: the result displayed when it failed in Jenkins is what matters. All the nice looking cell outputs do not make it to Jenkins and there is no way to retrieve them after the fact.

In the previous implementation, the failed outputs are saved and given to the assert error message. That was how we display only the failed cell outputs when failure occurred during Jenkins run.

@fmigneault
Copy link
Contributor

@ahandan

I agree with @tlvu that for convenience, it would be great to have the failing subset of threads/URL reported directly in the raised message.
Sorry this changes the output summary we previously discussed about.

Do you have an ETA to implement this?
This PR is blocking #197 and another one coming soon by @ChaamC.
If this adjustment takes too long, I would rather have the current iteration integrated and a subsequent update for the output formatting.

@tlvu
Copy link
Contributor

tlvu commented Nov 22, 2021

I agree with @tlvu that for convenience, it would be great to have the failing subset of threads/URL reported directly in the raised message.

It's not just a "convenience". Jenkins build failure is useless if we are unable to have the failing info to diagnose !

This PR is blocking bird-house/birdhouse-deploy#197 and another one coming soon by @ChaamC.
If this adjustment takes too long, I would rather have the current iteration integrated and a subsequent update for the output formatting.

If it helps remove pressure and unblock things, I can accept merging this PR as-is as long as a followup PR with proper error reporting when Jenkins fails is done promptly right after.

@fmigneault
Copy link
Contributor

It's not just a "convenience". Jenkins build failure is useless if we are unable to have the failing info to diagnose !

Everything is available in the notebook outputs. Its just more trouble to load them in jupyter than have them already in logs.

@ahandan
Copy link
Collaborator Author

ahandan commented Nov 22, 2021

Ok no problem for changing the outputs.
I have already save exceptions and failed results in case we would change our minds.

I changed to assert printout to print only all the failed runs as the same output from the results one.
Also added a Exception Section printing the Reason
ex : (500 : Internal Server Error)

@tlvu Let me know what would rather have in the Exceptions area

  1. the entire Error Message like :
<ExceptionReport version="1.0.0"
    xmlns="http://www.opengis.net/ows/1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
    <Exception exceptionCode="NoApplicableCode" locator="NoApplicableCode">
        <ExceptionText>Unknown Error</ExceptionText>
    </Exception>
</ExceptionReport>
  1. A simple : 500 : Internal Server Error

Here is an example of the output from the test :
Test 1
Test 2

12:32:50  ---------------------------------------------------------------------------
12:32:50  AssertionError                            Traceback (most recent call last)
12:32:50  /tmp/ipykernel_1067/1504455324.py in <module>
12:32:50        7             url = thredds_url,
12:32:50        8             n_threads = TEST_N_THREADS_THREDDS,
12:32:50  ----> 9             runs_per_threads = TEST_RUNS_THREDDS)
12:32:50  
12:32:50  /tmp/ipykernel_1067/1081959450.py in run_threads(target_test, url, execution_id, n_threads, runs_per_threads)
12:32:50       37     assert (
12:32:50       38         stress_test_progression.failed_threads_count == 0
12:32:50  ---> 39     ), stress_test_progression.fail_report()
12:32:50       40 
12:32:50       41     print(f"All threads passed {target_test.__name__} function.")
12:32:50  
12:32:50  AssertionError: This test failed due to bad code received from the requests.
12:32:50   Number of failed thread(s) : 1/ 5
12:32:50   Number of failed requests : 1/500
12:32:50  
12:32:50  
12:32:50   Stress Test:
12:32:50      Test:
12:32:50          code: 200
12:32:50          runs: 100
12:32:50          max-avg-time:  1s
12:32:50          max-err-code:  0
12:32:50          sum-err-code:  1
12:32:50          timeout-abort: 0s
12:32:50          timeout-retry: 0
12:32:50          timeout-count: 0
12:32:50      Request:
12:32:50          method: GET
12:32:50          url:    https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc
12:32:50          args:   {}
12:32:50      Times:
12:32:50          min: 0.137s
12:32:50          avg: 0.001s
12:32:50          max: 0.137s
12:32:50      Results:
12:32:50          Run                 Codes                 Delta                 Times            Timestamps
12:32:50            1               (!) 500                0.000s                0.137s   2021-11-22 17:32:38
12:32:50      Summary:
12:32:50          Detected 1 erroneous HTTP codes not equal to expected 200.
12:32:50          Test failed (status=-1).
12:32:50   === Exceptions ===
12:32:50   1. Internal Server Error
12:32:50  
12:32:50  =========================== short test summary info ============================

@tlvu
Copy link
Contributor

tlvu commented Nov 22, 2021

It's not just a "convenience". Jenkins build failure is useless if we are unable to have the failing info to diagnose !

Everything is available in the notebook outputs. Its just more trouble to load them in jupyter than have them already in logs.

The actual notebook outputs are discarded !

The outputs that are being saved is the result of another run and in the case of intermittent problem like this caching issue, no guaranty we will have the same result. Also another runs means all the timestamps do not match.

See

booleanParam(name: 'SAVE_RESULTING_NOTEBOOK', defaultValue: true,
description: '''Check the box to save the resulting notebooks of the run.
Note this is another run, will double the time and no guaranty to have same error as the run from py.test.''')

@tlvu
Copy link
Contributor

tlvu commented Nov 22, 2021

I changed to assert printout to print only all the failed runs as the same output from the results one. Also added a Exception Section printing the Reason ex : (500 : Internal Server Error)

@tlvu Let me know what would rather have in the Exceptions area

1. the entire Error Message like :
<ExceptionReport version="1.0.0"
    xmlns="http://www.opengis.net/ows/1.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
    <Exception exceptionCode="NoApplicableCode" locator="NoApplicableCode">
        <ExceptionText>Unknown Error</ExceptionText>
    </Exception>
</ExceptionReport>
2. A simple : 500 : Internal Server Error

Can we combine both so we have the bad code and and actual error matching the bad code. This will be particularly useful in the case where we have multiple failing code. For example we have seen in the past both 500 and 408 can happen in the same run. Having the actual error matching the code will speed up the diagnostic.

Here is an example of the output from the test : Test 1 Test 2

12:32:50  ---------------------------------------------------------------------------
12:32:50  AssertionError                            Traceback (most recent call last)
12:32:50  /tmp/ipykernel_1067/1504455324.py in <module>
12:32:50        7             url = thredds_url,
12:32:50        8             n_threads = TEST_N_THREADS_THREDDS,
12:32:50  ----> 9             runs_per_threads = TEST_RUNS_THREDDS)
12:32:50  
12:32:50  /tmp/ipykernel_1067/1081959450.py in run_threads(target_test, url, execution_id, n_threads, runs_per_threads)
12:32:50       37     assert (
12:32:50       38         stress_test_progression.failed_threads_count == 0
12:32:50  ---> 39     ), stress_test_progression.fail_report()
12:32:50       40 
12:32:50       41     print(f"All threads passed {target_test.__name__} function.")
12:32:50  
12:32:50  AssertionError: This test failed due to bad code received from the requests.
12:32:50   Number of failed thread(s) : 1/ 5
12:32:50   Number of failed requests : 1/500
12:32:50  
12:32:50  
12:32:50   Stress Test:
12:32:50      Test:
12:32:50          code: 200
12:32:50          runs: 100
12:32:50          max-avg-time:  1s
12:32:50          max-err-code:  0
12:32:50          sum-err-code:  1
12:32:50          timeout-abort: 0s
12:32:50          timeout-retry: 0
12:32:50          timeout-count: 0
12:32:50      Request:
12:32:50          method: GET
12:32:50          url:    https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc
12:32:50          args:   {}
12:32:50      Times:
12:32:50          min: 0.137s
12:32:50          avg: 0.001s
12:32:50          max: 0.137s
12:32:50      Results:
12:32:50          Run                 Codes                 Delta                 Times            Timestamps
12:32:50            1               (!) 500                0.000s                0.137s   2021-11-22 17:32:38
12:32:50      Summary:
12:32:50          Detected 1 erroneous HTTP codes not equal to expected 200.
12:32:50          Test failed (status=-1).
12:32:50   === Exceptions ===
12:32:50   1. Internal Server Error
12:32:50  
12:32:50  =========================== short test summary info ============================

Oh this is great !

So I see combining both status code and the actual error for the exception will look like

 === Exceptions ===
1. 500 : Internal Server Error
<ExceptionReport version="1.0.0"
     xmlns="http://www.opengis.net/ows/1.1"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
     <Exception exceptionCode="NoApplicableCode" locator="NoApplicableCode">
         <ExceptionText>Unknown Error</ExceptionText>
     </Exception>
 </ExceptionReport>

3. 408 : Request Timeout
 <ExceptionReport>
    blah blah blah ...
 </ExceptionReport>

@ahandan
Copy link
Collaborator Author

ahandan commented Nov 23, 2021

@tlvu @fmigneault
Here is the latest test output :
The test Jenkins_Test

10:12:59  ---------------------------------------------------------------------------
10:12:59  AssertionError                            Traceback (most recent call last)
10:12:59  /tmp/ipykernel_1135/1504455324.py in <module>
10:12:59        7             url = thredds_url,
10:12:59        8             n_threads = TEST_N_THREADS_THREDDS,
10:12:59  ----> 9             runs_per_threads = TEST_RUNS_THREDDS)
10:12:59  
10:12:59  /tmp/ipykernel_1135/1081959450.py in run_threads(target_test, url, execution_id, n_threads, runs_per_threads)
10:12:59       37     assert (
10:12:59       38         stress_test_progression.failed_threads_count == 0
10:12:59  ---> 39     ), stress_test_progression.fail_report()
10:12:59       40 
10:12:59       41     print(f"All threads passed {target_test.__name__} function.")
10:12:59  
10:12:59  AssertionError: This test failed due to bad code received from the requests.
10:12:59   Number of failed thread(s) : 3/ 5
10:12:59   Number of failed requests : 7/2500
10:12:59  
10:12:59  
10:12:59   Stress Test:
10:12:59      Test:
10:12:59          code: 200
10:12:59          runs: 500
10:12:59          max-avg-time:  1s
10:12:59          max-err-code:  0
10:12:59          sum-err-code:  2
10:12:59          timeout-abort: 0s
10:12:59          timeout-retry: 0
10:12:59          timeout-count: 0
10:12:59      Request:
10:12:59          method: GET
10:12:59          url:    https://hirondelle.crim.ca/twitcher/ows/proxy/thredds/catalog/birdhouse/testdata/catalog.html?dataset=birdhouse/testdata/ta_Amon_MRI-CGCM3_decadal1980_r1i1p1_199101-200012.nc
10:12:59          args:   {}
10:12:59      Times:
10:12:59          min: 0.064s
10:12:59          avg: 0.000s
10:12:59          max: 0.087s
10:12:59      Results:
10:12:59          Run                 Codes                 Delta                 Times            Timestamps
10:12:59            1               (!) 500                0.052s                0.087s   2021-11-23 15:11:52
10:12:59            2               (!) 500                0.075s                0.064s   2021-11-23 15:12:13
10:12:59      Summary:
10:12:59          Detected 2 erroneous HTTP codes not equal to expected 200.
10:12:59          Test failed (status=-1).
10:12:59  
10:12:59   === Exceptions ===
10:12:59  
10:12:59   1. 500 Internal Server Error
10:12:59   <?xml version="1.0" encoding="utf-8"?>
10:12:59  <ExceptionReport version="1.0.0"
10:12:59      xmlns="http://www.opengis.net/ows/1.1"
10:12:59      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
10:12:59      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
10:12:59      <Exception exceptionCode="NoApplicableCode" locator="NoApplicableCode">
10:12:59          <ExceptionText>Unknown Error</ExceptionText>
10:12:59      </Exception>
10:12:59  </ExceptionReport>
10:12:59  
10:12:59   2. 500 Internal Server Error
10:12:59   <?xml version="1.0" encoding="utf-8"?>
10:12:59  <ExceptionReport version="1.0.0"
10:12:59      xmlns="http://www.opengis.net/ows/1.1"
10:12:59      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
10:12:59      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
10:12:59      <Exception exceptionCode="NoApplicableCode" locator="NoApplicableCode">
10:12:59          <ExceptionText>Unknown Error</ExceptionText>
10:12:59      </Exception>
10:12:59  </ExceptionReport>
10:12:59  
10:12:59  

Copy link
Contributor

@tlvu tlvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new error reporting in Jenkins is very good. I just have comments about what seems to me as an accidental way of hiding wrong behavior from the server (Magpie, Twitcher or the service behind it).

I'd rather have all the errors exposed clearly so we can fix them but I do not know enough the design decision to make a call.

notebooks/stress-tests.ipynb Outdated Show resolved Hide resolved
@@ -330,7 +330,7 @@
" resp = requests.request(method, url, timeout=abort_timeout, **req_kwargs)\n",
" resp.raise_for_status()\n",
" except requests.exceptions.Timeout as errT:\n",
" result.codes.append(code) # read timeout\n",
" result.codes.append(408) # read timeout\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the wrong code for timeout, why not expose it instead of hardcoding 408? Was hardcoding the previous behavior that you are trying to maintain?

In general, tests are meant to expose errors so masking the error is usually the wrong thing to do. My 2 cents because I do not really know for sure what is the design decision of Magpie.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll answer both here,

I might be wrong on this, please correct me if it is the case.
But because your request TimeOut after a specified time as timeout=abort_timeout it means you dont have any answer from the server, which result in not knowing the response, only knowing the request made, so we can't have the exact Status_Code of the server, so we force a 408 code after the TimeOut limit. And because there is no answer from the Server I force the "Request Timeout" message and I print the requests.exceptions.Timeout(request) as exception_text message.

Also, the thing with this test, is that a 408 error happens randomly, so trying to debug this is like flipping a coin, sometime it takes 1 minute sometimes an entire day. So I can try to optimise it on my side, but it would be important that @fmigneault can be unblocked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OHH I misunderstood, so it's the test that is timing out, it's not the actual response from the server. So in that case, the error reporting should not make it looks like it's a 408 received from the server !

Maybe just put "xxx" as the code and the reason would be "test timeout after ### second"?

So we can differentiate a real timeout as returned by Magpie because the server behind Magpie takes too long to respond vs the test that is timing out because we enforce an arbitrary abort_timeout.

But I'd suggest you wait to hear from @fmigneault before changing code. He knows best how to test Magpie.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahandan's interpretation is correct.
We manually set a limit we want to wait for, and if no response from the server is received until then, the client (us) aborts.
So there is no real code from the server.
Code 408 is picked because that's the code for "Client Timeout", in contrast to "Server Timeout" that should be 504 (or some other 5xx).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, there is no confusion to be had here.
When the request is initiated, our client establishes a connection with a limit "wait for" x seconds.
The server accepts that connection and keeps it open for that amount of time to receive the request.
If either the client fails to produce the request or the server to receive it / read data in time, 408 happens.
In other words, the request did not reach (completely or at all) the server.
If the server failed to produce a response in time (eg: it itself did not receive a result in time from the service it was supposed to redirect to), 504 happens instead. This is based on the timeout value configured in nginx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it's clear for me now. Just so it's clear for others as well, can we simply add "(client side)" to the reason "Request TimeOut (client side)"? Thanks.

@ahandan ahandan merged commit 9c63eeb into master Nov 24, 2021
@ahandan ahandan deleted the multi_threading_stress_test branch November 24, 2021 18:34
ChaamC pushed a commit to bird-house/birdhouse-deploy that referenced this pull request Dec 3, 2021
Magpie integrate new feats

## Overview

Update to latest Magpie version 3.19.0
Tested on a local birdhouse setup.

## Changes

- Adds new environment variables to handle new Magpie features, which rely on email usage. These features include user registration and approvals + user assignment to groups with terms and conditions.

Following PR will have to be merged prior to merging this current PR : [pavics-e2e #90](Ouranosinc/PAVICS-e2e-workflow-tests#90)

Also, PR [birdhouse #197](#197 content has been transfered to this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants