Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solved incident "llm_lazy" - "llm_summary" running against coolstore and llama3 is seeing multiple issues #382

Open
jwmatthews opened this issue Sep 21, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@jwmatthews
Copy link
Member

jwmatthews commented Sep 21, 2024

run_demo.py against coolstore with llama3 served from IBM BAM.

I am running with an alternate configuration where I explicitly enabled the solution_producers "llm_lazy".
As I did this I have not seen a single fix come back to the run_demo.py client, it's been close to 60 minutes now since I first kicked this off.

I am seeing multiple problems:

  • Processing run_demo.py ran for 6 hours: 9:30am - 3:30pm and then stopped. Many timeouts and errors

  • ~10 minutes of no activity in server logs, and then I saw:

    • "The user message is too long. Send a short message or decrease property min_new_tokens" for pom.xml
  • waited an additional ~40 minutes past the prior error and have not see any other activity from other files in the server console

    • I do see DEBUG activity in logs/kai_server.log, but it is sparse, all I see are multiple kai.models.file_solution - [ file_solution.py:23 - guess_language()]
    • Because I do see that 1 line outputted multiple times for various files I can see that for the past 40 minutes the server is processing more files for solved incidents. We do not output any other information to logs to know what is happening.
  • After ~75 minutes, most of the requests from the client run_demo.py have timed out, I have seen 1 file succeed, rest timeouts (10 minute timeout)

    • ERROR - 2024-09-21 10:35:09,636 - __main__ - [ run_demo.py:61 - generate_fix()] - [src/main/java/com/redhat/coolstore/model/ShoppingCart.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
  • Seeing [CRITICAL] WORKER TIMEOUT (pid:18881) server side

      [2024-09-21 10:45:19 -0400] [18875] [CRITICAL] WORKER TIMEOUT (pid:18881)
      [2024-09-21 10:45:20 -0400] [18875] [ERROR] Worker (pid:18881) was sent SIGKILL! Perhaps out of memory?
      [2024-09-21 10:45:20 -0400] [20505] [INFO] Booting worker with pid: 20505 
    

Server console output: https://gist.github.com/jwmatthews/a60f2a36b5691b466d9964386c61d04b

Full kai_server log: https://gist.githubusercontent.com/jwmatthews/c93bbd785643422f07d122f10635471a/raw/9423743b874ae61c95dfd2967f69f78e8c7fb190/kai_server.log

  • Note that I kicked off this experiment ~930am, there are some prior runs earlier in the log file which can be ignored.

Full run_demo.py console output:
https://gist.github.com/jwmatthews/5af0c32bc7951fffbf4be6c302a5e7d0

Looking at one example of pom.xml traceback: https://gist.github.com/jwmatthews/7c5a6e4d24bf445c35d972013cad4729

Sample error from pom.xml:

Failed to handle request to https://bam-api.res.ibm.com/v2/text/chat_stream?version=2024-01-10.
{
  "error": "Bad Request",
  "extensions": {
    "code": "INVALID_INPUT",
    "state": null
  },
  "message": "The user message is too long. Send a short message or decrease property min_new_tokens",
  "status_code": 400
}

The config I am running with

$ cat ~/git/jwmatthews/kai/kai/config.toml
# Default configuration file for Kai. For a better understanding of the
# configuration options, please refer to `build/example_config.toml`

log_level = "info"
file_log_level = "debug"
log_dir = "$pwd/logs"
demo_mode = false
trace_enabled = true

solution_consumers = ["diff_only", "llm_summary"]

[incident_store]
solution_detectors = "naive"
solution_producers = "llm_lazy"

[incident_store.args]
provider = "postgresql"
host = "127.0.0.1"
database = "kai"
user = "kai"
password = "dog8code"

[models]
provider = "ChatIBMGenAI"

[models.args]
model_id = "meta-llama/llama-3-70b-instruct"
parameters.max_new_tokens = 2048
@jwmatthews
Copy link
Member Author

Below is an example from above run that completed. It's been ~75 minutes since I kicked off run_demo.py.
Most of the files request to be fixed have timed out and are retrying, below is 1 example that succeeded

$ gist ShoppingCart.java.prompts.md
https://gist.github.com/jwmatthews/481ed0ced17b81588e5a9362c663ec26

$ gist ShoppingCart.java.llm_result.md
https://gist.github.com/jwmatthews/881a6f06dcc630d3b5b12733d8ce9c16

$ gist ShoppingCart.java.llm_response_metadata.json
https://gist.github.com/jwmatthews/8a1d0f55d680c64a4aabd50a01d1c51a

@jwmatthews
Copy link
Member Author

Example of what I'm seeing with run_demo.py

$ time ./run_demo.py
26 files with a total of 150 incidents.
INFO - 2024-09-21 09:33:57,004 - __main__ - [         run_demo.py:227  -             run_demo()] - Running in parallel with 8 workers
INFO - 2024-09-21 09:33:57,004 - __main__ - [         run_demo.py:192  -         process_file()] - File #1 of 26 - Processing src/main/webapp/WEB-INF/web.xml which has 1 incidents.
INFO - 2024-09-21 09:33:57,005 - __main__ - [         run_demo.py:192  -         process_file()] - File #2 of 26 - Processing pom.xml which has 12 incidents.
INFO - 2024-09-21 09:33:57,005 - __main__ - [         run_demo.py:192  -         process_file()] - File #3 of 26 - Processing src/main/java/com/redhat/coolstore/model/Order.java which has 10 incidents.
INFO - 2024-09-21 09:33:57,005 - __main__ - [         run_demo.py:192  -         process_file()] - File #4 of 26 - Processing src/main/java/com/redhat/coolstore/model/OrderItem.java which has 6 incidents.
INFO - 2024-09-21 09:33:57,005 - __main__ - [         run_demo.py:192  -         process_file()] - File #5 of 26 - Processing src/main/webapp/WEB-INF/beans.xml which has 5 incidents.
INFO - 2024-09-21 09:33:57,006 - __main__ - [         run_demo.py:192  -         process_file()] - File #6 of 26 - Processing src/main/resources/META-INF/persistence.xml which has 8 incidents.
INFO - 2024-09-21 09:33:57,006 - __main__ - [         run_demo.py:192  -         process_file()] - File #7 of 26 - Processing src/main/java/com/redhat/coolstore/model/InventoryEntity.java which has 6 incidents.
INFO - 2024-09-21 09:33:57,007 - __main__ - [         run_demo.py:192  -         process_file()] - File #8 of 26 - Processing src/main/java/com/redhat/coolstore/model/ShoppingCart.java which has 1 incidents.
INFO - 2024-09-21 09:34:18,062 - __main__ - [         run_demo.py:208  -         process_file()] - Response StatusCode: 200 for src/main/webapp/WEB-INF/web.xml

INFO - 2024-09-21 09:34:18,062 - __main__ - [         run_demo.py:110  -        write_to_disk()] - Writing updated source code to ./coolstore/src/main/webapp/WEB-INF/web.xml
INFO - 2024-09-21 09:34:18,062 - __main__ - [         run_demo.py:122  -        write_to_disk()] - Writing prompts to ./coolstore/src/main/webapp/WEB-INF/web.xml.prompts.md
INFO - 2024-09-21 09:34:18,063 - __main__ - [         run_demo.py:132  -        write_to_disk()] - Writing llm_response_metadata to ./coolstore/src/main/webapp/WEB-INF/web.xml.llm_response_metadata.json
INFO - 2024-09-21 09:34:18,064 - __main__ - [         run_demo.py:145  -        write_to_disk()] - Writing llm_result to ./coolstore/src/main/webapp/WEB-INF/web.xml.llm_result.md
INFO - 2024-09-21 09:34:18,064 - __main__ - [         run_demo.py:192  -         process_file()] - File #9 of 26 - Processing src/main/java/com/redhat/coolstore/persistence/Resources.java which has 6 incidents.
INFO - 2024-09-21 09:34:18,064 - __main__ - [         run_demo.py:239  -             run_demo()] - Result:  21.05963397026062s to process src/main/webapp/WEB-INF/web.xml with 1 violations
INFO - 2024-09-21 09:34:18,064 - __main__ - [         run_demo.py:246  -             run_demo()] - 25 files remaining from total of 26
INFO - 2024-09-21 09:44:04,445 - __main__ - [         run_demo.py:57   -         generate_fix()] - [pom.xml] Received status code 500
ERROR - 2024-09-21 09:44:04,445 - __main__ - [         run_demo.py:64   -         generate_fix()] - [pom.xml] Failed to get a '200' response from the server.  Retrying 6 more times
INFO - 2024-09-21 09:44:04,948 - __main__ - [         run_demo.py:57   -         generate_fix()] - [pom.xml] Received status code 500
ERROR - 2024-09-21 09:44:04,948 - __main__ - [         run_demo.py:64   -         generate_fix()] - [pom.xml] Failed to get a '200' response from the server.  Retrying 5 more times
INFO - 2024-09-21 09:44:05,278 - __main__ - [         run_demo.py:57   -         generate_fix()] - [pom.xml] Received status code 500
ERROR - 2024-09-21 09:44:05,278 - __main__ - [         run_demo.py:64   -         generate_fix()] - [pom.xml] Failed to get a '200' response from the server.  Retrying 4 more times
INFO - 2024-09-21 09:44:05,614 - __main__ - [         run_demo.py:57   -         generate_fix()] - [pom.xml] Received status code 500
ERROR - 2024-09-21 09:44:05,614 - __main__ - [         run_demo.py:64   -         generate_fix()] - [pom.xml] Failed to get a '200' response from the server.  Retrying 3 more times
INFO - 2024-09-21 09:44:05,956 - __main__ - [         run_demo.py:57   -         generate_fix()] - [pom.xml] Received status code 500
ERROR - 2024-09-21 09:44:05,956 - __main__ - [         run_demo.py:64   -         generate_fix()] - [pom.xml] Failed to get a '200' response from the server.  Retrying 2 more times
INFO - 2024-09-21 09:44:06,287 - __main__ - [         run_demo.py:57   -         generate_fix()] - [pom.xml] Received status code 500
ERROR - 2024-09-21 09:44:06,287 - __main__ - [         run_demo.py:64   -         generate_fix()] - [pom.xml] Failed to get a '200' response from the server.  Retrying 1 more times
INFO - 2024-09-21 09:44:06,288 - __main__ - [         run_demo.py:192  -         process_file()] - File #10 of 26 - Processing src/main/java/com/redhat/coolstore/rest/CartEndpoint.java which has 9 incidents.
INFO - 2024-09-21 10:28:26,933 - __main__ - [         run_demo.py:208  -         process_file()] - Response StatusCode: 200 for src/main/webapp/WEB-INF/beans.xml

INFO - 2024-09-21 10:28:26,935 - __main__ - [         run_demo.py:110  -        write_to_disk()] - Writing updated source code to ./coolstore/src/main/webapp/WEB-INF/beans.xml
INFO - 2024-09-21 10:28:26,935 - __main__ - [         run_demo.py:122  -        write_to_disk()] - Writing prompts to ./coolstore/src/main/webapp/WEB-INF/beans.xml.prompts.md
INFO - 2024-09-21 10:28:26,935 - __main__ - [         run_demo.py:132  -        write_to_disk()] - Writing llm_response_metadata to ./coolstore/src/main/webapp/WEB-INF/beans.xml.llm_response_metadata.json
INFO - 2024-09-21 10:28:26,944 - __main__ - [         run_demo.py:145  -        write_to_disk()] - Writing llm_result to ./coolstore/src/main/webapp/WEB-INF/beans.xml.llm_result.md
INFO - 2024-09-21 10:28:26,944 - __main__ - [         run_demo.py:192  -         process_file()] - File #11 of 26 - Processing src/main/java/com/redhat/coolstore/rest/OrderEndpoint.java which has 8 incidents.
ERROR - 2024-09-21 10:35:09,636 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/ShoppingCart.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:35:09,636 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/OrderItem.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:35:09,637 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/ShoppingCart.java] Failed to get a '200' response from the server.  Retrying 6 more times
ERROR - 2024-09-21 10:35:09,637 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/resources/META-INF/persistence.xml] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:35:09,638 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/resources/META-INF/persistence.xml] Failed to get a '200' response from the server.  Retrying 6 more times
ERROR - 2024-09-21 10:35:09,637 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/Order.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:35:09,640 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/Order.java] Failed to get a '200' response from the server.  Retrying 6 more times
ERROR - 2024-09-21 10:35:09,637 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/InventoryEntity.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:35:09,641 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/InventoryEntity.java] Failed to get a '200' response from the server.  Retrying 6 more times
ERROR - 2024-09-21 10:35:09,637 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/java/com/redhat/coolstore/model/OrderItem.java] Failed to get a '200' response from the server.  Retrying 6 more times
ERROR - 2024-09-21 10:35:30,642 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/java/com/redhat/coolstore/persistence/Resources.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:35:30,643 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/java/com/redhat/coolstore/persistence/Resources.java] Failed to get a '200' response from the server.  Retrying 6 more times
INFO - 2024-09-21 10:35:57,218 - __main__ - [         run_demo.py:208  -         process_file()] - Response StatusCode: 200 for src/main/java/com/redhat/coolstore/model/ShoppingCart.java

INFO - 2024-09-21 10:35:57,219 - __main__ - [         run_demo.py:110  -        write_to_disk()] - Writing updated source code to ./coolstore/src/main/java/com/redhat/coolstore/model/ShoppingCart.java
INFO - 2024-09-21 10:35:57,219 - __main__ - [         run_demo.py:122  -        write_to_disk()] - Writing prompts to ./coolstore/src/main/java/com/redhat/coolstore/model/ShoppingCart.java.prompts.md
INFO - 2024-09-21 10:35:57,219 - __main__ - [         run_demo.py:132  -        write_to_disk()] - Writing llm_response_metadata to ./coolstore/src/main/java/com/redhat/coolstore/model/ShoppingCart.java.llm_response_metadata.json
INFO - 2024-09-21 10:35:57,222 - __main__ - [         run_demo.py:145  -        write_to_disk()] - Writing llm_result to ./coolstore/src/main/java/com/redhat/coolstore/model/ShoppingCart.java.llm_result.md
INFO - 2024-09-21 10:35:57,222 - __main__ - [         run_demo.py:192  -         process_file()] - File #12 of 26 - Processing src/main/java/com/redhat/coolstore/rest/ProductEndpoint.java which has 3 incidents.
ERROR - 2024-09-21 10:45:18,913 - __main__ - [         run_demo.py:61   -         generate_fix()] - [src/main/java/com/redhat/coolstore/rest/CartEndpoint.java] Received exception: HTTPConnectionPool(host='0.0.0.0', port=8080): Read timed out. (read timeout=3600)
ERROR - 2024-09-21 10:45:18,913 - __main__ - [         run_demo.py:64   -         generate_fix()] - [src/main/java/com/redhat/coolstore/rest/CartEndpoint.java] Failed to get a '200' response from the server.  Retrying 6 more times

@jwmatthews jwmatthews added the bug Something isn't working label Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant