Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Crashes Due To Zero Byte WAL File #1603

Closed
tonychoe opened this issue Jul 25, 2022 · 5 comments · Fixed by #1614
Closed

Ingest Crashes Due To Zero Byte WAL File #1603

tonychoe opened this issue Jul 25, 2022 · 5 comments · Fixed by #1614
Assignees

Comments

@tonychoe
Copy link

tonychoe commented Jul 25, 2022

Describe the bug

We have the environment ready for go-live but no traffic currently. When the ingester gets crashed by whatever reason, it goes into the crash loop while complaining the zero byte WAL files. We assume the empty WAL files are legit because there was no traffic in this environment. We could revive the ingester by deleting those empty WAL files.

Could this be related to some handling empty WAL?

level=info ts=2022-07-25T18:05:15.769048542Z caller=wal.go:145 msg="replay complete" file=507de965-0ac4-487b-9701-cc5310857413:default:v2:snappy:v2 duration=5.291931411s
level=info ts=2022-07-25T18:05:15.769153172Z caller=wal.go:118 msg="beginning replay" file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:snappy:v2 size=0
level=warn ts=2022-07-25T18:05:15.769248463Z caller=wal.go:129 msg="empty wal file. ignoring." file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:snappy:v2 err=null
level=info ts=2022-07-25T18:05:15.770347835Z caller=rescan_blocks.go:35 msg="beginning replay" file=0c692642-e72f-4bb3-ae0f-823c79dac5a2:fa:v2:none: size=0
level=warn ts=2022-07-25T18:05:15.770463073Z caller=rescan_blocks.go:49 msg="empty wal file. ignoring." file=0c692642-e72f-4bb3-ae0f-823c79dac5a2:fa:v2:none: err=null
level=info ts=2022-07-25T18:05:15.770501331Z caller=rescan_blocks.go:35 msg="beginning replay" file=507de965-0ac4-487b-9701-cc5310857413:default:v2:none: size=832314742
level=info ts=2022-07-25T18:05:23.560966754Z caller=rescan_blocks.go:65 msg="replay complete" file=507de965-0ac4-487b-9701-cc5310857413:default:v2:none: duration=7.790460674s
level=info ts=2022-07-25T18:05:23.561048476Z caller=rescan_blocks.go:35 msg="beginning replay" file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:none: size=0
level=warn ts=2022-07-25T18:05:23.561154162Z caller=rescan_blocks.go:49 msg="empty wal file. ignoring." file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:none: err=null
level=info ts=2022-07-25T18:05:23.561614599Z caller=ingester.go:413 msg="wal replay complete"
level=info ts=2022-07-25T18:05:23.561685053Z caller=ingester.go:427 msg="reloading local blocks" tenants=2
level=info ts=2022-07-25T18:05:23.561904241Z caller=instance.go:647 msg="reloaded local block" tenantID=default block=c7b4dbdb-ae9b-4c55-a9d1-bc0f9511da4a flushed=2022-07-24T06:08:31Z
level=info ts=2022-07-25T18:05:23.561987356Z caller=instance.go:647 msg="reloaded local block" tenantID=default block=cbe0c68b-106a-4b36-974e-0819b348d556 flushed=2022-07-24T06:14:38Z
level=error ts=2022-07-25T18:05:23.562392191Z caller=app.go:296 msg="module failed" module=ingester err="invalid service state: Failed, expected: Running, failure: failed to rediscover local blocks: getting local blocks for tenant fa: unexpected end of JSON input"

This is one of the tenant's WAL directory in PV attached to the crashed ingester.

var/tempo/wal/blocks/fa/f592df09-58aa-434f-9674-f39810d7352f

[root@oke-c5v2ase5lfa-nk5mskxyg2q-sphayxz36iq-4 f592df09-58aa-434f-9674-f39810d7352f]# ls -lah
total 8.0K
drwxr-xr-x. 2 root root 4.0K Jul 24 06:16 .
drwxr-xr-x. 3 root root 4.0K Jul 24 06:16 ..
-rw-r--r--. 1 root root    0 Jul 24 06:16 bloom-0
-rw-r--r--. 1 root root    0 Jul 24 06:16 data
-rw-r--r--. 1 root root    0 Jul 24 06:16 flushed
-rw-r--r--. 1 root root    0 Jul 24 06:16 index
-rw-r--r--. 1 root root    0 Jul 24 06:16 meta.json
-rw-r--r--. 1 root root    0 Jul 24 06:16 search
-rw-r--r--. 1 root root    0 Jul 24 06:16 search-header
-rw-r--r--. 1 root root    0 Jul 24 06:16 search-index
-rw-r--r--. 1 root root    0 Jul 24 06:16 search.meta.json

To Reproduce
Steps to reproduce the behavior:

  1. Run Tempo as distributed on K8S
  2. Let no new trace data ingested for a while
  3. Crash the ingester pod

Expected behavior
The ingester boots up

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm
  • Tempo: 1.4.0
@annanay25
Copy link
Contributor

Hi @tonychoe thanks for your report.

This indeed is strange behaviour because we have guard code to ensure zero length wal files are never created. Following line ensures that if there's no ingested data we never create a WAL file

if i.headBlock == nil || i.headBlock.DataLength() == 0 {

I ran the following test on my local machine to simulate "crashing a container" to see if we write zero length files in that case, but could not reproduce your issue -

# Running tempo without any ingested trace data
➜  local git:(zero-byte-wal) ✗ docker-compose ps
       Name                     Command               State                                                                                                         Ports
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
local_grafana_1      /run.sh                          Up      0.0.0.0:3000->3000/tcp,:::3000->3000/tcp
local_prometheus_1   /bin/prometheus --config.f ...   Up      0.0.0.0:9090->9090/tcp,:::9090->9090/tcp
local_tempo_1        /tempo -config.file=/etc/t ...   Up      0.0.0.0:14268->14268/tcp,:::14268->14268/tcp, 0.0.0.0:3200->3200/tcp,:::3200->3200/tcp, 0.0.0.0:4317->4317/tcp,:::4317->4317/tcp, 0.0.0.0:4318->4318/tcp,:::4318->4318/tcp,
                                                              0.0.0.0:9411->9411/tcp,:::9411->9411/tcp

# No WAL files created
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal
total 0
drwxr-xr-x  3 annanay  staff   96 Jul 26 15:28 .
drwxr-xr-x  5 annanay  staff  160 Jul 26 15:28 ..
drwxr-xr-x  2 annanay  staff   64 Jul 26 15:28 blocks

# No blocks created
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal/blocks
total 0
drwxr-xr-x  2 annanay  staff  64 Jul 26 15:28 .
drwxr-xr-x  3 annanay  staff  96 Jul 26 15:28 ..


# Kill container
➜  local git:(zero-byte-wal) ✗ docker-compose kill tempo
Killing local_tempo_1 ... done

➜  local git:(zero-byte-wal) ✗ docker-compose ps
       Name                     Command                State                      Ports
---------------------------------------------------------------------------------------------------------
local_grafana_1      /run.sh                          Up         0.0.0.0:3000->3000/tcp,:::3000->3000/tcp
local_prometheus_1   /bin/prometheus --config.f ...   Up         0.0.0.0:9090->9090/tcp,:::9090->9090/tcp
local_tempo_1        /tempo -config.file=/etc/t ...   Exit 137

# Still no WAL files
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal
total 0
drwxr-xr-x  3 annanay  staff   96 Jul 26 15:28 .
drwxr-xr-x  5 annanay  staff  160 Jul 26 15:28 ..
drwxr-xr-x  2 annanay  staff   64 Jul 26 15:28 blocks

# No blocks either
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal/blocks
total 0
drwxr-xr-x  2 annanay  staff  64 Jul 26 15:28 .
drwxr-xr-x  3 annanay  staff  96 Jul 26 15:28 ..

However, from your terminal output, it appears that its not actually a zero length WAL file but rather a corrupt block created when replaying a legit WAL file. The bloom filters, meta.json, index etc, are only created on completion of WAL replay. Here is where that happens -

return bytesFlushed, writeBlockMeta(ctx, w, meta, indexBytes, c.bloom)

Are you able to reproduce this crash reliably? In that case could you please attach the entire (recursive) directory structure of the /var/tempo directory?

@tonychoe
Copy link
Author

tonychoe commented Jul 26, 2022

@annanay25 thanks for reviewing and sharing the insights.

This has been a repeated problem on one of our production region, so we had to implement a temp fix while writing this issue - an init container for ingester that find empty files and delete them.

So I have only this path copied in my note:
var/tempo/wal/blocks/fa/f592df09-58aa-434f-9674-f39810d7352f

To get the entire dir structure, we need to revert the fix back and let the issue coming back. We'll get back.

@annanay25
Copy link
Contributor

annanay25 commented Jul 27, 2022

@tonychoe I see.

Yes in that case it looks like there was a bug while replaying the WAL. Is it possible to retrieve the logs for when the ingester pod was crashed? Was this done manually (under steps to reproduce its mentioned "Crash the ingester pod")? I'm not sure how an empty meta.json could have been written.

@mdisibio
Copy link
Contributor

mdisibio commented Aug 1, 2022

We have now seen this internally along with another possibly related error:

level=error ts=... caller=app.go:339 msg="module failed" module=ingester err="invalid service state: Failed, expected: Running, failure: failed to rediscover local blocks: getting local blocks for tenant ...: unexpected end of JSON input"

level=warn ts=... caller=rescan_blocks.go:24 msg="failed to open search wal directory" err="open /var/tempo/wal/search: no such file or directory"

Situation was the same: crash loop until the block was removed (in our case with a whole new disk). Although we haven't determined the root cause yet, have a few ideas on error-handling improvements:

  1. Include the block ID or file name in the error message
  2. Skip over this block instead of failing to start, leaving it for investigation and manual removal. Reiterating that when replication factor > 1, then it is safe to delete blocks from an ingester without data-loss.

@mdisibio
Copy link
Contributor

mdisibio commented Aug 3, 2022

Internally we determined that the ingester pod was expectedly terminated due to a hardware fault on the k8s node, which led to the improperly flushed meta.json. Therefore this should be considered an expected though rare occurrence, and the changes to continue starting up seem sufficient. Perhaps in the future Tempo could automatically delete the bad block if it could determine there would be no data loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants