Ingest Crashes Due To Zero Byte WAL File #1603

tonychoe · 2022-07-25T18:19:32Z

Describe the bug

We have the environment ready for go-live but no traffic currently. When the ingester gets crashed by whatever reason, it goes into the crash loop while complaining the zero byte WAL files. We assume the empty WAL files are legit because there was no traffic in this environment. We could revive the ingester by deleting those empty WAL files.

Could this be related to some handling empty WAL?

level=info ts=2022-07-25T18:05:15.769048542Z caller=wal.go:145 msg="replay complete" file=507de965-0ac4-487b-9701-cc5310857413:default:v2:snappy:v2 duration=5.291931411s
level=info ts=2022-07-25T18:05:15.769153172Z caller=wal.go:118 msg="beginning replay" file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:snappy:v2 size=0
level=warn ts=2022-07-25T18:05:15.769248463Z caller=wal.go:129 msg="empty wal file. ignoring." file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:snappy:v2 err=null
level=info ts=2022-07-25T18:05:15.770347835Z caller=rescan_blocks.go:35 msg="beginning replay" file=0c692642-e72f-4bb3-ae0f-823c79dac5a2:fa:v2:none: size=0
level=warn ts=2022-07-25T18:05:15.770463073Z caller=rescan_blocks.go:49 msg="empty wal file. ignoring." file=0c692642-e72f-4bb3-ae0f-823c79dac5a2:fa:v2:none: err=null
level=info ts=2022-07-25T18:05:15.770501331Z caller=rescan_blocks.go:35 msg="beginning replay" file=507de965-0ac4-487b-9701-cc5310857413:default:v2:none: size=832314742
level=info ts=2022-07-25T18:05:23.560966754Z caller=rescan_blocks.go:65 msg="replay complete" file=507de965-0ac4-487b-9701-cc5310857413:default:v2:none: duration=7.790460674s
level=info ts=2022-07-25T18:05:23.561048476Z caller=rescan_blocks.go:35 msg="beginning replay" file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:none: size=0
level=warn ts=2022-07-25T18:05:23.561154162Z caller=rescan_blocks.go:49 msg="empty wal file. ignoring." file=9402d44d-91ad-4b0a-acb6-02a41a6e0f02:default:v2:none: err=null
level=info ts=2022-07-25T18:05:23.561614599Z caller=ingester.go:413 msg="wal replay complete"
level=info ts=2022-07-25T18:05:23.561685053Z caller=ingester.go:427 msg="reloading local blocks" tenants=2
level=info ts=2022-07-25T18:05:23.561904241Z caller=instance.go:647 msg="reloaded local block" tenantID=default block=c7b4dbdb-ae9b-4c55-a9d1-bc0f9511da4a flushed=2022-07-24T06:08:31Z
level=info ts=2022-07-25T18:05:23.561987356Z caller=instance.go:647 msg="reloaded local block" tenantID=default block=cbe0c68b-106a-4b36-974e-0819b348d556 flushed=2022-07-24T06:14:38Z
level=error ts=2022-07-25T18:05:23.562392191Z caller=app.go:296 msg="module failed" module=ingester err="invalid service state: Failed, expected: Running, failure: failed to rediscover local blocks: getting local blocks for tenant fa: unexpected end of JSON input"

This is one of the tenant's WAL directory in PV attached to the crashed ingester.

var/tempo/wal/blocks/fa/f592df09-58aa-434f-9674-f39810d7352f

[root@oke-c5v2ase5lfa-nk5mskxyg2q-sphayxz36iq-4 f592df09-58aa-434f-9674-f39810d7352f]# ls -lah
total 8.0K
drwxr-xr-x. 2 root root 4.0K Jul 24 06:16 .
drwxr-xr-x. 3 root root 4.0K Jul 24 06:16 ..
-rw-r--r--. 1 root root    0 Jul 24 06:16 bloom-0
-rw-r--r--. 1 root root    0 Jul 24 06:16 data
-rw-r--r--. 1 root root    0 Jul 24 06:16 flushed
-rw-r--r--. 1 root root    0 Jul 24 06:16 index
-rw-r--r--. 1 root root    0 Jul 24 06:16 meta.json
-rw-r--r--. 1 root root    0 Jul 24 06:16 search
-rw-r--r--. 1 root root    0 Jul 24 06:16 search-header
-rw-r--r--. 1 root root    0 Jul 24 06:16 search-index
-rw-r--r--. 1 root root    0 Jul 24 06:16 search.meta.json

To Reproduce
Steps to reproduce the behavior:

Run Tempo as distributed on K8S
Let no new trace data ingested for a while
Crash the ingester pod

Expected behavior
The ingester boots up

Environment:

Infrastructure: Kubernetes
Deployment tool: Helm
Tempo: 1.4.0

The text was updated successfully, but these errors were encountered:

annanay25 · 2022-07-26T10:50:04Z

Hi @tonychoe thanks for your report.

This indeed is strange behaviour because we have guard code to ensure zero length wal files are never created. Following line ensures that if there's no ingested data we never create a WAL file

tempo/modules/ingester/instance.go

Line 254 in 7791a36

if i.headBlock == nil || i.headBlock.DataLength() == 0 {

I ran the following test on my local machine to simulate "crashing a container" to see if we write zero length files in that case, but could not reproduce your issue -

# Running tempo without any ingested trace data
➜  local git:(zero-byte-wal) ✗ docker-compose ps
       Name                     Command               State                                                                                                         Ports
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
local_grafana_1      /run.sh                          Up      0.0.0.0:3000->3000/tcp,:::3000->3000/tcp
local_prometheus_1   /bin/prometheus --config.f ...   Up      0.0.0.0:9090->9090/tcp,:::9090->9090/tcp
local_tempo_1        /tempo -config.file=/etc/t ...   Up      0.0.0.0:14268->14268/tcp,:::14268->14268/tcp, 0.0.0.0:3200->3200/tcp,:::3200->3200/tcp, 0.0.0.0:4317->4317/tcp,:::4317->4317/tcp, 0.0.0.0:4318->4318/tcp,:::4318->4318/tcp,
                                                              0.0.0.0:9411->9411/tcp,:::9411->9411/tcp

# No WAL files created
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal
total 0
drwxr-xr-x  3 annanay  staff   96 Jul 26 15:28 .
drwxr-xr-x  5 annanay  staff  160 Jul 26 15:28 ..
drwxr-xr-x  2 annanay  staff   64 Jul 26 15:28 blocks

# No blocks created
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal/blocks
total 0
drwxr-xr-x  2 annanay  staff  64 Jul 26 15:28 .
drwxr-xr-x  3 annanay  staff  96 Jul 26 15:28 ..


# Kill container
➜  local git:(zero-byte-wal) ✗ docker-compose kill tempo
Killing local_tempo_1 ... done

➜  local git:(zero-byte-wal) ✗ docker-compose ps
       Name                     Command                State                      Ports
---------------------------------------------------------------------------------------------------------
local_grafana_1      /run.sh                          Up         0.0.0.0:3000->3000/tcp,:::3000->3000/tcp
local_prometheus_1   /bin/prometheus --config.f ...   Up         0.0.0.0:9090->9090/tcp,:::9090->9090/tcp
local_tempo_1        /tempo -config.file=/etc/t ...   Exit 137

# Still no WAL files
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal
total 0
drwxr-xr-x  3 annanay  staff   96 Jul 26 15:28 .
drwxr-xr-x  5 annanay  staff  160 Jul 26 15:28 ..
drwxr-xr-x  2 annanay  staff   64 Jul 26 15:28 blocks

# No blocks either
➜  local git:(zero-byte-wal) ✗ ls -al tempo-data/wal/blocks
total 0
drwxr-xr-x  2 annanay  staff  64 Jul 26 15:28 .
drwxr-xr-x  3 annanay  staff  96 Jul 26 15:28 ..

However, from your terminal output, it appears that its not actually a zero length WAL file but rather a corrupt block created when replaying a legit WAL file. The bloom filters, meta.json, index etc, are only created on completion of WAL replay. Here is where that happens -

tempo/tempodb/encoding/v2/streaming_block.go

Line 145 in 7791a36

return bytesFlushed, writeBlockMeta(ctx, w, meta, indexBytes, c.bloom)

Are you able to reproduce this crash reliably? In that case could you please attach the entire (recursive) directory structure of the /var/tempo directory?

tonychoe · 2022-07-26T17:53:51Z

@annanay25 thanks for reviewing and sharing the insights.

This has been a repeated problem on one of our production region, so we had to implement a temp fix while writing this issue - an init container for ingester that find empty files and delete them.

So I have only this path copied in my note:
var/tempo/wal/blocks/fa/f592df09-58aa-434f-9674-f39810d7352f

To get the entire dir structure, we need to revert the fix back and let the issue coming back. We'll get back.

annanay25 · 2022-07-27T08:19:49Z

@tonychoe I see.

Yes in that case it looks like there was a bug while replaying the WAL. Is it possible to retrieve the logs for when the ingester pod was crashed? Was this done manually (under steps to reproduce its mentioned "Crash the ingester pod")? I'm not sure how an empty meta.json could have been written.

mdisibio · 2022-08-01T12:59:25Z

We have now seen this internally along with another possibly related error:

level=error ts=... caller=app.go:339 msg="module failed" module=ingester err="invalid service state: Failed, expected: Running, failure: failed to rediscover local blocks: getting local blocks for tenant ...: unexpected end of JSON input"

level=warn ts=... caller=rescan_blocks.go:24 msg="failed to open search wal directory" err="open /var/tempo/wal/search: no such file or directory"

Situation was the same: crash loop until the block was removed (in our case with a whole new disk). Although we haven't determined the root cause yet, have a few ideas on error-handling improvements:

Include the block ID or file name in the error message
Skip over this block instead of failing to start, leaving it for investigation and manual removal. Reiterating that when replication factor > 1, then it is safe to delete blocks from an ingester without data-loss.

mdisibio · 2022-08-03T16:36:12Z

Internally we determined that the ingester pod was expectedly terminated due to a hardware fault on the k8s node, which led to the improperly flushed meta.json. Therefore this should be considered an expected though rare occurrence, and the changes to continue starting up seem sufficient. Perhaps in the future Tempo could automatically delete the bad block if it could determine there would be no data loss.

mdisibio self-assigned this Aug 2, 2022

mdisibio mentioned this issue Aug 3, 2022

Ingester continue startup after rediscover block error #1614

Merged

3 tasks

mdisibio closed this as completed in #1614 Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest Crashes Due To Zero Byte WAL File #1603

Ingest Crashes Due To Zero Byte WAL File #1603

tonychoe commented Jul 25, 2022 •

edited

Loading

annanay25 commented Jul 26, 2022

tonychoe commented Jul 26, 2022 •

edited

Loading

annanay25 commented Jul 27, 2022 •

edited

Loading

mdisibio commented Aug 1, 2022

mdisibio commented Aug 3, 2022

Ingest Crashes Due To Zero Byte WAL File #1603

Ingest Crashes Due To Zero Byte WAL File #1603

Comments

tonychoe commented Jul 25, 2022 • edited Loading

annanay25 commented Jul 26, 2022

tonychoe commented Jul 26, 2022 • edited Loading

annanay25 commented Jul 27, 2022 • edited Loading

mdisibio commented Aug 1, 2022

mdisibio commented Aug 3, 2022

tonychoe commented Jul 25, 2022 •

edited

Loading

tonychoe commented Jul 26, 2022 •

edited

Loading

annanay25 commented Jul 27, 2022 •

edited

Loading