Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in ingester: cannot grow buffer beyond 2 gigabytes #1258

Closed
mvkrishna86 opened this issue Jan 30, 2022 · 13 comments
Closed

Error in ingester: cannot grow buffer beyond 2 gigabytes #1258

mvkrishna86 opened this issue Jan 30, 2022 · 13 comments
Labels
stale Used for stale issues / PRs

Comments

@mvkrishna86
Copy link

Describe the bug
Tempo ingesters are failing with the below error:

panic: cannot grow buffer beyond 2 gigabytes

goroutine 326 [running]:
github.com/google/flatbuffers/go.(*Builder).growByteBuffer(...)
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:214
github.com/google/flatbuffers/go.(*Builder).Prep(0x41066185b0, 0x4, 0x36b45)
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:268 +0x2c0
github.com/google/flatbuffers/go.(*Builder).CreateString(0x41066185b0, {0x415fa04000, 0x36b44})
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:337 +0x4c
github.com/google/flatbuffers/go.(*Builder).CreateSharedString(0x41066185b0, {0x415fa04000, 0x36b44})
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:327 +0xbc
github.com/grafana/tempo/pkg/tempofb.writeToBuilder(0x41066185b0, {0x406cbbaa80, 0x37, 0x37}, 0x4041845888)
        /drone/src/pkg/tempofb/searchdatamap.go:159 +0x440
github.com/grafana/tempo/pkg/tempofb.SearchDataMapLarge.WriteToBuilder(0x404b0524e0, 0x41066185b0)
        /drone/src/pkg/tempofb/searchdatamap.go:133 +0x194
github.com/grafana/tempo/pkg/tempofb.(*SearchBlockHeaderMutable).ToBytes(0x4041845ad8)
        /drone/src/pkg/tempofb/SearchBlockHeader_util.go:66 +0xfc
github.com/grafana/tempo/tempodb/search.NewBackendSearchBlock(0x4000928150, {0x227dfc8, 0x400011d5f0}, {0x97, 0x23, 0x35, 0xaf, 0x2a, 0x2f, 0x43, ...}, ...)
        /drone/src/tempodb/search/backend_search_block.go:113 +0x5dc
github.com/grafana/tempo/tempodb.(*readerWriter).CompleteSearchBlockWithBackend(0x4040ca09c0, 0x4000928150, {0x97, 0x23, 0x35, 0xaf, 0x2a, 0x2f, 0x43, 0x4b, ...}, ...)
        /drone/src/tempodb/tempodb.go:266 +0x74
github.com/grafana/tempo/modules/ingester.(*instance).CompleteBlock(0x4000663440, {0x97, 0x23, 0x35, 0xaf, 0x2a, 0x2f, 0x43, 0x4b, 0xa9, ...})
        /drone/src/modules/ingester/instance.go:286 +0x328
github.com/grafana/tempo/modules/ingester.(*Ingester).handleComplete(0x400031ec00, 0x4000797540)
        /drone/src/modules/ingester/flush.go:252 +0x290
github.com/grafana/tempo/modules/ingester.(*Ingester).flushLoop(0x400031ec00, 0x3)
        /drone/src/modules/ingester/flush.go:205 +0xb8
created by github.com/grafana/tempo/modules/ingester.New
        /drone/src/modules/ingester/ingester.go:80 +0x1c8

To Reproduce
Steps to reproduce the behavior:

  • This is happenning after few days of using tempo 1.2
  • I am able to overcome it only by removing the PVC and restarting the ingester pods.

Expected behavior
Ingesters should be running fine.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm, chart version: 0.13.0
  • Version: 1.2.0

Additional Context
Here are the values I am using:

     config: |
      search_enabled: true
      multitenancy_enabled: true
      compactor:
        compaction:
          block_retention: 168h
        ring:
          kvstore:
            store: memberlist

      querier:
        frontend_worker:
          frontend_address: {{ include "tempo.queryFrontendFullname" . }}-discovery:9095
          grpc_client_config:
            max_recv_msg_size: 1.34217728e+08
            max_send_msg_size: 1.34217728e+08
      ingester:
        lifecycler:
          ring:
            replication_factor: 1
            kvstore:
              store: memberlist
          tokens_file_path: /var/tempo/tokens.json
        complete_block_timeout: 10m
      memberlist:
        abort_if_cluster_join_fails: true
        join_members:
          - {{ include "tempo.fullname" . }}-gossip-ring
      overrides:
        max_traces_per_user: 0
        max_bytes_per_trace: 3000000
        ingestion_rate_strategy: global
        ingestion_burst_size_bytes: 40000000
        ingestion_rate_limit_bytes: 30000000
      server:
        grpc_server_max_recv_msg_size: 1.34217728e+08
        grpc_server_max_send_msg_size: 1.34217728e+08
        http_listen_port: 3100
      storage:
        trace:
          backend: {{.Values.storage.trace.backend}}
          s3:
            {{- toYaml .Values.storage.trace.s3 | nindent 6}}
          blocklist_poll: 5m
          local:
            path: /var/tempo/traces
          wal:
            path: /var/tempo/wal
          cache: memcached
          memcached:
            consistent_hash: true
            host: {{ include "tempo.fullname" . }}-memcached
            service: memcached-client
            timeout: 500ms

@mdisibio
Copy link
Contributor

Hi thanks for reporting this. It seems that this error is occurring while writing the search-header file to the ingester disk. 2GB is much larger than expected for that, those files are typically quite small (20MB or less). We will try to protect against this panic in a future release, but for now let's investigate what is different about your traffic and check some relevant config options.

  1. Can you provide a full file listing of /var/tempo/wal? We particularly want to see how large the blocks being flushed by the ingester are, and how large the search and search-header files are.

  2. What is your traffic volume, in spans/s and byte/s? These are reported by the metrics tempo_distributor_spans_received_total and tempo_distributor_bytes_received_total.

  3. Are there any quite large span attributes? For example if the trace includes a copy of the HTTP request and response bodies, that could inflate the size of the search file. We can exclude these tags from search by setting the distributor:search_tags_deny_list. Additionally there might be some other tags that are not useful to your workflow, and excluding these would also cut down on the file size. For example here are the tags we exclude ourselves:

    search_tags_deny_list:
        - opencensus.exporterversion
        - sampler.type
        - sampler.param
        - client-uuid
        - component
        - ip

@mvkrishna86
Copy link
Author

mvkrishna86 commented Feb 10, 2022

The issue happened again

image
image

when it reached 1.4 B spans, its going for a toss.
We are running with 5 ingesters and 5 distributors. I dont see much CPU and memory usage.

@mvkrishna86
Copy link
Author

Hi @mdisibio , we are writing large volumes of data, running load tests with thousands of requests per minute continously. Can you suggest any appropriate configuration to flush the data frequently?

@mdisibio
Copy link
Contributor

mdisibio commented Feb 10, 2022

Hi. Could you repost those graphs with rate's like sum(rate(tempo_distributor_spans_received_total[1m]))?

These are the flush-related settings that we use and are known to work well up to 2M spans/s.

ingester:
    flush_check_period: 5s
    trace_idle_period: 5s
    max_block_duration: 15m0s
    max_block_bytes: 734003200

Of these max_block_duration is the most relevant and it defaults to 1 hour. If 15m isn't enough, try 5m (our ingesters actually flush blocks even faster than that, about every 2 minutes). Just check that your compactors are able to keep up with the higher number of blocks being generated.

@AsishVenakataSai
Copy link

AsishVenakataSai commented Feb 11, 2022

hi @mdisibio
spans-rate
Screenshot (780)

we just have applied your suggested configuration, will report back on this.

@AsishVenakataSai
Copy link

AsishVenakataSai commented Feb 17, 2022

hi @mdisibio ,
We have observed a new issue after recent changes.
Issue: Status: 404 Not Found Body: trace not found
we cannot see the trace
Screenshot (801)_LI
after few minutes.

Screenshot (798)

  • So we have restarted the ingesters
    After restarting the ingresters, we were able to see the missing traces
    Screenshot (800)_LI

Current configuration

overrides.yaml: |
    overrides: {}
  tempo-query.yaml: |
    backend: 127.0.0.1:3100
  tempo.yaml: |
    search_enabled: true
    multitenancy_enabled: true
    compactor:
      compaction:
        block_retention: 336h
        compacted_block_retention: 30m
        compaction_window: 30m
      ring:
        kvstore:
          store: memberlist
    distributor:
      ring:
        kvstore:
          store: memberlist
      receivers:
        otlp:
          protocols:
            http:
              endpoint: 0.0.0.0:55681
            grpc:
              endpoint: 0.0.0.0:4317
    querier:
      frontend_worker:
        frontend_address: tempo-query-frontend-discovery:9095
        grpc_client_config:
          max_recv_msg_size: 1.34217728e+08
          max_send_msg_size: 1.34217728e+08
    ingester:
      lifecycler:
        ring:
          replication_factor: 1
          kvstore:
            store: memberlist
        tokens_file_path: /var/tempo/tokens.json
      complete_block_timeout: 10m
      flush_check_period: 5s
      trace_idle_period: 5s
      max_block_duration: 15m0s
      max_block_bytes: 734003200
    memberlist:
      abort_if_cluster_join_fails: true
      join_members:
        - tempo-gossip-ring
    overrides:
      max_traces_per_user: 0
      max_bytes_per_trace: 2000000
      ingestion_rate_strategy: local
      ingestion_burst_size_bytes: 40000000
      ingestion_rate_limit_bytes: 30000000
    server:
      grpc_server_max_recv_msg_size: 1.34217728e+08
      grpc_server_max_send_msg_size: 1.34217728e+08
      http_listen_port: 3100
    storage:
      trace:
        backend: s3
        s3:
          bucket: xxxxxxxxxxxxxxxxxxxxxxxx
          endpoint: xxxxxxxxxxxxxxxxxxxxxxxxxx
        pool:
          max_workers: 50
          queue_depth: 20000
        blocklist_poll: 5m
        local:
          path: /var/tempo/traces
        wal:
          path: /var/tempo/wal
        cache: memcached
        memcached:
          consistent_hash: true
          host: tempo-memcached
          service: memcached-client
          timeout: 500ms

         

@mdisibio
Copy link
Contributor

mdisibio commented Feb 22, 2022

When a trace disappears/reappears, it is usually explained by a mismatch between component timings, where a querier cannot see a block, for example. To start, I would adjust this setting. The default is 3x blocklist_poll which gives more buffer for the tenant index to be built and for queriers to pick it up. This addresses the scenario where an ingester pushes a block to the backend and deletes its local copy before queriers see the new block.

    ingester:
      complete_block_timeout: 15m

There are a few more ways to dig deeper if it is still occurring.
(1) Search the querier logs for messages like searching for trace in block when getting a trace not found. It will list all of the blocks checked. If it is sometimes working and sometimes not then we can see which block(s) contain the trace and whether they are being searched each time.

(2) Capture tracing data for tempo itself, and lookup the failed request that was logged in (1). The log message will contain another trace ID which is tempo's own trace. We can inspect the querier and ingester work from there.

@mdisibio
Copy link
Contributor

mdisibio commented Mar 2, 2022

Hi, this issue seems to have moved on from the original reported flatbuffer error. If you can provide the requested information, we can keep troubleshooting, else I think it makes sense to close this issue and open new ones for other issues. Or we can always do general help in the forums or slack channel.

  1. Can you provide a full file listing of /var/tempo/wal? We particularly want to see how large the blocks being flushed by the ingester are, and how large the search and search-header files are.
  1. Are there any quite large span attributes? For example if the trace includes a copy of the HTTP request and response bodies, that could inflate the size of the search file. We can exclude these tags from search by setting the distributor:search_tags_deny_list.

@jvilhuber
Copy link

jvilhuber commented May 6, 2022

We had this happen to us just now, so here's the /var/tempo/wal (ls -ltR)

$ kubectl -n tempo exec -it ingester-0 -- ls -ltR /var/tempo/wal
/var/tempo/wal:
total 1062500
-rw-r--r--    1 root     root      12271199 May  6 07:06 20368c0c-67d6-4001-a8db-80378a6cb4b3:single-tenant:v2:snappy:v2
drwxr-xr-x    2 root     root          4096 May  6 07:06 search
-rw-r--r--    1 root     root     1075715540 May  6 06:44 e2ce9212-aae4-4eb4-8a53-a045423d97b1:single-tenant:v2:snappy:v2
drwxr-xr-x    3 root     root          4096 May  6 04:27 blocks

/var/tempo/wal/search:
total 3451792
-rw-r--r--    1 root     root      29068476 May  6 07:06 20368c0c-67d6-4001-a8db-80378a6cb4b3:single-tenant:v2:none:
-rw-r--r--    1 root     root     3505560048 May  6 06:44 e2ce9212-aae4-4eb4-8a53-a045423d97b1:single-tenant:v2:none:

/var/tempo/wal/blocks:
total 4
drwxr-xr-x    7 root     root          4096 May  6 07:06 single-tenant

/var/tempo/wal/blocks/single-tenant:
total 20
drwxr-xr-x    2 root     root          4096 May  6 07:06 8a76fb3e-bd5e-4a2f-97fb-c77e4c245f91
drwxr-xr-x    2 root     root          4096 May  6 06:59 fa9cdc9f-59a3-4929-b83c-075fe08a48ee
drwxr-xr-x    2 root     root          4096 May  6 06:55 06ab2165-1b64-45a7-947e-ea9d9e673e78
drwxr-xr-x    2 root     root          4096 May  6 06:51 6b1feb85-da86-44a8-90ed-e5a2e4528535
drwxr-xr-x    2 root     root          4096 May  6 06:50 a1512bb6-a54a-4459-ba01-c34ea7ff95b3

/var/tempo/wal/blocks/single-tenant/8a76fb3e-bd5e-4a2f-97fb-c77e4c245f91:
total 7092
-rw-r--r--    1 root     root        102424 May  6 07:06 bloom-0
-rw-r--r--    1 root     root       6887984 May  6 07:06 data
-rw-r--r--    1 root     root            30 May  6 07:06 flushed
-rw-r--r--    1 root     root        256000 May  6 07:06 index
-rw-r--r--    1 root     root           381 May  6 07:06 meta.json

/var/tempo/wal/blocks/single-tenant/fa9cdc9f-59a3-4929-b83c-075fe08a48ee:
total 8752
-rw-r--r--    1 root     root        102424 May  6 06:59 bloom-0
-rw-r--r--    1 root     root       8585334 May  6 06:59 data
-rw-r--r--    1 root     root            30 May  6 06:59 flushed
-rw-r--r--    1 root     root        256000 May  6 06:59 index
-rw-r--r--    1 root     root           381 May  6 06:59 meta.json

/var/tempo/wal/blocks/single-tenant/06ab2165-1b64-45a7-947e-ea9d9e673e78:
total 7576
-rw-r--r--    1 root     root            30 May  6 06:55 flushed
-rw-r--r--    1 root     root        102424 May  6 06:54 bloom-0
-rw-r--r--    1 root     root       7382349 May  6 06:54 data
-rw-r--r--    1 root     root        256000 May  6 06:54 index
-rw-r--r--    1 root     root           381 May  6 06:54 meta.json

/var/tempo/wal/blocks/single-tenant/6b1feb85-da86-44a8-90ed-e5a2e4528535:
total 5596
-rw-r--r--    1 root     root        102424 May  6 06:51 bloom-0
-rw-r--r--    1 root     root       5354751 May  6 06:51 data
-rw-r--r--    1 root     root            30 May  6 06:51 flushed
-rw-r--r--    1 root     root        256000 May  6 06:51 index
-rw-r--r--    1 root     root           380 May  6 06:51 meta.json

/var/tempo/wal/blocks/single-tenant/a1512bb6-a54a-4459-ba01-c34ea7ff95b3:
total 6860
-rw-r--r--    1 root     root        102424 May  6 06:50 bloom-0
-rw-r--r--    1 root     root       6650228 May  6 06:50 data
-rw-r--r--    1 root     root            30 May  6 06:50 flushed
-rw-r--r--    1 root     root        256000 May  6 06:50 index
-rw-r--r--    1 root     root           381 May  6 06:50 meta.json

The PV is
pvc-b816510b-0670-49cd-980b-71bdb2f5183b 100Gi RWO Delete Bound tempo/ingester-data-ingester-0 standard-block-storage 4h11m
and has

$ kubectl -n tempo exec -it ingester-0 -- du -sm /var/tempo
4680	/var/tempo

So should have plenty of space.

Some logs:

level=info ts=2022-05-06T07:06:10.183916094Z caller=flush.go:244 msg="completing block" userid=single-tenant blockID=8a76fb3e-bd5e-4a2f-97fb-c77
e4c245f91
level=info ts=2022-05-06T07:06:11.089084194Z caller=flush.go:251 msg="block completed" userid=single-tenant blockID=8a76fb3e-bd5e-4a2f-97fb-c77e
4c245f91 duration=905.177258ms
level=info ts=2022-05-06T07:06:11.096297342Z caller=flush.go:300 msg="flushing block" userid=single-tenant block=8a76fb3e-bd5e-4a2f-97fb-c77e4c2
45f91
panic: cannot grow buffer beyond 2 gigabytes
goroutine 289 [running]:
github.com/google/flatbuffers/go.(*Builder).growByteBuffer(...)
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:214
github.com/google/flatbuffers/go.(*Builder).Prep(0xc0e5679710, 0x4, 0x5ee7)
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:268 +0x2b8
github.com/google/flatbuffers/go.(*Builder).CreateString(0xc0e5679710, {0xc0aa9d0000, 0x5ee6})
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:337 +0x4b
github.com/google/flatbuffers/go.(*Builder).CreateSharedString(0xc0e5679710, {0xc0aa9d0000, 0x5ee6})
        /drone/src/vendor/github.com/google/flatbuffers/go/builder.go:327 +0xa5

EDIT: we have

    max_block_bytes: 1073741824,
    max_block_duration: '1h',
    trace_idle_period: '10s',

and flush_check_period is unset (so whatever the default is)

@mdisibio
Copy link
Contributor

mdisibio commented May 6, 2022

Hi thanks for the information! Sorry that this is happening, my thought is it's very traffic-dependent, and not necessarily a bug. We haven't seen this ourselves. If this is causing immediate trouble, the best workaround is disable search and delete the files from /var/tempo/wal/search.

  • Was search recently enabled? I notice that there are no search-related files under /var/tempo/wal/blocks/single-tenant/. There should be the presence of search.meta.json for example. This question will help understand if things are working sometimes or not. When fully working it looks like this:
/var/tempo/wal/blocks/1/dd07acdd-7622-49fe-bd73-1117e2e0a752:
total 670M   
-rw-r--r--    1 root     root       97.7K May  6 13:50 bloom-0
-rw-r--r--    1 root     root       97.7K May  6 13:50 bloom-1
-rw-r--r--    1 root     root       97.7K May  6 13:50 bloom-2
-rw-r--r--    1 root     root      590.9M May  6 13:50 data
-rw-r--r--    1 root     root          29 May  6 13:50 flushed
-rw-r--r--    1 root     root      250.0K May  6 13:50 index
-rw-r--r--    1 root     root         375 May  6 13:50 meta.json
-rw-r--r--    1 root     root       59.0M May  6 13:50 search
-rw-r--r--    1 root     root       19.6M May  6 13:50 search-header
-rw-r--r--    1 root     root      100.0K May  6 13:50 search-index
-rw-r--r--    1 root     root          77 May  6 13:50 search.meta.json
  • If you have the complete stack trace that will be helpful, as we can see which file is erroring. There are 3 locations where flatbuffers are used.

  • The large size of the files under /var/tempo/wal/search is very surprising compared the size of the matching WAL block. For example wal/search/e2ce9212-aae4-4eb4-8a53-a045423d97b1:single-tenant:v2:none: is 3.5GB but wal/e2ce9212-aae4-4eb4-8a53-a045423d97b1:single-tenant:v2:snappy:v2 is only 1.1GB. The search files contain a subset of the data, so is usually smaller (even without compression), not 300% larger.

Internally our ratio looks like this:

# ls -lRh /var/tempo/wal
/var/tempo/wal:
-rw-r--r--    1 root     root       42.4M May  6 14:03 0ffafe61-c321-46e1-9875-4b76f07afda2:1:v2:snappy:v2
-rw-r--r--    1 root     root      709.8M May  6 14:03 f683d609-c666-45e8-bb99-aa122ed618a8:1:v2:snappy:v2

/var/tempo/wal/search:
-rw-r--r--    1 root     root       17.2M May  6 14:03 0ffafe61-c321-46e1-9875-4b76f07afda2:1:v2:none:
-rw-r--r--    1 root     root      275.3M May  6 14:03 f683d609-c666-45e8-bb99-aa122ed618a8:1:v2:none:

~60% smaller.

Please see question (3) in the comment above. Do your traces include any very large attributes like http request/response bodies? If so we can remove them from being recorded by setting search_tags_deny_list. Removing smaller tags that aren't searched will also help, such as client-uuid, etc.

The 2GB limit is a hard limit that can't be changed, so the goal is to determine what is using such space and eliminate it.

@jvilhuber
Copy link

jvilhuber commented May 9, 2022

We've had search enabled for a while now, but this seems to have started happening either because of the upgrade to 1.4.1, or because we switched to tail-based sampling in otel-collector (those two happened around the same time).

I'm not aware of any huge tags, though I did find one java command-line that was quite large and I excluded it. Is there a way to look at the data in the search files to find any large tags and/or verify that my exclusion worked?

The crash just happened again (sadly I deleted the PV's and recreated the ingester statefulset to get things working again, so I can't get a direcory listing; I have a feeling I can get it later, when it crashes again, though):

level=info ts=2022-05-09T04:42:26.399844575Z caller=flush.go:300 msg="flushing block" userid=single-tenant block=d494cad4-2176-408e-984e-9e1ed1d4bc1e
	
panic: cannot grow buffer beyond 2 gigabytes
goroutine 321 [running]:
github.com/google/flatbuffers/go.(*Builder).growByteBuffer(...)
	/drone/src/vendor/github.com/google/flatbuffers/go/builder.go:214
github.com/google/flatbuffers/go.(*Builder).Prep(0xc000855910, 0x4, 0x5fe1)
	/drone/src/vendor/github.com/google/flatbuffers/go/builder.go:268 +0x2b8
github.com/google/flatbuffers/go.(*Builder).CreateString(0xc000855910, {0xc0b71a6000, 0x5fe0})
	/drone/src/vendor/github.com/google/flatbuffers/go/builder.go:337 +0x4b
github.com/google/flatbuffers/go.(*Builder).CreateSharedString(0xc000855910, {0xc0b71a6000, 0x5fe0})
	/drone/src/vendor/github.com/google/flatbuffers/go/builder.go:327 +0xa5
github.com/grafana/tempo/pkg/tempofb.writeKeyValues(0xc000855910, {0xc0b9bffc90, 0x80}, {0xc0f6f2a000, 0xbf00, 0xec00}, {0x22866c0, 0xc0eee4b4f0}, 0x0)
	/drone/src/pkg/tempofb/searchdatamap.go:133 +0x318
github.com/grafana/tempo/pkg/tempofb.WriteSearchDataMap(0xe48, 0x1dccc98, 0xc)
	/drone/src/pkg/tempofb/searchdatamap.go:89 +0x2b8
github.com/grafana/tempo/pkg/tempofb.(*SearchBlockHeaderMutable).ToBytes(0xc000855b08)
	/drone/src/pkg/tempofb/SearchBlockHeader_util.go:66 +0xd3
github.com/grafana/tempo/tempodb/search.NewBackendSearchBlock(0x40, {0x2286948, 0xc040b06120}, {0x15, 0xb0, 0x86, 0x6d, 0xaa, 0xcb, 0x41, ...}, ...)
	/drone/src/tempodb/search/backend_search_block.go:115 +0x656
github.com/grafana/tempo/tempodb.(*readerWriter).CompleteSearchBlockWithBackend(0xbb41cbaa6d86b015, 0xd1de026f6f7c8880, {0x15, 0xb0, 0x86, 0x6d, 0xaa, 0xcb, 0x41, 0xbb, ...}, ...)
	/drone/src/tempodb/tempodb.go:271 +0x85
github.com/grafana/tempo/modules/ingester.(*instance).CompleteBlock(0xc0007939e0, {0x15, 0xb0, 0x86, 0x6d, 0xaa, 0xcb, 0x41, 0xbb, 0x80, ...})
	/drone/src/modules/ingester/instance.go:311 +0x2a4
github.com/grafana/tempo/modules/ingester.(*Ingester).handleComplete(0xc000282c00, 0xc040f86370)
	/drone/src/modules/ingester/flush.go:250 +0x25c
github.com/grafana/tempo/modules/ingester.(*Ingester).flushLoop(0xc000282c00, 0x1)
	/drone/src/modules/ingester/flush.go:203 +0xc7
created by github.com/grafana/tempo/modules/ingester.New
	/drone/src/modules/ingester/ingester.go:89 +0x1e9

@jvilhuber
Copy link

jvilhuber commented May 9, 2022

BTW: once an ingester gets into this state, it seems to be hosed until I manually "delete the large file" (i.e. kill the PV everything is stored on, since I don't think I should be messing with ingester-internal files with a scalpel). If nothing else, it might be good to fix it so that ingesters can recover without manual intervention (perhaps that's a tall order, though).

Suggestion2: perhaps there should be (or perhaps there is? Let me know!) some setting via which to reject large tags. Or not add to a file that is already too big, so we don't get into this unrecoverable state to begin with.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Nov 17, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Used for stale issues / PRs
Projects
None yet
Development

No branches or pull requests

4 participants