Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After shard splitting, our log is flooded with warning messages "Cannot find the shard given shardId" #55

Open
xujiaxj opened this issue Jan 12, 2016 · 20 comments

Comments

@xujiaxj
Copy link

xujiaxj commented Jan 12, 2016

ShardSyncTask is to run either on worker initialization or when the worker detects one of its assigned shards completes. In the event of shard split, if however the child shard falls on a worker that's not previously processing the parent shard, this worker will not run the ShardSyncTask because none of its previously assigned shards have completed.

Meanwhile, the lease coordinator has timer tasks to sync up with the Dynamo table to assign itself shards to process.

So we end up with the worker start processing the child shard while at the same time, keeps logging a warning message from line 208 of KinesisProxy:

LOG.warn("Cannot find the shard given the shardId " + shardId);

As far as I understand, the shard info is needed only for de-aggregation to discard user records that are supposed to be re-routed to other shards during resharding. So we are not experiencing dropped records or sth severe, it's just the flooding of our log, and maybe some duplicates as we are using KPL aggregation on the producer side.

@aakavalevich
Copy link

I have the same warning. Does anyone know how to fix it?

@amanduggal
Copy link

We are facing similar issue. Any advice on a solution would be appreciated?

@xujiaxj just curious if you thought of anything since the bug filing?

@xujiaxj
Copy link
Author

xujiaxj commented Jul 28, 2016

@amanduggal we modified our logback setting to suppress the warning message

logger name="com.amazonaws.services.kinesis.clientlibrary.proxies.KinesisProxy" level="ERROR"

@matthewbogner
Copy link

This is especially annoying when using the KCL to read a dynamodb stream, which claims to split it's shards every 4 hours according to this blog post by one of the DynamoDB engineers at AWS:

Typically, shards in DynamoDB streams close for writes roughly every four hours after they are created and become completely unavailable 24 hours after they are created.

https://blogs.aws.amazon.com/bigdata/post/TxFCI3UJJJYEXJ/Process-Large-DynamoDB-Streams-Using-Multiple-Amazon-Kinesis-Client-Library-KCL

@pfifer
Copy link
Contributor

pfifer commented Aug 16, 2016

Just letting people know that we are aware of this. We're looking into fixing this, but I don't have an ETA at this time.

@shawnsmith
Copy link

I ran into this using DynamoDB Streams without explicit shard splitting occurring (just the usual DynamoDB cycling of the shard as @matthewbogner described). FWIW, here is the sequence we encountered that triggered the warnings. With DynamoDB Streams this occurs pretty often--at any given point in time there's almost always at least one of our servers in this state where it's logging these warnings every 2 seconds. We've had to turn off WARN for KinesisProxy and ProcessTask.

Assume a DynamoDB stream with shard S1 and two stream workers A and B using the KCL (we aren't using the KPL):

  1. At the start, consumer A owns a lease on shard S1, consumer B is idle because no leases are available.
  2. At some point, DynamoDB closes shard S1 and creates a child shard S2 whose parent is S1.
  3. A reaches the end of S1.
  4. A syncs the shard set with the DynamoDB lease table, creating a new lease for S2.
  5. A obtains the lease for S2. It hasn't yet cleaned up the lease for S1.
  6. B wakes up, notices 2 leases in the lease table both owned by A (S1 and S2), and steals the S2 lease from A (code).
  7. A notices that S2 lease has been lost, becomes idle.
  8. B begins processing records in S2.
  9. B logs warnings because it did not execute a code path that would cause it to re-sync its cached list of shards to include S2 (code).
    • KinesisProxy initializes its cached shard list on startup
    • KinesisProxy cached shard list is refreshed upon reaching the end of a shard
    • KinesisProxy cached shard list is NOT refreshed on lease steal
  10. B continues to log warnings until it reaches the end of shard S2.
  11. .. at which point, A may steal the lease for the new S3 and begin logging warnings.

@ryanlewis
Copy link

Been testing the stack and looking at the sharding and been noticing these errors, although everything continues to appear to work.

Forgive my newness to the technology, but is this something that we should be concerned about?

@adrian-baker
Copy link

Unsure why this is labelled an enhancement?

@klesniewski
Copy link

Any updates on this? If I understood correctly from @shawnsmith's analysis, the solution is to refresh cached shard list on lease steal?

@adrian-baker
Copy link

adrian-baker commented Mar 7, 2019

From 2016:

Just letting people know that we are aware of this. We're looking into fixing this, but I don't have an ETA at this time.

Is this still the case?

@mrhota
Copy link

mrhota commented Feb 28, 2020

Just copying this over from the linked issue. @pfifer Do you have any updates or insight here?

I think I have the same issue, although we also see non-stop ERROR logs like:

ERROR [2020-02-27 13:02:45,382] [RecordProcessor-2873] c.a.s.k.c.lib.worker.InitializeTask: Caught exception: 
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Unable to fetch checkpoint for shardId shardId-00000001582460850801-53f6f94b
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.getCheckpointObject(KinesisClientLibLeaseCoordinator.java:286)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitializeTask.call(InitializeTask.java:82)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

And I found an AWS Dev forum link related to this issue here: https://forums.aws.amazon.com/thread.jspa?messageID=913872

@mrhota
Copy link

mrhota commented Mar 26, 2020

@pfifer any ideas? any updates? any anything?

ychunxue pushed a commit to ychunxue/amazon-kinesis-client that referenced this issue Jun 17, 2020
…rics_configs

Ltr 1 periodic auditor metrics configs
@aldilaff
Copy link

@pfifer any updates on this?

@igracia
Copy link

igracia commented Sep 9, 2020

@aldilaff might want to check #185 too, in case the workerId is also bugging you.

@joshua-kim
Copy link
Contributor

Cannot find the shard given the shardId

chgenvulgfjlejltgvglhecbucrihrcbbclfj

@igracia
Copy link

igracia commented Oct 26, 2020

@joshua-kim was that a yubikey press? :-P Otherwise, can you please elaborate on why the issue is being closed and how to solve/prevent it?

@joshua-kim joshua-kim reopened this Oct 26, 2020
@joshua-kim
Copy link
Contributor

@igracia Sorry, yes that was a Yubikey press. I was referencing this issue when looking into another cached shard map issue in a fork of 1.6; I'm curious though, are you still seeing this on the latest 2.x/1.x releases? The latest releases are no longer using ListShards in most cases, so I'm curious to see if this bug is still present.

@igracia
Copy link

igracia commented Oct 28, 2020

Thanks @joshua-kim! We have several consumers using the DynamoDB Streams Kinesis adapter on a shingle shard, and still getting this with the following versions

  • dynamodb-streams-kinesis-adapter 1.5.1
  • amazon-kinesis-client 1.13.3

Bumping those versions makes it all stop working, so we're stuck with them for the time being. Also, as per this issue in dynamodb-streams-kinesis-adapter, we can't use v2. Any suggestions would be appreciated!

@dacevedo12
Copy link

Same problem 6 years later 😞

I'm using amazon-kinesis-client 1.13.3 with dynamodb-streams-kinesis-adapter 1.5.3
This is especially annoying in combination with the already spammy MultilangDaemon

@adrian-skybaker
Copy link

adrian-skybaker commented Oct 4, 2022

The KCL dev flow has been quite stable in the many years I've been using it.

  1. wire in the KCL library
  2. be surprised about how much boilerplate handling code is required, without much supporting documentation, particularly on how to handle errors safely
  3. be alarmed about sporadic, opaque but continual warnings logged in your production deployments
  4. spend time googling and pursuing old open github issues with unclear resolutions
  5. give up and set log level to ERROR and cross your fingers. Hopefully you're not dealing with a domain where data loss is a serious issue. Or switch to Lambda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests