Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Stuck" Kinesis Shards #185

Open
aloisbarreras opened this issue Jun 28, 2017 · 34 comments
Open

"Stuck" Kinesis Shards #185

aloisbarreras opened this issue Jun 28, 2017 · 34 comments

Comments

@aloisbarreras
Copy link

aloisbarreras commented Jun 28, 2017

@pfifer I've been using Kinesis as my real time data streaming pipeline for over a year now, and I am consistently running into an issue where Kinesis shards seem to randomly stop processing data. I am using the Node.js KCL library to consume the stream.

Here are some screenshots to show what I mean.

screen shot 2017-06-27 at 6 34 51 pm

You can see here that from roughly 21:00 to 6:00, shards 98 and 100 stop emitting data for DataBytesProcessed. I restarted the application at 6:00 and the consumer immediately started processing data again.

Now here is a graph of the IteratorAgeMilliseconds for the same time period.

screen shot 2017-06-27 at 6 34 34 pm

The shards are still emitting IteratorAge from 21:00 to 6:00 and show that the IteratorAgeMilliseconds is still at 0, so it seems like the KCL was still running during that time, but it wasn't consuming more data from the stream. You can see that when I restarted the application at 6:00, the KCL realized it was very far behind and the IteratorAge spiked up to 32.0M instantly. I can confirm that the KCL process was still running during that time period and continuously logged out the usual messages:

INFO: Current stream shard assignments: shardId-000000000098
Jun 28, 2017 5:37:36 PM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: Sleeping ...

There are 2 other shards in the same stream and they continued to process data as normal. I have been working on debugging this on my end for several months now, but have come up empty. I am catching all exceptions thrown in my application, and I have set timeouts on any network requests or async I/O operations, so I don't think the process is crashing or getting hung up on anything.

I also thought that maybe I didn't have enough resources allocated to the application and something weird was happening with that and that's why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.

I cannot seem to reliably replicate this issue; it seems to just happen randomly across different shards and can range from happening a few times a week to a few times a day. FYI, I have several other Node applications using the KCL that also experience the same issue.

I have seen this issue where he seems to have the same problem with the Python library which he solved by just restarting the applications every 2 hours...and also this issue which seems to be the same problem I have as well.

Is this a known bug with the KCL? And if the problem is on my end, do you have any pointers on how to track down this problem?

@pfifer
Copy link
Contributor

pfifer commented Jun 30, 2017

I don't believe this is a bug in the core Java KCL, but could be a bug in the MultiLang component.

I also thought that maybe I didn't have enough resources allocated to the application and something weird was happening with that and that's why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.

You shouldn't need to do this, since each record processor is single threaded. If your record processor does its work on the main thread, adding more threading capacity will be wasted.

What you're seeing happen is the lack of response from when the record were dispatched to the language component. The Java record processor is waiting for response from the language components before returning from the processRecords call. The MultiLang Daemon implements the Java record processor interface and acts as a forwarder to the language specific record processor. The dispatch behavior of the Java record processor is to make a call to processRecords, and pause activity until that call returns. The MultiLang Daemon implements the same behavior, but across process boundaries. The MultiLang Daemon, acting as a forwarder, needs to wait for a response from the native language record processor. If for some reason the response never comes, it will hang indefinitely.

There isn't a good way to see what is actually causing the problem it could be any of three components:

I think for a short term fix adding a configuration that would limit the maximum time the MultiLang Daemon will wait for a response could limit the impact of these sort of issues. This fix would terminate the record processor if a response isn't received within the configured time.

Longer term will require adding some additional logging to understand the state of components on both side of the process boundary. This would help us detect which component is losing track of the current state.

For everyone else please comment, or add a reaction to this to help us know how many people are affected by this issue.

@aloisbarreras
Copy link
Author

@pfifer thanks for looking into this. Let me know if you need anything from me to help you debug this.

That short term fix would be pretty helpful as well, so if you get that implemented, please let me know.

Thanks!

@aloisbarreras
Copy link
Author

@pfifer any update on this? Even implementing some sort of short term fix would be extremely helpful.

@perryao
Copy link

perryao commented Jul 19, 2017

+1

@pfifer
Copy link
Contributor

pfifer commented Jul 19, 2017

We're currently working on the short term fix for this.

@pfifer
Copy link
Contributor

pfifer commented Jul 25, 2017

To help us troubleshoot the source of this issue can you enable more detailed logging using these steps:

If you don't mind using Apache Maven this is a pom.xml that will provide additional logging, which can help diagnose issues.

To create the launcher script:

  1. Create an empty directory
  2. Create a new file in the directory called pom.xml and copy the contents of this pom.xml in to it.
  3. Install Apache Maven if you don't already have it.
  4. In the directory created in 1, run the command mvn package
  5. If everything works you should have a script in target/appassembler/bin/multilang-daemon, and all the jars are in target/appassembler/repo.
  6. Copy the directory target/appassembler to wherever you need it.
  7. To start your application run the mutlilang-daemon, or on Windows multilang-daemon.bat, script with the location of your property file.

The pom.xml above uses Logback, and SLF4J to provide logging. It will default to rather verbose console logging without any configuration. I’ve uploaded an example Logback configuration file that can be used to control the amount of logging.

To provide the configuration file for logging follow these steps:

  1. In you application directory create a directory called logback (you can call it something else if you want to).
  2. Download the Logback configuration file: logback.xml, and save it to logback/logback.xml
  3. Provide a classpath prefix for your application at startup
    You can use relative paths. The script will correctly set the current working directory
    CLASSPATH_PREFIX=<path to the logback directory created in step 1> <path to your application directory>/bin/mutlilang-daemon <your properties file>
    
    Here is an example of the command assuming:
    • You've copied the contents of target/appassembler to your application directory
    • The name of your applications properties is config.properties
    • You named the logging configuration directory logback
    • You're starting the multilang daemon in your application directory
    CLASSPATH_PREFIX=logback bin/multilang-daemon config.properties
    

@kenXengineering
Copy link

@pfifer, I've been working with Alois on this issue as well, just getting back with some more information. We now have the debug logs streaming to our ELK stack, so if we see a case where the Kinesis stream seems to be stuck we can hopefully get logs from our process that processes the stream.

We were also investigating DynamoDB, and we can see that when the Kinesis stream is no longer getting processed, we see our tables Get Latency start reporting a latency (all the time the graph has been empty), and we see the Scan Latency spike very high. The Write capacity also dips during the same time. These issues go away once we restart the process that is consuming the Kinesis stream.

We have seen this on all the DynamoDB tables that are created by our processors. We have 10 kinesis stream that have gotten "stuck" in the last day, and on all their DynamoDB tables we see high latency at the same time. Some of us are wondering if the request to DynamoDB is getting dropped or something, and the caller waits for a response indefinitely, pausing the process.

We've also noticed that these incidents seem to happen together in a one to two hour window. Last night (7/28/2017) at around 1:30 AM five of our Kinesis streams became "stuck" within 5 min of each other. Then two other streams became "stuck" at 4:50 AM and 5:50 AM respectively.

It took a bit of wrangling to get the logs from KCL since we use the amazon-kinesis-client-nodejs library to run our processes, but with a fork of that I was able to get all the jars setup and logs streaming over. Hopefully we get some useful logs soon! Thanks!

@kenXengineering
Copy link

@pfifer We finally got some logs for ya! You can see them at https://gist.github.com/chosenken/e79f3eda5eefb5bd6a910ae6e5875d4a#file-kcl-log-L699. Specifically on line 699, I see the multi-lang-daemon-0004 report back that it is waiting on a previous PROCESS task, and it will sit there until we restart the process. The file has one minute of logs, with data being processed at the beginning.

Please let me know if you need any more information, thank you!

@pfifer pfifer added this to the Release 1.8.1 milestone Aug 2, 2017
pfifer added a commit to pfifer/amazon-kinesis-client that referenced this issue Aug 2, 2017
Support timeouts for calls to the MultiLang Daemon
This adds support for setting a timeout when dispatching records to
the client record processor. If the record processor doesn't respond
within the timeout the parent Java process will be terminated. This
is a temporary fix to handle cases where the KCL becomes blocked
while waiting for a client record processor.

The timeout for the this can be set by adding
  `timeoutInSeconds = <timeout value>`.
The default for this is no timeout.

Setting this can cause the KCL to exit suddenly, before using this
ensure that you have an automated restart for your application

Related awslabs#195
Related awslabs#185
@pfifer pfifer mentioned this issue Aug 2, 2017
pfifer added a commit that referenced this issue Aug 2, 2017
Support timeouts for calls to the MultiLang Daemon
This adds support for setting a timeout when dispatching records to
the client record processor. If the record processor doesn't respond
within the timeout the parent Java process will be terminated. This
is a temporary fix to handle cases where the KCL becomes blocked
while waiting for a client record processor.

The timeout for the this can be set by adding
  `timeoutInSeconds = <timeout value>`.
The default for this is no timeout.

Setting this can cause the KCL to exit suddenly, before using this
ensure that you have an automated restart for your application

Related #195
Related #185
@oscarjiv91
Copy link

I'm not sure if this is the same case, but running the kcl sample project with a stream of 4 shards, the first shard shardId-00000000 is initialized, but recordProcess is never called for that shard. In dynamoDB, checkpoint column always stays as LATEST. Re-running the application doesn't seems to work and I get no errors. Is this issue related to what you are trying to fix? Thank you!

Is there a workaround in the meantime?

@sahilpalvia
Copy link
Contributor

@oscarjiv91 I don't think your issue is related to this one, but to be sure I would need some additional information. Could you provide the link to the sample that you are using? Another issue could be that there are not records being pushed to your stream, in that case the checkpointer will never advance, because there is no record to checkpoint against. You can check that by enabling callProcessRecordsEvenForEmptyRecordList in the properties file. This enables calling of the recordProcessor even if there are no records on the shard. If you need further assistance, I would recommend you open up a thread on the AWS Kinesis Forum.

@oscarjiv91
Copy link

This is the link: https:/awslabs/amazon-kinesis-client-nodejs/tree/master/samples/click_stream_sample

I'm pushing records with the same sample project, and set 4 shards. If I use 3 shards, I have the same problem (shardId-00000000 is stuck), but not with 2 shards which works well.

Thank you. I will open up a thread on the Kinesis Forum.

@juanstiza
Copy link

We have a KCL app working, not on production yet, but it showed similar behaviour. Two weeks ago it just stopped fetching records, after a restart it would get back up. I had to set up a cron to restart the process every day in order to avoid this.

@sahilpalvia
Copy link
Contributor

@juanstiza Are you using the Java KCL or the Multilang?

@juanstiza
Copy link

@sahilpalvia We are using Java.

@xinec
Copy link

xinec commented Dec 5, 2017

+1, Java KCL

@antgus
Copy link

antgus commented Dec 18, 2017

+1, Java KCL

Update: After all the source of the problem was that all workers had the same workerId. By the way, @pfifer it would probably be useful to add a comment in SampleConsumer when setting workerId since it's easy for a newcomer to miss the requirement for unique worker ids. It might even be wise to concatenate the id with a random string so that in SampleConsumer it would be unique by default.

@pfifer
Copy link
Contributor

pfifer commented Dec 18, 2017

In Release 1.8.8 we added the ability to log a message when a processing task is blocked for a long time. For those affected by it, can you enable the log message.

Additionally if you see the log message could you retrieve a stack trace from the Java application. On Linux, and macOS you can do this with the jstack using a command like jstack <pid of java process>. This will output the stack traces for all threads. If you want you can post here, or send it to me on the AWS Forums, my account is justintaws.

@shaharck
Copy link

shaharck commented Feb 2, 2018

@pfifer we're seeing similar behavior on every new deployment of our kcl cluster.
prior to a new deployment we see that for quite some time the worker for the shard is not receiving any records from kinesis: https://gist.github.com/shaharck/25e9fefd0d5d6dca4705bd0f10f899dd

when a new AWS KCL instance is coming up and takes the lease it starts processing records from
that shard having a large spike in iterator age
post deploy logs: https://gist.github.com/shaharck/871f0dcc8a26a82bd3c812ea8ce07c2b

image
seems to be always 1 (not the same) shard that is "getting stuck"

It then processes the records and the delay goes down.

FYI we're using Java kcl v 1.7.5

@pfifer
Copy link
Contributor

pfifer commented Feb 5, 2018

@shaharck The high iterator age can occur for a number of reasons.

If it's caused by processing getting stuck you should see the metrics for the stuck shard stop being produced. If the processing is actually stuck if you could get a stack trace via jstack that would be really helpful.

If the metrics are continuous it indicates that processing is occurring. The common cause of iterator spikes in this case are due to not having a recent checkpoint. This can occur due to a number of reasons, but commonly the causes are checkpoint strategy, or a stream receiving no data. When there is no data on the stream your application will be unable to checkpoint at its current position in time. So if the lease is lost processing for the shard will restart at the last checkpoint which will be the last time there was data available on the stream.

There is a third possibility: Your record processor is getting stuck. The processing of a shard is synchronous. The KCL waits for the record processor to finish processing a batch of records before it will retrieve, and dispatch the next batch. If your record processor gets stuck for any reason this can cause processing on the affected shard to stop. To help detect these situations we added a new log message in Release 1.8.8. The message is emitted whenever processing is paused for an extended period of time waiting for the record processor to return.

@limawebdev1
Copy link

Also having this issue aws-sdk Node package.
PutRecords suddenly stopped and only upon restarting the server that is running putRecords method did we see any changes (the data point with 10 records is me running that server locally as a test)
screen shot 2018-05-16 at 11 13 06 am
And there we see a long latency for putRecords.
screen shot 2018-05-16 at 11 26 48 am
Any insight? Thank you!

@pfifer
Copy link
Contributor

pfifer commented Aug 14, 2018

This issues has only been reported, and confirmed on Node.ja multi-lang daemon. It's possible that this may occur with other multi-lang daemon clients, but we haven't been able to confirm that.

We have no indications that this affects the Java KCL. The most common cause of this issue in the Java KCL is due to the record processor blocking. The KCL dispatches batches of records serially, and must wait for the call to IRecordProcessor#processRecords() to return before it will move on. If you're using the prefetch feature the KCL will fetch some data ahead, but queues it until the record processor is ready to process it. If you're not using prefetch the KCL will not receive data for that shard until the record processor returns.

In release 1.8.8 we added logWarningForTaskAfterMillis which will log a message if a task hasn't completed after the configured amount of time. This can be used to detect when a task is blocked for an extended period of time. The task that most people care about is ProcessTask which is responsible for retrieving records, and dispatching records to the record processor.

If your application appears to be stuck one of the first options is to retrieve thread stack traces from the JVM. The simplest way is to use jstack from the JDK. This may require installing the JDK on the affected host. On Linux and macOS there is a second option to use kill -QUIT or kill -3. These signals will cause the JVM to log the stack trace to stdout.

Here is an example of a blocked thread retrieved using jstack (it's using Thread#sleep to simulate a blocking call so the thread states for your threads maybe different):

"RecordProcessor-0001" #19 prio=5 os_prio=31 tid=0x00007fd96d1ef800 nid=0x1007 waiting on condition [0x00007000101a8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at com.amazon.aws.kinesis.RecordProcessor.processRecords(RecordProcessor.java:19)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.callProcessRecords(ProcessTask.java:221)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:176)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

@pfifer
Copy link
Contributor

pfifer commented Aug 14, 2018

@limawebdev1

PutRecord/PutRecords isn't related to the KCL. You appear to have run into an issue with the AWS Node SDK. I would recommend opening an issue on the aws-sdk-js repository

@gonzalodiaz
Copy link

gonzalodiaz commented Aug 21, 2018

@pfifer we are running a production instance of Kinesis Streams + PySpark Streaming.

Time to time, and recently it's more often (twice a day), the application stops processing records, indefinitely. It can go like this for hours, until we restart the receiver. Now we have some alarms to alert us about this situation.

We couldn't find anything in the logs, but the way we have to recover the Spark Application is by manually killing the running receiver. Spark spawns a new receiver and the records start coming along, just fine.

Our issue seems to be related to this thread. We are still looking for a solution, or at least a workaround to avoid the manual restart. Any help here would be gratefully accepted!.
We are running the latest EMR version (emr-5.16.0).

UPDATE 2018-09-15:
It is consistent this error in the logs, when the app stops consuming

18/09/11 18:20:00 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error while storing block into Spark - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
	at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
	at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
	at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
	at org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
	at org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:306)
	at org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:357)
	at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
	at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
	at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

@manoj-maharjan
Copy link

@pfifer We are using Java KCL Version 1.8.7 in springboot application(with docker) in testing and staging environments(Going to production soon). We are using kinesis stream with 2 shards. We have been noticing similar behavior mentioned in this thread - The KCL application is not reading messages from kinesis stream even there are new messages in the stream. We terminated the EC2 instance to spin up new one but that didn't help. However, after restarting the docker container it started processing messages from kinesis stream. That is the temporary workaround for now for us.
We contacted AWS customer support, we could not find the root cause but we tried following things

  1. Upgraded KCL version to 1.8.10(we tried to upgrade to 1.9.1 but it failed because of the incompatible version of AWS Java SDK 1.11.37). hope KCL version 1.8.10 will fix this issue.
  2. Changed the workedId to ensure it is unique.
workerId = InetAddress.getLocalHost().getCanonicalHostName() + ":" + UUID.randomUUID();
  1. We are still testing for this issue. Are there any good AWS metrics to identity this issue and set up the Alarms?

@htarevern
Copy link

htarevern commented Oct 26, 2018

@pfifer I'm facing the same issue as other folks here. I'm using Java implementation of KCL to read stream records from dynamodb. Currently we have 1 shard and 2 workers with unique workerId running in separate instances. Once in awhile, I see that the 1 shard that we have reaches to end (shard expires in 4 hours) and a new shard gets created (verified using dynamodb describe stream api), but KCL doesn't create a new lease for the new shard. As a result the application doesn't process any data. The only way to fix the problem is to bounce the instances and KCL will create a new lease and will start processing the records.

Before bouncing the instances this time, I ran jstack and here is what I got.

Java KCL version: 1.7.6

https://gist.github.com/htarevern/e7429e4d8e27658bc1404d80d09011a9

@pfifer
Copy link
Contributor

pfifer commented Oct 30, 2018

@htarevern If you're using the DynamoDB Streams plugin your issue maybe related to awslabs/dynamodb-streams-kinesis-adapter#20

@pfifer
Copy link
Contributor

pfifer commented Oct 30, 2018

@ManojGH I would make sure that your record processor isn't blocking. The KCL will wait for the record processor to return before retrieving the next batch of records.

  1. I would recommend updating your SDK client to the current version. We are unable to support older versions of the Java SDK, and there will be no further releases in the 1.8 line.
  2. The worker id only needs to be unique within the current active workers. It's fine for a worker to reuse its previous worker id, as the worker will need to reacquire the leases anyway.
  3. There are multiple possibilities for alarming:
    - At the shard level alarm on missing or high MillisBehindLatest either from the Kinesis service metrics or the KCL metrics
    - Have a job that scans the lease table periodically that is making sure that checkpoints are advancing. This really only works if you have a consistent put rate.
    - Use sentinel records that trigger your record processor to indicate it's status in some way.

@pfifer
Copy link
Contributor

pfifer commented Oct 30, 2018

@gonzalodiaz Your exception indicates something is timing out in Spark. If it's the KCL we would need logs from KCL. For the Spark adapter the KCL runs on a single node dispatchin which is dispatching work from a single node.

@dharmeshspatel4u
Copy link

dharmeshspatel4u commented Feb 9, 2019

@pfifer with client v1.8.1 having below issue.


2019-02-09 21:10:01.875  INFO 26971 --- [      Thread-29] c.a.s.k.clientlibrary.lib.worker.Worker  : Worker shutdown requested.
2019-02-09 21:10:01.876  INFO 26971 --- [      Thread-29] c.a.s.k.leases.impl.LeaseCoordinator     : Worker ip-1234. has successfully stopped lease-tracking threads
2019-02-09 21:10:01.877  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Checkpointing shard shardId-000000000000
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] k.c.l.w.KinesisClientLibLeaseCoordinator : Worker ip-1234. could not update checkpoint for shard shardId-000000000000 because it does not hold the lease
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Caught shutdown exception, skipping checkpoint.

com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:174) ~[amazon-kinesis-client-1.8.1.jar!/:na]

Any clue, if this is my issue? I see sometimes checkpoint gets updated, sometimes throws above error and it delivers again those messages back to consumer.

appericiate your quick response.

@cenkbircanoglu
Copy link

+1

@muscovitebob
Copy link

muscovitebob commented Oct 10, 2019

I am consuming records from a Kinesis stream via the Spark Kinesis receiver library and encountering the same issue described here. However, in my case, record processing does occur once I send a sigterm to the spark process, but only then. I was able to confirm using jstack that my record processors are also stuck in the waiting state:

   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.handleNoRecords(ProcessTask.java:282)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:166)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)```

@aheuermann
Copy link

Was there any resolution for this issue? We are running into it on KCL 1.9.3 and amazon-kinesis-client-nodejs 0.8.0

@DeathsPirate
Copy link

adding a workerId to the xxx.properties file fixed this issue for me.

@kimberlyamandalu
Copy link

Hello, looks like this issue is still open so I am assuming there is no fix yet?
I recently started using Kinesis data streams and have encountered the same issues. I get errors like the following:
ResourceNotFoundException: shard_id cannot be found because it does not exist in the account...

I also cannot find the DynamoDB table used as the Kinesis metadata store.
I am running a Glue streaming job that consumes from a Kinesis stream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests