"Stuck" Kinesis Shards #185

aloisbarreras · 2017-06-28T17:44:17Z

@pfifer I've been using Kinesis as my real time data streaming pipeline for over a year now, and I am consistently running into an issue where Kinesis shards seem to randomly stop processing data. I am using the Node.js KCL library to consume the stream.

Here are some screenshots to show what I mean.

You can see here that from roughly 21:00 to 6:00, shards 98 and 100 stop emitting data for DataBytesProcessed. I restarted the application at 6:00 and the consumer immediately started processing data again.

Now here is a graph of the IteratorAgeMilliseconds for the same time period.

The shards are still emitting IteratorAge from 21:00 to 6:00 and show that the IteratorAgeMilliseconds is still at 0, so it seems like the KCL was still running during that time, but it wasn't consuming more data from the stream. You can see that when I restarted the application at 6:00, the KCL realized it was very far behind and the IteratorAge spiked up to 32.0M instantly. I can confirm that the KCL process was still running during that time period and continuously logged out the usual messages:

INFO: Current stream shard assignments: shardId-000000000098
Jun 28, 2017 5:37:36 PM com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker info
INFO: Sleeping ...

There are 2 other shards in the same stream and they continued to process data as normal. I have been working on debugging this on my end for several months now, but have come up empty. I am catching all exceptions thrown in my application, and I have set timeouts on any network requests or async I/O operations, so I don't think the process is crashing or getting hung up on anything.

I also thought that maybe I didn't have enough resources allocated to the application and something weird was happening with that and that's why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.

I cannot seem to reliably replicate this issue; it seems to just happen randomly across different shards and can range from happening a few times a week to a few times a day. FYI, I have several other Node applications using the KCL that also experience the same issue.

I have seen this issue where he seems to have the same problem with the Python library which he solved by just restarting the applications every 2 hours...and also this issue which seems to be the same problem I have as well.

Is this a known bug with the KCL? And if the problem is on my end, do you have any pointers on how to track down this problem?

The text was updated successfully, but these errors were encountered:

pfifer · 2017-06-30T16:10:51Z

I don't believe this is a bug in the core Java KCL, but could be a bug in the MultiLang component.

I also thought that maybe I didn't have enough resources allocated to the application and something weird was happening with that and that's why the process got hung up. So, I made 4 instances of my application (1 for each shard) each with 224GB of RAM and 32 cpus, but I still run into this issue.

You shouldn't need to do this, since each record processor is single threaded. If your record processor does its work on the main thread, adding more threading capacity will be wasted.

What you're seeing happen is the lack of response from when the record were dispatched to the language component. The Java record processor is waiting for response from the language components before returning from the processRecords call. The MultiLang Daemon implements the Java record processor interface and acts as a forwarder to the language specific record processor. The dispatch behavior of the Java record processor is to make a call to processRecords, and pause activity until that call returns. The MultiLang Daemon implements the same behavior, but across process boundaries. The MultiLang Daemon, acting as a forwarder, needs to wait for a response from the native language record processor. If for some reason the response never comes, it will hang indefinitely.

There isn't a good way to see what is actually causing the problem it could be any of three components:

The Java MultiLangProtocol
The language specific KCL interface e.g. KCLProcess for Node.js.
The application specific record processor.

I think for a short term fix adding a configuration that would limit the maximum time the MultiLang Daemon will wait for a response could limit the impact of these sort of issues. This fix would terminate the record processor if a response isn't received within the configured time.

Longer term will require adding some additional logging to understand the state of components on both side of the process boundary. This would help us detect which component is losing track of the current state.

For everyone else please comment, or add a reaction to this to help us know how many people are affected by this issue.

aloisbarreras · 2017-06-30T17:11:18Z

@pfifer thanks for looking into this. Let me know if you need anything from me to help you debug this.

That short term fix would be pretty helpful as well, so if you get that implemented, please let me know.

Thanks!

aloisbarreras · 2017-07-19T04:14:57Z

@pfifer any update on this? Even implementing some sort of short term fix would be extremely helpful.

perryao · 2017-07-19T04:26:53Z

+1

pfifer · 2017-07-19T15:46:59Z

We're currently working on the short term fix for this.

pfifer · 2017-07-25T13:52:22Z

To help us troubleshoot the source of this issue can you enable more detailed logging using these steps:

If you don't mind using Apache Maven this is a pom.xml that will provide additional logging, which can help diagnose issues.

To create the launcher script:

Create an empty directory
Create a new file in the directory called pom.xml and copy the contents of this pom.xml in to it.
Install Apache Maven if you don't already have it.
In the directory created in 1, run the command mvn package
If everything works you should have a script in target/appassembler/bin/multilang-daemon, and all the jars are in target/appassembler/repo.
Copy the directory target/appassembler to wherever you need it.
To start your application run the mutlilang-daemon, or on Windows multilang-daemon.bat, script with the location of your property file.

The pom.xml above uses Logback, and SLF4J to provide logging. It will default to rather verbose console logging without any configuration. I’ve uploaded an example Logback configuration file that can be used to control the amount of logging.

To provide the configuration file for logging follow these steps:

In you application directory create a directory called logback (you can call it something else if you want to).
Download the Logback configuration file: logback.xml, and save it to logback/logback.xml
Provide a classpath prefix for your application at startup
You can use relative paths. The script will correctly set the current working directory
```
CLASSPATH_PREFIX=<path to the logback directory created in step 1> <path to your application directory>/bin/mutlilang-daemon <your properties file>
```
Here is an example of the command assuming:
- You've copied the contents of target/appassembler to your application directory
- The name of your applications properties is config.properties
- You named the logging configuration directory logback
- You're starting the multilang daemon in your application directory
```
CLASSPATH_PREFIX=logback bin/multilang-daemon config.properties
```

kenXengineering · 2017-07-28T18:47:46Z

@pfifer, I've been working with Alois on this issue as well, just getting back with some more information. We now have the debug logs streaming to our ELK stack, so if we see a case where the Kinesis stream seems to be stuck we can hopefully get logs from our process that processes the stream.

We were also investigating DynamoDB, and we can see that when the Kinesis stream is no longer getting processed, we see our tables Get Latency start reporting a latency (all the time the graph has been empty), and we see the Scan Latency spike very high. The Write capacity also dips during the same time. These issues go away once we restart the process that is consuming the Kinesis stream.

We have seen this on all the DynamoDB tables that are created by our processors. We have 10 kinesis stream that have gotten "stuck" in the last day, and on all their DynamoDB tables we see high latency at the same time. Some of us are wondering if the request to DynamoDB is getting dropped or something, and the caller waits for a response indefinitely, pausing the process.

We've also noticed that these incidents seem to happen together in a one to two hour window. Last night (7/28/2017) at around 1:30 AM five of our Kinesis streams became "stuck" within 5 min of each other. Then two other streams became "stuck" at 4:50 AM and 5:50 AM respectively.

It took a bit of wrangling to get the logs from KCL since we use the amazon-kinesis-client-nodejs library to run our processes, but with a fork of that I was able to get all the jars setup and logs streaming over. Hopefully we get some useful logs soon! Thanks!

kenXengineering · 2017-07-31T20:04:40Z

@pfifer We finally got some logs for ya! You can see them at https://gist.github.com/chosenken/e79f3eda5eefb5bd6a910ae6e5875d4a#file-kcl-log-L699. Specifically on line 699, I see the multi-lang-daemon-0004 report back that it is waiting on a previous PROCESS task, and it will sit there until we restart the process. The file has one minute of logs, with data being processed at the beginning.

Please let me know if you need any more information, thank you!

Support timeouts for calls to the MultiLang Daemon This adds support for setting a timeout when dispatching records to the client record processor. If the record processor doesn't respond within the timeout the parent Java process will be terminated. This is a temporary fix to handle cases where the KCL becomes blocked while waiting for a client record processor. The timeout for the this can be set by adding `timeoutInSeconds = <timeout value>`. The default for this is no timeout. Setting this can cause the KCL to exit suddenly, before using this ensure that you have an automated restart for your application Related awslabs#195 Related awslabs#185

Support timeouts for calls to the MultiLang Daemon This adds support for setting a timeout when dispatching records to the client record processor. If the record processor doesn't respond within the timeout the parent Java process will be terminated. This is a temporary fix to handle cases where the KCL becomes blocked while waiting for a client record processor. The timeout for the this can be set by adding `timeoutInSeconds = <timeout value>`. The default for this is no timeout. Setting this can cause the KCL to exit suddenly, before using this ensure that you have an automated restart for your application Related #195 Related #185

oscarjiv91 · 2017-08-15T23:48:59Z

I'm not sure if this is the same case, but running the kcl sample project with a stream of 4 shards, the first shard shardId-00000000 is initialized, but recordProcess is never called for that shard. In dynamoDB, checkpoint column always stays as LATEST. Re-running the application doesn't seems to work and I get no errors. Is this issue related to what you are trying to fix? Thank you!

Is there a workaround in the meantime?

sahilpalvia · 2017-08-16T18:21:28Z

@oscarjiv91 I don't think your issue is related to this one, but to be sure I would need some additional information. Could you provide the link to the sample that you are using? Another issue could be that there are not records being pushed to your stream, in that case the checkpointer will never advance, because there is no record to checkpoint against. You can check that by enabling callProcessRecordsEvenForEmptyRecordList in the properties file. This enables calling of the recordProcessor even if there are no records on the shard. If you need further assistance, I would recommend you open up a thread on the AWS Kinesis Forum.

oscarjiv91 · 2017-08-23T01:06:19Z

This is the link: https:/awslabs/amazon-kinesis-client-nodejs/tree/master/samples/click_stream_sample

I'm pushing records with the same sample project, and set 4 shards. If I use 3 shards, I have the same problem (shardId-00000000 is stuck), but not with 2 shards which works well.

Thank you. I will open up a thread on the Kinesis Forum.

juanstiza · 2017-10-24T11:49:02Z

We have a KCL app working, not on production yet, but it showed similar behaviour. Two weeks ago it just stopped fetching records, after a restart it would get back up. I had to set up a cron to restart the process every day in order to avoid this.

sahilpalvia · 2017-10-24T18:59:34Z

@juanstiza Are you using the Java KCL or the Multilang?

juanstiza · 2017-10-24T19:00:26Z

@sahilpalvia We are using Java.

xinec · 2017-12-05T16:05:34Z

+1, Java KCL

antgus · 2017-12-18T10:39:07Z

~~+1, Java KCL~~

Update: After all the source of the problem was that all workers had the same workerId. By the way, @pfifer it would probably be useful to add a comment in SampleConsumer when setting workerId since it's easy for a newcomer to miss the requirement for unique worker ids. It might even be wise to concatenate the id with a random string so that in SampleConsumer it would be unique by default.

pfifer · 2017-12-18T14:56:58Z

In Release 1.8.8 we added the ability to log a message when a processing task is blocked for a long time. For those affected by it, can you enable the log message.

Additionally if you see the log message could you retrieve a stack trace from the Java application. On Linux, and macOS you can do this with the jstack using a command like jstack <pid of java process>. This will output the stack traces for all threads. If you want you can post here, or send it to me on the AWS Forums, my account is justintaws.

shaharck · 2018-02-02T20:02:37Z

@pfifer we're seeing similar behavior on every new deployment of our kcl cluster.
prior to a new deployment we see that for quite some time the worker for the shard is not receiving any records from kinesis: https://gist.github.com/shaharck/25e9fefd0d5d6dca4705bd0f10f899dd

when a new AWS KCL instance is coming up and takes the lease it starts processing records from
that shard having a large spike in iterator age
post deploy logs: https://gist.github.com/shaharck/871f0dcc8a26a82bd3c812ea8ce07c2b

seems to be always 1 (not the same) shard that is "getting stuck"

It then processes the records and the delay goes down.

FYI we're using Java kcl v 1.7.5

pfifer · 2018-02-05T16:34:51Z

@shaharck The high iterator age can occur for a number of reasons.

If it's caused by processing getting stuck you should see the metrics for the stuck shard stop being produced. If the processing is actually stuck if you could get a stack trace via jstack that would be really helpful.

If the metrics are continuous it indicates that processing is occurring. The common cause of iterator spikes in this case are due to not having a recent checkpoint. This can occur due to a number of reasons, but commonly the causes are checkpoint strategy, or a stream receiving no data. When there is no data on the stream your application will be unable to checkpoint at its current position in time. So if the lease is lost processing for the shard will restart at the last checkpoint which will be the last time there was data available on the stream.

There is a third possibility: Your record processor is getting stuck. The processing of a shard is synchronous. The KCL waits for the record processor to finish processing a batch of records before it will retrieve, and dispatch the next batch. If your record processor gets stuck for any reason this can cause processing on the affected shard to stop. To help detect these situations we added a new log message in Release 1.8.8. The message is emitted whenever processing is paused for an extended period of time waiting for the record processor to return.

limawebdev1 · 2018-05-16T17:28:11Z

Also having this issue aws-sdk Node package.
PutRecords suddenly stopped and only upon restarting the server that is running putRecords method did we see any changes (the data point with 10 records is me running that server locally as a test)

And there we see a long latency for putRecords.

Any insight? Thank you!

pfifer · 2018-08-14T15:48:58Z

This issues has only been reported, and confirmed on Node.ja multi-lang daemon. It's possible that this may occur with other multi-lang daemon clients, but we haven't been able to confirm that.

We have no indications that this affects the Java KCL. The most common cause of this issue in the Java KCL is due to the record processor blocking. The KCL dispatches batches of records serially, and must wait for the call to IRecordProcessor#processRecords() to return before it will move on. If you're using the prefetch feature the KCL will fetch some data ahead, but queues it until the record processor is ready to process it. If you're not using prefetch the KCL will not receive data for that shard until the record processor returns.

In release 1.8.8 we added logWarningForTaskAfterMillis which will log a message if a task hasn't completed after the configured amount of time. This can be used to detect when a task is blocked for an extended period of time. The task that most people care about is ProcessTask which is responsible for retrieving records, and dispatching records to the record processor.

If your application appears to be stuck one of the first options is to retrieve thread stack traces from the JVM. The simplest way is to use jstack from the JDK. This may require installing the JDK on the affected host. On Linux and macOS there is a second option to use kill -QUIT or kill -3. These signals will cause the JVM to log the stack trace to stdout.

Here is an example of a blocked thread retrieved using jstack (it's using Thread#sleep to simulate a blocking call so the thread states for your threads maybe different):

"RecordProcessor-0001" #19 prio=5 os_prio=31 tid=0x00007fd96d1ef800 nid=0x1007 waiting on condition [0x00007000101a8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at com.amazon.aws.kinesis.RecordProcessor.processRecords(RecordProcessor.java:19)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.callProcessRecords(ProcessTask.java:221)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:176)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

pfifer · 2018-08-14T15:50:56Z

@limawebdev1

PutRecord/PutRecords isn't related to the KCL. You appear to have run into an issue with the AWS Node SDK. I would recommend opening an issue on the aws-sdk-js repository

gonzalodiaz · 2018-08-21T19:45:44Z

@pfifer we are running a production instance of Kinesis Streams + PySpark Streaming.

Time to time, and recently it's more often (twice a day), the application stops processing records, indefinitely. It can go like this for hours, until we restart the receiver. Now we have some alarms to alert us about this situation.

We couldn't find anything in the logs, but the way we have to recover the Spark Application is by manually killing the running receiver. Spark spawns a new receiver and the records start coming along, just fine.

Our issue seems to be related to this thread. We are still looking for a solution, or at least a workaround to avoid the manual restart. Any help here would be gratefully accepted!.
We are running the latest EMR version (emr-5.16.0).

UPDATE 2018-09-15:
It is consistent this error in the logs, when the app stops consuming

18/09/11 18:20:00 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error while storing block into Spark - java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
	at org.apache.spark.streaming.receiver.WriteAheadLogBasedBlockHandler.storeBlock(ReceivedBlockHandler.scala:210)
	at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushAndReportBlock(ReceiverSupervisorImpl.scala:158)
	at org.apache.spark.streaming.receiver.ReceiverSupervisorImpl.pushArrayBuffer(ReceiverSupervisorImpl.scala:129)
	at org.apache.spark.streaming.receiver.Receiver.store(Receiver.scala:133)
	at org.apache.spark.streaming.kinesis.KinesisReceiver.org$apache$spark$streaming$kinesis$KinesisReceiver$$storeBlockWithRanges(KinesisReceiver.scala:306)
	at org.apache.spark.streaming.kinesis.KinesisReceiver$GeneratedBlockHandler.onPushBlock(KinesisReceiver.scala:357)
	at org.apache.spark.streaming.receiver.BlockGenerator.pushBlock(BlockGenerator.scala:297)
	at org.apache.spark.streaming.receiver.BlockGenerator.org$apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:269)
	at org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:110)

manoj-maharjan · 2018-09-13T18:51:54Z

@pfifer We are using Java KCL Version 1.8.7 in springboot application(with docker) in testing and staging environments(Going to production soon). We are using kinesis stream with 2 shards. We have been noticing similar behavior mentioned in this thread - The KCL application is not reading messages from kinesis stream even there are new messages in the stream. We terminated the EC2 instance to spin up new one but that didn't help. However, after restarting the docker container it started processing messages from kinesis stream. That is the temporary workaround for now for us.
We contacted AWS customer support, we could not find the root cause but we tried following things

Upgraded KCL version to 1.8.10(we tried to upgrade to 1.9.1 but it failed because of the incompatible version of AWS Java SDK 1.11.37). hope KCL version 1.8.10 will fix this issue.
Changed the workedId to ensure it is unique.

workerId = InetAddress.getLocalHost().getCanonicalHostName() + ":" + UUID.randomUUID();

We are still testing for this issue. Are there any good AWS metrics to identity this issue and set up the Alarms?

htarevern · 2018-10-26T17:41:04Z

@pfifer I'm facing the same issue as other folks here. I'm using Java implementation of KCL to read stream records from dynamodb. Currently we have 1 shard and 2 workers with unique workerId running in separate instances. Once in awhile, I see that the 1 shard that we have reaches to end (shard expires in 4 hours) and a new shard gets created (verified using dynamodb describe stream api), but KCL doesn't create a new lease for the new shard. As a result the application doesn't process any data. The only way to fix the problem is to bounce the instances and KCL will create a new lease and will start processing the records.

Before bouncing the instances this time, I ran jstack and here is what I got.

Java KCL version: 1.7.6

https://gist.github.com/htarevern/e7429e4d8e27658bc1404d80d09011a9

pfifer · 2018-10-30T15:28:44Z

@htarevern If you're using the DynamoDB Streams plugin your issue maybe related to awslabs/dynamodb-streams-kinesis-adapter#20

pfifer · 2018-10-30T15:38:09Z

@ManojGH I would make sure that your record processor isn't blocking. The KCL will wait for the record processor to return before retrieving the next batch of records.

I would recommend updating your SDK client to the current version. We are unable to support older versions of the Java SDK, and there will be no further releases in the 1.8 line.
The worker id only needs to be unique within the current active workers. It's fine for a worker to reuse its previous worker id, as the worker will need to reacquire the leases anyway.
There are multiple possibilities for alarming:
- At the shard level alarm on missing or high MillisBehindLatest either from the Kinesis service metrics or the KCL metrics
- Have a job that scans the lease table periodically that is making sure that checkpoints are advancing. This really only works if you have a consistent put rate.
- Use sentinel records that trigger your record processor to indicate it's status in some way.

pfifer · 2018-10-30T15:40:51Z

@gonzalodiaz Your exception indicates something is timing out in Spark. If it's the KCL we would need logs from KCL. For the Spark adapter the KCL runs on a single node dispatchin which is dispatching work from a single node.

dharmeshspatel4u · 2019-02-09T21:30:49Z

@pfifer with client v1.8.1 having below issue.


2019-02-09 21:10:01.875  INFO 26971 --- [      Thread-29] c.a.s.k.clientlibrary.lib.worker.Worker  : Worker shutdown requested.
2019-02-09 21:10:01.876  INFO 26971 --- [      Thread-29] c.a.s.k.leases.impl.LeaseCoordinator     : Worker ip-1234. has successfully stopped lease-tracking threads
2019-02-09 21:10:01.877  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Checkpointing shard shardId-000000000000
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] k.c.l.w.KinesisClientLibLeaseCoordinator : Worker ip-1234. could not update checkpoint for shard shardId-000000000000 because it does not hold the lease
2019-02-09 21:10:01.878  INFO 26971 --- [dProcessor-0000] c.c.d.v.s.p.KinesisRecordProcessor       : Caught shutdown exception, skipping checkpoint.

com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:174) ~[amazon-kinesis-client-1.8.1.jar!/:na]

Any clue, if this is my issue? I see sometimes checkpoint gets updated, sometimes throws above error and it delivers again those messages back to consumer.

appericiate your quick response.

cenkbircanoglu · 2019-04-29T06:34:54Z

+1

muscovitebob · 2019-10-10T08:06:04Z

I am consuming records from a Kinesis stream via the Spark Kinesis receiver library and encountering the same issue described here. However, in my case, record processing does occur once I send a sigterm to the spark process, but only then. I was able to confirm using jstack that my record processors are also stuck in the waiting state:

   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.handleNoRecords(ProcessTask.java:282)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:166)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
	at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)```

aheuermann · 2019-12-09T18:57:36Z

Was there any resolution for this issue? We are running into it on KCL 1.9.3 and amazon-kinesis-client-nodejs 0.8.0

DeathsPirate · 2020-03-24T10:48:04Z

adding a workerId to the xxx.properties file fixed this issue for me.

kimberlyamandalu · 2021-04-02T19:14:02Z

Hello, looks like this issue is still open so I am assuming there is no fix yet?
I recently started using Kinesis data streams and have encountered the same issues. I get errors like the following:
ResourceNotFoundException: shard_id cannot be found because it does not exist in the account...

I also cannot find the DynamoDB table used as the Kinesis metadata store.
I am running a Glue streaming job that consumes from a Kinesis stream.

pfifer added bug investigation labels Jun 30, 2017

pfifer added this to the Release 1.8.1 milestone Aug 2, 2017

pfifer mentioned this issue Aug 2, 2017

Release 1.8.1 #198

Merged

This was referenced Dec 18, 2017

KCL does not appear to catch JsonParseException, halting operation of thread #270

Closed

Kinesis plugin eventually stops collecting messages, halting pipelines logstash-plugins/logstash-input-kinesis#35

Open

ghost mentioned this issue May 16, 2018

Missing shards after Kinesis stream resharding #339

Open

ghost mentioned this issue Aug 10, 2018

Kcl can not consume data from shard (Stuck) #361

Closed

abhiranjan mentioned this issue Dec 10, 2019

KCL stuck StreetContxt/kcl-akka-stream#20

Closed

igracia mentioned this issue Sep 9, 2020

After shard splitting, our log is flooded with warning messages "Cannot find the shard given shardId" #55

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Stuck" Kinesis Shards #185

"Stuck" Kinesis Shards #185

aloisbarreras commented Jun 28, 2017 •

edited

Loading

pfifer commented Jun 30, 2017

aloisbarreras commented Jun 30, 2017

aloisbarreras commented Jul 19, 2017

perryao commented Jul 19, 2017

pfifer commented Jul 19, 2017

pfifer commented Jul 25, 2017

kenXengineering commented Jul 28, 2017

kenXengineering commented Jul 31, 2017

oscarjiv91 commented Aug 15, 2017

sahilpalvia commented Aug 16, 2017

oscarjiv91 commented Aug 23, 2017

juanstiza commented Oct 24, 2017

sahilpalvia commented Oct 24, 2017

juanstiza commented Oct 24, 2017

xinec commented Dec 5, 2017

antgus commented Dec 18, 2017 •

edited

Loading

pfifer commented Dec 18, 2017

shaharck commented Feb 2, 2018

pfifer commented Feb 5, 2018

limawebdev1 commented May 16, 2018

pfifer commented Aug 14, 2018

pfifer commented Aug 14, 2018

gonzalodiaz commented Aug 21, 2018 •

edited

Loading

manoj-maharjan commented Sep 13, 2018

htarevern commented Oct 26, 2018 •

edited

Loading

pfifer commented Oct 30, 2018

pfifer commented Oct 30, 2018

pfifer commented Oct 30, 2018

dharmeshspatel4u commented Feb 9, 2019 •

edited

Loading

cenkbircanoglu commented Apr 29, 2019

muscovitebob commented Oct 10, 2019 •

edited

Loading

aheuermann commented Dec 9, 2019

DeathsPirate commented Mar 24, 2020

kimberlyamandalu commented Apr 2, 2021

"Stuck" Kinesis Shards #185

"Stuck" Kinesis Shards #185

Comments

aloisbarreras commented Jun 28, 2017 • edited Loading

pfifer commented Jun 30, 2017

aloisbarreras commented Jun 30, 2017

aloisbarreras commented Jul 19, 2017

perryao commented Jul 19, 2017

pfifer commented Jul 19, 2017

pfifer commented Jul 25, 2017

kenXengineering commented Jul 28, 2017

kenXengineering commented Jul 31, 2017

oscarjiv91 commented Aug 15, 2017

sahilpalvia commented Aug 16, 2017

oscarjiv91 commented Aug 23, 2017

juanstiza commented Oct 24, 2017

sahilpalvia commented Oct 24, 2017

juanstiza commented Oct 24, 2017

xinec commented Dec 5, 2017

antgus commented Dec 18, 2017 • edited Loading

pfifer commented Dec 18, 2017

shaharck commented Feb 2, 2018

pfifer commented Feb 5, 2018

limawebdev1 commented May 16, 2018

pfifer commented Aug 14, 2018

pfifer commented Aug 14, 2018

gonzalodiaz commented Aug 21, 2018 • edited Loading

manoj-maharjan commented Sep 13, 2018

htarevern commented Oct 26, 2018 • edited Loading

pfifer commented Oct 30, 2018

pfifer commented Oct 30, 2018

pfifer commented Oct 30, 2018

dharmeshspatel4u commented Feb 9, 2019 • edited Loading

cenkbircanoglu commented Apr 29, 2019

muscovitebob commented Oct 10, 2019 • edited Loading

aheuermann commented Dec 9, 2019

DeathsPirate commented Mar 24, 2020

kimberlyamandalu commented Apr 2, 2021

aloisbarreras commented Jun 28, 2017 •

edited

Loading

antgus commented Dec 18, 2017 •

edited

Loading

gonzalodiaz commented Aug 21, 2018 •

edited

Loading

htarevern commented Oct 26, 2018 •

edited

Loading

dharmeshspatel4u commented Feb 9, 2019 •

edited

Loading

muscovitebob commented Oct 10, 2019 •

edited

Loading