Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JTA Transactions fail with XAER_RMFAIL #292

Closed
flauschie opened this issue Oct 12, 2022 · 22 comments
Closed

JTA Transactions fail with XAER_RMFAIL #292

flauschie opened this issue Oct 12, 2022 · 22 comments
Assignees
Labels
bug Something isn't working

Comments

@flauschie
Copy link

Describe the bug

We have the issue that some of our JTA transactions fail with the Exception below.

The same transactions work nicely with the vanilla MySQL driver (8.0.29+).

This happened with v1.1.0.

Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the  PREPARED state
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1201)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.queryForTopology(AuroraTopologyService.java:231)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.getTopology(AuroraTopologyService.java:200)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.updateTopologyAndConnectIfNeeded(FailoverConnectionPlugin.java:714)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.execute(FailoverConnectionPlugin.java:247)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.ConnectionPluginManager.execute(ConnectionPluginManager.java:138)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.ConnectionProxy$JdbcInterfaceProxy.invoke(ConnectionProxy.java:320)
		at com.awssql//com.sun.proxy.$Proxy265.getResultSet(Unknown Source)
		at com.awssql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAConnection.dispatchCommand(MysqlXAConnection.java:325)
		... 45 more

Expected Behavior

JTA transactions work the same as with the latest vanilla MySQL drivers.

Current Behavior

Transaction rollback

Reproduction Steps

This happens in specific scenarios with no trivial way to recreate this.

Possible Solution

I suppose this may be related to the rather old MySQL base version used for the AWS Driver.

Additional Information/Context

No response

The AWS JDBC Driver for MySQL version used

1.1.0

JDK version used

JDK 11

Operating System and version

AL2 (hosted in EKS)

@flauschie flauschie added the bug Something isn't working label Oct 12, 2022
@mariokoehler
Copy link

mariokoehler commented Oct 12, 2022

i can confirm that we see the same exceptions in our project. we've had them for at least a month, but usually less than 5 or so per day. but starting with October 6th (which coincides with increased traffic on our system) we saw a steady incline to up to 100 per day, which starts to be concerning. so far we tried to set the jdbc connection URL parameter pinGlobalTxToPhysicalConnection to true, without success. The exceptions also occur "all over the place" and there seems to be no obvious commonality between them, at least none that we could figure out so far.

I realize this is probably not very helpful information to get to the source of the problem, but nonetheless i wanted to confirm that it's not just a "single customer problem". we're open to providing additional information if needed/helpful. let us know how we can support the effort.

@sergiyvamz
Copy link
Contributor

Hello @flauschie, @mariokoehler

Thank you for raising the issue. We understand this issue is causing inconveniences in your workflows. We'll be investigating this issue and will update you when we have a release. We appreciate your patience.

@flauschie
Copy link
Author

FWIW, our JTA transactions can contain a fairly big payload with megabytes of data and intermediate updates to the same data during the transaction. Simple JTA transactions don't seem to be affected.

Maybe it's related or not, but we also had JTA issues with the RDS proxy (which we had to abandon). If the RDS proxy also uses this driver, it could have the same root cause.

@mariokoehler
Copy link

in our case it's much simpler, the only reason we use the XA version of the datasource at all is because we do some of our "less important" queries on the read-only instance of the Aurora RDS cluster (to take some load off of the main instance). So we have two datasources defined in our persistence.xml and effectively have to EntityManagers, one associated with the main instance and one with the readonly instance. And sometimes we use both of them in the same transaction, thats why we need the XA. But we never really have a "typical" XA-scenario, so for example we don't have write-operations that span more than one EntityManager/Database.

@flauschie
Copy link
Author

FWIW, for complicated reasons, our JTA transactions can span across 4 datasources.

@mariokoehler
Copy link

mariokoehler commented Oct 14, 2022

by further trial-and-error we discovered what seems to be a correlation to the connection pool size. by increasing the pool size we were able to reduce the frequency of the exceptions. we also modified the way the connection pool manages connections, so that it doesn't hold a constant size pool of connections but instead only has a steady size of 10 and increases/decreases this as needed. that was done so that we can better monitor the actual usage (%) of connections via the RDS cloudwatch metrics. Where before the metric always showed a flat line of the max pool size, we can now see the actual usage (given it's above 10 connections). By doing this we noticed that the exceptions seem to coincide with peaks of high connection usage. whenever the pool is close to being exhausted (maybe even exhausted, it's hard to tell because of the metrics resolution) the exceptions keep coming. when the pool is only moderately busy we can have several hours without any exceptions.

@flauschie
Copy link
Author

flauschie commented Oct 16, 2022

We always had different min/max settings for our pools. I don't think that the max pool setting was ever hit during our regression testing.

FWIW, we are using Wildfly 26 (Narayana) as the JTA engine.

@mariokoehler
Copy link

yesterday we switched back to the "vanilla" driver and we're free of exceptions ever since. so it really seems to be related to the driver itself.

@karenc-bq karenc-bq self-assigned this Oct 18, 2022
@karenc-bq
Copy link
Contributor

Thank you all for raising this issue. We are currently working on a fix for this issue and will post again once the fix is merged. Thank you for your patience.

@karenc-bq
Copy link
Contributor

The fix has been merged, can you please test out our 1.1.2-Snapshot build here and let us know if the issue persists.
Thank you.

@github-actions github-actions bot added the Stale label Nov 21, 2022
@hsuamz hsuamz removed the Stale label Dec 13, 2022
@hsuamz
Copy link
Contributor

hsuamz commented Dec 13, 2022

Closing this issue since this has been addressed in #297. Please re-open if this is still a problem.

@hsuamz hsuamz closed this as completed Dec 13, 2022
@flauschie
Copy link
Author

flauschie commented Feb 20, 2023

@hsuamz

The issue has not been solved. After switching from the native MySQL driver to the AWS version (1.1.4) we are still getting the same XAER_RMFAIL exception in Wildfly 26 (Narayana JTA):

ARJUNA016045: attempted rollback of < formatId=131077, gtrid_length=44, bqual_length=36, tx_uid=0:ffffac1084d4:-23b1904f:63f2ccd7:17cad, node_name=86c98bcdd6-kwq2f, branch_uid=0:ffffac1084d4:-23b1904f:63f2ccd7:17cb2, subordinatenodename=null, eis_name=java:jboss/datasources/xxxxxxx > (XAResourceWrapperImpl@858080e[xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@50d575c7 pad=false overrideRmValue=false productName=MySQL productVersion=8.0.23 jndiName=java:jboss/datasources/xxxx]) failed with exception code XAException.XAER_RMFAIL: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAException: Undetermined error occurred in the underlying Connection - check your data for consistency

Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state
ARJUNA016056: TransactionImple.delistResource - caught exception during delist : XAException.XAER_RMFAIL: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAException: XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state
ARJUNA016138: Failed to enlist XA resource XAResourceWrapperImpl@7ec9e8f9[xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@512d69a1 pad=false overrideRmValue=false productName=MySQL productVersion=8.0.23 jndiName=java:jboss/datasources/xxxx]: java.lang.IllegalStateException: XAER_RMFAIL: The command cannot be executed when global transaction is in the  PREPARED state

Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the  PREPARED state

Driver setting:

                    <driver name="aws" module="com.mysql">
                        <driver-class>software.aws.rds.jdbc.mysql.Driver</driver-class>
                        <xa-datasource-class>software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXADataSource</xa-datasource-class>
                    </driver>

@karenc-bq karenc-bq reopened this Feb 21, 2023
@karenc-bq
Copy link
Contributor

Hi @flauschie, we are looking into this issue.

To help us with our investigation, are you able to provide us with the following?

  • a full stacktrace of the error
  • driver logs (see this on how to enable driver logs)

Are you seeing the issue with the same frequency as before?

Thank you.

@flauschie
Copy link
Author

Hi @karenc-bq

Here's a broader stacktrace - the lines with an empty {} are omitted lines (our own code):
exception.txt

Sorry for the format, but these exceptions are breaking the logging format of the log service ;-)

As for the driver trace ... this is difficult at the moment. The issue only occurs during integration testing, and I can't run the related test locally. Our QA is currently busy with a release. Maybe in 1-2 weeks.

I can't really remember the error frequency during our initial testing (based on 1.1.0). We don't keep logs that long. Probably similar.

The exception doesn't occur during simple tests - only with some substantial load and possibly competition.

However, the original/Oracle MySQL driver doesn't show these JTA-related errors - and our code can recover from "normal" PersistenceExceptions due to competition.

FWIW, the driver produced 8 errors during a 45min test run with thousands of transactions.

Generally: we have 4 MySQL data sources that could, in theory, take part in the JTA transaction. However, I don't think that this error occurred with more than 2 of them involved.

@karenc-bq
Copy link
Contributor

Hi @flauschie, we are still unable to reproduce the issue, however we recently merged fix #364 that may impact this issue.
Is it possible to run your integration tests with our snapshot build?
Thank you.

@hsuamz
Copy link
Contributor

hsuamz commented Jun 15, 2023

Closing this issue due to inactivity and recent changes that went into the driver that addresses the issue. If the issue is still present, please provide @karenc-bq with the information above. Thank you.

@hsuamz hsuamz closed this as completed Jun 15, 2023
@flauschie
Copy link
Author

flauschie commented Jun 30, 2023

Sorry to say but the issue has still not been fixed, including driver version 1.1.7 on Aurora Serverless v2 MySQL:

ARJUNA016041: prepare on < formatId=131077, gtrid_length=44, bqual_length=36, tx_uid=0:ffffac107cdd:-24ffcc5f:649e3775:1becf, node_name=6d58d8745b-jjrzr, branch_uid=0:ffffac107cdd:-24ffcc5f:649e3775:1bed4, subordinatenodename=null, eis_name=java:jboss/datasources/.... > (XAResourceWrapperImpl@37fce128[xaResource=org.jboss.jca.adapters.jdbc.xa.XAManagedConnection@5c31c310 pad=false overrideRmValue=false productName=MySQL productVersion=8.0.23 jndiName=java:jboss/datasources/...]) failed with exception XAException.XAER_RMFAIL: software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAException: XAER_RMFAIL: The command cannot be executed when global transaction is in the IDLE state

Once again - this doesn't happen with the vanilla MySQL driver 8.xx

FWIW, I can also see this here in the logs:
WFLYJCA0005: Deploying non-JDBC-compliant driver class software.aws.rds.jdbc.mysql.Driver

This is on Wildfly 26.1.3

XA Driver config:

                <drivers>
                    <driver name="aws" module="com.mysql">
                        <driver-class>software.aws.rds.jdbc.mysql.Driver</driver-class>
                        <xa-datasource-class>software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXADataSource</xa-datasource-class>
                    </driver>
                </drivers>
ARJUNA012075: Action Aborting
ARJUNA012073: BasicAction.End() - prepare phase of action-id 0:ffffac107cdd:-24ffcc5f:649e3775:1becf failed.
.....
at com.mysql//com.sun.proxy.$Proxy260.execute(Unknown Source)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.ConnectionProxy$JdbcInterfaceProxy.invoke(ConnectionProxy.java:345)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.ConnectionPluginManager.execute(ConnectionPluginManager.java:138)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.execute(FailoverConnectionPlugin.java:274)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.updateTopologyIfNeeded(FailoverConnectionPlugin.java:735)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.getTopology(AuroraTopologyService.java:192)

at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.queryForTopology(AuroraTopologyService.java:219)
at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200)

@flauschie
Copy link
Author

flauschie commented Jun 30, 2023

It's the same with 1.1.8

Caused by: java.sql.SQLException: XAER_RMFAIL: The command cannot be executed when global transaction is in the  IDLE state
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.queryForTopology(AuroraTopologyService.java:216)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.AuroraTopologyService.getTopology(AuroraTopologyService.java:189)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.updateTopologyIfNeeded(FailoverConnectionPlugin.java:735)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.failover.FailoverConnectionPlugin.execute(FailoverConnectionPlugin.java:275)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.plugins.ConnectionPluginManager.execute(ConnectionPluginManager.java:138)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.ha.ConnectionProxy$JdbcInterfaceProxy.invoke(ConnectionProxy.java:352)
		at com.mysql//com.sun.proxy.$Proxy233.execute(Unknown Source)
		at com.mysql//software.aws.rds.jdbc.mysql.shading.com.mysql.cj.jdbc.MysqlXAConnection.dispatchCommand(MysqlXAConnection.java:328)

It seems as if this is only happening under high load ( and XA data source).

In most test cases, the transactions work just fine.

@karenc-bq karenc-bq reopened this Jun 30, 2023
@karenc-bq
Copy link
Contributor

Hi @flauschie, thank you for the update. We merged a fix in (#432), could you please try out our snapshot build and let us know if the issue persists? Thank you.

@flauschie
Copy link
Author

Hi @karenc-bq

Yes, it seems that this latest fix (aws-mysql-jdbc-1.1.9-20230710.234101-4.jar) is working - at least during a 1h test campaign. Thx!

BTW, I don't know whether this is related - a while back, we also had XA issues with the RDS Proxy (with no fix from the AWS support at the time) - preventing us from going ahead with it. Is this change also going to be incorporated into the RDS Proxy? The symptoms were somewhat different, though.

@davecramer
Copy link
Contributor

BTW, I don't know whether this is related - a while back, we also had XA issues with the RDS Proxy (with no fix from the AWS support at the time) - preventing us from going ahead with it. Is this change also going to be incorporated into the RDS Proxy? The symptoms were somewhat different, though.

RDS Proxy does not use this driver. I doubt the problem is the same problem, thanks!

@karenc-bq
Copy link
Contributor

Hi @flauschie,

Version 1.1.9 of the AWS JDBC Driver for MySQL has been released with the fix for this issue. We'll be closing this issue now. If you have any questions or concerns please don't hesitate to create a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants