-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Outgoing federation suddenly broken #8747
Comments
That sounds bad. Do you have some room IDs that you've sent messages in that aren't federating? Preferably one's with matrix.org in so that we can have a look at its logs. |
I just sent a test message to The message as displayed by element, if that helps: {
"type": "m.room.message",
"sender": "@levans:safaradeg.net",
"content": {
"msgtype": "m.text",
"body": "Test message, can you receive me ?"
},
"origin_server_ts": 1605175426928,
"unsigned": {
"age": 232,
"transaction_id": "m1605175426260.4"
},
"event_id": "$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo",
"room_id": "!ehXvUhWNASUkSLvAGP:matrix.org"
} |
Thanks! That event doesn't seem to have made it to either matrix.org or jki.re 😕 I've looked at both matrix.org and jki.re logs, and it seems that your server is repeatedly sending the same set of events to repeatedly to matrix.org and jki.re (though different sets to each), which is very weird. The resent events seem to be from 2020-11-09 ~08:00. Though both servers do seem to have a few events that arrived since then, even this morning at This sounds like it could be related to the recent work to retry sending events to other servers after outages. I'd be interested in seeing the contents of some of the federation DB tables if possible, I think the following would be very useful to know: SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT MAX(stream_ordering) FROM events;
SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
SELECT * FROM federation_stream_position; |
|
Thanks! This does look like both matrix.org and jki.re are stuck in "catchup" mode (the stream orderings are significantly earlier than the current stream orderings). @vberger do you see log lines which contain Also if its easy for you to do it would be very useful if you could apply the following patch and share the logs after letting it run for a bit, if not don't worry about it. diff --git a/synapse/federation/sender/per_destination_queue.py b/synapse/federation/sender/per_destination_queue.py
index db8e456fe..285f767b9 100644
--- a/synapse/federation/sender/per_destination_queue.py
+++ b/synapse/federation/sender/per_destination_queue.py
@@ -240,11 +240,13 @@ class PerDestinationQueue:
# hence why we throw the result away.
await get_retry_limiter(self._destination, self._clock, self._store)
+ logger.info("Catching up %s?", self._destination)
if self._catching_up:
# we potentially need to catch-up first
await self._catch_up_transmission_loop()
if self._catching_up:
# not caught up yet
+ logger.info("Still catching up %s!", self._destination)
return
pending_pdus = []
@@ -452,6 +454,10 @@ class PerDestinationQueue:
# Sadly, this means we can't do anything here as we don't know what
# needs catching up — so catching up is futile; let's stop.
self._catching_up = False
+ logger.info(
+ "Caught up %s, due to null last successful stream ordering!",
+ self._destination,
+ )
return
# get at most 50 catchup room/PDUs
@@ -465,10 +471,15 @@ class PerDestinationQueue:
# of a race condition, so we check that no new events have been
# skipped due to us being in catch-up mode
+ logger.info("No more events for %s...", self._destination)
+
if self._catchup_last_skipped > self._last_successful_stream_ordering:
# another event has been skipped because we were in catch-up mode
+ logger.info("... but got a poke so looping %s!", self._destination)
continue
+ logger.info("Caught up as not poked %s!", self._destination)
+
# we are done catching up!
self._catching_up = False
break
@@ -500,6 +511,9 @@ class PerDestinationQueue:
)
if not success:
+ logger.info(
+ "Catch up failed as failed to poke remote %s!", self._destination
+ )
return
sent_transactions_counter.inc() |
Yes, I have quite a lot of lines like
relating to a lot of different servers in my logs of the last few days. I'll try the patch. |
So I applied the patch and added |
Thanks! Hopefully that is enough to investigate more deeply |
After yet another reboot of my HS it looks like things suddenly fixed themselves, on all rooms at once. Before rebooting I actually was adding some more debug-logging, and noticed something that surprised me (before I realized the federation was working again): It appeared that this piece of code, which updates the synapse/synapse/federation/sender/per_destination_queue.py Lines 358 to 368 in f31f8e6
was invoked several times in a row with the same value for |
I actually got a similar problem as OP, today no messages from my instance were federated any longer and I don't know what caused it, right now I cannot chat with any other servers that are not my own, read receipts and typing notifications, such as joining channels, it all works, just my outgoing messages are not delivered. In my journal I get the following all the time, like every 5-10 seconds
It definetly does some 200s there, this is really weird, I am using https://archlinux.org/packages/community/any/matrix-synapse/ |
We're doing a bunch of work in this area atm, in particular I'd be intrigued to know if #9639 helps, which should stop the "could not serialize access due to concurrent update" errors |
If it happens again I will surely report on this matter, but after I could not fix it myself and nobody knew how to possibly fix it, I decided to create a new database which is working quite good for now, however if it happens again I will of course try this, thanks! |
Yeah i have the feeling this is the exact same error as #9635, looking at the logs like this. |
Description
Since a few days (not exactly sure when), my HS seems incapable of sending outgoing federation messages. Incoming messages do arrive correctly.
Weirder part, my HS apparently manages to send read receipts to other servers (people on matrix have told me they see my read icon advance in element). Incoming read receipts arrive as well.
I don't see any error in my synapse log, but whenever I try to send a message in a federated room, the logs are flooded for a few seconds with a lot of lines like
relating to a lot of other servers (way more than the number of servers participating in the room I'm trying to send a message into).
I don't really see what I can do to figure out what is going on, but I remain available to check anything you think would be useful to debug this. I am available for more direct discussion as
Levans
on Freenode IRC (I'm in#matrix-synapse
/#synapse:matrix.org
so you can ping me there for example).Steps to reproduce
Sadly I don't know what started it, it seemingly started by itself at some point.
Restarting synapse or rebooting the server does not help.
Version information
https://packages.matrix.org/debian buster InRelease
on a debian 10.6 serversafaradeg.net
, and the server is delegated to the subdomainmatrix.safaradeg.net
The text was updated successfully, but these errors were encountered: