Outgoing federation suddenly broken #8747

elinorbgr · 2020-11-12T08:38:07Z

Description

Since a few days (not exactly sure when), my HS seems incapable of sending outgoing federation messages. Incoming messages do arrive correctly.

Weirder part, my HS apparently manages to send read receipts to other servers (people on matrix have told me they see my read icon advance in element). Incoming read receipts arrive as well.

I don't see any error in my synapse log, but whenever I try to send a message in a federated room, the logs are flooded for a few seconds with a lot of lines like

2020-11-12 08:15:29,153 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-1480260 - TX [matrix.org] {1605094380906} Sending transaction [
1605094380906], (PDUs: 3, EDUs: 0)
2020-11-12 08:15:29,703 - synapse.http.matrixfederationclient - 204 - INFO - federation_transaction_transmission_loop-1480260 - {PUT-O-1142924} [matrix.org] Completed request: 200 OK in 0.55
secs - PUT matrix://matrix.org/_matrix/federation/v1/send/1605094380906

relating to a lot of other servers (way more than the number of servers participating in the room I'm trying to send a message into).

I don't really see what I can do to figure out what is going on, but I remain available to check anything you think would be useful to debug this. I am available for more direct discussion as Levans on Freenode IRC (I'm in #matrix-synapse / #synapse:matrix.org so you can ping me there for example).

Steps to reproduce

Sadly I don't know what started it, it seemingly started by itself at some point.

Restarting synapse or rebooting the server does not help.

Version information

Synapse 1.22.1
installed from https://packages.matrix.org/debian buster InRelease on a debian 10.6 server
Instance with only a few users (and only one participating in federated rooms, me), and running without workers enabled
The instance name is safaradeg.net, and the server is delegated to the subdomain matrix.safaradeg.net

The text was updated successfully, but these errors were encountered:

erikjohnston · 2020-11-12T09:57:04Z

That sounds bad. Do you have some room IDs that you've sent messages in that aren't federating? Preferably one's with matrix.org in so that we can have a look at its logs.

elinorbgr · 2020-11-12T10:04:43Z

I just sent a test message to #synapse:matrix.org (!ehXvUhWNASUkSLvAGP:matrix.org) from my account @levans:safaradeg.net.

The message as displayed by element, if that helps:

{
  "type": "m.room.message",
  "sender": "@levans:safaradeg.net",
  "content": {
    "msgtype": "m.text",
    "body": "Test message, can you receive me ?"
  },
  "origin_server_ts": 1605175426928,
  "unsigned": {
    "age": 232,
    "transaction_id": "m1605175426260.4"
  },
  "event_id": "$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo",
  "room_id": "!ehXvUhWNASUkSLvAGP:matrix.org"
}

erikjohnston · 2020-11-12T10:53:31Z

Thanks! That event doesn't seem to have made it to either matrix.org or jki.re 😕

I've looked at both matrix.org and jki.re logs, and it seems that your server is repeatedly sending the same set of events to repeatedly to matrix.org and jki.re (though different sets to each), which is very weird. The resent events seem to be from 2020-11-09 ~08:00. Though both servers do seem to have a few events that arrived since then, even this morning at 2020-11-12 08:18:19+00.

This sounds like it could be related to the recent work to retry sending events to other servers after outages.

I'd be interested in seeing the contents of some of the federation DB tables if possible, I think the following would be very useful to know:

SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT MAX(stream_ordering) FROM events;
SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
SELECT * FROM federation_stream_position;

elinorbgr · 2020-11-12T11:00:19Z

synapse_matrix=# SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
 destination | retry_last_ts | retry_interval | failure_ts | last_successful_stream_ordering
-------------+---------------+----------------+------------+---------------------------------
 matrix.org  |             0 |              0 |            |                         7972966
 jki.re      |             0 |              0 |            |                         7972924
(2 rows)

synapse_matrix=# SELECT MAX(stream_ordering) FROM events;
   max
---------
 7982676
(1 row)

synapse_matrix=# SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
 stream_ordering
-----------------
         7982585
(1 row)

synapse_matrix=# SELECT * FROM federation_stream_position;
    type    | stream_id | instance_name
------------+-----------+---------------
 federation |        -1 | master
 events     |   7972922 | master
(2 rows)

erikjohnston · 2020-11-12T11:07:44Z

Thanks! This does look like both matrix.org and jki.re are stuck in "catchup" mode (the stream orderings are significantly earlier than the current stream orderings).

@vberger do you see log lines which contain Catching up rooms to?

Also if its easy for you to do it would be very useful if you could apply the following patch and share the logs after letting it run for a bit, if not don't worry about it.

diff --git a/synapse/federation/sender/per_destination_queue.py b/synapse/federation/sender/per_destination_queue.py
index db8e456fe..285f767b9 100644
--- a/synapse/federation/sender/per_destination_queue.py
+++ b/synapse/federation/sender/per_destination_queue.py
@@ -240,11 +240,13 @@ class PerDestinationQueue:
             # hence why we throw the result away.
             await get_retry_limiter(self._destination, self._clock, self._store)
 
+            logger.info("Catching up %s?", self._destination)
             if self._catching_up:
                 # we potentially need to catch-up first
                 await self._catch_up_transmission_loop()
                 if self._catching_up:
                     # not caught up yet
+                    logger.info("Still catching up %s!", self._destination)
                     return
 
             pending_pdus = []
@@ -452,6 +454,10 @@ class PerDestinationQueue:
             # Sadly, this means we can't do anything here as we don't know what
             # needs catching up — so catching up is futile; let's stop.
             self._catching_up = False
+            logger.info(
+                "Caught up %s, due to null last successful stream ordering!",
+                self._destination,
+            )
             return
 
         # get at most 50 catchup room/PDUs
@@ -465,10 +471,15 @@ class PerDestinationQueue:
                 # of a race condition, so we check that no new events have been
                 # skipped due to us being in catch-up mode
 
+                logger.info("No more events for %s...", self._destination)
+
                 if self._catchup_last_skipped > self._last_successful_stream_ordering:
                     # another event has been skipped because we were in catch-up mode
+                    logger.info("... but got a poke so looping %s!", self._destination)
                     continue
 
+                logger.info("Caught up as not poked %s!", self._destination)
+
                 # we are done catching up!
                 self._catching_up = False
                 break
@@ -500,6 +511,9 @@ class PerDestinationQueue:
             )
 
             if not success:
+                logger.info(
+                    "Catch up failed as failed to poke remote %s!", self._destination
+                )
                 return
 
             sent_transactions_counter.inc()

elinorbgr · 2020-11-12T11:13:38Z

Yes, I have quite a lot of lines like

2020-11-11 11:16:08,632 - synapse.federation.sender.per_destination_queue - 496 - INFO - federation_transaction_transmission_loop-242 - Catching up rooms to t2bot.io: ['!FKxhjkICpjBVzVeCdF:safaradeg.net']

relating to a lot of different servers in my logs of the last few days.

I'll try the patch.

elinorbgr · 2020-11-12T11:32:08Z

So I applied the patch and added [DEBG] in the messages for easier grepping. Then grepping for it in the log files just after restarting my HS gives this : https://paste.rs/Ac8

erikjohnston · 2020-11-12T12:29:52Z

Thanks! Hopefully that is enough to investigate more deeply

elinorbgr · 2020-11-13T15:29:31Z

After yet another reboot of my HS it looks like things suddenly fixed themselves, on all rooms at once.

Before rebooting I actually was adding some more debug-logging, and noticed something that surprised me (before I realized the federation was working again):

It appeared that this piece of code, which updates the last_successful_stream_ordering for a given destination

synapse/synapse/federation/sender/per_destination_queue.py

Lines 358 to 368 in f31f8e6

 if pending_pdus: 

 # we sent some PDUs and it was successful, so update our 

 # last_successful_stream_ordering in the destinations table. 

 final_pdu = pending_pdus[-1] 

 last_successful_stream_ordering = ( 

 final_pdu.internal_metadata.stream_ordering 

 ) 

 assert last_successful_stream_ordering 

 await self._store.set_destination_last_successful_stream_ordering( 

 self._destination, last_successful_stream_ordering 

 )

was invoked several times in a row with the same value for last_successful_stream_ordering. I don't know if it is relevant to the current bug, but I found it odd so I guess it's worth putting it here.

licentiapoetica · 2021-03-21T12:59:53Z

I actually got a similar problem as OP, today no messages from my instance were federated any longer and I don't know what caused it, right now I cannot chat with any other servers that are not my own, read receipts and typing notifications, such as joining channels, it all works, just my outgoing messages are not delivered.

In my journal I get the following all the time, like every 5-10 seconds

Mar 21 12:46:36 synapse postgres[254]: 2021-03-21 12:46:36.404 UTC [254] ERROR:  could not serialize access due to concurrent update
Mar 21 12:46:36 synapse postgres[254]: 2021-03-21 12:46:36.404 UTC [254] STATEMENT:  INSERT INTO destinations (destination) VALUES ('matrix.kek.fyi') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('isitviral.ca') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('undernet.uy') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('kawaiilo.li') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.ordoevangelistarum.com') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('nerdsin.space') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('zereg.firewall-gateway.de') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.netcavy.net') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('asra.gr') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('hackerspace.pl') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('theboys.gg') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('200acres.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('privacytools.io') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('wub.root.sx') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('nemuneko.me') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('jupiterbroadcasting.com') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('wooot.de') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('badopinion.club') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('n1tsu.freeboxos.fr') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('fard.topplekek.xyz') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('gravitation.me') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('mozilla.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.ipv8.pl') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('tchncs.de') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.asger.xyz') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.eientei.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('sudo.ems.host') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('midov.pl') ON CONFLICT DO NOTHING;

2021-03-21 00:01:37,902 - synapse.http.matrixfederationclient - 209 - INFO - PUT-23300 - {GET-O-102338} [matrix.org] Completed request: 200 OK in 0.36 secs - GET matrix://matrix.org/_matrix/federation/v1/user/devices/%40nitzu%3Amatrix.org
2021-03-21 00:01:38,099 - synapse.access.http.8008 - 295 - INFO - PUT-23300 - 2a00:1098:84:1c8::157 - 8008 - {matrix.org} Processed request: 0.566sec/-0.000sec (0.003sec, 0.001sec) (0.002sec/0.195sec/8) 11B 200 "PUT /_matrix/federation/v1/send/1615911485007 HTTP/1.0" "Synapse/1.29.0 (b=matrix-org-hotfixes,d8953b34f)" [0 dbevts]
2021-03-21 00:01:38,768 - synapse.federation.transport.server - 406 - INFO - PUT-23301 - Received txn 1615908879035 from linuxbox.ninja.

It definetly does some 200s there, this is really weird, I am using https://archlinux.org/packages/community/any/matrix-synapse/
and psql (PostgreSQL) 13.2

erikjohnston · 2021-03-23T13:53:19Z

We're doing a bunch of work in this area atm, in particular I'd be intrigued to know if #9639 helps, which should stop the "could not serialize access due to concurrent update" errors

licentiapoetica · 2021-03-23T13:56:07Z

We're doing a bunch of work in this area atm, in particular I'd be intrigued to know if #9639 helps, which should stop the "could not serialize access due to concurrent update" errors

If it happens again I will surely report on this matter, but after I could not fix it myself and nobody knew how to possibly fix it, I decided to create a new database which is working quite good for now, however if it happens again I will of course try this, thanks!

ShadowJonathan · 2021-03-23T14:15:53Z

Yeah i have the feeling this is the exact same error as #9635, looking at the logs like this.

richvdh · 2021-08-13T12:15:44Z

It sounds like we're hoping the original cause of this is fixed by #9639. Since the @vberger's problem fixed itself, I don't think there's anything else we can do here.

erikjohnston added the info-needed label Nov 12, 2020

erikjohnston added z-bug (Deprecated Label) z-p2 (Deprecated Label) and removed info-needed labels Nov 12, 2020

elinorbgr mentioned this issue Nov 13, 2020

Late-arriving federation messages are forwarded as-is, causing confusion matrix-org/matrix-appservice-irc#1190

Closed

ShadowJonathan mentioned this issue Mar 26, 2021

Experimental Federation Speedup #9702

Merged

4 tasks

richvdh closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outgoing federation suddenly broken #8747

Outgoing federation suddenly broken #8747

elinorbgr commented Nov 12, 2020

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 12, 2020 •

edited

Loading

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 12, 2020

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 12, 2020

elinorbgr commented Nov 12, 2020

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 13, 2020 •

edited

Loading

licentiapoetica commented Mar 21, 2021

erikjohnston commented Mar 23, 2021 •

edited

Loading

licentiapoetica commented Mar 23, 2021

ShadowJonathan commented Mar 23, 2021

richvdh commented Aug 13, 2021

Outgoing federation suddenly broken #8747

Outgoing federation suddenly broken #8747

Comments

elinorbgr commented Nov 12, 2020

Description

Steps to reproduce

Version information

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 12, 2020 • edited Loading

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 12, 2020

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 12, 2020

elinorbgr commented Nov 12, 2020

erikjohnston commented Nov 12, 2020

elinorbgr commented Nov 13, 2020 • edited Loading

licentiapoetica commented Mar 21, 2021

erikjohnston commented Mar 23, 2021 • edited Loading

licentiapoetica commented Mar 23, 2021

ShadowJonathan commented Mar 23, 2021

richvdh commented Aug 13, 2021

elinorbgr commented Nov 12, 2020 •

edited

Loading

elinorbgr commented Nov 13, 2020 •

edited

Loading

erikjohnston commented Mar 23, 2021 •

edited

Loading