Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Outgoing federation suddenly broken #8747

Closed
elinorbgr opened this issue Nov 12, 2020 · 14 comments
Closed

Outgoing federation suddenly broken #8747

elinorbgr opened this issue Nov 12, 2020 · 14 comments
Labels
z-bug (Deprecated Label) z-p2 (Deprecated Label)

Comments

@elinorbgr
Copy link

Description

Since a few days (not exactly sure when), my HS seems incapable of sending outgoing federation messages. Incoming messages do arrive correctly.

Weirder part, my HS apparently manages to send read receipts to other servers (people on matrix have told me they see my read icon advance in element). Incoming read receipts arrive as well.

I don't see any error in my synapse log, but whenever I try to send a message in a federated room, the logs are flooded for a few seconds with a lot of lines like

2020-11-12 08:15:29,153 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-1480260 - TX [matrix.org] {1605094380906} Sending transaction [
1605094380906], (PDUs: 3, EDUs: 0)
2020-11-12 08:15:29,703 - synapse.http.matrixfederationclient - 204 - INFO - federation_transaction_transmission_loop-1480260 - {PUT-O-1142924} [matrix.org] Completed request: 200 OK in 0.55
secs - PUT matrix://matrix.org/_matrix/federation/v1/send/1605094380906

relating to a lot of other servers (way more than the number of servers participating in the room I'm trying to send a message into).

I don't really see what I can do to figure out what is going on, but I remain available to check anything you think would be useful to debug this. I am available for more direct discussion as Levans on Freenode IRC (I'm in #matrix-synapse / #synapse:matrix.org so you can ping me there for example).

Steps to reproduce

Sadly I don't know what started it, it seemingly started by itself at some point.

Restarting synapse or rebooting the server does not help.

Version information

  • Synapse 1.22.1
  • installed from https://packages.matrix.org/debian buster InRelease on a debian 10.6 server
  • Instance with only a few users (and only one participating in federated rooms, me), and running without workers enabled
  • The instance name is safaradeg.net, and the server is delegated to the subdomain matrix.safaradeg.net
@erikjohnston
Copy link
Member

That sounds bad. Do you have some room IDs that you've sent messages in that aren't federating? Preferably one's with matrix.org in so that we can have a look at its logs.

@elinorbgr
Copy link
Author

elinorbgr commented Nov 12, 2020

I just sent a test message to #synapse:matrix.org (!ehXvUhWNASUkSLvAGP:matrix.org) from my account @levans:safaradeg.net.

The message as displayed by element, if that helps:

{
  "type": "m.room.message",
  "sender": "@levans:safaradeg.net",
  "content": {
    "msgtype": "m.text",
    "body": "Test message, can you receive me ?"
  },
  "origin_server_ts": 1605175426928,
  "unsigned": {
    "age": 232,
    "transaction_id": "m1605175426260.4"
  },
  "event_id": "$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo",
  "room_id": "!ehXvUhWNASUkSLvAGP:matrix.org"
}

@erikjohnston
Copy link
Member

Thanks! That event doesn't seem to have made it to either matrix.org or jki.re 😕

I've looked at both matrix.org and jki.re logs, and it seems that your server is repeatedly sending the same set of events to repeatedly to matrix.org and jki.re (though different sets to each), which is very weird. The resent events seem to be from 2020-11-09 ~08:00. Though both servers do seem to have a few events that arrived since then, even this morning at 2020-11-12 08:18:19+00.

This sounds like it could be related to the recent work to retry sending events to other servers after outages.

I'd be interested in seeing the contents of some of the federation DB tables if possible, I think the following would be very useful to know:

SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
SELECT MAX(stream_ordering) FROM events;
SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
SELECT * FROM federation_stream_position;

@elinorbgr
Copy link
Author

synapse_matrix=# SELECT * FROM destinations WHERE destination IN ('matrix.org', 'jki.re');
 destination | retry_last_ts | retry_interval | failure_ts | last_successful_stream_ordering
-------------+---------------+----------------+------------+---------------------------------
 matrix.org  |             0 |              0 |            |                         7972966
 jki.re      |             0 |              0 |            |                         7972924
(2 rows)

synapse_matrix=# SELECT MAX(stream_ordering) FROM events;
   max
---------
 7982676
(1 row)

synapse_matrix=# SELECT stream_ordering FROM events WHERE event_id = '$ycmMNoIU5HeihPB2IJ-Txd_kG9kwh2_tjD0zzzypRIo';
 stream_ordering
-----------------
         7982585
(1 row)

synapse_matrix=# SELECT * FROM federation_stream_position;
    type    | stream_id | instance_name
------------+-----------+---------------
 federation |        -1 | master
 events     |   7972922 | master
(2 rows)

@erikjohnston
Copy link
Member

Thanks! This does look like both matrix.org and jki.re are stuck in "catchup" mode (the stream orderings are significantly earlier than the current stream orderings).

@vberger do you see log lines which contain Catching up rooms to?

Also if its easy for you to do it would be very useful if you could apply the following patch and share the logs after letting it run for a bit, if not don't worry about it.

diff --git a/synapse/federation/sender/per_destination_queue.py b/synapse/federation/sender/per_destination_queue.py
index db8e456fe..285f767b9 100644
--- a/synapse/federation/sender/per_destination_queue.py
+++ b/synapse/federation/sender/per_destination_queue.py
@@ -240,11 +240,13 @@ class PerDestinationQueue:
             # hence why we throw the result away.
             await get_retry_limiter(self._destination, self._clock, self._store)
 
+            logger.info("Catching up %s?", self._destination)
             if self._catching_up:
                 # we potentially need to catch-up first
                 await self._catch_up_transmission_loop()
                 if self._catching_up:
                     # not caught up yet
+                    logger.info("Still catching up %s!", self._destination)
                     return
 
             pending_pdus = []
@@ -452,6 +454,10 @@ class PerDestinationQueue:
             # Sadly, this means we can't do anything here as we don't know what
             # needs catching up — so catching up is futile; let's stop.
             self._catching_up = False
+            logger.info(
+                "Caught up %s, due to null last successful stream ordering!",
+                self._destination,
+            )
             return
 
         # get at most 50 catchup room/PDUs
@@ -465,10 +471,15 @@ class PerDestinationQueue:
                 # of a race condition, so we check that no new events have been
                 # skipped due to us being in catch-up mode
 
+                logger.info("No more events for %s...", self._destination)
+
                 if self._catchup_last_skipped > self._last_successful_stream_ordering:
                     # another event has been skipped because we were in catch-up mode
+                    logger.info("... but got a poke so looping %s!", self._destination)
                     continue
 
+                logger.info("Caught up as not poked %s!", self._destination)
+
                 # we are done catching up!
                 self._catching_up = False
                 break
@@ -500,6 +511,9 @@ class PerDestinationQueue:
             )
 
             if not success:
+                logger.info(
+                    "Catch up failed as failed to poke remote %s!", self._destination
+                )
                 return
 
             sent_transactions_counter.inc()

@elinorbgr
Copy link
Author

Yes, I have quite a lot of lines like

2020-11-11 11:16:08,632 - synapse.federation.sender.per_destination_queue - 496 - INFO - federation_transaction_transmission_loop-242 - Catching up rooms to t2bot.io: ['!FKxhjkICpjBVzVeCdF:safaradeg.net']

relating to a lot of different servers in my logs of the last few days.

I'll try the patch.

@elinorbgr
Copy link
Author

So I applied the patch and added [DEBG] in the messages for easier grepping. Then grepping for it in the log files just after restarting my HS gives this : https://paste.rs/Ac8

@erikjohnston erikjohnston added z-bug (Deprecated Label) z-p2 (Deprecated Label) and removed info-needed labels Nov 12, 2020
@erikjohnston
Copy link
Member

Thanks! Hopefully that is enough to investigate more deeply

@elinorbgr
Copy link
Author

elinorbgr commented Nov 13, 2020

After yet another reboot of my HS it looks like things suddenly fixed themselves, on all rooms at once.

Before rebooting I actually was adding some more debug-logging, and noticed something that surprised me (before I realized the federation was working again):

It appeared that this piece of code, which updates the last_successful_stream_ordering for a given destination

if pending_pdus:
# we sent some PDUs and it was successful, so update our
# last_successful_stream_ordering in the destinations table.
final_pdu = pending_pdus[-1]
last_successful_stream_ordering = (
final_pdu.internal_metadata.stream_ordering
)
assert last_successful_stream_ordering
await self._store.set_destination_last_successful_stream_ordering(
self._destination, last_successful_stream_ordering
)

was invoked several times in a row with the same value for last_successful_stream_ordering. I don't know if it is relevant to the current bug, but I found it odd so I guess it's worth putting it here.

@licentiapoetica
Copy link

I actually got a similar problem as OP, today no messages from my instance were federated any longer and I don't know what caused it, right now I cannot chat with any other servers that are not my own, read receipts and typing notifications, such as joining channels, it all works, just my outgoing messages are not delivered.

In my journal I get the following all the time, like every 5-10 seconds

Mar 21 12:46:36 synapse postgres[254]: 2021-03-21 12:46:36.404 UTC [254] ERROR:  could not serialize access due to concurrent update
Mar 21 12:46:36 synapse postgres[254]: 2021-03-21 12:46:36.404 UTC [254] STATEMENT:  INSERT INTO destinations (destination) VALUES ('matrix.kek.fyi') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('isitviral.ca') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('undernet.uy') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('kawaiilo.li') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.ordoevangelistarum.com') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('nerdsin.space') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('zereg.firewall-gateway.de') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.netcavy.net') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('asra.gr') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('hackerspace.pl') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('theboys.gg') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('200acres.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('privacytools.io') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('wub.root.sx') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('nemuneko.me') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('jupiterbroadcasting.com') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('wooot.de') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('badopinion.club') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('n1tsu.freeboxos.fr') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('fard.topplekek.xyz') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('gravitation.me') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('mozilla.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.ipv8.pl') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('tchncs.de') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.asger.xyz') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('matrix.eientei.org') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('sudo.ems.host') ON CONFLICT DO NOTHING;;INSERT INTO destinations (destination) VALUES ('midov.pl') ON CONFLICT DO NOTHING;
2021-03-21 00:01:37,902 - synapse.http.matrixfederationclient - 209 - INFO - PUT-23300 - {GET-O-102338} [matrix.org] Completed request: 200 OK in 0.36 secs - GET matrix://matrix.org/_matrix/federation/v1/user/devices/%40nitzu%3Amatrix.org
2021-03-21 00:01:38,099 - synapse.access.http.8008 - 295 - INFO - PUT-23300 - 2a00:1098:84:1c8::157 - 8008 - {matrix.org} Processed request: 0.566sec/-0.000sec (0.003sec, 0.001sec) (0.002sec/0.195sec/8) 11B 200 "PUT /_matrix/federation/v1/send/1615911485007 HTTP/1.0" "Synapse/1.29.0 (b=matrix-org-hotfixes,d8953b34f)" [0 dbevts]
2021-03-21 00:01:38,768 - synapse.federation.transport.server - 406 - INFO - PUT-23301 - Received txn 1615908879035 from linuxbox.ninja.

It definetly does some 200s there, this is really weird, I am using https://archlinux.org/packages/community/any/matrix-synapse/
and psql (PostgreSQL) 13.2

@erikjohnston
Copy link
Member

erikjohnston commented Mar 23, 2021

We're doing a bunch of work in this area atm, in particular I'd be intrigued to know if #9639 helps, which should stop the "could not serialize access due to concurrent update" errors

@licentiapoetica
Copy link

We're doing a bunch of work in this area atm, in particular I'd be intrigued to know if #9639 helps, which should stop the "could not serialize access due to concurrent update" errors

If it happens again I will surely report on this matter, but after I could not fix it myself and nobody knew how to possibly fix it, I decided to create a new database which is working quite good for now, however if it happens again I will of course try this, thanks!

@ShadowJonathan
Copy link
Contributor

Yeah i have the feeling this is the exact same error as #9635, looking at the logs like this.

@richvdh
Copy link
Member

richvdh commented Aug 13, 2021

It sounds like we're hoping the original cause of this is fixed by #9639. Since the @vberger's problem fixed itself, I don't think there's anything else we can do here.

@richvdh richvdh closed this as completed Aug 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

No branches or pull requests

5 participants