News: 1756908190

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Matrix.org homeserver grinds to a halt after RAID meltdown

(2025/09/03)


A RAID failure has taken the Matrix.org homeserver offline, leaving users of the decentralized messaging service unable to send or receive messages while engineers attempt a 55 TB database restore.

To be clear, those with their own homeservers, such as government organizations, are unaffected, but anyone using Matrix.org as their homeserver will have been hearing the sound of silence from the platform while the team works to bring the service back online.

Problems [1]began at 1117 UTC on September 2, when the secondary Matrix.org database lost its file system due to a RAID failure. The primary fell over at 1726 UTC, and a few minutes later, the organization [2]admitted that things were indeed not very healthy.

[3]

The Matrix.org homeserver is backed by a large PostgreSQL database, which caused the organization grief in July when a long-gestating corruption of part of a table index caused issues with "rooms" in the system. The [4]result was that attempts to join rooms would fail, messages wouldn't send, and occasional cryptic error messages would appear.

[5]

[6]

The team was understandably a little cautious when restoring the database and eventually [7]reported : "We haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption)."

The solution is a full 55 TB database snapshot restore followed by a replay of 17 hours' worth of traffic. At the time of writing, the team had managed to restore the snapshot and subsequent incremental backups and was about to embark on the traffic replay.

[8]Secure chat darling Matrix admits pair of 'high severity' protocol flaws need painful fixes

[9]Messaging app makers' dilemma: Keeping comms private and funding open source

[10]Open source license challenges part 461: Element plots move to AGPLv3

[11]Element users are asking for protection against government encryption busting

Neil Johnson, chief engineering officer at Element, a messaging platform by the creators of Matrix, told The Register the trouble started with a routine storage upgrade exercise that went badly wrong. "A whole series of things happened at exactly the wrong time in unison, which then led to the situation that we see," he said.

It's not a great look for the organization, as users who rely on the Matrix.org homeserver can't access it. Messages sent to Matrix.org users will be queued until the service is back up and running. "There's not going to be any data loss. Eventually your message will get through," Johnson said.

[12]

There is no charge for using Matrix.org and there is also no service level agreement.

The incident demonstrates the benefits of a decentralized system. Users with their own homeservers aren't affected, nor are organizations such as Element, which have customer deployments that utilize the underlying technology.

One homeserver going down does not affect the rest, even one as visible as Matrix.org.

[13]

Matrix has become increasingly important in recent years as public and private sector organizations seek to reduce their dependency on centralized messaging services that might not meet sovereignty or privacy requirements. The Matrix.org outage, while embarrassing, serves to highlight that a decentralized approach can protect users from whoopsies on the part of those who run the service. ®

Get our [14]Tech Resources



[1] https://mastodon.matrix.org/@matrix/115136245785561439

[2] https://status.matrix.org/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aLhmFd1TEqysJS9x_evpRgAAAIo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://matrix.org/blog/2025/07/postgres-corruption-postmortem/

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aLhmFd1TEqysJS9x_evpRgAAAIo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aLhmFd1TEqysJS9x_evpRgAAAIo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[7] https://mastodon.matrix.org/@matrix/115136866878237078

[8] https://www.theregister.com/2025/08/13/secure_chat_darling_matrix_admits/

[9] https://www.theregister.com/2024/09/25/element_bosses_on_funding_open/

[10] https://www.theregister.com/2023/11/06/element_moves_to_agplv3/

[11] https://www.theregister.com/2023/10/24/element_spy_clause_protection/

[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aLhmFd1TEqysJS9x_evpRgAAAIo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[13] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_onprem/storage&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aLhmFd1TEqysJS9x_evpRgAAAIo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[14] https://whitepapers.theregister.com/



ParlezVousFranglais

"A whole series of things happened at exactly the wrong time in unison, which then led to the situation that we see"

Yep - been there, done that, still have the t-shirt somewhere. I don't envy the inevitable stress of that level of failure, followed by a 55Tb restore and replaying that many logs (just hope they are all intact) but while they may not see it at the moment, that will become a learning exercise, and the team involved will hopefully eventually be better off for the experience. Hope they have a well deserved beer at the end of it all...

ChoHag

I don't know what's worse. That the central node's failure took the decentralised service down or that it has 55TB of data in it.

SW or HW raid?

m4r35n357

It is sort of relevant here! Otherwise comes across as vague bitching about PostgreSQL.

Re: SW or HW raid?

alain williams

It does not matter what sort of RAID. PostgreSQL sits on top of whatever it is. This fu-bar is nothing to do with the database.

It does show that even if you have RAID you still need backups; they protect you in different ways.

What RAID

Colin Bull 1

Is it RAID 10 or 5 or an even crappier version. WE SHOULD BE TOLD.

The whole earth is in jail and we're plotting this incredible jailbreak.
-- Wavy Gravy