User Details
- User Since
- May 11 2015, 8:31 AM (480 w, 2 d)
- Availability
- Away Away until Sep 9.
- IRC Nick
- jynus
- LDAP User
- Jcrespo
- MediaWiki User
- JCrespo (WMF) [ Global Accounts ]
Tue, Jul 9
0 -> backup1004 0 1 -> backup1004 0 2 -> backup1004 0 3 -> backup1004 -> backup1005 1 (done) 4 -> backup1005 * 1 5 -> backup1005 * 1 6 -> backup1005 * backup1006 2 7 -> backup1005 * backup1006 2 8 -> backup1006 2 9 -> backup1006 -> backup1007 3 (done) a -> backup1006 -> backup1007 3 (done) b -> backup1006 -> backup1007 3 (done) c -> backup1007 -> backup1011 4 (done) d -> backup1007 -> backup1011 4 (done) e -> backup1007 -> backup1011 4 (done) f -> backup1007 -> backup1011 4 (done)
Mon, Jul 8
Resharding completed, only pending 2 running purge screeen on ms-backup2001, 2002 for purging leftovers. backup1011 & backup2011 will have to be completented by backup1012 and backup2012 this Q.
Thu, Jul 4
Running it manually it worked every time, so I am confused- it doesn't seem to be a script issue. Could it be a mailing subsystem issue?
Wed, Jul 3
1 more week left to finish the resharding.
Backlog for when I come back.
5 million files left to recover!
It would be nice to productionize this, but didn't had the time so far.
This has been workarounded with the mini-loader method of restoring backups, so I would call it resolved.
I will skip the "Remove dump user", as I think that may be useful and we will decide how to leave it long term when the es1, es2 & es3 backups are generated (with or without the user).
Let's wait a little bit before deleting the files on the old dbprovs just in case (I will do it when I come back).
@Volans @ABran-WMF FYI
Tue, Jul 2
es4 has already been archived on jobs 574899 and 574900, the two for es5 are running now. When finished, we will be able to close this ticket.
Mon, Jul 1
No action will be needed for backup1010 in the end.
@Davenyi please note you missed the options asked on the form, as seen above.
If I may @fnegri, the issue is that those hosts are in a way special, because they are pieces (data) of production (meaning here mediawiki) on cloud realm, so it may not be easy to solve with the current architecture. If there was an implementation where absolutely all non-public data and configuration was deleted on production side (e.g. a message protocol that cleans up everything and reconstructs them again on cloud network), that would solve all concerns- but that would be way more complex and will require a lot of work. And only now there is the start of a proper inventory where each table and column will document its privacy and concerns for global usage and editing.
Thu, Jun 27
I got another error at backup2002 (es5):
2024-06-26 17:07:31 [ERROR] - Could not read data from enwiki.blobs_cluster27: Lost connection to MySQL server during query
Wed, Jun 26
This seems to not be reproducible, maybe it was related to cold caches after reboot? Lowering the priority as not happening again since, but wanting to trace it at some point.
es2022 finished, all good. I am going to disable bacula for es hosts so it doesn't run while the ongoing db dumps finish, then reenable and do the one-time read only backup.
This is the status now: es2025 completed and repooled, es2022 is about to finish (but still depooled), es1022 + es1025 have just started:
Tue, Jun 25
Question, what went wrong with dump replica groups? I believe (or at least it used to be) that dumps only use databases in the dump replica group. Was it overloaded and it jump into other dbs? If yes, can that be prevented/help throttle the dumps?
Is there a procedure for that so we know how to do so?
Jun 24 2024
@Ladsgroup Can you please also fix and clean up the backups that failed because of it?
Jun 21 2024
I didn't get to this this week, but let's try to have this done next week CC @Marostegui @ABran-WMF @Ladsgroup as I will be gone in 2 weeks, and that way everything is where it should be.
The above errors have been solved, and I compared all tables to db1150, now resulting on the same results.
Jun 20 2024
For example, a freshly loaded host from backups (db1240:s3) has it as 169G in size. So the differece may come from 10.6 or some change in defaults, not the lack of optimization (?).
There was nothing other than rebuilds/optimizes
These are the kind of things I wanted to avoid on a naive approach:
I have reloaded data to db1240:s3 from a backup taken 2024-06-11 00:00:05. It will require schema changes that happened since then (no rebuilds or other non-data changes, as the data has been loaded logically).
Jun 19 2024
We are trying to measure the lag with a pretty high precision
I met with Abran on a quick call and we both shared thoughts and I think understood each other. I also encouraged him to talk to the rest of the DBA team for decisions that are open and future improvements.
I wasn't worried about immediate alerts. I know those won't change for now.
@ABran-WMF It very likely change it, because as you shared, the exporter does:
I don't see a clear difference with the current icinga/perl implementation.
I am also not going to enable any account until the end of the load (including monitoring) to avoid any bad interaction.
The process seems to have failed at the last steps. Retrying with a higher buffer pool and stopping s1.
Jun 18 2024
I am leaving for the day, but there is a chance this is not worth debugging because the hosts are about to be decommissioned (unless it happens on the new ones, too). Filing it in case it could be useful for other perf issues for other hosts.
While technically the host didn't crash- it had an "unscheduled normal shutdown", given it is the source of s3 backups on eqiad, I am going to recover it from backups.
Jun 17 2024
And this is the wiki distribution:
This is the API request I filed: T267365
@ABran-WMF As you can see, codfw health status is much better (I queried it just before restarting it) ^
Jun 14 2024
Done!
Deleted from zarcillo and stopped.
Jun 12 2024
The alerts should be configurable by lag and by role from puppet- that means: I don't want alerts for backup sources that are < 4h, as I regularly stop those while taking the backups. E.g. core db hosts vs misc vs test hosts, etc.
db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.
backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoid backup errors.
backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.
backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unless it situtation changes by july and finally gets pooled for bacula, it will require no action.
@Marostegui, in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:
@ABran-WMF Thanks for handling it. To confirm, the issue happened at 2024-06-11 13:53:41 (Tuesday), right (or before?)? Because I may recover the host from backups just to be 100% sure there is no leftover corruption.
Jun 7 2024
This is ready for dc-ops.
This is ready for dc ops.
This is ready for dc ops.