jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)
View All

bacula
Component
Data-Persistence-Backup
Component
database-backups
Component
DBA
Group
dbbackups-dashboard
Component

Calendar

User Details

User Since: May 11 2015, 8:31 AM (480 w, 2 d)
Availability: Away Away until Sep 9.
IRC Nick: jynus
LDAP User: Jcrespo
MediaWiki User: JCrespo (WMF) [ Global Accounts ]

Recent Activity
View All

Tue, Jul 9

• jcrespo added a comment to P65571 mediabackups resharding.

0 -> backup1004               0
1 -> backup1004               0
2 -> backup1004               0
3 -> backup1004 -> backup1005 1 (done)
4 -> backup1005 *             1
5 -> backup1005 *             1
6 -> backup1005 *  backup1006 2
7 -> backup1005 *  backup1006 2
8 -> backup1006               2
9 -> backup1006 -> backup1007 3 (done)
a -> backup1006 -> backup1007 3 (done)
b -> backup1006 -> backup1007 3 (done)
c -> backup1007 -> backup1011 4 (done)
d -> backup1007 -> backup1011 4 (done)
e -> backup1007 -> backup1011 4 (done)
f -> backup1007 -> backup1011 4 (done)

Tue, Jul 9, 3:56 PM

Mon, Jul 8

• jcrespo closed T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space as Resolved.

Resharding completed, only pending 2 running purge screeen on ms-backup2001, 2002 for purging leftovers. backup1011 & backup2011 will have to be completented by backup1012 and backup2012 this Q.

Mon, Jul 8, 12:40 PM · Data-Persistence-Backup, media-backups

• jcrespo closed T365607: Reprovision missing files due to backup1005 hw issues as Resolved.

Mon, Jul 8, 12:21 PM · Data-Persistence-Backup, media-backups

Thu, Jul 4

• jcrespo renamed T369253: Alert email sent from backupmon1001 didn't reach engineer's google inbox (was: check-dbbackup-time sometimes doesn't send email alerts) from check-dbbackup-time sometimes doesn't send email alerts to Alert email sent from backupmon1001 didn't reach engineer's google inbox (was: check-dbbackup-time sometimes doesn't send email alerts).

Thu, Jul 4, 8:16 AM · Infrastructure-Foundations, Mail

• jcrespo edited projects for T369253: Alert email sent from backupmon1001 didn't reach engineer's google inbox (was: check-dbbackup-time sometimes doesn't send email alerts), added: Mail, Infrastructure-Foundations; removed observability, Data-Persistence-Backup, database-backups.

Running it manually it worked every time, so I am confused- it doesn't seem to be a script issue. Could it be a mailing subsystem issue?

Thu, Jul 4, 8:14 AM · Infrastructure-Foundations, Mail

• jcrespo created T369253: Alert email sent from backupmon1001 didn't reach engineer's google inbox (was: check-dbbackup-time sometimes doesn't send email alerts).

Thu, Jul 4, 7:30 AM · Infrastructure-Foundations, Mail

Wed, Jul 3

• jcrespo changed the status of T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space from Open to In Progress.

1 more week left to finish the resharding.

Wed, Jul 3, 12:43 PM · Data-Persistence-Backup, media-backups

• jcrespo triaged T334069: Evaluate and decide the future of MinIO for media backups given the upgrade requirements and increase the available storage space as High priority.

Wed, Jul 3, 12:42 PM · Data-Persistence-Backup, media-backups

• jcrespo placed T351895: Make it easy to retrieve disk usage trends on backup storage for hw provisioning up for grabs.

Wed, Jul 3, 12:42 PM · database-backups, media-backups, bacula, Data-Persistence-Backup

• jcrespo placed T313582: Migrate bacula director to new hardware and setup independent bacula directors/storage/metadata for each primary datacenter for increased redundancy up for grabs.

Backlog for when I come back.

Wed, Jul 3, 12:41 PM · Patch-For-Review, Goal, bacula, Data-Persistence-Backup

• jcrespo placed T330882: transferpy should not log cumin subcomands as ERRORs on a normal, succesful run up for grabs.

Wed, Jul 3, 12:40 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo changed the status of T365607: Reprovision missing files due to backup1005 hw issues from Open to In Progress.

Wed, Jul 3, 12:39 PM · Data-Persistence-Backup, media-backups

• jcrespo added a comment to T365607: Reprovision missing files due to backup1005 hw issues.

5 million files left to recover!

Wed, Jul 3, 12:38 PM · Data-Persistence-Backup, media-backups

• jcrespo placed T283017: Create a dashboard for database backups monitoring/reporting up for grabs.

It would be nice to productionize this, but didn't had the time so far.

Wed, Jul 3, 12:38 PM · dbbackups-dashboard, Patch-For-Review, Goal, database-backups, Data-Persistence-Backup

• jcrespo updated the task description for T365607: Reprovision missing files due to backup1005 hw issues.

Wed, Jul 3, 12:38 PM · Data-Persistence-Backup, media-backups

• jcrespo changed the status of T362509: Setup new dbprov hosts and decommission the old ones from Open to Stalled.

Wed, Jul 3, 12:37 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo closed T200035: DB backup restore skip empty databases as Resolved.

This has been workarounded with the mini-loader method of restoring backups, so I would call it resolved.

Wed, Jul 3, 12:36 PM · Data-Persistence-Backup, Upstream

• jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.

Wed, Jul 3, 12:24 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo closed T363812: Setup backups for es6, es7 and archive old read only backups as Resolved.

I will skip the "Remove dump user", as I think that may be useful and we will decide how to leave it long term when the es1, es2 & es3 backups are generated (with or without the user).

Wed, Jul 3, 12:24 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo added a comment to T362509: Setup new dbprov hosts and decommission the old ones.

Let's wait a little bit before deleting the files on the old dbprovs just in case (I will do it when I come back).

Wed, Jul 3, 11:50 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo updated subscribers of T362509: Setup new dbprov hosts and decommission the old ones.

@Volans @ABran-WMF FYI

Wed, Jul 3, 11:49 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Tue, Jul 2

• jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

es4 has already been archived on jobs 574899 and 574900, the two for es5 are running now. When finished, we will be able to close this ticket.

Tue, Jul 2, 12:28 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

Mon, Jul 1

• jcrespo added a comment to T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad.

No action will be needed for backup1010 in the end.

Mon, Jul 1, 1:52 PM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE

• jcrespo added a comment to T368907: Requesting GitLab account activation for [Davenyi].

@Davenyi please note you missed the options asked on the form, as seen above.

Mon, Jul 1, 12:14 PM · GitLab (Account Approval), Release-Engineering-Team

• jcrespo merged T368906: Requesting GitLab account activation for [YOUR DEVELOPER ACCOUNT USERNAME HERE] into T368907: Requesting GitLab account activation for [Davenyi].

Mon, Jul 1, 12:13 PM · GitLab (Account Approval), Release-Engineering-Team

• jcrespo merged task T368906: Requesting GitLab account activation for [YOUR DEVELOPER ACCOUNT USERNAME HERE] into T368907: Requesting GitLab account activation for [Davenyi].

Mon, Jul 1, 12:12 PM · GitLab (Account Approval), Release-Engineering-Team

• jcrespo added a comment to T344599: wikireplicas root access.

If I may @fnegri, the issue is that those hosts are in a way special, because they are pieces (data) of production (meaning here mediawiki) on cloud realm, so it may not be easy to solve with the current architecture. If there was an implementation where absolutely all non-public data and configuration was deleted on production side (e.g. a message protocol that cleans up everything and reconstructs them again on cloud network), that would solve all concerns- but that would be way more complex and will require a lot of work. And only now there is the start of a proper inventory where each table and column will document its privacy and concerns for global usage and editing.

Mon, Jul 1, 11:56 AM · cloud-services-team (FY2023/2024-Q3-Q4), Data-Services, Infrastructure Security

• jcrespo created P65571 mediabackups resharding.

Mon, Jul 1, 9:26 AM

Thu, Jun 27

• jcrespo closed T368189: db2197 rebooted itself as Resolved.

Thu, Jun 27, 8:16 AM · Data-Persistence-Backup

• jcrespo reassigned T368189: db2197 rebooted itself from • jcrespo to Ladsgroup.

Thu, Jun 27, 8:16 AM · Data-Persistence-Backup

• jcrespo raised the priority of T367882: Possible weird interaction between es backups and puppet runs leading to failures from Low to Medium.

I got another error at backup2002 (es5):

2024-06-26 17:07:31 [ERROR] - Could not read data from enwiki.blobs_cluster27: Lost connection to MySQL server during query

Thu, Jun 27, 7:23 AM · Data-Persistence, database-backups, Puppet

Wed, Jun 26

• jcrespo triaged T367882: Possible weird interaction between es backups and puppet runs leading to failures as Low priority.

This seems to not be reproducible, maybe it was related to cold caches after reboot? Lowering the priority as not happening again since, but wanting to trace it at some point.

Wed, Jun 26, 3:18 PM · Data-Persistence, database-backups, Puppet

• jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

es2022 finished, all good. I am going to disable bacula for es hosts so it doesn't run while the ongoing db dumps finish, then reenable and do the one-time read only backup.

Wed, Jun 26, 12:14 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

This is the status now: es2025 completed and repooled, es2022 is about to finish (but still depooled), es1022 + es1025 have just started:

Wed, Jun 26, 8:21 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Tue, Jun 25

• jcrespo added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

Question, what went wrong with dump replica groups? I believe (or at least it used to be) that dumps only use databases in the dump replica group. Was it overloaded and it jump into other dbs? If yes, can that be prevented/help throttle the dumps?

Tue, Jun 25, 1:03 PM · Dumps 2.0, MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Dumps-Generation, SRE

• jcrespo added a comment to T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad.

Is there a procedure for that so we know how to do so?

Tue, Jun 25, 10:50 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE

Jun 24 2024

• jcrespo added a comment to T368189: db2197 rebooted itself.

@Ladsgroup Can you please also fix and clean up the backups that failed because of it?

Jun 24 2024, 7:41 AM · Data-Persistence-Backup

Jun 21 2024

• jcrespo updated subscribers of T363812: Setup backups for es6, es7 and archive old read only backups.

I didn't get to this this week, but let's try to have this done next week CC @Marostegui @ABran-WMF @Ladsgroup as I will be gone in 2 weeks, and that way everything is where it should be.

Jun 21 2024, 12:43 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo closed T367162: db1240.s3 index issues as Resolved.

The above errors have been solved, and I compared all tables to db1150, now resulting on the same results.

Jun 21 2024, 12:00 PM · Data-Persistence-Backup

• jcrespo closed T367162: db1240.s3 index issues, a subtask of T367261: Rebuild recentchanges table everywhere, as Resolved.

Jun 21 2024, 12:00 PM · DBA

Jun 20 2024

• jcrespo added a comment to T368045: Optimize main tablespace of db1225, db1239.

For example, a freshly loaded host from backups (db1240:s3) has it as 169G in size. So the differece may come from 10.6 or some change in defaults, not the lack of optimization (?).

Jun 20 2024, 1:03 PM · Data-Persistence-Backup, database-backups

• jcrespo added a comment to T367162: db1240.s3 index issues.

There was nothing other than rebuilds/optimizes

Jun 20 2024, 12:58 PM · Data-Persistence-Backup

• jcrespo renamed T368045: Optimize main tablespace of db1225, db1239 from Optimize main tablespace of db1225 to Optimize main tablespace of db1225, db1239.

Jun 20 2024, 12:55 PM · Data-Persistence-Backup, database-backups

• jcrespo triaged T368045: Optimize main tablespace of db1225, db1239 as Low priority.

Jun 20 2024, 12:48 PM · Data-Persistence-Backup, database-backups

• jcrespo created T368045: Optimize main tablespace of db1225, db1239.

Jun 20 2024, 12:48 PM · Data-Persistence-Backup, database-backups

• jcrespo created P65222 jobs 29091 & 29094.

Jun 20 2024, 12:31 PM

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

These are the kind of things I wanted to avoid on a naive approach:

Jun 20 2024, 11:07 AM · Patch-For-Review, DBA

• jcrespo added a comment to T367162: db1240.s3 index issues.

I have reloaded data to db1240:s3 from a backup taken 2024-06-11 00:00:05. It will require schema changes that happened since then (no rebuilds or other non-data changes, as the data has been loaded logically).

Jun 20 2024, 9:23 AM · Data-Persistence-Backup

Jun 19 2024

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

We are trying to measure the lag with a pretty high precision

Jun 19 2024, 8:43 PM · Patch-For-Review, DBA

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

I met with Abran on a quick call and we both shared thoughts and I think understood each other. I also encouraged him to talk to the rest of the DBA team for decisions that are open and future improvements.

Jun 19 2024, 1:58 PM · Patch-For-Review, DBA

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

I wasn't worried about immediate alerts. I know those won't change for now.

Jun 19 2024, 1:19 PM · Patch-For-Review, DBA

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

@ABran-WMF It very likely change it, because as you shared, the exporter does:

Jun 19 2024, 10:14 AM · Patch-For-Review, DBA

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

In T367278#9906323, @Marostegui wrote:

In T367278#9906299, @jcrespo wrote:

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

Jun 19 2024, 9:35 AM · Patch-For-Review, DBA

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

In T367278#9906323, @Marostegui wrote:

In T367278#9906299, @jcrespo wrote:

I don't see a clear difference with the current icinga/perl implementation.

In the past there was 2 additional fields for the section and datacenter- so there were 2 rows being written at the same time from both masters. I don't know if that changed for orchestator setup needs, I wasn't involved since its original setup..

That hasn't changed - both masters write their own heartbeats and they get replicated.

If eqiad is primary:
eqiad hosts get eqiad's heartbeat
codfw hosts get eqiads and codfw heartbeats

If codfw is primary:
eqiad hosts get eqiad's and codfw heartbeat
codfw hosts get codfw heartbeats

Jun 19 2024, 9:31 AM · Patch-For-Review, DBA

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

I don't see a clear difference with the current icinga/perl implementation.

Jun 19 2024, 9:20 AM · Patch-For-Review, DBA

• jcrespo added a comment to T367162: db1240.s3 index issues.

I am also not going to enable any account until the end of the load (including monitoring) to avoid any bad interaction.

Jun 19 2024, 6:54 AM · Data-Persistence-Backup

• jcrespo added a comment to T367162: db1240.s3 index issues.

The process seems to have failed at the last steps. Retrying with a higher buffer pool and stopping s1.

Jun 19 2024, 6:51 AM · Data-Persistence-Backup

Jun 18 2024

• jcrespo updated the task description for T367882: Possible weird interaction between es backups and puppet runs leading to failures.

Jun 18 2024, 2:24 PM · Data-Persistence, database-backups, Puppet

• jcrespo added projects to T367882: Possible weird interaction between es backups and puppet runs leading to failures: Puppet, database-backups, Data-Persistence.

Jun 18 2024, 2:22 PM · Data-Persistence, database-backups, Puppet

• jcrespo added a comment to T367882: Possible weird interaction between es backups and puppet runs leading to failures.

I am leaving for the day, but there is a chance this is not worth debugging because the hosts are about to be decommissioned (unless it happens on the new ones, too). Filing it in case it could be useful for other perf issues for other hosts.

Jun 18 2024, 2:22 PM · Data-Persistence, database-backups, Puppet

• jcrespo created T367882: Possible weird interaction between es backups and puppet runs leading to failures.

Jun 18 2024, 2:18 PM · Data-Persistence, database-backups, Puppet

• jcrespo added a comment to T367162: db1240.s3 index issues.

While technically the host didn't crash- it had an "unscheduled normal shutdown", given it is the source of s3 backups on eqiad, I am going to recover it from backups.

Jun 18 2024, 6:36 AM · Data-Persistence-Backup

• jcrespo placed T366892: decommission db2102.codw.wmnet up for grabs.

Jun 18 2024, 6:23 AM · SRE, DC-Ops, ops-codfw, decommission-hardware

• jcrespo updated the task description for T358741: Decommission db2096-db2120.

Jun 18 2024, 6:04 AM · DBA

• jcrespo updated the task description for T366892: decommission db2102.codw.wmnet.

Jun 18 2024, 5:58 AM · SRE, DC-Ops, ops-codfw, decommission-hardware

Jun 17 2024

• jcrespo added a comment to P65105 Codfw media backup status.

And this is the wiki distribution:

Jun 17 2024, 10:19 AM · media-backups

• jcrespo added a comment to P65105 Codfw media backup status.

This is the API request I filed: T267365

Jun 17 2024, 10:18 AM · media-backups

• jcrespo added a project to P65105 Codfw media backup status: media-backups.

Jun 17 2024, 10:14 AM · media-backups

• jcrespo updated subscribers of P65105 Codfw media backup status.

@ABran-WMF As you can see, codfw health status is much better (I queried it just before restarting it) ^

Jun 17 2024, 10:13 AM · media-backups

• jcrespo created P65105 Codfw media backup status.

Jun 17 2024, 10:12 AM · media-backups

Jun 14 2024

• jcrespo closed T360751: Upgrade backup sources to MariaDB 10.6 as Resolved.

Done!

Jun 14 2024, 3:21 PM · Data-Persistence, Data-Persistence-Backup

• jcrespo closed T360751: Upgrade backup sources to MariaDB 10.6, a subtask of T356960: Upgrade hosts to MariaDB 10.6, as Resolved.

Jun 14 2024, 3:20 PM · DBA

• jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.

Jun 14 2024, 3:19 PM · Data-Persistence, Data-Persistence-Backup

• jcrespo added a comment to T366892: decommission db2102.codw.wmnet.

Deleted from zarcillo and stopped.

Jun 14 2024, 3:16 PM · SRE, DC-Ops, ops-codfw, decommission-hardware

• jcrespo updated the task description for T366892: decommission db2102.codw.wmnet.

Jun 14 2024, 3:16 PM · SRE, DC-Ops, ops-codfw, decommission-hardware

• jcrespo triaged T367162: db1240.s3 index issues as Medium priority.

Jun 14 2024, 2:26 PM · Data-Persistence-Backup

Jun 12 2024

• jcrespo added a comment to T367278: Migrate mysql icinga alerts to alert manager - pt-heartbeat + scaffolding.

The alerts should be configurable by lag and by role from puppet- that means: I don't want alerts for backup sources that are < 4h, as I regularly stop those while taking the backups. E.g. core db hosts vs misc vs test hosts, etc.

Jun 12 2024, 2:55 PM · Patch-For-Review, DBA

• jcrespo added a comment to T365998: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad .

db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the active server because the primary is unavailable, it just has to be checked that replication restarts correctly after maintenance.

Jun 12 2024, 9:40 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE

• jcrespo added a comment to T365996: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad .

backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoid backup errors.

Jun 12 2024, 9:36 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE

• jcrespo added a comment to T365995: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad.

backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monitoring that it came back and new backups happen normally would be enough.

Jun 12 2024, 9:35 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE

• jcrespo added a comment to T365993: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad.

backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unless it situtation changes by july and finally gets pooled for bacula, it will require no action.

Jun 12 2024, 9:29 AM · SRE-swift-storage, DBA, Data-Persistence, Infrastructure-Foundations, netops, SRE

• jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

@Marostegui, in order to resolve this ticket, now that read activity I assume is lower, do you think I could get a host from es4 and es5 on both dcs depooled for a day and with exclusive usage in order to take a final, archivable, full backup of those sections? Doesn't have to happen at the same time on the 4 hosts:

Jun 12 2024, 9:08 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.

Jun 12 2024, 9:01 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

• jcrespo added a comment to T367162: db1240.s3 index issues.

@ABran-WMF Thanks for handling it. To confirm, the issue happened at 2024-06-11 13:53:41 (Tuesday), right (or before?)? Because I may recover the host from backups just to be 100% sure there is no leftover corruption.

Jun 12 2024, 8:43 AM · Data-Persistence-Backup