⚓ T355034 Deploy new block_target schema
Page MenuHomePhabricator

Deploy new block_target schema
Closed, ResolvedPublic

Description

  • Apply schema change maintenance/archives/patch-block_target.sql to all beta wikis
  • Apply schema change to all production wikis
  • QA prepare blocks on beta
  • Switch $wgBlockTargetMigrationStage on beta to SCHEMA_COMPAT_READ_OLD | SCHEMA_COMPAT_WRITE_BOTH
  • QA check block actions are still functional
  • Run migrateBlocks.php on beta
  • Switch beta to SCHEMA_COMPAT_READ_NEW | SCHEMA_COMPAT_WRITE_BOTH
  • Switch beta to SCHEMA_COMPAT_NEW
  • QA final check on beta
  • Deploy WMCS patch to hide private data
  • Switch production to SCHEMA_COMPAT_READ_OLD | SCHEMA_COMPAT_WRITE_BOTH
  • Run migrateBlocks.php on all production wikis
  • Switch production to SCHEMA_COMPAT_READ_NEW | SCHEMA_COMPAT_WRITE_BOTH
  • Deploy WMCS b/c view for ipblocks
  • Switch production to SCHEMA_COMPAT_NEW
  • Move away ipblocks tables
  • Drop ipblocks tables

Rollback plans:

  • It is possible to roll back from SCHEMA_COMPAT_WRITE_BOTH mode to SCHEMA_COMPAT_OLD mode, whether or not migrateBlocks.php has been run. If this is done, the new tables should be truncated before moving forward again.
  • Minor, fully-isolated bugs specific to the SCHEMA_COMPAT_WRITE_BOTH mode can be dealt with by completing the migration. We are not going to use this code again.
  • Rollback from SCHEMA_COMPAT_NEW to SCHEMA_COMPAT_READ_OLD can be done at the cost of losing blocks inserted since the SCHEMA_COMPAT_NEW mode was entered.

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenBUG REPORTNone
OpenNone
OpenNone
OpenNone
ResolvedAug 21 2018dbarratt
DeclinedNone
OpenNone
Resolvedtstarling
Resolvedtstarling
Resolvedtstarling
OpenNone
OpenNone
ResolvedMusikAnimal
Resolvedtaavi
ResolvedNone
ResolvedLadsgroup
ResolvedPRODUCTION ERRORtstarling
Resolvedtstarling
Resolvedtstarling
Resolvedtstarling
ResolvedPRODUCTION ERRORtstarling
ResolvedMarostegui

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 991105 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] WMCS: add views for block and block_target tables

https://gerrit.wikimedia.org/r/991105

I think the ipblocks view can become a permanent backwards-compatible view into block/block_target. The reasons for changing the schema don't really apply to read-only clients.

Brilliant! That would probably prevent a lot of tools from breaking. I doubt it would have much if any of a performance impact, either, so if that's the case we don't need a grace period for developers to update code. Good stuff!

Backward compatible views on wikireplicas must be only temporary. Let me explain why: The views on wikireplicas have been used to hide private information (that is basically the reason for their existence) and adding layers of backwards compatibility makes those rules much more complicated increasing the chance of PII leak (specially in this case that a view in one table is being used to hide both private data and provide b/c). That's why the wikireplica schema should be as close as possible to the production schema. On top of that, we do many schema changes and these will add up and add a lot of tech debt.

Our schema has never been considered stable and it shouldn't be (given its glaring issues) until we improve it to the point it would become stable in production as well. I don't like to break wikireplica tools either (I maintain tens of them) but keeping b/c forever is not a good idea. I can help with the communication though, done it many times.

Change 991105 merged by Ladsgroup:

[operations/puppet@production] WMCS: add views for block and block_target tables

https://gerrit.wikimedia.org/r/991105

Ran the maintain-views on every clouddb host from 1013 to 1021

I added a read-new/write-both stage, since that allows us to test the read-new code paths while still allowing rollback without data loss.

I was worried about the potential for deadlocks and other atomicity issues in the write-both mode, but after reviewing the code again at @dmaza's request, I think including the stage is the safer option.

Change 998624 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] beta: Switch block schema to read-old/write-both mode

https://gerrit.wikimedia.org/r/998624

Change 998625 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] beta: Switch block schema to read-new/write-both mode

https://gerrit.wikimedia.org/r/998625

Change 998626 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] beta: Switch block schema to read-new/write-new mode

https://gerrit.wikimedia.org/r/998626

Change 998624 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Switch block schema to read-old/write-both mode

https://gerrit.wikimedia.org/r/998624

Mentioned in SAL (#wikimedia-releng) [2024-02-08T22:41:57Z] <TimStarling> on deployment-deploy03 running foreachwiki maintenance/migrateBlocks.php for T355034

@tstarling If an autoblock exists for an IP and I try to create a block for the same IP, it gets inserted into ipblocks but not into block/block_target. On Special:Block, I see the message The block could not be made, but no existing block was found for "<ip>". If this problem persists, please report it. even though the block is created. Also happens when creating a global block and select the option to create a local block (except I don't see the message). Let me know if you would like me to raise a separate bug.

Did the maintenance scripts run correctly on beta? I notice that en_rtlwiki has nothing in block but has 4 (non-expired) blocks in ipblocks.

On enwiki beta, there are a few rows in ipblocks that are not in block. For example, there is a block against RebeccawlqphbcvaeFiora in ipblocks but not block. They all seem to have ipb_user=0, which I assume is wrong. Perhaps leftover from the actor migration project.

Did the maintenance scripts run correctly on beta? I notice that en_rtlwiki has nothing in block but has 4 (non-expired) blocks in ipblocks.

I don't have the log anymore. We'll see what happens when I run it again.

On enwiki beta, there are a few rows in ipblocks that are not in block. For example, there is a block against RebeccawlqphbcvaeFiora in ipblocks but not block. They all seem to have ipb_user=0, which I assume is wrong. Perhaps leftover from the actor migration project.

That's expected, if ipb_address is not an IP address or range.

Mentioned in SAL (#wikimedia-releng) [2024-02-19T02:57:45Z] <TimStarling> on deployment-deploy03 running foreachwiki maintenance/migrateBlocks.php again (T355034, T357366)

Did the maintenance scripts run correctly on beta? I notice that en_rtlwiki has nothing in block but has 4 (non-expired) blocks in ipblocks.

I don't have the log anymore. We'll see what happens when I run it again.

I filed T357877 for this. I ran migrateBlocks.php on en_rtlwiki.

Change 998625 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Switch block schema to read-new/write-both mode

https://gerrit.wikimedia.org/r/998625

Change 1006179 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch block schema to read-old/write-both mode

https://gerrit.wikimedia.org/r/1006179

Change 1006180 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch block schema to read-new/write-both mode

https://gerrit.wikimedia.org/r/1006180

Change 1006181 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch block schema to read-new/write-new mode

https://gerrit.wikimedia.org/r/1006181

Change 998626 merged by jenkins-bot:

[operations/mediawiki-config@master] beta: Switch block schema to read-new/write-new mode

https://gerrit.wikimedia.org/r/998626

Change 1006179 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch block schema to read-old/write-both mode

https://gerrit.wikimedia.org/r/1006179

Mentioned in SAL (#wikimedia-operations) [2024-02-29T04:19:17Z] <tstarling@deploy2002> Synchronized wmf-config/CommonSettings.php: Switch block schema to read-old/write-both mode T355034 (duration: 08m 47s)

migrateBlocks.php has been running in production since 04:30 UTC, now up to itwiki.

I ran @dom_walden's SQL script P58700. It's apparent from the results that the script failed to complete on 34 wikis, exiting with a duplicate key error, which I missed because I didn't capture stderr from the script. The errors can be found in logstash.

The migration script simply doesn't check for rows that were already inserted into block/block_target due to write-both mode activity. It always tries to insert a new row into block even if there's already one there.

Change 1009939 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] migrateBlocks.php: Skip existing IDs

https://gerrit.wikimedia.org/r/1009939

Change 1009939 merged by jenkins-bot:

[mediawiki/core@master] migrateBlocks.php: Skip existing IDs

https://gerrit.wikimedia.org/r/1009939

Change 1010218 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.42.0-wmf.21] migrateBlocks.php: Skip existing IDs

https://gerrit.wikimedia.org/r/1010218

Change 1010218 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.21] migrateBlocks.php: Skip existing IDs

https://gerrit.wikimedia.org/r/1010218

Mentioned in SAL (#wikimedia-operations) [2024-03-12T22:57:45Z] <tstarling@deploy2002> Started scap: Backport for [[gerrit:1010218|migrateBlocks.php: Skip existing IDs (T355034)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-12T23:00:05Z] <tstarling@deploy2002> tstarling: Backport for [[gerrit:1010218|migrateBlocks.php: Skip existing IDs (T355034)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-12T23:40:22Z] <tstarling@deploy2002> Finished scap: Backport for [[gerrit:1010218|migrateBlocks.php: Skip existing IDs (T355034)]] (duration: 42m 36s)

I ran migrateBlocks.php again. Now the only mismatches are:

  • Rows with ipb_user=0, from T56458
  • Rows with a leading zero in one part of a dotted-quad address, which were previously misinterpreted as octal for ipb_range_start, now fixed
  • A batch of 100 blocks on eswiki was deleted from ipblocks but not block. I can't find any relevant errors -- probably the request was just killed before it finished deleting the batch. This is fine and will fix itself eventually. There is a backlog of 14800 expired blocks to delete on eswiki.

I checked the logs for errors with DatabaseBlockStore in the backtrace, but I couldn't find any that were bugs in DatabaseBlockStore. I did notice T307738 on gnwiki.

I'm ready to go to the next stage.

Change 1006180 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch block schema to read-new/write-both mode

https://gerrit.wikimedia.org/r/1006180

Mentioned in SAL (#wikimedia-operations) [2024-03-14T04:23:48Z] <tstarling@deploy2002> Synchronized wmf-config/CommonSettings.php: reverting for now due to slow query T355034 (duration: 12m 28s)

I reverted the deployment of read-new mode due to slow queries. I analysed the slow query logs and found four categories of slow query errors, and I filed tasks for each: T360088, T360160, T360163, T360165. I will fix those bugs, and when the fixes reach production, we can continue with the deployment.

Change #1013635 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Switch block schema to read-new/write-both mode (attempt 2)

https://gerrit.wikimedia.org/r/1013635

Change #1013635 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch block schema to read-new/write-both mode (attempt 2)

https://gerrit.wikimedia.org/r/1013635

Mentioned in SAL (#wikimedia-operations) [2024-03-25T02:06:31Z] <tstarling@deploy1002> Synchronized wmf-config/CommonSettings.php: Switch block schema to read-new/write-both mode T355034 (duration: 12m 53s)

Another slow query: T360864.

There was also this ApiQueryBlocks bkcontinue query that took 5.3 seconds, but the explain seems correct (bl_timestamp first). It's fast if you use a limit of 10, but with a limit of 500 it takes a while. It's searching for range blocks created before 2019-01-23, but there are only 398 so it goes right back to the start of the table, filtering the rows. I'm not filing a bug for this, but maybe we could try to tune it if it keeps coming up.

SELECT  bt_auto,bl_id,bl_timestamp,bt_address,bt_user,
	COALESCE(bt_address, bt_user_text) AS `bt_address_or_user_name`,
	actor_user,actor_name,bl_expiry,bl_anon_only,bl_create_account,
	bl_enable_autoblock,bl_block_email,bl_deleted,bl_allow_usertalk,
	bl_sitewide,comment_bl_reason.comment_text AS `bl_reason_text`,
	comment_bl_reason.comment_data AS `bl_reason_data`,
	comment_bl_reason.comment_id AS `bl_reason_cid`  
FROM `block` 
	JOIN `block_target` ON ((bt_id=bl_target)) 
	LEFT JOIN `user` `block_target_user` ON ((user_id=bt_user))
	JOIN `actor` ON ((actor_id=bl_by_actor)) 
	JOIN `comment` `comment_bl_reason` ON ((comment_bl_reason.comment_id = bl_reason_id))   
WHERE (bl_timestamp < '20190123220810' OR (bl_timestamp = '20190123220810' AND (bl_id <= 8799641)))
	AND (bl_expiry != 'infinity') 
	AND (bt_range_end > bt_range_start) 
	AND ((
		SELECT  1  FROM `block_target` `hu_block_target` 
		JOIN `block` `hu_block` ON ((hu_block.bl_target=hu_block_target.bt_id))   
		WHERE (hu_block_target.bt_user=block_target.bt_user) AND hu_block.bl_deleted = 1  
	) IS NULL) 
	AND (bl_expiry > '20240325020353')  
ORDER BY bl_timestamp DESC,bl_id DESC 
LIMIT 501;

There was a slow query from Special:ListUsers with a group filter, but the block table is apparently not at fault, except maybe for constant factors.

I'm leaving it deployed for now since the rate of slow queries is very low -- just one unambiguous log entry in the hour since deployment.

Change #1016066 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] WMCS: Read from the new block/block_target tables

https://gerrit.wikimedia.org/r/1016066

Change #1016066 merged by Tim Starling:

[operations/puppet@production] WMCS: Read from the new block/block_target tables

https://gerrit.wikimedia.org/r/1016066

Change #1016892 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] WMCS: Fix type of ipb_range_start and ipb_range_end in the b/c view

https://gerrit.wikimedia.org/r/1016892

Change #1016892 merged by Tim Starling:

[operations/puppet@production] WMCS: Fix type of ipb_range_start and ipb_range_end in the b/c view

https://gerrit.wikimedia.org/r/1016892

Mentioned in SAL (#wikimedia-operations) [2024-04-04T01:07:41Z] <TimStarling> on clouddb1020 running maintain-views --all-databases --replace-all --auto-depool (T355034)

Mentioned in SAL (#wikimedia-operations) [2024-04-04T02:48:57Z] <TimStarling> ran maintain-views on clouddb1013-1019 (T355034)

tstarling updated the task description. (Show Details)

Change #1006181 merged by jenkins-bot:

[operations/mediawiki-config@master] Switch block schema to read-new/write-new mode

https://gerrit.wikimedia.org/r/1006181

Mentioned in SAL (#wikimedia-operations) [2024-04-08T23:25:33Z] <tstarling@deploy1002> Synchronized wmf-config/CommonSettings.php: stop writing to ipblocks table T355034 (duration: 12m 32s)

There were a few duplicate key errors while the change was half-deployed, the last at 23:23:00.

tstarling updated the task description. (Show Details)