⚓ T357050 editResponseTime's port to statslib is not actually backwards-compatible
Page MenuHomePhabricator

editResponseTime's port to statslib is not actually backwards-compatible
Closed, ResolvedPublic

Description

Pretty exactly synchronized with the rollout of wmf.17 to group1 -- and then with its rollback later -- we saw a decrease in the graphite metric MediaWiki.timing.editResponseTime.sample_rate.
https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&from=1707408780000&to=1707419730000&viewPanel=10

This metric is used on the Edit Count dashboard, on the front page of grafana showing 5 high-level indicators of site functioning status, and is also displayed on our public site status page https://www.wikimediastatus.net.

So we should probably fix whatever issue with the transition patch causes it to not be backwards-compatible by the current clients.

Event Timeline

CDanis triaged this task as High priority.Feb 8 2024, 7:03 PM
CDanis created this task.
brennen raised the priority of this task from High to Unbreak Now!.Feb 8 2024, 7:05 PM
brennen subscribed.

Raising to UBN per train blocker policy.

colewhite changed the task status from Open to In Progress.Feb 8 2024, 7:11 PM
colewhite claimed this task.

My understanding is that sample_rate is generated by statsd rather than our direct code, so I don't understand how we could provide back-compat. with this migration; we could shift the monitoring to a different metric that is emitted in both, perhaps? But that's then measuring something else.

I am unable to replicate this behavior in phpunit tests. It seems something more than a simple bug is occurring. I recommend we revert the transition patch to unblock the train and continue investigating.

	public function testCopyToStatsdAtArray() {
		$m = new StatsFactory( new StatsCache, new NullEmitter, new NullLogger );
		$statsdMock = $this->createMock( IBufferingStatsdDataFactory::class );

		$statsdMock->expects( $this->exactly( 2 ) )
			->method( 'timing' )
			->withConsecutive( [ 'test.timing.1', 1.0 ], [ 'test.timing.2', 1.0 ] );

		$m->withComponent('WikiMediaEvents')
			->withStatsdDataFactory( $statsdMock )
			->getTiming( 'testMetricTiming' )
			->copyToStatsdAt( ['test.timing.1', 'test.timing.2'] )
			->observe( 1 );
	}

Change 998971 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[mediawiki/extensions/WikimediaEvents@master] Revert "Migrate `editResponseTime` metric to Prometheus store"

https://gerrit.wikimedia.org/r/998971

Created revert patch, please check me on that...

Change 998972 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[mediawiki/extensions/WikimediaEvents@wmf/1.42.0-wmf.17] Revert "Migrate `editResponseTime` metric to Prometheus store"

https://gerrit.wikimedia.org/r/998972

Change 998972 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.42.0-wmf.17] Revert "Migrate `editResponseTime` metric to Prometheus store"

https://gerrit.wikimedia.org/r/998972

Mentioned in SAL (#wikimedia-operations) [2024-02-08T20:45:58Z] <brennen@deploy2002> Started scap: Backport for [[gerrit:998972|Revert "Migrate editResponseTime metric to Prometheus store" (T357050)]]

Mentioned in SAL (#wikimedia-operations) [2024-02-08T20:47:30Z] <brennen@deploy2002> brennen: Backport for [[gerrit:998972|Revert "Migrate editResponseTime metric to Prometheus store" (T357050)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-02-08T20:55:15Z] <brennen@deploy2002> Finished scap: Backport for [[gerrit:998972|Revert "Migrate editResponseTime metric to Prometheus store" (T357050)]] (duration: 09m 17s)

Change 998971 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Revert "Migrate `editResponseTime` metric to Prometheus store"

https://gerrit.wikimedia.org/r/998971

brennen lowered the priority of this task from Unbreak Now! to Needs Triage.Feb 8 2024, 9:14 PM

Change 999012 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] stats: ensure statsdDataFactory follows when setting component

https://gerrit.wikimedia.org/r/999012

colewhite triaged this task as High priority.EditedFeb 8 2024, 10:42 PM

Reproducing this issue locally was complicated because DeveloperSettings defaults $wgForceDeferredUpdatesPreSend = true;. After setting this to false, I was able to reproduce the problem locally.

The issue was that StatsFactory did not populate the $statsdDataFactory of the new StatsFactory instance returned from withComponent().

This was the first place we have tried using withComponent() and the bug appears limited to usage of that feature. Also affects wikibase.quality.constraints.* (T354909) - we'll probably want a backport

Change 998973 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/extensions/WikimediaEvents@master] Migrate `editResponseTime` metric to Prometheus store

https://gerrit.wikimedia.org/r/998973

Change 999012 merged by jenkins-bot:

[mediawiki/core@master] Stats: Ensure statsdDataFactory follows when setting component

https://gerrit.wikimedia.org/r/999012

Change 999022 had a related patch set uploaded (by Cwhite; author: Cwhite):

[mediawiki/core@master] stats: amend tests to validate changing statsdDataFactory

https://gerrit.wikimedia.org/r/999022

Change 998973 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Migrate `editResponseTime` metric to Prometheus store

https://gerrit.wikimedia.org/r/998973

Change 999022 merged by jenkins-bot:

[mediawiki/core@master] Stats: Amend tests to validate changing statsdDataFactory

https://gerrit.wikimedia.org/r/999022

Fix was deployed on 2/15 and the graphs still look ok. Considering this resolved!