⚓ T354758 Data not retrieved by IPInfo from IPoid for IPv6 addresses
Page MenuHomePhabricator

Data not retrieved by IPInfo from IPoid for IPv6 addresses
Closed, ResolvedPublicBUG REPORT

Description

What is the problem?

IPInfo does not retrieve info from IPoid for IPv6 addresses. The fields returned in the /w/rest.php/ipinfo/v0/revision/ request are all null.

It seems to make the request to /feed/v1/ip/<ip> using capital letters in IPv6 addresses. IPoid does not recognise these, even if the lower case version of the same IP exists.

It seems like IPoid stores IP addresses in whatever format they are given. So if they are in lower case it stores them as lower case, and upper case as upper case. Perhaps they should be normalised, making sure we are consistent between IPInfo and IPoid in terms of our normalisation.

Steps to reproduce problem
  1. Copy the reproduction data locally, gzip it and save it in your tmp/ directory (e.g. as reprod.json.gz)
  2. In the ipoid directory, start docker (e.g docker compose up -d)
  3. docker compose exec web node -e "require('./create-users.js')();"
  4. docker compose exec web node -e "require('./init-db.js').init(true);"
  5. docker compose exec web ./diff.sh --today tmp/reprod.json.gz
  6. docker compose exec web ./import.sh 0
  7. In the LocalSettings.php file for your test wiki, put $wgIPInfoIpoidUrl = 'http://localhost:6927'; (if this doesn't work, try something like $wgIPInfoIpoidUrl = 'http://172.18.0.1:6927'; instead)
  8. Make an edit with the IP address 2001:0db8:0000:0000:0000:8a2e:0370:7334 (see instructions)
  9. Login as a user who has IPInfo enabled
  10. Open your browser tools and go to Special:Contributions/2001:0db8:0000:0000:0000:8a2e:0370:7334
  11. In the network tab, you will see a request like /w/rest.php/ipinfo/v0/revision/559?dataContext=infobox&language=en. Open this request in a new window (you should be able to right-click and and "copy url").

Expected behaviour: The section ipinfo-source-ipoid will have data in it (such as the screenshot below)

ipoid_data_which_should_be_shown.png (212×420 px, 11 KB)

Observed behaviour: This section will have only null values (such as the screenshot below)
ipoid_data_which_is_actually_shown.png (150×309 px, 7 KB)

Environment

ipoid commit 1814c4b30f20deaf569a6cf41b5d491135a0ae6d
CheckUser version CheckUser 2.5 (c72cefb) 07:32, 10 January 2024

Reproduction data
{"ip": "2001:0db8:0000:0000:0000:8a2e:0370:7334", "as": {}, "organization": "0", "location": {"city": false}, "client": {"behaviors": [null, "foo"]}, "risks": ["CALLBACK_PROXY"]}

Event Timeline

kostajh subscribed.

Seems like an important one to solve sooner than later. I think we want IPv6 searches to be case insensitive.

Note on the urgency and scope

I don't think an urgent fix is required here, because IPInfo and the Spur data seem to be using the same normalization.

But we should probably add support for callers passing in an IP in any format, and we should probably account for the Spur data containing any format.

Case sensitivity

String comparisons are case insensitive by default, but because actor_data.ip is type VARBINARY, the comparison is binary rather than string. For this specific bug, we can update the query in ipoid to do a string comparison.

This solves the case where IPInfo sends 2001:0db8:0000:0000:0000:8a2e:0370:7334 but ipoid has 2001:0DB8:0000:0000:0000:8A2E:0370:7334.

Normalization

However, IPInfo would currently actually send 2001:DB8:0:0:0:8A2F:370:7334, and it looks as though Spur data will contain that format too. (I had a little search around the DB and all the IPv6 addresses I saw were in this form, though the documentation doesn't seem to guarantee it.)

Solutions

We should clear up existing confusion and make this all robust for the future by:

  • Updating our test data in ipoid to use IPv6 addresses in the currently used format
  • Normalizing the IP from the API request
  • Normalizing the IP from the Spur data as we save it into actor_data

MediaWiki has good handling for IP address parsing in util.js, so we could make that into a little library that can be reused in ipoid. Or if there's an existing library that's small enough and secure enough we could use that in ipoid (and maybe later in MW) - e.g. https://www.npmjs.com/package/ipaddr.js

Change 990731 had a related patch set uploaded (by Tchanders; author: Tchanders):

[mediawiki/extensions/IPInfo@master] IPoidInfoRetriever: Look for normalized IP address in IPoid data

https://gerrit.wikimedia.org/r/990731

Change 990731 merged by jenkins-bot:

[mediawiki/extensions/IPInfo@master] IPoidInfoRetriever: Look for normalized IP address in IPoid data

https://gerrit.wikimedia.org/r/990731

The datasets contain mixed formats so we'll also need a maintenance script to fix existing entries in our DB before we can consider this task done. (See comments on https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/212)

This was reverted, then other work took precedence, so moving back to Ready. It's a goal in the next sprint in case we don't get to it before then.

Assigning to @STran, who I think has started work on it (please un-assign if I got that wrong!)

Change 1007290 had a related patch set uploaded (by STran; author: STran):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1007290

Change 1007290 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1007290

Change 1007562 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1007562

Change 1007562 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1007562

@STran I don't think IPInfo is retrieving information about IPv6 properly. With the below test data, if I make an edit as IP 2001:2::b706:5bd3:f2b:9e33 and look at the Special:Contributions, the ipoid data is all null. When I look in the logs I see:

[http] [2024-03-01T11:52:51+00:00] GET http://172.18.0.1:6927/feed/v1/ip/2001:2:0:0:B706:5BD3:F2B:9E33 HTTP/1.1 - 200 NULL
[IPInfo] ipoid results were not in the expected format: {"2001:2:0:0:b706:5bd3:f2b:9e33":{"ip":"2001:2:0:0:b706:5bd3:f2b:9e33","org":"Foobaz","client_count":0,"types":["UNKNOWN"],"conc_city":"","conc_state":"","conc_country":"","countries":0,"location_country":"IN","risks":["CALLBACK_PROXY"],"last_updated":1709293854,"proxies":["OXYLABS_PROXY"],"behaviors":[],"tunnels":[]}}

I see that ipoid has stored the IP 2001:2::b706:5bd3:f2b:9e33 in the database as 2001:2:0:0:b706:5bd3:f2b:9e33. This is different to how that IP is normalised by IPUtils.

The IPv4 IP in the test data (1.2.3.4) does work.

@STran I don't think IPInfo is retrieving information about IPv6 properly. With the below test data, if I make an edit as IP 2001:2::b706:5bd3:f2b:9e33 and look at the Special:Contributions, the ipoid data is all null. When I look in the logs I see:

[http] [2024-03-01T11:52:51+00:00] GET http://172.18.0.1:6927/feed/v1/ip/2001:2:0:0:B706:5BD3:F2B:9E33 HTTP/1.1 - 200 NULL
[IPInfo] ipoid results were not in the expected format: {"2001:2:0:0:b706:5bd3:f2b:9e33":{"ip":"2001:2:0:0:b706:5bd3:f2b:9e33","org":"Foobaz","client_count":0,"types":["UNKNOWN"],"conc_city":"","conc_state":"","conc_country":"","countries":0,"location_country":"IN","risks":["CALLBACK_PROXY"],"last_updated":1709293854,"proxies":["OXYLABS_PROXY"],"behaviors":[],"tunnels":[]}}

I see that ipoid has stored the IP 2001:2::b706:5bd3:f2b:9e33 in the database as 2001:2:0:0:b706:5bd3:f2b:9e33. This is different to how that IP is normalised by IPUtils.

The IPv4 IP in the test data (1.2.3.4) does work.

Maybe instead of ipaddr.js library, we should copy/paste mw.util.prettifyIP and make use of that in ipoid:

prettifyIP: function ( ip ) {
		ip = this.sanitizeIP( ip );
		if ( ip === null ) {
			return null;
		}
		if ( this.isIPv6Address( ip, true ) ) {
			var cidr, matches, ipCidrSplit, i, replaceZeros;
			if ( ip.indexOf( '/' ) !== -1 ) {
				ipCidrSplit = ip.split( '/', 2 );
				ip = ipCidrSplit[ 0 ];
				cidr = ipCidrSplit[ 1 ];
			} else {
				cidr = '';
			}
			matches = ip.match( /(?:^|:)0(?::0)+(?:$|:)/g );
			if ( matches ) {
				replaceZeros = matches[ 0 ];
				for ( i = 1; i < matches.length; i++ ) {
					if ( matches[ i ].length > replaceZeros.length ) {
						replaceZeros = matches[ i ];
					}
				}
			}
			ip = ip.replace( replaceZeros, '::' );

			if ( cidr !== '' ) {
				ip = ip.concat( '/', cidr );
			}
			ip = ip.toLowerCase();
		}
		return ip;
	},

That should have parity with IPUtils' PHP implementation.

Maybe instead of ipaddr.js library, we should copy/paste mw.util.prettifyIP and make use of that in ipoid:

toString() might do the trick instead of toNormalizedString() (https://github.com/whitequark/ipaddr.js/blob/dbd4c4f2f30aa82e166d0dfa25b2f5ccd3fc25be/README.md#object-representation)

Maybe instead of ipaddr.js library, we should copy/paste mw.util.prettifyIP and make use of that in ipoid:

toString() might do the trick instead of toNormalizedString() (https://github.com/whitequark/ipaddr.js/blob/dbd4c4f2f30aa82e166d0dfa25b2f5ccd3fc25be/README.md#object-representation)

Well-spotted. It seems to work for the one use case I looked at. But I am unsure if we might be safer by copy/pasting mw.util.prettifyIP to avoid any further discrepancies.

Chatted with Kosta 1:1 and I'm a fan of removing dependencies wherever possible and pushed up a patch implementing the IPUtils approach.

Change 1008959 had a related patch set uploaded (by Tchanders; author: Tchanders):

[mediawiki/extensions/IPInfo@master] IPoidInfoRetriever: Handle returned IP address in any format

https://gerrit.wikimedia.org/r/1008959

@STran I don't think IPInfo is retrieving information about IPv6 properly. With the below test data, if I make an edit as IP 2001:2::b706:5bd3:f2b:9e33 and look at the Special:Contributions, the ipoid data is all null. When I look in the logs I see:

[http] [2024-03-01T11:52:51+00:00] GET http://172.18.0.1:6927/feed/v1/ip/2001:2:0:0:B706:5BD3:F2B:9E33 HTTP/1.1 - 200 NULL
[IPInfo] ipoid results were not in the expected format: {"2001:2:0:0:b706:5bd3:f2b:9e33":{"ip":"2001:2:0:0:b706:5bd3:f2b:9e33","org":"Foobaz","client_count":0,"types":["UNKNOWN"],"conc_city":"","conc_state":"","conc_country":"","countries":0,"location_country":"IN","risks":["CALLBACK_PROXY"],"last_updated":1709293854,"proxies":["OXYLABS_PROXY"],"behaviors":[],"tunnels":[]}}

I see that ipoid has stored the IP 2001:2::b706:5bd3:f2b:9e33 in the database as 2001:2:0:0:b706:5bd3:f2b:9e33. This is different to how that IP is normalised by IPUtils.

The IPv4 IP in the test data (1.2.3.4) does work.

It seems to me that the main problem here is the inflexibility of IPInfo's IPoidInfoRetriever, which can only handle one IP address format:

if ( $response->isOK() ) {
	$content = json_decode( $request->getContent(), true );
	$ipInIpoidFormat = IPUtils::prettifyIP( $ip );
	if ( is_array( $content ) && is_array( $content[$ipInIpoidFormat] ) ) {
		$data = $content[$ipInIpoidFormat];
	} else {
		...

The patch in https://gerrit.wikimedia.org/r/1008959 should address this.

I think it's worth making this change regardless of whether we make the change to ipoid also, because that way ipoid isn't bound by IPInfo's expectations, in case we ever have a reason to change the format.

In general I think we should expect both the service and those consuming the service to be agnostic of each other's preferred IP formats.

I have an alternate proposal, in which ipoid is flexible in the format of the IPv6 IPs it accepts. I was fine with normalizing ipoid to mediawiki's sanitized IP output, since mediawiki is the only consumer of ipoid data and I think those are all standardized using its utilities but It's probably good to think about future flexibility and therefore it's probably good if:

  1. ipoid should be able to match an IPv6 IP regardless of the format it comes in
  2. ipoid shouldn't return the IP in a different format than what was passed in

This patch makes it so that any IPv6 gets normalized on lookup so ipoid can find it and then on output sends back the IP the user requested.

This isn't an either-or proposal and it might be good to do this in both repos.

  1. ipoid should be able to match an IPv6 IP regardless of the format it comes in

I 100% agree with this. IPoid should be able to handle any valid IPv6 regardless of the format

  1. ipoid shouldn't return the IP in a different format than what was passed in

As long as this is clearly documented, I think this should be okay. I say this because, while in theory a user would be using the same string they provided in the request to parse the response, there could be a situation where they decided to send an IPv6 in a specific format but then not used this format when parsing the response.

there could be a situation where they decided to send an IPv6 in a specific format but then not used this format when parsing the response.

I think that's on the consumer then. IMO it's odder if you ask for 2e91:0:76ea:0:863d:1c62:: and ipoid sends back 2e91:0:76ea:0:863d:1c62:0:0. This is only for the single-IP return. The prefix search would still return IPs in the format ipoid stores them in (atm expanded) and it would still be up to the consumer to condense them if necessary.

Change 1008959 merged by jenkins-bot:

[mediawiki/extensions/IPInfo@master] IPoidInfoRetriever: Handle returned IP address in any format

https://gerrit.wikimedia.org/r/1008959

IPInfo appears to now be able to retrieve information about IPv6 addresses from ipoid.

Test environment: local docker IP Info 0.0.0 (bfe7766) 07:23, 8 March 2024. ipoid commit 8ebe1c299b9052f4852017339acde6b61746edc6.

Just drawing attention that this is still open.

Change 1010524 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1010524

Change 1010524 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1010524