⚓ T72899 Search box needs some normalization for Arabic Family languages
Page MenuHomePhabricator

Search box needs some normalization for Arabic Family languages
Closed, ResolvedPublic5 Estimated Story PointsFeature

Description

We have some langues such as Arabic, Persian, Urdu, Kurdish,... which uses common characters and they have similar geliphs with different Unicode number for example:
for ک (Kaf)
ك Arabic U+0643
ڪ Urdu U+06AA
ﻙ Pushtu U+FED9
ﻚ Uyghur U+FEDA
ک Persian U+06A9
for ی (ya)
ی Persian U+06CC
ي Arabic U+064A
ى Urdu U+0649
ۍ Pushtu U+06CD
ې Uyghur U+06D0
for ه (heh)
ہ Pushtu U+06C1
ە Kurdish U+06D5
ه Persian U+0647
we have these characters which have different Unicode number and different keyboard.
Now many users does not access to Persian keyboard or urdu keyboard by default in their OS (like windows xp, android (low versions), IOS ,...). so when they search for an article they can not find it in wikipedia searach box but it is existing in local characters.

For example if you search at fa.wikipedia for article ويليام شكسپير (characters are in Arabic ي , ك) you can not find it and the article in Farsi is ویلیام شکسپیر (characters are in Persian ی , ک).

for farsi please add a possibility for search tool to assume
U+064A or U+0649 or U+06CD or U+06D0 or U+06CC > U+06CC
U+0643 or U+06AA or U+FED9 or U+FEDA > U+06A9
U+06C1 or U+06D5 > U+0647


Version: unspecified
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:47 AM
bzimport set Reference to bz70899.
bzimport added a subscriber: Unknown Object (MLST).

Yes, we have a same problem on ckb wikipedia. It can be useful.

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

(In reply to Andre Klapper from comment #4)

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

We need normalization for search box which is placed on the top pages.

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

I meant CirrusSearch

I'm having a hard time understanding the scope of this task. Could @TJones help? :-)

We need to specify the list of languages we are trying to do this for. The description mentions Persian, but alludes to Arabic, Urdu, Kurdish, Pushtu, and Uyghur, and the comments mention Sorani (ckb).

For each language, we'd need to figure out the exact mappings. (Persian is listed above, but I'd want to double check that the correspondences work in the other direction for each language—and there may be others glyphs that are relevant to other languages but not relevant to Persian, and so not listed here.) There may be more detail available in the github repos listed but I haven't looked closely.

Then we have to figure out which language analyzers are being used for each language. Arabic, Persian, and Sorani have their own analyzers. The fallbacks (T147959), though very imperfect, are the status quo, so we'd have to see what's going on there.

For each analyzer being used (possibly including the default), we'd need to unpack the built-in ES analyzer so we can modify it. Doing this for French and others has given unexpected results—generally not bad, mostly improvements, with the few regressions being readily fixable. Figuring all that out requires testing per language, and I'd really want to be careful the first time we did it to an Arabic-script analyzer.

After the unpacking, actually setting up the mapping is very little work.

Once that's done, the wikis in question need to be re-indexed. Arabic Wikipedia has already been done for BM25, and others may happen before we work on this, in which case the change going live could take a while—until the next time we re-index. (Though that's something we need to get better at being able to do, and doing it for the projects in a handful of languages is less effort than doing it for almost everything, as we are with BM25.)

Does that help?

TJones lowered the priority of this task from Medium to Low.Aug 27 2020, 8:04 PM
Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:12 AM
Restricted Application added a subscriber: Huji. · View Herald TranscriptFeb 4 2022, 11:12 AM
TJones raised the priority of this task from Low to Medium.Jul 28 2022, 3:26 PM
TJones set the point value for this task to 3.
TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.
TJones changed the point value for this task from 3 to 5.Apr 29 2024, 9:27 PM

Change #1024764 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Normalize Arabic variants of kaf, yeh, heh

https://gerrit.wikimedia.org/r/1024764

I looked at all of these as best I could, and I decided on a general mapping to standard Arabic forms internally. Arabic does that to some degree, as does Persian! And for the languages without custom stemmers and stop word filters, the character used internally doesn't matter, as long as the desired words can find each other.

Things were complicated by the special characters for isolated, initial, medial, and final forms that many Arabic letters have, but I got it under control in the end!

Sorani/Central Kurdish (ckb) didn't need any normalization. It already normalizes to the Persian forms (though Persian often doesn't!) and didn't have any benefit from the custom normalization I tried to brew up for it—plus a lot of things broke!

More details in my full write up on MediaWiki.

Change #1024764 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Normalize Arabic variants of kaf, yeh, heh

https://gerrit.wikimedia.org/r/1024764