Search box needs some normalization for Arabic Family languages
Closed, ResolvedPublic5 Estimated Story PointsFeature
Actions

Description

We have some langues such as Arabic, Persian, Urdu, Kurdish,... which uses common characters and they have similar geliphs with different Unicode number for example:
for ک (Kaf)
ك Arabic U+0643
ڪ Urdu U+06AA
ﻙ Pushtu U+FED9
ﻚ Uyghur U+FEDA
ک Persian U+06A9
for ی (ya)
ی Persian U+06CC
ي Arabic U+064A
ى Urdu U+0649
ۍ Pushtu U+06CD
ې Uyghur U+06D0
for ه (heh)
ہ Pushtu U+06C1
ە Kurdish U+06D5
ه Persian U+0647
we have these characters which have different Unicode number and different keyboard.
Now many users does not access to Persian keyboard or urdu keyboard by default in their OS (like windows xp, android (low versions), IOS ,...). so when they search for an article they can not find it in wikipedia searach box but it is existing in local characters.

For example if you search at fa.wikipedia for article ويليام شكسپير (characters are in Arabic ي , ك) you can not find it and the article in Farsi is ویلیام شکسپیر (characters are in Persian ی , ک).

for farsi please add a possibility for search tool to assume
U+064A or U+0649 or U+06CD or U+06D0 or U+06CC > U+06CC
U+0643 or U+06AA or U+FED9 or U+FEDA > U+06A9
U+06C1 or U+06D5 > U+0647

Version: unspecified
Severity: enhancement

Details

Reference: bz70899

	Subject	Repo	Branch	Lines +/-
	Normalize Arabic variants of kaf, yeh, heh	mediawiki/extensions/CirrusSearch	master	+7 K -315

Customize query in gerrit

Related Objects

Mentioned In: T363734: Reindex all wikis to enable dotted I fix, Yiddish ligatures, maybe Arabic normalization
T272606: [EPIC] Unpack all Elasticsearch analyzers
Mentioned Here: T147959: Generic language fallbacks in Mediawiki should not be used for Elasticsearch language analyzers

Event Timeline

• bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:47 AM

• bzimport added projects: MediaWiki-Search, I18n.

• bzimport set Reference to bz70899.

• bzimport added a subscriber: Unknown Object (MLST).

Yamaha5 created this task.Sep 16 2014, 7:17 PM

Yes, we have a same problem on ckb wikipedia. It can be useful.

may be for fa.wikipedia or ckb.wikipedia we needs some normalization like

https://github.com/wikimedia/mediawiki-core/blob/master/languages/classes/LanguageAr.php

and https://github.com/wikimedia/mediawiki-core/blob/master/maintenance/language/generateNormalizerDataAr.php

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

(In reply to Andre Klapper from comment #4)

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

We need normalization for search box which is placed on the top pages.

Liuxinyu970226 subscribed.Mar 4 2015, 12:37 AM

Yamaha5 added a project: CirrusSearch.Jun 4 2016, 5:26 PM

Restricted Application added projects: Discovery-ARCHIVED, Discovery-Search. · View Herald TranscriptJun 4 2016, 5:26 PM

In T72899#752615, @Aklapper wrote:

Is this request about CirrusSearch or about LuceneSearch (deprecated)?

I meant CirrusSearch

I'm having a hard time understanding the scope of this task. Could @TJones help? :-)

We need to specify the list of languages we are trying to do this for. The description mentions Persian, but alludes to Arabic, Urdu, Kurdish, Pushtu, and Uyghur, and the comments mention Sorani (ckb).

For each language, we'd need to figure out the exact mappings. (Persian is listed above, but I'd want to double check that the correspondences work in the other direction for each language—and there may be others glyphs that are relevant to other languages but not relevant to Persian, and so not listed here.) There may be more detail available in the github repos listed but I haven't looked closely.

Then we have to figure out which language analyzers are being used for each language. Arabic, Persian, and Sorani have their own analyzers. The fallbacks (T147959), though very imperfect, are the status quo, so we'd have to see what's going on there.

For each analyzer being used (possibly including the default), we'd need to unpack the built-in ES analyzer so we can modify it. Doing this for French and others has given unexpected results—generally not bad, mostly improvements, with the few regressions being readily fixable. Figuring all that out requires testing per language, and I'd really want to be careful the first time we did it to an Arabic-script analyzer.

After the unpacking, actually setting up the mapping is very little work.

Once that's done, the wikis in question need to be re-indexed. Arabic Wikipedia has already been done for BM25, and others may happen before we work on this, in which case the change going live could take a while—until the next time we re-index. (Though that's something we need to get better at being able to do, and doing it for the projects in a handful of languages is less effort than doing it for almost everything, as we are with BM25.)

Does that help?

debt moved this task from needs triage to search-icebox on the Discovery-Search board.Oct 27 2016, 8:47 PM

Amire80 moved this task from Untriaged to Search on the I18n board.Mar 25 2018, 6:29 AM

Restricted Application added a subscriber: alaa. · View Herald TranscriptMar 25 2018, 6:29 AM

TJones moved this task from search-icebox to Language Stuff on the Discovery-Search board.Jan 30 2019, 10:57 PM

TJones lowered the priority of this task from Medium to Low.Aug 27 2020, 8:04 PM

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Mar 17 2021, 9:09 PM

Aklapper changed the subtype of this task from "Task" to "Feature Request".Feb 4 2022, 11:12 AM

Aklapper removed subscribers: • wikibugs-l-list, • Manybubbles.

Restricted Application added a subscriber: Huji. · View Herald TranscriptFeb 4 2022, 11:12 AM

TJones raised the priority of this task from Low to Medium.Jul 28 2022, 3:26 PM

TJones claimed this task.Apr 24 2024, 1:17 PM

TJones edited projects, added Discovery-Search (Current work); removed Discovery-Search.

TJones set the point value for this task to 3.

TJones moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

TJones changed the point value for this task from 3 to 5.Apr 29 2024, 9:27 PM

Change #1024764 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Normalize Arabic variants of kaf, yeh, heh

https://gerrit.wikimedia.org/r/1024764

gerritbot added a project: Patch-For-Review.Apr 30 2024, 8:03 PM

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Apr 30 2024, 8:09 PM

I looked at all of these as best I could, and I decided on a general mapping to standard Arabic forms internally. Arabic does that to some degree, as does Persian! And for the languages without custom stemmers and stop word filters, the character used internally doesn't matter, as long as the desired words can find each other.

Things were complicated by the special characters for isolated, initial, medial, and final forms that many Arabic letters have, but I got it under control in the end!

Sorani/Central Kurdish (ckb) didn't need any normalization. It already normalizes to the Persian forms (though Persian often doesn't!) and didn't have any benefit from the custom normalization I tried to brew up for it—plus a lot of things broke!

More details in my full write up on MediaWiki.

TJones mentioned this in T363734: Reindex all wikis to enable dotted I fix, Yiddish ligatures, maybe Arabic normalization.Apr 30 2024, 9:25 PM

Change #1024764 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Normalize Arabic variants of kaf, yeh, heh

https://gerrit.wikimedia.org/r/1024764

ReleaseTaggerBot added a project: MW-1.43-notes (1.43.0-wmf.4; 2024-05-07).May 1 2024, 8:00 PM

Maintenance_bot removed a project: Patch-For-Review.May 1 2024, 8:30 PM

dr0ptp4kt moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.May 13 2024, 3:07 PM

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Jun 4 2024, 7:32 PM

Gehel closed this task as Resolved.Jun 7 2024, 9:25 AM