Keywords

1 Introduction

Arabic is the largest Semitic language today, used by more than 422 millions persons around the world, as a first or second language, making it the fifth most spoken language in the world.

The Arabic language uses a writing system consisted of 28 letters but represented by 36 characters due to 2 letters which have more than one formFootnote 1. Unlike Latin, Arabic is always written in a cursive style where most of the letters are joined together with no upper case letters, from right to left (RTL).

The writing system is composed of letters and other marks representing phonetic information, known as diacritics, which are small marks that should be placed above or below most of the letters. They are represented as additional Arabic characters in UTF-8 encoding. There are eight diacritics in the Modern Standard Arabic (MSA), arranged into three main groups:

  • Short vowels. Three marks: Fatha, Damma, Kasra.

  • Doubled case endings (Tanween). Three marks: Tanween Fath (Fathatan), Tanween Damm (Dammatan), Tanween Kasr (Kasratan).

  • Syllabification marks. Two marks: Sukoon and Shadda [46].

Shadda is a secondary diacritic indicating that the specified consonant is doubled, rather than making a primitive sound. The Tanween diacritics can appear only at the end of the word, and Sukoon cannot appear in the first letter. Besides, short vowels can be placed in any position. Furthermore, some characters cannot accept any diacritics at all (ex: ), and some others cannot do that in specified grammatical contexts (ex: the definitive at the beginning of the word). The diacritics are essential to indicate the correct pronunciation and the meaning of the word. They are all presented on the letter in Table 1.

Table 1. The diacritics of the Modern Standard Arabic

These marks are dropped from almost all the written text today, except the documents intolerant to pronunciation errors, such as religious texts and Arabic teaching materials. The native speakers can generally infer the correct diacritization from their knowledge and the context of every word. However, this is still not a trivial task for a beginner learner or NLP applications [12].

Table 2. The diacritizations of and their meanings

The automatic diacritization problem is an essential topic due to the high ambiguity of the undiacritized text and the free word order nature of the grammar. Table 2 illustrates the differences made by the possible diacritizations of the word . As one might see, the diacritization defines many linguistic features, such as the part-of-speech (POS), the active/passive voice, and the grammatical case.

The full diacritization problem includes two sub-problems: morphological diacritization and syntactic diacritization. The first indicates the meaning of the word, and the second shows the grammatical case.

Two metrics are defined to calculate the quantitative performance of an automated diacritics restoration system: Diacritization Error Rate (DER) and Word Error Rate (WER). The first one measures the ratio of the number of incorrectly diacritized characters to the number of all characters. The second metric applies the same principle considering the whole word as a unit, where a word is considered incorrect if any of its characters has a wrong diacritic. Both metrics have two variants: One includes the diacritics of all characters (DER1 and WER1), and another excludes the diacritics of the last character of every word (DER2 and WER2).

We propose a new approach to restore the diacritics of a raw Arabic text using a combination of deep learning, rule-based, and statistical methods.

2 Related Works

Many works were done in the automatic restoration of the Arabic diacritics using different techniques. They can be classified into three groups.

  • Rule-based approaches. The used methods include cascading Weighted Finite-State Transducers[33], lexicon retrieval and rule-based morphological analysis [7]. One other particular work [9] used diacritized text borrowing from other sources to diacritize a highly cited text.

  • Statistical approaches. This type of approaches includes using Hidden Markov Models both on word level and on character level [8, 18, 21], N-grams models on word level and on character level as well [10], Dynamic Programming methods [24,25,26], classical Machine learning models such as Maximum-entropy classifier [46], and Deep Learning methods like the Deep Neural Networks, both the classical Multi-Layer Perceptron and the advanced Recurrent Neural Networks[6, 14, 32, 36].

  • Hybrid approaches. They are a combination of rule-based methods and statistical methods in the same system. They include hybridization of rules and dictionary retrievals with morphological analysis, N-grams, Hidden Markov Models, Dynamic Programming and Machine Learning methods [5, 15, 17, 20, 23, 31, 35, 37,38,39, 42]. Some Deep Learning models improved by rules [2, 3] have been developed as well.

Despite a large number of works done on this topic, the number of available tools for Arabic diacritization is still limited because most researchers do not release their source code or provide any practical application. Therefore, we will compare the performance of our system to these available ones:

  • Farasa [4] is a text processing toolkit which includes an automatic diacritics restoration module, in addition to other tools. It is based on the segmentation of the words based on separating the prefixes and suffixes using SVM-ranking and performing dictionary lookups.

  • MADAMIRA [34] is a complete morphological analyser that generates possible analyses for every word with their diacritization and uses an SVM and n-gram language models to select the most probable one.

  • Mishkal [44] is an application which diacritize a text by generating the possible diacritized word forms through the detection of affixes and the use of a dictionary, then limiting them using semantic relations, and finally choosing the most likely diacritization.

  • Tashkeela-Model [11] uses a basic N-gram language model on character level trained on the Tashkeela corpus [45].

  • Shakkala [13] is a character-level deep learning system made of an embedding, three bidirectional LSTM, and dense layers. It was trained on Tashkeela corpus as well. To the best of our knowledge, this is the system that achieves state-of-the-art results.

3 Dataset

In this work, the Tashkeela corpus [45] was mainly used for training and testing our model. This dataset is made of 97 religious books written in the Classical Arabic style, with a small part of web crawled text written in the Modern Standard Arabic style. The original dataset has over 75.6 million words, where over 67.2 million are diacritized Arabic words.

The structure of the data in this dataset is not consistent since its sources are heterogeneous. Furthermore, it contains some diacritization errors and some useless entities. Therefore, we applied some operations to normalize this dataset and keep the necessary text:

  1. 1.

    Remove the lines which do not contain any useful data (empty lines or lines without diacritized Arabic text).

  2. 2.

    Split the sentences at XML tags and end of lines, then discard these symbols. After that, split the new sentences at some punctuation symbols: dots, commas, semicolons, double dots, interrogation, and exclamation marks without removing them.

  3. 3.

    Fix some diacritization errors, such as removing the extra Sukoon on the declarative , reversing the + Tanween Fath and diacritic + Shadda combinations, removing any diacritic preceded by anything other than Arabic letter or Shadda, and keeping the latest diacritic when having more than one (excluding Shadda + diacritic combinations).

  4. 4.

    Any sentence containing undiacritized words or having less than 2 Arabic words is discarded.

After this process, the resulted dataset will be a raw text file with one sentence per line and a single space between every two tokens. This file is further shuffled then divided into a training set containing 90% of the sentences, and the rest is distributed equally between the validation and the test setsFootnote 2. After the division, we calculated some statistics and presented them in Table 3.

Table 3. Statistics about the processed Tashkeela dataset

We note that the train-test Out-of-Vocabulary ratio for the unique Arabic words is 9.53% when considering the diacritics and 6.83% when ignoring them.

4 Proposed Method

Our approach is a pipeline of different components, where each one does a part of the process of the diacritization of the undiacritized Arabic text of the input sentence. Only a human-readable, fully diacritized Arabic text is needed to train this architecture, without any additional morphological or syntactic information.

4.1 Preprocessing

At first, only the necessary characters of the sentence which affect the diacritization are kept. These are Arabic characters, numbers, and spaces. The numbers are replaced by 0 since their values will most likely not affect the diacritization of the surrounding words. The other characters are removed before the diacritization process and restored at the end.

Every filtered sentence is then separated into an input and an output. The input is the bare characters of the text, and the output is the corresponding diacritics for every character. Considering that an Arabic letter can have up to two diacritics where one of them is Shadda, the output is represented by two vectors; one indicates the primary diacritic corresponding to every letter, and the other indicates the presence or the absence of the Shadda. Figure 1 illustrates this process.

Fig. 1.
figure 1

Transformation of the diacritized text to the input and output labels

The input is mapped to a set of 38 numeric labels representing all the Arabic characters in addition to 0 and the white space. It is transformed into a 2D one-hot encoded array, where the size of the first dimension equals the length of the sentence, and the size of the second equals the number of the labels. After that, this array is extended to 3 dimensions by inserting the time steps dimension as the second dimension and moving the dimension of the label into the third position. The time steps are generated by a sliding window of size 1 on the first dimension. The number of time steps is fixed to 10 because this number is large enough to cover most of the Arabic words, along with a part of their previous words. The output of the primary diacritics is also transformed from a vector of labels to a 2D one-hot array. The output of Shadda marks is left as a binary vector. Figure 2 shows a representation of the input array and the two output arrays after the preprocessing of the previous example. The \(\varnothing \) represents a padding vector (all zeros), and the numbers in the input and the second output indicate the indexes of the cells of the one-hot vectors set to 1.

Fig. 2.
figure 2

Input and output arrays after the transformations

4.2 Deep Learning Model

The following component in this system is an RNN model, composed from a stack of two bidirectional LSTM [22, 28] layers of 64 cells in each direction, and parallel dense layers of sizes 8 and 64. All of the previous layers use hyperbolic tangent (Tanh) as an activation function. The first parallel layer is connected to a single perceptron having the sigmoid activation function, while the second is connected to 7 perceptrons having softmax as an activation function. The first estimates the probability that the current character has a Shadda, and the second generates the probabilities of the primary diacritics for that character. A schema of this network is displayed in Fig. 3. The size, type, and number of layers were determined empirically and according to the previous researches that used deep learning approaches [2, 13, 14, 27, 32].

Fig. 3.
figure 3

Architecture of the Deep Learning model

4.3 Rule-Based Corrections

Rule-based corrections are linked to the input and output of the RNN to apply some changes to the output. These rules can select the appropriate diacritic for some characters in some contexts, or exclude the wrong choices in other contexts by nullifying their probabilities. Different sets of rules are applied to the outputs to eliminate some impossible diacritizations according to Arabic rules.

Shadda Corrections. The first output of the DL model representing the probability of the Shadda diacritic is ignored by nullifying its value if any of these conditions are met for the current character:

  • It is a space, 0 or one of the following: .

  • It is the first letter of the Arabic word.

  • It has Sukoon as a predicted primary diacritic.

Primary Diacritics Corrections. The probabilities of the second output of the DL model are also altered by these rules when their respective conditions are met for the current character:

  • If it is , set the current diacritic to Kasra, by setting the probability of its class to 1 and the others to 0.

  • If it is or , set the diacritic of the previous character to Fatha.

  • If it is and the last letter of the word, allow only Fatha, Fathatan, or no-diacritic choices by zeroing the probabilities of the other classes.

  • If it is and not the last letter of the word, set Fatha on the previous character.

  • If it is the first letter in the word, forbid Sukoon.

  • If it is not the last character of the word, prohibit any Tanween diacritic from appearing on it.

  • If it is the last letter, prohibit Fathatan unless this character is or .

  • If it is a space, 0 or any of the following characters: , set the choice to no-diacritic.

4.4 Statistical Corrections

The output and the input of the previous phase are transformed and merged to generate a standard diacritized sentence. The sentence is segmented into space-delimited words and augmented by unique starting and ending entities. Every word in the sentence is checked up to 4 times in the levels of correction using the saved training data. If any acceptable correction is found at any level, the word will be corrected. Otherwise, it will be forwarded to the next level. In the case where many corrections get the same score in a single level, the first correction is chosen. If no correction is found, the predicted diacritization of the previous component is not changed.

Word Trigram Correction. In the first stage, trigrams are extracted from the undiacritized sentence and checked whether a known diacritization for its core word is available. The core word is the second one in the trigram, while the first and the third are considered previous and following contexts, respectively. If such a trigram is found, the most frequent diacritization for the core word in that context is selected. Despite its high accuracy, especially for the syntactic part, this correction rarely works since having the exact surrounding words in the test data is not common. This correction is entirely independent of the output of the DL model and the rule-based corrections. An example is shown in Fig. 4.

Fig. 4.
figure 4

Selecting the diacritization using the trigrams

Word Bigram Correction. In the second stage, the same processing as the previous one is applied for the remaining words but considering bigrams where the core word is the second one, and the first one represents the previous context. This correction works more often than the trigram-based one since it depends only on the previous word. Similarly, it does not depend on the output of the previous components.

Word Minimum Edit Distance Correction. In the third stage, when the undiacritized word has known compatible diacritizations, the Levenshtein distance [29] is calculated between the predicted diacritization and every saved diacritization for that word. The saved diacritization corresponding to the minimal edit distance is chosen, as shown in Fig. 5. Most predictions are corrected at this stage when the vocabulary of the test set is relatively similar to the training set.

Fig. 5.
figure 5

Selecting the diacritization according to the minimal edit distance

Pattern Minimum Edit Distance Correction. Finally, if the word was never seen, the pattern of the predicted word is extracted and compared against the saved diacritized forms of that pattern. To generate the word pattern, the following substitutions are applied: are all replaced by . is replaced by . The rest of the Arabic characters except and the long vowels ( ) are substituted by the character . The diacritics and the other characters are not affected. The predicted diacritized pattern is compared to the saved diacritization forms of this pattern when available, and the closest one, according to the Levenshtein distance, is used as a correction, following the same idea of the previous stage. This correction is effective when the test data contains many words not seen in the training data.

5 Experiments

5.1 Implementation Details

The described architecture was developed using Python [41] 3.6 with NumPy [40] 1.16.5 and TensorFlow [1] 1.14.

The training data was transformed into NumPy arrays of input and output. The DL model was implemented using Keras, and each processed sentence of text is considered a single batch of data when fed into the DL model. The optimizer used for adjusting the model weights is ADADELTA [43] with an initial learning rate of 0.001 and \(\rho \) of 0.95.

The rule-based corrections are implemented as algebraic operations working on the arrays of the input and the output.

The statistical corrections use dictionaries as data structures, where the keys are the undiacritized n-grams/patterns, and the values are lists of the possible tuples of the diacritized form along with their frequencies in the training set.

5.2 System Evaluation

The DL model is trained for a few iterations to adjust its weights, while the dictionaries of the statistical corrections are populated while reading the training data in the first pass.

We report the accuracy of our system using the variants of the metrics DER and WER as explained in the introduction. These metrics do not have an agreed exact definition, but most of the previous works followed the definition of Zitouni et al. [46] which takes non-Arabic characters into account, while some of the new ones tend to follow the definition of Alansary et al. [7] and Fadel et al. [19] which excludes these characters. In our work, we chose the latter definition since the former can be significantly biased, as demonstrated in [19]. The calculation of these metrics should include the letters without diacritics, but they can be excluded as well, especially when the text is partially diacritized.

First, we used our testing set to measure the performances of our system. We got DER1 = 4.00%, WER1 = 12.08%, DER2 = 2.80%, and WER2 = 6.22%.

Table 4. Comparison of the performances of our system to the available baselines

The same testing data and testing method of Fadel et al. [19] were used as well in order to compare our system to the others evaluated in that work. The results are summarized in Table 4.

Results show that our system outperforms the best-reported system (Shakkala). These results can be justified as Shakkala does not perform any corrections on the output of the deep learning model, while ours includes a cascade of corrections that fix many of its errors.

When training and testing our system on the text extracted from the LDC’s ATB part 3 [30], it archives DER1 = 9.32%, WER1 = 28.51%, DER2 = 6.37% and WER2 = 12.85%. Its incomplete diacritization mainly causes the higher error rates for this dataset in a lot of words, in addition to its comparatively small size, which prevents our system from generalizing well.

5.3 Error Analysis

To get a deeper understanding of the system performances, we study the effect of its different components and record the errors committed at each level. We performed the tests taking all and only Arabic characters into account on our test part of the Tashkeela dataset.

Contribution of the Components. In order to show the contribution in error reduction of every component, two evaluation setups were used.

Firstly, only the DL model and the static rules are enabled at first, then the following component is enabled at every step, and the values of the metrics are recalculated. Table 5a shows the obtained results.

Secondly, all the components are enabled except one at a time. The same calculations are done and displayed in Table 5b.

Table 5. Reduction of the error rates according to the enabled components

The contributions of the unigram and bigram corrections are the most important considering their effect on the error rates in both setups. The effect of the trigram correction is more visible on the syntactic diacritization rather than the morphological diacritization since the former is more dependant on the context. The contribution of the pattern corrections is not very noticeable due to the position of this component in the pipeline, limiting its effect only to the OoV words.

Error Types. We use our system to generate diacritization for a subset of the sentences of our testing set. We limit our selection to 200 sentences where there is at least one word wrongly diacritized. We counted and classified a total of 426 errors manually. We present the results in Table 6.

Table 6. Diacritization errors count from 200 wrong test sentences

We found that 52.58% of the mistakes committed by our diacritization system are caused by the syntactic diacritization, which specify the role of the word in the sentence. The syntactic diacritization is so hard that even the Arabic native speakers often commit mistakes of this type when speaking. Since this is manual verification, we do not just suppose that the diacritic of the last character of the Arabic word is the syntactic one as what is done in the calculations of DER2 and WER2, but we select the diacritics which have a syntactic role according to Arabic rules, no matter where they appear.

A replacement error is when the system generates a diacritization that makes a valid Arabic word, but it is wrong according to the test data. 24.18% of the errors of our system are considered in this type.

Non-existence error happens when the diacritization system generates a diacritization making a word that does not exist in the standard Arabic. 11.27% of our system’s errors are in this type.

The remaining error types are prediction missing and label missing, which indicate that the system has not predicted any diacritic where it should do, and the testing set has missing/wrong diacritics, respectively. These types are generally caused by the mistakes of diacritization in training and testing sets.

6 Conclusion

In this work, we developed and presented our automatic Arabic diacritization system, which follows a hybrid approach combining a deep learning model, rule-based corrections, and two types of statistical corrections. The system was trained and tested on a large part of the Tashkeela corpus after being cleaned and normalized. On our test set, the system scored DER1 = 4.00%, WER1 = 12.08%, DER2 = 2.80% and WER2 = 6.22%. These values were calculated when taking all and only Arabic words into account.

Our method establishes new state-of-the-art results in the diacritization of raw Arabic texts, mainly when the classical style is used. It performs well even on the documents that contain unseen words or non-Arabic words and symbols. We made our code publicly available as wellFootnote 3.

In the next work, we will focus on improving the generalization of the system to better handle the out-of-vocabulary words, while reducing the time and memory requirements.