Abstract
In the paper, we study one of the fundamental problems of grammatical inference, namely the induction of nondeterministic finite automata (NFA). We consider the induction of NFA consistent with the given sets of examples and counterexamples. We transform the induction problem into a constraint satisfaction problem and propose two variable ordering methods to solve it. We evaluate experimentally the proposed variable ordering methods and compare them with a state-of-the-art method. Additionally, through the experiments, we assess the impact of sample sets sizes on the performance of the induction algorithm using the respective variable ordering methods.
The preliminary version of this paper was presented at ICGI 2018 [13]. In the current paper, we substantially extended the algorithmic descriptions and made the experimental part much more comprehensive.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In the paper, we deal with automata, which are important for numerous practical applications, such as compiler design, bioinformatics [20], and grammatical inference [19]. These automata are finite, nondeterministic and minimal in terms of the number of states. They are given by \(A = (Q, \varSigma , \delta , q_0\), \(Q_F)\), where Q is the finite set of states of the automaton, \(\varSigma \) is the input alphabet, \(\delta : Q \times \varSigma \rightarrow 2^Q\) is the transition function, \(q_0 \in Q\) is the initial state and \(Q_F \subseteq Q\) is the set of final states [11]. The automata are also consistent with the given sample \(S = (S_+, S_-)\), in which \(S_+\) denotes the examples, and \(S_-\) denotes the counterexamples. The automaton A is said to be consistent with the sample S if for each word \(w \in S_+\) there exists a sequence of transitions between state \(q_0\) and at least one state \(q \in Q_F\) and for each word \(w \in S_-\) the condition does not hold.
In order to find a minimal nondeterministic finite automaton (NFA) consistent with the given sample S, we transform the induction problem into a constraint satisfaction problem (CSP) as discussed in [12, 15, 18]. Using the CSP formulation we ask whether for the given sample S and a given positive integer k there exists a k-state automaton consistent with S. By taking \(k = 1, 2, \ldots \), we ensure that we find the minimal NFA for the given sample.
The main contributions of the paper are as follows. Firstly, we discuss two variable ordering methods devised specifically to solve the given CSP formulation of the problem. And secondly, we perform comprehensive experiments on input samples built of amino acid sequences from WaltzDB database [2]. To assess how the sizes of sets \(S_+\) and \(S_-\) affect the performance of the induction algorithms, we consider samples for which \(|S_+| \ll |S_-|\), \(|S_+| = |S_-|\), and \(|S_+| \gg |S_-|\) hold.
The paper is organized into 6 sections. In Sect. 2 we present a brief literature review. In Sect. 3 we recall the formulation of minimal NFA induction problem as a CSP. In Sect. 4 we present the parallel induction algorithm and the two proposed variable ordering methods. We present the results of conducted experiments in Sect. 5. Finally, Sect. 6 contains the conclusions.
2 State of the Art
The problem of finding a minimal NFA consistent with the given sample is hard. In particular, it cannot be done from polynomial time and data [6]. It was also shown that inducing a minimal NFA given a deterministic finite automaton (DFA) is PSPACE-complete [16].
The existing NFA induction methods focus around state merging algorithms, designed initially for DFAs. The merging is performed over a prefix tree acceptor constructed for the sample S. The algorithms remove redundant states keeping the consistency with the sample. Among the existing algorithms there are DeLeTe2 [8], NRPNI [1], and the state merging methods proposed in [5, 9]. Some algorithms transform the induction problem to CSPs [12, 14, 15, 18].
The variable ordering methods are either static – defined before the algorithm begins, or dynamic – changing as the algorithm proceeds. Examples of static methods include deg [7] and ddeg heuristics based on the initial and current number of constraints involving the variable (degree). Dynamic methods include domain-size based method dom [10], as well as dom-deg [3], dom-ddeg [17] or dom-futdeg [4] methods, which follow the dom heuristic but in case of ties use the initial, current or future degree of the variables, respectively.
3 Problem Formulation
Since we assume the fixed number of states k, to find the NFA we only need to find the transition function \(\delta \) and the set of final states \(Q_F\). As the result, the following statements about the given CSP always hold [12, 15, 18]:
-
1.
Let \(l = |\varSigma |\) be the number of symbols in the alphabet. Then we need exactly \(n = k^2l + k\) binary variables x and y.
-
2.
For each pair of states \(p, q \in Q\) and each symbol \(a \in \varSigma \), if \(q \in \delta (p, a)\), then \(x_i = 1\), for some \(i \in [1, k^2l]\). Likewise, \(x_i = 0\) means that \(q \notin \delta (p, a)\) holds.
-
3.
For each state \(p \in Q_F\), \(y_i = 1\) holds for some \(i \in [1, k]\). Otherwise, \(y_i = 0\).
-
4.
Let i, j, \(i \ne j\) be the indices of x variables. Let \(a, b \in \varSigma \), and \(p, q, r \in Q\), \(p \ne r, q \ne r\). Then the x variables are ordered as follows:
-
(a)
If \(a < b\) in lexicographical order, then \(i < j\) holds.
-
(b)
Given the transitions \(p \overset{a}{\rightarrow } r\) and \(q \overset{a}{\rightarrow } r\), if p precedes q in Q, then \(i < j\).
-
(c)
Given the transitions \(r \overset{a}{\rightarrow } p\) and \(r \overset{a}{\rightarrow } q\), if p precedes q in Q, then \(i < j\).
-
(a)
-
5.
If p precedes q in Q, then \(i < j\), where i, j are the y variables indices.
Example 1
Let the set of states be \(Q = \{q_0, q_1\}\), and so \(k = 2\), and let \(\varSigma = \{a, b\}\). Then variables \(y_1\) and \(y_2\) correspond to the possible final states \(q_0\) and \(q_1\). Since \(a < b\) lexicographically, and \(q_0 < q_1\) in set Q, then the following relations hold: variable \(x_1\) corresponds to the transition \(q_0 \overset{a}{\rightarrow } q_0\), variable \(x_2\) corresponds to the transition \(q_0 \overset{a}{\rightarrow } q_1\), \(\ldots \), variable \(x_8\) corresponds to the transition \(q_1 \overset{b}{\rightarrow } q_1\).
Given the above considerations, a path is a product of x variables followed by a y variable. Using Example 1, the product \(x_1x_6x_8y_1\) corresponds to the transitions \(q_0 \overset{a}{\rightarrow } q_0\), \(q_0 \overset{b}{\rightarrow } q_1\), and \(q_1 \overset{b}{\rightarrow } q_1\), i.e., a word abb ending in state \(q_1\).
Let us note that for the given number of states k and for each word w, there are exactly \(k^{|w|}\) paths on which this word can be spelled out. There are also \(k^{|w|}\) products of length \(|w| + 1\) corresponding to these paths. In the sequel, we use \(P^w_{mi}\) to denote the m-th product of x variables corresponding to word w, and ending in variable \(y_i\). Hence an NFA is consistent with the sample S when:
-
1.
for all words \(w \in S_+ \setminus \{\lambda \}\) it holds that (examples acceptance):
$$\begin{aligned} \sum _{i = 1..k} \sum _{m = 1..k^{|w| - 1}} P^w_{mi}y_i = 1, \end{aligned}$$(1) -
2.
for all words \(w \in S_- \setminus \{\lambda \}\) it holds that (counterexamples rejection):
$$\begin{aligned} \sum _{i = 1..k} \sum _{m = 1..k^{|w| - 1}} P^w_{mi}y_i = 0. \end{aligned}$$(2)
Note that the summation is performed according to the rules of Boolean algebra. Note that (1) is satisfied iff there exists a path for word w that ends in a final state. Note also that (2) is satisfied iff either there is no path for word w, or the path ends in a non-final state. Finally, if \(\lambda \in (S_+ \cup S_-)\) holds, then we set:
The above condition allows to reduce the solution space to be searched [14].
4 Algorithms and Methods
Hereinafter, let C be the set of constraints, \(c_+ \in C\) be a constraint given by (1), and \(c_- \in C\) be a constraint given by (2). Let |c| be the number of active (non-zero) products P in (1) or (2), and d(c, x) be the number of active products in c containing variable x. The variable ordering methods pertain to the x variables. The y variables are set first to produce independent instances of the CSP, solved in parallel [12, 14]. It also applies to the deg method used in the experiments.
The induction procedure InduceNFA, executed for each independent instance of the CSP, is given in Fig. 1. The procedure consists of the evaluation phase (represented by the Evaluate procedure in line 2), and the ordering phase (represented by the Reorder procedure in line 5). They both differ depending on the selected variable ordering method.
The min-max-ex Method. The Evaluate procedure (Fig. 2) implements a ‘fail-fast’ behavior, by checking first the constraints that may result in a contradiction (lines 6 and 8). It aims at explicitly satisfying constraints \(c_+ \in C\) (line 9), setting the other \(x_i\) variables to zeros if all \(c_+\) are satisfied (line 10). It updates \(|c_+|\) and \(d(c_+, x_j)\) using the techniques described in [12] (lines 3–5).
The Reorder procedure (Fig. 3) selects a constraint \(c_+' \in C\), for which \(|c_+'|\) is minimal (line 2) and the index of the most frequent variable \(x_i \in c'_+\) (line 3). It sets v to zero (line 4). The ordering makes the evaluation shorter, by choosing constraint with fewer products and enforcing the ‘fail-fast’ behavior.
The min-max-cex Method. The second variable ordering method, called min-max-cex, differs from min-max-ex in that, that in the Evaluate procedure, for \(v = 1\) it only checks for a contradiction in any \(c_- \in C\). For \(v = 0\) it checks for a contradiction in any \(c_+ \in C\), it updates the \(|c_-|\) and \(d(c_-, x_j)\) values as in lines 3–5, and checks for satisfied constraints \(c_-\). If all \(c_-\) are satisfied it sets all unset \(x_j\) variables to 1. In the Reorder procedure, in line 2 we look for a constraint \(c_- \in C\), such that \(|c_-|\) is minimal, and we propose that \(v \leftarrow 1\) in line 4.
The deg Method. In the Evaluate procedure the deg method first checks for any contradictions, either in \(c_+\) or in \(c_-\) (for \(v = 0\) or \(v = 1\), respectively), and then it checks for satisfied constraints. We require that all constraints are explicitly satisfied. Since the method is static, the order of variables is established before the induction begins. So the Reorder procedure merely returns the next variable according to the already known order. It sets \(v \leftarrow 0\) in line 4 of Fig. 3.
Example 2
Due to space limitations, we provide an example trace of the induction algorithm at https://github.com/tjastrzab/iccs.
5 Experimental Evaluation
We generated 450 samples based on the sets of amino acid sequences [2]. The number of sequences in set \(S_+\) (resp. \(S_-\)) was 5 (resp. 45), 25 (resp. 25), and 45 (resp. 5), for the samples, for which \(|S_+| \ll |S_-|\), \(|S_+| = |S_-|\), and \(|S_+| \gg |S_-|\) hold. The algorithms were implemented in Java and run on an Intel Xeon E5-2640 2.60 GHz processor (16 cores, 8 GB RAM). The time limit was 3 hours. We considered two scenarios. In Exp. 1, we sought the first consistent k-state automaton. In Exp. 2, the configuration of the final states was also given to see how the algorithms perform when searching for a specific NFA. All minimal automata had 2 states.
The distribution of the run times, given by the minimum, maximum and average values, is shown in Table 1. Note that in all cases, the average times for the best- and worst-performing methods, differ by an order of magnitude or more. Moreover, the balanced samples are the hardest to solve, since the average times are the longest for both Exp. 1 and Exp. 2. The min-max-ex method prevails when \(|S_+| \ll |S_-|\) holds, while min-max-cex is the fastest for the cases in which \(|S_+| \gg |S_-|\) is true. This is expected since the time spent in the Reorder procedure is then much shorter (as the ordering is based on \(S_+\) or \(S_-\) only). The min-max-cex method is also the fastest for the hardest sample type. This proves that the new methods are competitive with respect to the deg method.
To explain the differences in the run times, we counted the number of calls of the InduceNFA procedure, i.e., the number of visited solution tree nodes. We observed that the number of calls differed by two to four orders of magnitude.
Finally, to observe the relation between the run time performance and the automaton size, we counted the number of transitions in each NFA. We observed that min-max-ex and deg methods produce NFAs of similar size, while the min-max-cex method generates on average 17–30 transitions more. It is so because min-max-ex and deg methods produce mostly the transitions necessary to accept the examples. The min-max-cex method satisfies all counterexamples and makes the remaining variables equal to one, which generates more transitions.
6 Conclusions
In the paper we have investigated the problem of nondeterministic finite automata induction. The experiments have shown that the proposed variable ordering methods perform better than the state-of-the-art one, especially in case of imbalanced sizes of sets \(S_+\) and \(S_-\). The result is important since it is not uncommon that, for instance, we know just a few factors causing a disease (set \(S_+\)) and much more factors that are not responsible for this particular disease (set \(S_-\)). Hence, being able to classify these factors efficiently and correctly using the induced NFAs, can be of help in some bioinformatics tasks.
References
Alvarez, G., Ruiz, J., Cano, A., García, P.: Nondeterministic Regular Positive Negative Inference NRPNI. In: Proceedings of CLEI 2005, pp. 239–249 (2005)
Beerten, J., Van Durme, J., Rousseau, F., Schymkowitz, J.: WALTZ-DB. Database of amyloid forming peptides (2014). http://waltzdb.switchlab.org/
Bessière, C., Régin, J.-C.: MAC and combined heuristics: two reasons to forsake FC (and CBJ?) on hard problems. In: Freuder, E.C. (ed.) CP 1996. LNCS, vol. 1118, pp. 61–75. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-61551-2_66
Brélaz, D.: New methods to color the vertices of a graph. Commun. ACM 22(4), 251–256 (1979)
Coste, F., Fredouille, D.: Unambiguous automata inference by means of state-merging methods. In: Lavrač, N., Gamberger, D., Blockeel, H., Todorovski, L. (eds.) ECML 2003. LNCS (LNAI), vol. 2837, pp. 60–71. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-39857-8_8
de la Higuera, C.: Characteristic sets for polynomial grammatical inference. Mach. Learn. 27, 125–138 (1997)
Dechter, R., Meiri, I.: Experimental evaluation of preprocessing techniques in constraint satisfaction problems. In: Proceedings of IJCAI 1989, pp. 271–277. Morgan Kaufmann Publishers Inc., San Francisco (1989)
Denis, F., Lemay, A., Terlutte, A.: Learning regular languages using RFSAs. Theor. Comput. Sci. 313(2), 267–294 (2004)
García, P., Vázquez Parga, M., Alvarez, G., Ruiz, J.: Universal automata and NFA learning. Theor. Comput. Sci. 407(1–3), 192–202 (2008)
Harallick, R.M., Elliot, G.L.: Increasing tree search efficiency for constraint satisfaction problems. Artif. Intell. 14, 263–313 (1980)
Hopcroft, J., Ullman, J.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company (1979)
Jastrząb, T.: On parallel induction of nondeterministic finite automata. Procedia Comput. Sci. 80, 257–268 (2016)
Jastrząb, T.: Performance evaluation of selected variable ordering methods for NFA induction (2018). http://icgi2018.pwr.edu.pl/public/ex-abstracts/jastrzab18.pdf
Jastrząb, T.: Two parallelization schemes for the induction of nondeterministic finite automata on PCs. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10777, pp. 279–289. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78024-5_25
Jastrząb, T., Czech, Z.J., Wieczorek, W.: Parallel induction of nondeterministic finite automata. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 248–257. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_24
Jiang, T., Ravikumar, B.: Minimal NFA problems are hard. SIAM J. Comput. 22(6), 1117–1141 (1993)
Smith, B.M., Grant, S.A.: Trying harder to fail fast. In: Proceedings of ECAI 1998, pp. 249–253. Wiley (1997)
Wieczorek, W.: Induction of non-deterministic finite automata on supercomputers. In: Proceedings of ICGI 2012, vol. 21, pp. 237–242 (2012)
Wieczorek, W.: Grammatical Inference. SCI, vol. 673. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-46801-3
Wieczorek, W., Unold, O.: Use of a novel grammatical inference approach in classification of amyloidogenic hexapeptides. Comput. Math. Methods Med. 2016, 8 (2016). article ID 1782732
Acknowledgment
The research was supported by National Science Centre Poland (NCN), project registration no. 2016/21/B/ST6/02158.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Jastrząb, T. (2019). A Comparison of Selected Variable Ordering Methods for NFA Induction. In: Rodrigues, J., et al. Computational Science – ICCS 2019. ICCS 2019. Lecture Notes in Computer Science(), vol 11540. Springer, Cham. https://doi.org/10.1007/978-3-030-22750-0_73
Download citation
DOI: https://doi.org/10.1007/978-3-030-22750-0_73
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22749-4
Online ISBN: 978-3-030-22750-0
eBook Packages: Computer ScienceComputer Science (R0)