TWO-STEPS SYSTEM IN SEARCHING SIMILAR WORDS FOR FAST AND RELIABLE AUTOMATIC CONCATENATION OF RUSSIAN SUB-WORD UNITS


如何引用文章

全文:

详细

In this paper we describe and investigate the two-steps system sorting out inappropriate words in searching of similar words in the lexicon for automatic concatenation of Russian sub-word units. This two-steps system consists of com- puting the Levenshtein distance on the first stage and computing the similarity coefficient by the relevance function on the second stage. We also compared the performance of the Wagner-Fisher algorithm and the suggested algorithm

全文:

The lexicon for Russian continuous speech recognition is much larger than that for English. This fact complicates the use of standard well developed approaches to language modeling. Quite a common approach to handle an abundant lexicon is the employment of sub-units, like syllables or morphemes. The challenge of such approach is the subsequent concatenation of recognized sub-unites. There exist some related works done to solve this problem such as [1]. However, further improvement of the concatenation accuracy and the performance of the algorithms are required. Continuous speech is transformed into the sequence of syllables, but word boundaries are unknown. The task is to accelerate the automatic concatenation of Russian subword units. For this purpose the Genetic Algorithm (GA) can be used. However, the likelihood of the sentence generated by GA should be estimated. The problem can be partly solved by accelerating the search for the same and similar words from the lexicon to the words from GA by exploiting the fuzzy search algorithms. Fuzzy search. Spell-checkers and different web search engines (such as Google, Yandex, etc) are also based on the fuzzy (string) search algorithms. For example, the fuzzy search algorithms are used in the web search engines to generate the results of the “Did you mean ...” suggestion list [2]. The problem of the fuzzy search can be formulated as follows: “Find in the text or lexicon of size N all the words matching the original word within the maximum K possible differences” [2]. There exist different fuzzy search algorithms, such as: linear search, bitap (Shift-Or or Baeza-Yates-Gonnet, and its modifications by Wu and Manber), Signature Hashing Method and others [2]. Fuzzy search algorithms are based on some metric, i. e. distance function between two strings, which measures their similarity or difference. One of the most well-known metrics is the Levenshtein distance. The Levenshtein distance and the Wagner-Fisher algorithm. The Levenshtein distance (edit distance) between two strings is the minimum number of singlecharacter edits (insertion, deletion, substitution) required to transform one string into the other. [3] Suppose S1 and S2 are strings, then, mathematically, the Levenshtein distance can be described by the following formula (1): 0, С„ і = ° j = 0 i, i > O, j = 0 j, i = O ,j> 0 min(Di· j1 Di_l j +1 Di_l j_y + СшШші0„ λ і > a j > 0 1, if SJi] * S2[j] 0, otherwise (1) Different sources suggest Csubstitution to be equal to 2 instead of 1 in the formula (1). In the following tests Csubstitu. tion = 2 was used. There exists a set of algorithms for computing the Levenshtein distance. Most popular algorithm is the Wagner-Fisher algorithm [4]. In this work the WagnerFisher algorithm was used, in which CsubstilMtion is presented by formula (2): С. substitution 2* WeightFunction(Sl [i], S2[j]), if S1 [i] * S2 [j] (2) 0, otherwise 145 Вестник СибГАУ. № 4(50). 2013 Weight Function is the function of weight coefficients for the symbol comparison. This function provides a set of rules for the phonetic comparison. It measures the phonetic similarity between two words. To accelerate the performance the lexicon is stored in one single tree. The Levenshtein distance is computed at each node of the tree. But the Wagner-Fisher algorithm has some drawbacks. This algorithm applied to the tree can sort out appropriate words at the beginning of the tree. The algorithm of search appropriate words in a tree. The algorithm SAWT was developed to accelerate the search for the similar words from the lexicon and to overcome drawback of the Wagner-Fisher algorithm described above. The ASAWT is worthy of using only if the maximum allowed distance between strings is rather small (here distance is the measure of difference), for example 2 or 3. This algorithm is able to overcome the disadvantage of the Wagner-Fisher algorithm. The idea of the ASAWT is as follow: First of all, suppose that: 1. The maximum allowed distance between two strings is k. 2. S1 of size n is the original string and S2 of size m is the test string. Furthermore, there is a restriction for n and m: I n - m | < k (this restriction saves the computational load while searching). 3. A value of the position P in the original string is known (starting from P = 0 corresponding to the first index in the string). 4. A value of the error err is known (starting from err = 0). Then the mechanism of the distance computation between two strings consists of the following steps: 1. Denote i- character of the string S by S[i] and j = 0. \i - P, if 3i є [P, P + k]:S2[j] = S^i]; 2. Ps = 3. P = 4. err = -1, otherwise. P+Ps + 1if Ps *-1; P, otherwise. err+Ps, ifPs *-1; err +1, otherwise. 5. If err > k, then stop the computation. It means that S1 and S2 are different. 6. j =j + 1 7. If P < n AND j < m, then go to the step 2. 8. err = err + n - P + m - j. If err < k the strings are decided to be similar, otherwise the strings are different. For example, suppose S1 is “TRUST” and S2 is “TEST”, k = 1 (left table) and k = 3 (right table). It should be mentioned that the cost for insertion, deletion is 1 and for substitution is 2 like in the WagnerFisher algorithm whit formula (2), but without exploiting of WeightFunction. Similarity coefficient. To get more appropriate results two-step system sorting out inappropriate words from the lexicon was applied. At the first step the Levenshtein distance between the original word (generated by the GA) and the words from the lexicon is computed. At the second step the similarity coefficient between the original word and the word from the lexicon is computed by the Relevance function. The similarity coefficient is the fractional number between 0 and 1, 0 means that two words are absolutely different, 1 means that two words are identical. There exist different word similarity coefficients, such as the Sörensen coefficient, the Kulczinsky coefficient, the Ochiai coefficient, the Szymkiewicz-Simpson coefficient and Braun-Blanquet coefficient [5]. These coefficients reflect the similarity coefficient dependence on the length of the words. For example, the formula (3) is presents the Sörensen coefficient: 2c Ks = , (3) a + b where a and b are the lengths of the words, c is the number of the matching characters, which can be computed by the formula (4) where for computing the Levenshtein distance (LD) the cost for all edit operations is equal to 1. с = max(a,b) - LD . (4) The Relevance function. The Relevance function gives the similarity coefficient which allows to take into account the positions of the difference in the word. Thus the difference at the beginning or at the end of the word is less critical than in the middle of the word [6]. The similarity coefficient in this case can be computed by the formula (5) and formula (6): N Σr (i) R = i=1 N r (i) = Match(Str1, Str 2, i) + Match(Str 2, Str 1, i) Count (Str1, i) + Count ( Str 2, i) (5) (6) where Match(S1, S2, i) is the number of the matches for all substrings of the length i from the string S1 in the string S2; Count(Str,i) = (len(Str)-i+1); len(S) is the length of the string S. For the following tests the Relevance function with N = 2 was used. Certain threshold should be specified for the R to distinguish between similar and non-similar words. Results. The computer program for testing was written in C++ using Microsoft Visual Studio 2010. The lexicon consists of 2311465 words. All tests were calculated by the computer with Intel(R) Core(TM)2 CPU 6700 @ 2.66GHz, running Linux. All results are dependent on the program realization, the characteristics of the computer, the computer workload, the words for the test and the lexicon. For testing the words of different size were randomly selected from the lexicon. The results of testing are presented in the following figures. 146 2nd International WorkshoP on Mathematical Models and its APPlications ASAWT for computing the distance T R U S T Err T 1 0 E 0 0 1 S 0 0 2 T 2 (err > k, stop) Distance - T R U S T Err T 1 0 E 0 0 0 0 1 S 0 0 1 3 T 1 3 distance 3 Fig. 1. Computational time for the algorithms with k = 1 Fig. 2. Computational time for the algorithms with k = 2 Abbreviations on the figures: L1, L2 mean the computation of the Levenshtein distance using the Wagner-Fisher algorithm without WeightFunction, with k = 1 and k = 2 correspondingly. L1+W, L2+W mean the computation of the Leven-shtein distance using the Wagner-Fisher algorithm with WeightFunction, with k = 1 and k = 2 correspondingly. NA1, NA2 mean the computation of the Levenshtein distance using the ASAWT without the Relevance function, with k = 1 and k = 2 correspondingly. NAR1, NAR2 mean the computation of the Leven-shtein distance using the ASAWT with the Relevance function, with k = 1 and k = 2 correspondingly. 147 Вестник СибГАУ. № 4(50). 2013 Computational time for 100 runs of algorithm in seconds is on the vertical axis, the length of the original words is on the horizontal axis. Conclusion. To sum it up, from the figures above it could be seen that the ASAWT with and without relevance function works faster than the Wagner-Fisher algorithm. The final lists of similar words are almost identical. The two-step system allows to keep the accuracy at the same level. However, the algorithm has some drawbacks, namely there are no weights modeling the closeness of the pronounced sounds. This problem is supposed to be solved in a future work. Acknowledgements. This work is partly supported by the DAAD (German Academic Exchange Service).
×

作者简介

Anastasia Spirina

Siberian State Aerospace University named after academician M. F. Reshetnev

Email: s_nastia@mail.ru
postgraduate student of the deparment of system analysis and operation research 31, Krasnoyarsky Rabochy Av., Krasnoyarsk, 660014, Russian Federation

Sergey Zablotskiy

Ulm University

Email: sergey.zablotskiy@uni-ulm.de
Master’s Degree student of the department of system analysis and management, postgraduate student of the dialogue systems department 43, Albert-Einstein-Allee, Ulm, 89081, Germany

Maxim Sidorov

Ulm University

Email: maxim.sidorov@uni-ulm.de
Master’s Degree student of the department of system analysis and management, postraduate student of the department of dialogue systems of the Institute of Communications Engineering 43, Albert-Einstein-Allee, Ulm, 89081, Germany

参考

  1. Zablotskiy S., Shvets A., Sidorov M., Semenkin E. and Minker W. Speech and Language Resources for LVCSR of Russian. International Conference on Lan guage Resources and Evaluation (LREC), Istanbul, Turkey, 2012. May.
  2. Smetanin N. (2011, March 24). Fuzzy string search. Nikita’s blog. Search algorithms, software development and so on. Retrieved July 30, 2013, from http://ntz-develop.blogspot.ru/.
  3. Levenshtein V. I. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10, February.
  4. Wagner R. A., Fischer M. J. The string-to-string correction problem. J. ACM. 1974, vol. 21, № 1, p. 168173.
  5. Wikipedia. Similarity index (coefficient). Wikipedia. The free encyclopedia. Retrieved July 30, 2013, from http://ru.wikipedia.org/ (in Russian).
  6. Karakhtanov D. S. Using of fuzzy search algorithm in processing of data for credit institutions. Audit and financial analysis. 2010, vol. 2 (in Russian).

补充文件

附件文件
动作
1. JATS XML

版权所有 © Spirina A., Zablotskiy S.G., Sidorov M.Y., 2013

Creative Commons License
此作品已接受知识共享署名 4.0国际许可协议的许可
##common.cookie##