乐胖代购免代理版

エムスリーエンジニアリンググループ AI・機械学習チームでソフトウェアエンジニアをしている中村(po3rin) です。検索とGoが好きです。

今回は文字列界隈を賑わせている高速なRust製パターンマッチングマシンDaachorseをPythonで呼び出して既存の文字列パターンマッチロジックを高速化したお話をします。

Daachorseとは
なぜPythonから呼び出したいのか
パターンマッチングのみのベンチマーク
python-daachorseだけオートマトン構築込みのベンチマーク
まとめ
- We are Hiring!
- その他

Daachorseとは

DaachorseはLegalForceさんで開発運用されている文字列パターンマッチを行うRust製ライブラリです。

github.com

技術的なトピックに関してはLegalForceさんの記事が全て解説しているののでそちらを参照してください。

tech.legalforce.co.jp

なぜPythonから呼び出したいのか

とある用途で文字列パターンマッチのロジックをPythonで組んでいたのですが、いまいちパフォーマンスが悪く、高速化できずに苦しんでいたところDaachorseのリリースがあり、是非とも使ってみたいと思いました。

しかし、AI・機械学習チームではデータ処理やモデル学習のパイプラインにPythonのgokartというモジュールを全面的に利用しており、基本的に何かを実装するときはPythonで開発されることが多いです。gokartに現行のデータ処理ロジックが乗っている以上、全てをRustで書き換えるのはかなりの大工事です。そこでロジック部分だけRustで書き直して高速化できないかと考えました。

そして調べてみるとDaachorseのPythonバインディング公開されていたので、こちらを利用することにしました。

github.com

python-daachorseではPyO3を利用しているようなので、もしPythonロジックをRustで書き直したいという欲求がある場合は使ってみてください。

github.com

僕自身もPythonでRustの処理を呼ぶ方法を勉強したかったので、実際にDaachorseをPyO3経由で呼び出すシンプルなサンプル実装を試してみました。PyO3の入門の参考になれば。

github.com

今回はpython-daachorseを使い、実際にベンチマークを取り、実戦投入できるかを調査しました。

パターンマッチングのみのベンチマーク

弊社の実際にACマシンが使われている処理を対象にpython-daachorseが我々の条件下でもパフォーマンスが出せるかを確認します。DaachorseのベンチマークはWord100K/UniDicデータセット両方ですでに公開されていますが、ベンチマークを改めてとった理由としては下記が挙げられます。

Pythonラッパー経由の呼び出しも含めたパフォーマンスが知りたかった
実際の弊社のパターン集合でもパフォーマンスが出せるか知りたかった
現在使っているpyahocorasickとの比較が知りたかった

ベースラインは弊社の現行ロジックで利用されているpyahocorasickというPythonモジュールと、Pure Python実装のahocorapy、そしてDaachorseと同じRust CrateのPythonバインディングであるahocorasick_rsです。

pyahocorasick

github.com

ahocorapy

github.com

ahocorasick_rs

github.com

今回のベンチマークではオートマトン構築は含めず、純粋なパターンマッチのみのベンチマークをとりました。

ベンチマーク環境

platform darwin -- Python 3.9.0, pytest-7.1.1, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)

モジュールのバージョンはこちら

pyahocorasick = "^1.4.4"
daachorse = "^0.1.3"
ahocorapy = "^1.6.1"
ahocorasick-rs = "^0.12.2"

ベンチマークのデータセットは下記になります。

自社で持つ日本語の医療系データセット
パターン数: 22948(弊社のとある医療辞書)
テキスト数: 5094(弊社のとある文書データ)

また、パターンとテキスト数はそれぞれ下記のような文字列長の分布になっています。

そしてベンチマークコードです。

import daachorse
import ahocorasick
import ahocorasick_rs
from ahocorapy.keywordtree import KeywordTree


def get_data() -> (list[str], list[str]):
    // 何かしら素敵なデータセットを返す


def substr_match_ahocorasick(automaton: any, haystacks: list[str]):
    for haystack in haystacks:
        x = list(automaton.iter(haystack))
    return x


def substr_match_ahocorapy(automaton: any, haystacks: list[str]):
    result = [automaton.search(t) for t in haystacks]
    return result


def substr_match_with_ahocorasick_rs(automaton: any, haystacks: list[str]):
    result = [automaton.find_matches_as_indexes(t) for t in haystacks]
    return result


def substr_match_with_daachorse(automaton: any, haystacks: list[str]):
    result = [automaton.find_overlapping(t) for t in haystacks]
    return result


def test_match_ahocorasick_benchmark(benchmark):
    patterns, haystacks = get_data()

    automaton = ahocorasick.Automaton()
    for idx, key in enumerate(patterns):
        automaton.add_word(key, (idx, key))
    automaton.make_automaton()

    ret = benchmark(substr_match_ahocorasick, automaton=automaton, haystacks=haystacks)
    assert len(ret)!=0


def test_match_ahocorapy_benchmark(benchmark):
    patterns, haystacks = get_data()

    automaton = KeywordTree(case_insensitive=True)
    for idx, key in enumerate(patterns):
        automaton.add(key)
    automaton.finalize()

    ret = benchmark(substr_match_ahocorapy, automaton=automaton, haystacks=haystacks)
    assert len(ret)!=0


def test_match_ahocorasick_rs_benchmark(benchmark):
    patterns, haystacks = get_data()

    automaton = ahocorasick_rs.AhoCorasick(patterns)

    ret = benchmark(substr_match_with_ahocorasick_rs, automaton=automaton, haystacks=haystacks)
    assert len(ret)!=0


def test_match_daachorse_benchmark(benchmark):
    patterns, haystacks = get_data()

    automaton = daachorse.Automaton(patterns)

    ret = benchmark(substr_match_with_daachorse, automaton=automaton, haystacks=haystacks)
    assert len(ret)!=0

結果は下記になります。

-------------------------------------------------------------------------------------------------- benchmark: 4 tests -------------------------------------------------------------------------------------------------
Name (time in ms)                              Min                   Max                  Mean              StdDev                Median                 IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_match_daachorse_benchmark             80.8431 (1.0)        147.8423 (1.0)         93.7325 (1.0)       19.8576 (1.0)         86.4954 (1.0)       11.9653 (1.12)          1;1  10.6687 (1.0)          10           1
test_match_ahocorasick_rs_benchmark       123.6172 (1.53)       412.8799 (2.79)       179.8610 (1.92)     114.3097 (5.76)       136.1129 (1.57)      10.6961 (1.0)           1;1   5.5598 (0.52)          6           1
test_match_ahocorasick_benchmark          736.3777 (9.11)       901.6807 (6.10)       776.5008 (8.28)      70.4611 (3.55)       745.9376 (8.62)      54.7538 (5.12)          1;1   1.2878 (0.12)          5           1
test_match_ahocorapy_benchmark          1,339.0980 (16.56)    3,124.5482 (21.13)    1,908.8495 (20.36)    744.7254 (37.50)    1,565.3466 (18.10)    979.0138 (91.53)         1;0   0.5239 (0.05)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Daachorse素敵です。Meanでの比較では現行のpyahocorasickよりも8倍速いです。また、同じRust実装のPythonバインディングよりも速いという結果になりました。この結果から実践投入で十分活躍できると判断しました。

python-daachorseだけオートマトン構築込みのベンチマーク

弊社ではgokartでパイプラインを構築しているので、1回構築したオートマトンはgokartキャッシュ(pickle)として保存しておきたいところです。しかし、builtins.Automatonはpickleで保存ができません。開発者からコメントをいただいた通り、serialize/deserializeをpython-daachorseがWrapしていないためです。

serialize/deserializeはunsafeなのでラッパーを書きたくないんですよね。daachorseの中では所々get_uncheckedを使っていて、信頼できないデータをdeserializeした際に何が起こるか分からないので。
— 水先案内人＠江戸川 (@vbkaisetsu) 2022年9月25日

unsafe Rustに関しては下記のThe Rust Programming Language 日本語版のガイドが非常に勉強になります。

doc.rust-jp.rs

unsafeなコードをできるだけ分離するために、unsafeなコードを安全な抽象の中に閉じ込め、安全なAPIを提供するのが最善です。... 標準ライブラリの一部は、検査されたunsafeコードの安全な抽象として実装されています。安全な抽象にunsafeなコードを包むことで、 unsafeが、あなたやあなたのユーザがunsafeコードで実装された機能を使いたがる可能性のある箇所全部に漏れ出ることを防ぎます。

unsafe Rustのガイドの通り、python-daachorseでは安全なAPIを提供するためにunsafeを安全な抽象で包んでいくという意図があるようです。

ゆえにpython-daachorseではプロセスを実行するたびに毎回オートマトンを構築し直す必要があります。そのため実践投入のためにはオートマトン構築も含めたベンチマークも取る必要がありました。

先ほどのコードにもう1つベンチマークを追加します。このベンチマークではオートマトン構築も含んでいます。

def substr_match_with_daachorse_build_automaton(patterns: list[str], haystacks: list[str]):
    automaton = daachorse.Automaton(patterns)
    result = [automaton.find_overlapping(t) for t in haystacks]
    return result


def test_match_daachorse_with_build_automaton_benchmark(benchmark):
    patterns, haystacks = get_data()
    ret = benchmark(substr_match_with_daachorse_build_automaton, patterns=patterns, haystacks=haystacks)
    assert len(ret)!=0

結果です。

----------------------------------------------------------------------------------------------------- benchmark: 5 tests ----------------------------------------------------------------------------------------------------
Name (time in ms)                                            Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_match_daachorse_benchmark                           49.9756 (1.0)       66.1039 (1.0)       54.0174 (1.0)       5.1499 (7.58)      51.5605 (1.0)       5.7571 (15.38)         3;1  18.5126 (1.0)          16           1
test_match_ahocorasick_rs_benchmark                      66.9054 (1.34)      68.9367 (1.04)      67.4963 (1.25)      0.6794 (1.0)       67.3037 (1.31)      0.3743 (1.0)           2;2  14.8156 (0.80)         12           1
test_match_daachorse_with_build_automaton_benchmark     112.0056 (2.24)     144.7642 (2.19)     127.7381 (2.36)     13.8674 (20.41)    133.4831 (2.59)     25.1129 (67.09)         3;0   7.8285 (0.42)          7           1
test_match_ahocorasick_benchmark                        497.1756 (9.95)     501.2648 (7.58)     499.6143 (9.25)      1.8371 (2.70)     500.4676 (9.71)      3.1697 (8.47)          1;0   2.0015 (0.11)          5           1
test_match_ahocorapy_benchmark                          650.2134 (13.01)    652.2055 (9.87)     651.1910 (12.06)     0.8694 (1.28)     651.2392 (12.63)     1.5757 (4.21)          2;0   1.5356 (0.08)          5           1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

オートマトン構築込みでも現在利用しているpyahocorasickのオートマトン構築スキップより3倍以上速い！優秀だ！この結果から、プロセスを起動するたびに毎回オートマトン構築してもお釣りが来るので、Daachorse採用に舵を取りました。

まとめ

今回はDaachorseで既存のpyahocorasickを使ったパターンマッチを高速化する話を紹介しました。データによっては他のモジュールの方が高速の場合もあるので、導入前には今回のようにベンチマークをとって調べてみると良いでしょう。個人的にはこの調査の中で、RustをPyO3を使ってPythonで呼び出す方法や、unsafe Rustについて非常に勉強になりました。

We are Hiring!

エムスリーでは文字列処理、自然言語処理で医療に貢献していきたいメンバーを募集しています。「ちょっと話を聞いてみたいかも」という人はこちらから！

jobs.m3.com

その他

カバー画像はUnsplashのBritish Libraryの画像です。ありがとうございます。

エムスリーテックブログ

エムスリー(m3)のエンジニア・開発メンバーによる技術ブログです

Rust製パターンマッチングマシンDaachorseを使ってPythonパイプラインを高速化する話

Daachorseとは

なぜPythonから呼び出したいのか

パターンマッチングのみのベンチマーク

python-daachorseだけオートマトン構築込みのベンチマーク

まとめ

We are Hiring!

その他