字符串算法是计算机科学中的一种算法,用于处理文本字符串数据。字符串算法可以用于搜索、匹配、排序、压缩、加密等各种操作。Python提供了许多字符串算法,下面我将简单介绍一些常用的字符串算法及其Python实现。
字符串匹配算法
字符串匹配算法用于在文本字符串中查找指定模式字符串的位置。常见的字符串匹配算法包括暴力匹配算法、KMP算法、BM算法等。
暴力匹配算法的Python实现:
def brute_force_match(text, pattern):
m = len(text)
n = len(pattern)
for i in range(m-n+1):
j = 0
while j < n and text[i+j] == pattern[j]:
j += 1
if j == n:
return i
return -1
KMP算法的Python实现:
def kmp_match(text, pattern):
m = len(text)
n = len(pattern)
if n == 0:
return 0
prefix = compute_prefix(pattern)
j = 0
for i in range(m):
while j > 0 and text[i] != pattern[j]:
j = prefix[j-1]
if text[i] == pattern[j]:
j += 1
if j == n:
return i - n + 1
return -1
def compute_prefix(pattern):
n = len(pattern)
prefix = [0] * n
j = 0
for i in range(1, n):
while j > 0 and pattern[i] != pattern[j]:
j = prefix[j-1]
if pattern[i] == pattern[j]:
j += 1
prefix[i] = j
return prefix
BM算法的Python实现:
def bm_match(text, pattern):
m = len(text)
n = len(pattern)
if n == 0:
return 0
bc = bad_character_table(pattern)
suffix, prefix = good_suffix_table(pattern)
i = n - 1
while i < m:
j = n - 1
while text[i] == pattern[j]:
if j == 0:
return i
i -= 1
j -= 1
i += max(suffix[j], j - bc.get(text[i], -1))
return -1
def bad_character_table(pattern):
bc = {}
for i in range(len(pattern)-1):
bc[pattern[i]] = i
return bc
def good_suffix_table(pattern):
n = len(pattern)
suffix = [-1] * n
prefix = [False] * n
for i in range(n-1):
j = i
k = 0
while j >= 0 and pattern[j] == pattern[n-1-k]:
j -= 1
k += 1
suffix[k] = j + 1
if j == -1:
prefix[k] = True
for i in range(n-1):
if suffix[i] != -1:
j = suffix[i]
while j != -1 and prefix[j] == False:
j = suffix[j]
suffix[i] = j
return suffix, prefix
字符串排序算法
字符串排序算法用于对一组字符串进行排序。常见的字符串排序算法包括基数排序、快速排序、归并排序等。
基数排序的Python实现:
def radix_sort(strings):
RADIX = 256
max_length = max(len(s) for s in strings)
for d in range(max_length-1, -1, -1):
counts = [0] * RADIX
for s in strings:
if len(s) > d:
counts[ord(s[d])] += 1
for i in range(1, RADIX):
counts[i] += counts[i-1]
temp = [None] * len(strings)
for s in reversed(strings):
if len(s) > d:
temp[counts[ord(s[d])]-1] = s
counts[ord(s[d])] -= 1
else:
temp[counts[0]-1] = s
counts[0] -= 1
strings = temp
return strings
快速排序的Python实现:
def quick_sort(strings):
if len(strings) <= 1:
return strings
pivot = strings[0]
less = [s for s in strings[1:] if s < pivot]
greater = [s for s in strings[1:] if s >= pivot]
return quick_sort(less) + [pivot] + quick_sort(greater)
归并排序的Python实现:
def merge_sort(strings):
if len(strings) <= 1:
return strings
mid = len(strings) // 2
left = merge_sort(strings[:mid])
right = merge_sort(strings[mid:])
return merge(left, right)
def merge(left, right):
result = []
i = j = 0
while i < len(left) and j < len(right):
if left[i] < right[j]:
result.append(left[i])
i += 1
else:
result.append(right[j])
j += 1
result.extend(left[i:])
result.extend(right[j:])
return result
字符串压缩算法
字符串压缩算法用于将一个字符串压缩成较小的字符串,以节省存储空间。常见的字符串压缩算法包括Huffman编码、LZW算法等。
Huffman编码的Python实现:
from heapq import heappush, heappop, heapify
from collections import defaultdict
def huffman_encode(text):
freq = defaultdict(int)
for c in text:
freq[c] += 1
heap = [[wt, [sym, ""]] for sym, wt in freq.items()]
heapify(heap)
while len(heap) > 1:
lo = heappop(heap)
hi = heappop(heap)
for pair in lo[1:]:
pair[1] = '0' + pair[1]
for pair in hi[1:]:
pair[1] = '1' + pair[1]
heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
codes = dict(heappop(heap)[1:])
encoded_text = "".join([codes[c] for c in text])
return encoded_text, codes
def huffman_decode(encoded_text, codes):
rev_codes = {v: k for k, v in codes.items()}
decoded_text = ""
i = 0
while i < len(encoded_text):
j = i+1
while encoded_text[i:j] not in rev_codes and j <= len(encoded_text):
j += 1
decoded_text += rev_codes[encoded_text[i:j]]
i = j
return decoded_text
LZW算法的Python实现:
def lzw_encode(text):
code_dict = {chr(i): i for i in range(256)}
next_code = 256
code = []
for c in text:
if code + [c] in code_dict:
code.append(c)
else:
yield code_dict[code]
code_dict[code + [c]] = next_code
next_code += 1
code = [c]
yield code_dict[code]
def lzw_decode(codes):
code_dict = {i: chr(i) for i in range(256)}
next_code = 256
code = [next(codes)]
text = code_dict[code[0]]
for c in codes:
if c in code_dict:
entry = code_dict[c]
elif c == next_code:
entry = code_dict[code[0]] + code_dict[code[0]][0]
else:
raise ValueError("Bad compressed code")
text += entry
code_dict[next_code] = code_dict[code[0]] + entry[0]
next_code += 1
code = [c]
return text
字符串搜索算法
字符串搜索算法用于在一个字符串中查找某个子串的位置或出现次数。常见的字符串搜索算法包括Brute-Force算法、KMP算法、Boyer-Moore算法等。
Brute-Force算法的Python实现:
def brute_force_search(text, pattern):
n, m = len(text), len(pattern)
for i in range(n - m + 1):
if text[i:i+m] == pattern:
return i
return -1
KMP算法的Python实现:
def kmp_search(text, pattern):
n, m = len(text), len(pattern)
fail = compute_fail(pattern)
j = 0
for i in range(n):
while j > 0 and pattern[j] != text[i]:
j = fail[j-1]
if pattern[j] == text[i]:
j += 1
if j == m:
return i - m + 1
return -1
def compute_fail(pattern):
m = len(pattern)
fail = [0] * m
j = 0
for i in range(1, m):
while j > 0 and pattern[j] != pattern[i]:
j = fail[j-1]
if pattern[j] == pattern[i]:
j += 1
fail[i] = j
return fail
Boyer-Moore算法的Python实现:
def boyer_moore_search(text, pattern):
n, m = len(text), len(pattern)
if m == 0:
return 0
last = {}
for i in range(m):
last[pattern[i]] = i
i = m - 1
j = m - 1
while i < n:
if text[i] == pattern[j]:
if j == 0:
return i
else:
i -= 1
j -= 1
else:
if text[i] in last:
k = last[text[i]]
else:
k = -1
i += m - min(j, k + 1)
j = m - 1
return -1
以上是三种常见的字符串搜索算法的Python实现,它们的时间复杂度分别为、
和
。在实际应用中,不同的算法适用于不同的场景。例如,对于小型模式串和大型文本串,Brute-Force算法可能比KMP算法更快;对于大型模式串和小型文本串,Boyer-Moore算法可能比KMP算法更快。
除了以上三种算法,还有其他的字符串搜索算法,例如Sunday算法、Rabin-Karp算法等。选择适合自己场景的算法可以提高算法的效率,从而提高程序的性能。