R语言替换字符串内容 r语言查找替换

转载

mob64ca14116c53 2023-08-17 07:32:01

文章标签 R语言替换字符串内容字符串正则表达式游戏自然语言处理 文章分类 R语言后端开发

需求描述

为了更方便的进行模糊匹配或者生成标签的工作，经常需要判断“长字符串中是否含有特定的短字符串”，或者查找“长字符串中含有特定的短字符串”的位置，或者判断/查找一连串字符串中是否含有特定的短字符串。

%in%

很好用的一个指令，但因为它会把每个字符串当成判断的最小单位，所以不能用来判断/查找“长字符串中是否含有特定的短字符串”

实例

“换装游戏” %in% “其他,换装游戏,二次元卡牌”
 [1] FALSE
 “换装游戏” %in% c(“其他”,“换装游戏”,“二次元卡牌”)
 [1] TRUE

grep

语法：

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE);

描述：
在向量x中寻找含有特定字符串（pattern参数指定）的元素，返回其在x中的下标；

Arguments
invert→若设置为TRUE，返回不包含pattern的元素的下标
value→若设置为TRUE，返回相应的元素；

实例

grep(“换装游戏”, “其他,换装游戏,二次元卡牌”)
 [1] 1
 grep(“换装游戏”, c(“其他”,“换装游戏”,“二次元卡牌”))
 [1] 2

grepl

语法：

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

描述：
类似grep，但是返回逻辑向量，即是否包含pattern

Arguments
invert→若设置为TRUE，返回不包含pattern的元素的下标
value→若设置为TRUE，返回相应的元素；

实例

grepl(“换装游戏”, “其他,换装游戏,二次元卡牌”)
 [1] TRUE

str_detect

描述：
检测一个字符串中，是否包含某个子串，是返回T，否返回F
功能跟grepl一样，但实际使用的时候容易报错，所以不如直接使用grepl

Usage
str_detect(string, pattern, negate = FALSE)

Arguments
string: Input vector. Either a character vector, or something coercible to one.

pattern: Pattern to look for.

negate: If TRUE, return non-matching elements.

Value
A logical vector.

实例

test <- c(“这里有天气热敏感冒”,“好天气”,“感冒了，也要加油”,“感？冒”,"",“不是”,“感冒？不是！”)
 str_detect(test,“感冒”)
 [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE

相关报错

str_detect 使用的时候会莫名其妙报如下的错：

Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
 Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

当时使用的代码是下面这段：

for ( i in 1:nrow(user_group)){
+   index = 0
+   if (str_detect(group_tag$game[1], user_group$game_name[i])){ index = 1}
+   if (str_detect(group_tag$game[2], user_group$game_name[i])){ index = 2}
+   if (str_detect(group_tag$game[3], user_group$game_name[i])){ index = 3}
+   if (str_detect(group_tag$game[4], user_group$game_name[i])){ index = 4}
+   if (str_detect(group_tag$game[5], user_group$game_name[i])){ index = 5}
+   user_group$type[i] = index
+ }

解决办法还是有的，就是在第二个argument那里加一个 fixed()，像下面这样，加了就不会报错了，但运行速度会非常慢（当然，不加这个的运行速度也没多快…）

> for ( i in 1:nrow(user_group)){
+   index = 0
+   if (str_detect(group_tag$game[1], fixed(user_group$game_name[i]))) { index = 1}
+   if (str_detect(group_tag$game[2], fixed(user_group$game_name[i]))) { index = 2}
+   if (str_detect(group_tag$game[3], fixed(user_group$game_name[i]))) { index = 3}
+   if (str_detect(group_tag$game[4], fixed(user_group$game_name[i]))) { index = 4}
+   if (str_detect(group_tag$game[5], fixed(user_group$game_name[i]))) { index = 5}
+   user_group$type[i] = index
+ }

综上，在查找字符/模糊匹配的时候，还是推荐使用base包里的grepl，速度快还不容易报错（相比于str_detect），然后记得灵活使用 %in% 和 grep。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。