使用字符向量表示文本数据
> x<- "Hello,World"
> x<- "Hello,World"
> is.character(x)
[1] TRUE
> length(x) //向量个数
[1] 1
> nchar(x) //向量元素字符数
[1] 11
使用文本
> letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
> LETTERS[24:25] //取出第24-25个向量元素
[1] "X" "Y"
> tail(LETTERS,7) //取出LETTERS的最后7个元素
[1] "T" "U" "V" "W" "X" "Y" "Z"
> head(LETTERS,7) //取出LETTERS的前面7个元素
[1] "A" "B" "C" "D" "E" "F" "G"
> islands
Africa Antarctica Asia Australia
11506 5500 16988 2968
Axel Heiberg Baffin Banks Borneo
16 184 23 280
Britain Celebes Celon Cuba
84 73 25 43
Devon Ellesmere Europe Greenland
21 82 3745 840
Hainan Hispaniola Hokkaido Honshu
13 30 30 89
Iceland Ireland Java Kyushu
40 33 49 14
Luzon Madagascar Melville Mindanao
42 227 16 36
Moluccas New Britain New Guinea New Zealand (N)
29 15 306 44
New Zealand (S) Newfoundland North America Novaya Zemlya
58 43 9390 32
Prince of Wales Sakhalin South America Southampton
13 29 6795 16
Spitsbergen Sumatra Taiwan Tasmania
15 183 14 26
Tierra del Fuego Timor Vancouver Victoria
19 13 12 82
>
> str(islands) //查看islands的结构
Named num [1:48] 11506 5500 16988 2968 16 ...
- attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
> islands[c("Asia","Africa")]
Asia Africa
16988 11506
> names(islands)[1:12]
[1] "Africa" "Antarctica" "Asia" "Australia"
[5] "Axel Heiberg" "Baffin" "Banks" "Borneo"
[9] "Britain" "Celebes" "Celon" "Cuba"
>
>
> names(sort(islands,decreasing = TRUE))[1:3] //降序排列,取出命名向量并取前三个
[1] "Asia" "Africa" "North America"
> names(sort(islands,decreasing = TRUE)[1:3]) //降序排列,取出前三个向量元素并找出向量命名元素
[1] "Asia" "Africa" "North America"
>
> names(islands[islands==16988]) //取出向量元素值为16988的元素命名
[1] "Asia"
操作文本
文本分离
> pangram <- "The quick brown fox jumps over the lazy dog"
> pangram
[1] "The quick brown fox jumps over the lazy dog"
> strsplit(pangram," ") //文本分离
[[1]]
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
> words <- strsplit(pangram," ")[[1]]
> words
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
> unique(tolower(words)) //要获得唯一元素构成的向量,对于字符型元素可以先转化为小写,然后使用unique函数
[1] "the" "quick" "brown" "fox" "jumps" "over" "lazy" "dog"
文本连接
paste("The","quick","brown","fox")
[1] "The quick brown fox"
> paste(words,y, sep = "—",collapse = " ") #sep参数连接向量内部元素 collapse参数连接重组后的单个向量元素
[1] "The—1 quick—2 brown—3 fox—4 jumps—1 over—2 the—3 lazy—4 dog—1"
文本排序
R的排序不以ASCⅡ码为基础,而是以本地语言字母顺序。例如如果用丹麦语运行软件并排序的话,“aa”将会被视为一个字母,排在“z”的后面。
> sort(letters,decreasing = TRUE)
[1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j"
[18] "i" "h" "g" "f" "e" "d" "c" "b" "a"
查找文本中包含的内容
(1)通过位置查找
> head(substr(state.name,start = 3,stop = 6)) #使用substr函数切割后取出头元素
[1] "abam" "aska" "izon" "kans" "lifo" "lora"
> sort(head(substr(state.name,start = 3,stop = 6)),decreasing = FALSE) #与排序函数结合
[1] "abam" "aska" "izon" "kans" "lifo" "lora"
(2)通过模式pattern查询
可以使用grep函数 (Global Regular Expression Print的缩写 译为正则表达式 )
> grep("New",state.name) #返回向量元素位置,并对大小写字母敏感
[1] 29 30 31 32
> state.name[grep("New",state.name)] #查询并取出符合模式的向量元素
[1] "New Hampshire" "New Jersey" "New Mexico" "New York"
> grep(" ",state.name) #间接查看州名中是否包含连个元素
[1] 29 30 31 32 33 34 39 40 41 48
> state.name[grep(" ",state.name)]
[1] "New Hampshire" "New Jersey" "New Mexico" "New York"
[5] "North Carolina" "North Dakota" "Rhode Island" "South Carolina"
[9] "South Dakota" "West Virginia"
文本替换
sub函数能够检索文本模式执行替换,gsub则适用于所有类型的替换,此函数不改变原本被替换向量的值,可将被替换的向量重新赋予新的命名
> A<-"He is a girl"
> gsub("girl","boy",A)
[1] "He is a boy"
> A
[1] "He is a girl"
使用正则表达式
> words<-c("bach","back","beech","beach","black")
> grep("beach|beech",words)
[1] 3 4
> grep("be(e|a)ch",words)
[1] 3 4
> grep("b(e*|a*)ch",words)
[1] 1 3
创建、转换并使用“因子(factors)”
使用因子factor只要有两个参数levels和labels,可以指定levels,lables则是factors的输出值
> direction<-c("North","East","South","South")
#指定level不同 存储的值也不同
> str(factor(direction,levels = c("North","East","South","West")))
Factor w/ 4 levels "North","East",..: 1 2 3 3
#指定level不同 存储的值也不同
> str(factor(direction,levels = c("North","East","South","West"),labels = c("N","E","S","W")))
Factor w/ 4 levels "N","E","S","W": 1 2 3 3
由于整形数组转为因子取出后为string类型,需要通过转变为数值类型
> as.character(factor(2:6))
[1] "2" "3" "4" "5" "6"
> as.numeric(factor(2:6))
[1] 1 2 3 4 5
> as.numeric(as.character(factor(2:6)))
[1] 2 3 4 5 6
level与因子数值分离,需要查看时可以使用level函数
table()函数可以累计各个因子水平出现的次数
> levels(word)
[1] "N" "E" "S" "W"
> nlevels(word)
[1] 4
> length(levels(word))
[1] 4
> table(word)
word
N E S W
1 1 2 0
使用有序因子
> status<-c("Lo","Hi","Med","Med","Hi")
> ordered.status<-factor(status,levels = c("Lo","Med","Hi"),ordered = TRUE)
> ordered.status
[1] Lo Hi Med Med Hi
Levels: Lo < Med < Hi
> table(status)
status
Hi Lo Med
2 1 2
> table(ordered.status) #使用有序因子时,返回的是按照顺序统计,对比上下,你就会发现。
ordered.status
Lo Med Hi
1 2 2