基本操作
工作路径
getwd()
setwd()
.Rprofile 默认设置文件
赋值 <-(Alt±)
<<- 强赋值给一个全局变量 而不是局部变量(写函数时用到)
R包安装
R Package Documentation 在线安装 、源代码安装
在线安装可以自动解决包之间的依赖关系 少部分需要源代码安装
install.package() 包名加引号 安装多个install.package(c(‘’, ‘’ ))
获取帮助
help() linux-man
args() 快速输出函数相关参数
example() 函数使用案例 特别是学习绘图函数时
demo() 案例图
apropos() 输出包含关键词的对象 mode 设置 mode = 函数
summary()函数可以获取描述性统计量 可以提供最小值、最大值、四分位数和数值型变量的均值,以及因子向量和逻辑型向量的频数统计
# 数据结构 ### 基本数据类型 1.**数值型** python整数 浮点数 R中无 直接用于计算 2.**字符串型** 进行连接,转换,提取等 3.**逻辑型** 真或者假 4.**日期型**
### 一般数据结构: **向量**,标量 **矩阵 数组 列表 数据框 因子** 时间序列 ... (一些特殊数据结构:python字典,C指针,perl哈希) R中所有数据结构都可以视为object对象
## 向量 **R中最基础的数据是一个集合而不是一个标量 —— 向量化编程** 用函数c()来创建向量 不同于数学意义上的向量定义 更相当于数学意义上的集合、其他编程语言的列表 **向量中的数据类型必须一致**
> x <- c(1,'hello')
> x
[1] "1" "hello"
一些常用函数:
seq(from=1,to=100,by=1)
> x <- c(1,2,3)
> rep(x,each=3,times=2)
[1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
> x <- c(1:5)
> y <- seq(1,10,2)
> y
[1] 1 3 5 7 9
> x
[1] 1 2 3 4 5
> x*2+y
[1] 3 7 11 15 19
> x>3
[1] FALSE FALSE FALSE TRUE TRUE
向量索引
<1>
正整数索引:类似其他编程语言中对数组进行索引
区别是R中从1开始
负整数索引
除去该索引所在元素
> length(x)
[1] 100
> x[100]
[1] 100
> x[-100]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[70] 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
[93] 93 94 95 96 97 98 99
同时,索引向量可以
输出多个元素 输出相同元素
<2>
根据逻辑值输出
> x[4:18]
[1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> x[c(1,23,45,67,89)]
[1] 1 23 45 67 89
> x[c(11,11,23,234,5,90,2)]
[1] 11 11 23 NA 5 90 2
> # vector logic index
> y <- c(1:10)
> y[c(T,F,T,T,F,F,T,T,T,F,T)]
[1] 1 3 4 7 8 9 NA
> y[y>5 & y<9]
[1] 6 7 8
> #charactor vector
> z <- c("one","two","three","four","five")
> z[z %in% c("one","two")]
[1] "one" "two"
<3>
利用元素名称进行访问
> # vector name index
> names(y) <- c("one","two","three","four","five","six","seven","eight","nine","ten")
> y
one two three four five six seven eight nine ten
1 2 3 4 5 6 7 8 9 10
> y['one']
one
1
在后面数据框中用的多
根据数据返回所在索引
> #get index
> t <- c (1,2,2,5,7,9,6)
> which.max (t)
[1] 6
> which.min(t)
[1] 1
> which(t==7)
[1] 5
> which(t>5)
[1] 5 6 7
> t[which (t>5)]
[1] 7 9 6
修改向量
> v[2]=15
> v
[1] 1 15 3 4 5 6
增加向量
> #change vector value
# 直接赋值
> x <- 1:100
> x[101] <- 101
> v <- c(1,2,3)
> v[c(4,5,6)] <- c(4,5,6)
> v
[1] 1 2 3 4 5 6
> append(v,99,after = 5) # 在第五个元素后加入
[1] 1 2 3 4 5 99 6
删减向量
负整数索引
> y[-c(1:3)]
four five six seven eight nine ten
4 5 6 7 8 9 10
> y <- y[-c(1:3)]
向量运算
向量数值运算
> y <- seq(1,100,length.out = 10)
> x
[1] 2 3 4 5 6 7 8 9 10 11
> y
[1] 1 12 23 34 45 56 67 78 89 100
> x+y
[1] 3 15 27 39 51 63 75 87 99 111
> x*y
[1] 2 36 92 170 270 392 536 702 890 1100
> y%%x # 取余
[1] 1 0 3 4 3 0 3 6 9 1
> y%/%x # 整除
[1] 0 4 5 6 7 8 8 8 8 9
向量逻辑运算
> x <- 1:10
> y <- seq(1,100,length.out = 10)
> x==y # 区别于=赋值
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> c(1,2,3) %in% c(1,2,2,4,5,6)
[1] TRUE TRUE FALSE
运算函数
> #vector functions
> x <- -5:5
> abs(x)
[1] 5 4 3 2 1 0 1 2 3 4 5
> sqrt(x) # 开方
[1] NaN NaN NaN NaN NaN 0.000000 1.000000
[8] 1.414214 1.732051 2.000000 2.236068
Warning message:
In sqrt(x) : NaNs produced
> log(16,base = 2) # 以2为底16的对数
[1] 4
> exp(x) # 指数
[1] 6.737947e-03 1.831564e-02 4.978707e-02 1.353353e-01 3.678794e-01
[6] 1.000000e+00 2.718282e+00 7.389056e+00 2.008554e+01 5.459815e+01
[11] 1.484132e+02
> ceiling (c(-2.3,3.1415)) # 不小于x的最小整数
[1] -2 4
> floor(c(-2.3,3.1415)) # 不大于x的最大整数
[1] -3 3
> trunc(c(-2.3,3.1415)) # 保留整数部分
[1] -2 3
> round (c(-0.618,3.1415),digits=2) # 四舍五入
[1] -0.62 3.14
统计函数
> vec <- 1:100
> sum(vec)
[1] 5050
> max(vec)
[1] 100
> min(vec)
[1] 1
> range(vec)
[1] 1 100
> mean(vec)
[1] 50.5
> var(vec) # 方差
[1] 841.6667
> sd(vec) # 标准差
[1] 29.01149
> median(vec) # 中位数
[1] 50.5
> quantile(vec) # 分位数(默认四分位)
0% 25% 50% 75% 100%
1.00 25.75 50.50 75.25 100.00
> quantile (vec,c(0.4,0.5,0.8))
40% 50% 80%
40.6 50.5 80.2
矩阵和数组
矩阵matrix相当于
有维数的向量/数组(元素类型一致)
m <- matrix(x,nrow = 4,ncol = 5,byrow = TRUE) # 按行建造矩阵 默认按列
矩阵names
方法一
> rnames <- c("R1","R2","R3","R4")
> cnames <- c("C1","C2","C3","C4","C5")
> dimnames(m)=list (rnames,cnames)
> m
C1 C2 C3 C4 C5
R1 1 2 3 4 5
R2 6 7 8 9 10
R3 11 12 13 14 15
R4 16 17 18 19 20
dimnames
可以发现这种方法其实是综合了 维数划分
这里是创建了一个二维数组
如果是三维数组
这个方法就需输入hnames
也可通过array函数实现
> dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2", "C3", "C4")
> z <- array(1:24, c(2,3,4), dimnames=list(dim1, dim2, dim3))
> z
, , C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
, , C2
B1 B2 B3
A1 7 9 11
A2 8 10 12
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
方法二
> colnames(m) <- c(1:5)
> m
1 2 3 4 5
R1 1 2 3 4 5
R2 6 7 8 9 10
R3 11 12 13 14 15
R4 16 17 18 19 20
通过colnames rownames 函数直接命名
同样的方法可也适用于数据框
矩阵索引(矩阵的访问)
也是数字索引和变量名
> #Using matrix subscripts
> m <- matrix(x,nrow = 4,ncol = 5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> m[1,2] # 第一行第二列 区别于python[1][2]
[1] 5
> m[1,c(2,3,4)]
[1] 5 9 13
> m[c(2,4),c(2,3)]
[,1] [,2]
[1,] 6 10
[2,] 8 12
> m[2,]
[1] 2 6 10 14 18
> m[2] # [2,]表示第二行 逗号十分重要
[1] 2
> m[-1,2]
[1] 6 7 8
> state.x77[,"Income"]
Alabama Alaska Arizona Arkansas
3624 6315 4530 3378
California Colorado Connecticut Delaware
5114 4884 5348 4809
Florida Georgia Hawaii Idaho
4815 4091 4963 4119
Illinois Indiana Iowa Kansas
5107 4458 4628 4669
Kentucky Louisiana Maine Maryland
3712 3545 3694 5299
Massachusetts Michigan Minnesota Mississippi
4755 4751 4675 3098
Missouri Montana Nebraska Nevada
4254 4347 4508 5149
New Hampshire New Jersey New Mexico New York
4281 5237 3601 4903
North Carolina North Dakota Ohio Oklahoma
3875 5087 4561 3983
Oregon Pennsylvania Rhode Island South Carolina
4660 4449 4558 3635
South Dakota Tennessee Texas Utah
4167 3821 4188 4022
Vermont Virginia Washington West Virginia
3907 4701 4864 3617
Wisconsin Wyoming
4468 4566
> state.x77["Alabama",]
Population Income Illiteracy Life Exp Murder HS Grad
3615.00 3624.00 2.10 69.05 15.10 41.30
Frost Area
20.00 50708.00
> state.x77[Alabama,] # 注意添加引号
Error: object 'Alabama' not found
> state.x77["Alabama"] # 注意取得是该整行/列 联系前面[2]
[1] NA
> state.x77["Alabama","Income"]
[1] 3624
矩阵运算
数值运算
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
> #Matrix peration
> m+1
[,1] [,2] [,3] [,4] [,5]
[1,] 2 6 10 14 18
[2,] 3 7 11 15 19
[3,] 4 8 12 16 20
[4,] 5 9 13 17 21
# 矩阵的乘法分为内积和外积
> n <- matrix (1:9,3,3)
> t <- matrix (2:10,3,3)
> n
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
> t
[,1] [,2] [,3]
[1,] 2 5 8
[2,] 3 6 9
[3,] 4 7 10
> n*t # 矩阵的内积
[,1] [,2] [,3]
[1,] 2 20 56
[2,] 6 30 72
[3,] 12 42 90
> n%*%t # 矩阵的外积
[,1] [,2] [,3]
[1,] 42 78 114
[2,] 51 96 141
[3,] 60 114 168
函数
> colSums(m)
[1] 10 26 42 58 74
> rowSums(m)
[1] 45 50 55 60
> colMeans(m)
[1] 2.5 6.5 10.5 14.5 18.5
> rowMeans(m)
[1] 9 10 11 12
# 可发现与向量运算相似
diag(n)
diag(m)
a <- matrix(rnorm(16),4,4)
solve(a)
eigen(a)
dist(a)
# ...
# 矩阵更多运算、转换参考线性代数
列表
在R中,
list列表是R中最复杂的一种数据结构,可以储存若干数据类型
> a <- 1:20
> b <- matrix(1:24,4,6)
> c=mtcars # 数据框
> d <- "This is a test list" # 标量
>
> mlist <- list(a,b,c,d)
> mlist <- list(first=a,second=b,third=c,fourth=d) # 也可直接添加名称
state.center # 内置列表
$x
[1] -86.7509 -127.2500 -111.6250 -92.2992 -119.7730 -105.5130 -72.3573
[8] -74.9841 -81.6850 -83.3736 -126.2500 -113.9300 -89.3776 -86.0808
[15] -93.3714 -98.1156 -84.7674 -92.2724 -68.9801 -76.6459 -71.5800
[22] -84.6870 -94.6043 -89.8065 -92.5137 -109.3200 -99.5898 -116.8510
[29] -71.3924 -74.2336 -105.9420 -75.1449 -78.4686 -100.0990 -82.5963
[36] -97.1239 -120.0680 -77.4500 -71.1244 -80.5056 -99.7238 -86.4560
[43] -98.7857 -111.3300 -72.5450 -78.2005 -119.7460 -80.6665 -89.9941
[50] -107.2560
$y
[1] 32.5901 49.2500 34.2192 34.7336 36.5341 38.6777 41.5928 38.6777
[9] 27.8744 32.3329 31.7500 43.5648 40.0495 40.0495 41.9358 38.4204
[17] 37.3915 30.6181 45.6226 39.2778 42.3645 43.1361 46.3943 32.6758
[25] 38.3347 46.8230 41.3356 39.1063 43.3934 39.9637 34.4764 43.1361
[33] 35.4195 47.2517 40.2210 35.5053 43.9078 40.9069 41.5928 33.6190
[41] 44.3365 35.6767 31.3897 39.1063 44.2508 37.5630 47.4231 38.4204
[49] 44.5937 43.0504
列表的访问
<1,2>也是整数索引和变量名
# 数值索引
> mlist[1]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> mlist[c(1,4)]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$fourth
[1] "This is a test list"
> mlist[1,4]
Error in mlist[1, 4] : incorrect number of dimensions
# 变量名索引
> mlist["first"]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> mlist[c('first','fourth')]
$first
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$fourth
[1] "This is a test list"
值得注意的是在列表中
不仅有单中括号访问[] 还有[[]]<3>
这两种访问方法是不同的
[]返回的还是一个列表(为了消除列表内不同数据类型带来的复杂性)
[[]]返回的则是真实的子集数据类型
> class(mlist[1])
[1] "list" # 列表
> class(mlist[[1]])
[1] "integer" # 子集数据类型
> class(mlist[frist])
Error: object 'frist' not found # 注意一定别忘了带引号
> class(mlist['first'])
[1] "list" # 列表
> class(mlist[['first']])
[1] "integer" # 子集数据类型
> class(mlist$first)
[1] "integer"
此外列表和后面的数据框还多了一种通过$ 访问的方法<4>
> mlist$first[c(1:4)]
[1] 1 2 3 4
在通过双中括号或者$访问到真实子集数据类型的子集后
便可分别使用各子集数据类型的访问方式进行进一步访问
> mlist$first[c(1:4)]
[1] 1 2 3 4
修改列表
> mlist[5] <- iris
Warning message:
In mlist[5] <- iris :
number of items to replace is not a multiple of replacement length
# 需要注意的问题是双中括号
> mlist[[5]] <- iris
# 删除使用负索引再赋值或者将其 <- NULL
> mlist[5] <- NULL
> View(mlist)
参考向量
数据框
向量构成数据框的列,
每一列必须具有相同长度,且必须命名(字符型向量)
数据框每一列内的数据必须类型相同(一个向量),而列之间可以不用
state <- data.frame(state.name,state.abb,state.region,state.x77)
数据框的访问
整数索引,变量名索引,$
state[1] # 返回第一列
state[c(2,4)] # 返回第2 4列
state[,"state.abb"] # 通过变量名 注意不要忘记逗号
state["Alabama",]
women$height
将数据整理成数据框结构后可以进行很多方便的处理
> plot(women$height,women$weight)
> lm (weight~height,data = women)
Call:
lm(formula = weight ~ height, data = women)
Coefficients:
(Intercept) height
-87.52 3.45
为更方便访问数据框数据
attach detach
> attach(mtcars)
> mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
> hp
[1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 66
[19] 52 65 97 150 150 245 175 66 91 113 264 175 335 109
> detach(mtcars)
因子
变量分类:
连续型(身高);
有序型(good,better,best);
名义型(省份)
有序型、名义型变量在R中被称为 因子factor
也可使用cut将连续型变量进行分组
> num <- c(1:100)
> cut (num,c(seq(0,100,10)))
[1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10]
[9] (0,10] (0,10] (10,20] (10,20] (10,20] (10,20] (10,20] (10,20]
[17] (10,20] (10,20] (10,20] (10,20] (20,30] (20,30] (20,30] (20,30]
[25] (20,30] (20,30] (20,30] (20,30] (20,30] (20,30] (30,40] (30,40]
[33] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40] (30,40]
[41] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50] (40,50]
[49] (40,50] (40,50] (50,60] (50,60] (50,60] (50,60] (50,60] (50,60]
[57] (50,60] (50,60] (50,60] (50,60] (60,70] (60,70] (60,70] (60,70]
[65] (60,70] (60,70] (60,70] (60,70] (60,70] (60,70] (70,80] (70,80]
[73] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80] (70,80]
[81] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90] (80,90]
[89] (80,90] (80,90] (90,100] (90,100] (90,100] (90,100] (90,100] (90,100]
[97] (90,100] (90,100] (90,100] (90,100]
10 Levels: (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] ... (90,100]
这些分类变量的可能值就成为一个水平level(good,better,best),
由这些水平值构成的向量也就是因子
因子 最大作用就是分类
计算频数频率
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
> table(mtcars$cyl)
4 6 8
11 7 14
> f <- factor(c("red","red","green","red","blue","green","blue","blue"))
> f
[1] red red green red blue green blue blue
Levels: blue green red
> week <- factor(c("Mon","Fri","Thu","Wed","Mon","Fri","Sun"),order = TRUE,
+ levels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
# 指定了因子水平levels并且要求按顺序order = TRUE
> week
[1] Mon Fri Thu Wed Mon Fri Sun
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
最基本应用就是画频数直方图
> plot(f)
缺失数据
缺失值NA
NA不可用not available(不可被比较,不可进行计算)
na.rm 参数 跳过缺失值
> 1+NA
[1] NA
> NA==0
[1] NA
> a <- c(NA,1:49)
> sum(a)
[1] NA
> mean(a)
[1] NA
# 使用na.rm = TRUE 参数设置来跳过缺失值
> sum(a,na.rm = TRUE)
[1] 1225
> mean(a,na.rm = TRUE)
[1] 25
> mean(1:49) # 发现na.rm将na移除了,未纳入个数
[1] 25
is.na() 函数 逻辑判断
na.omit()函数 直接删去缺失值
colSums(is.na(sleep))
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
0 0 14 12 4 4 4 0
Exp Danger
0 0
# 统计NA个数
> colSums(sleep, na.rm = T)
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
12324.98 17554.32 416.30 98.60 610.90 1152.90 8256.50 178.00
Exp Danger
150.00 162.00
> colSums(sleep)
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred
12324.98 17554.32 NA NA NA NA NA 178.00
Exp Danger
150.00 162.00
> c <- c(NA,1:20,NA,NA)
> d <- na.omit(c) # 直接删去缺失值
> d # 可发现缺失值被na.omit()删去,且多了一些属性信息
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
attr(,"na.action")
[1] 1 22 23
attr(,"class")
[1] "omit"
na.omit()删去向量每个NA
而对数据框,则是将包含缺失值的每一行都删除
(行为统计变量/样本,列为观测指标)
> head(sleep)
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
1 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3
2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
3 3.385 44.5 NA NA 12.5 14.0 60 1 1 1
4 0.920 5.7 NA NA 16.5 NA 25 5 2 3
5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
> head(na.omit(sleep))
BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
7 0.023 0.3 15.8 3.9 19.7 19.0 35 1 1 1
8 160.000 169.0 5.2 1.0 6.2 30.4 392 4 5 4
9 3.300 25.6 10.9 3.6 14.5 28.0 63 1 2 1
> 1/0
[1] Inf
> -1/0
[1] -Inf
> 0/0
[1] NaN
> is.nan(0/0)
[1] TRUE
> is.infinite(1/0)
[1] TRUE
可利用is.na和table进行缺失值统计
> r <- is.na(oc$rsid)
> table(r)
r
FALSE TRUE
9368169 2065842
> r <- is.na(oc$SNP)
> table(r)
r
FALSE
11434011
字符串
一些常见字符串操作
# 获取长度信息
> nchar ("Hello World")
[1] 11
> month.name
[1] "January" "February" "March" "April" "May" "June"
[7] "July" "August" "September" "October" "November" "December"
> nchar(month.name) # 统计每个字符串长度
[1] 7 8 5 5 3 4 4 6 9 7 8 8
> length(month.name) # 统计字符串个数
[1] 12
# 粘连字符串
> paste("Everybody","loves","stats") # 多个字符串粘在一起 py: 'a'+'b"
[1] "Everybody loves stats"
> paste("Everybody","loves","stats",sep = '-')
[1] "Everybody-loves-stats"
#注意向量化的结果是分别paste
> names <- c("Moe","Larry","Curly")
> paste(names,"love stats")
[1] "Moe love stats" "Larry love stats" "Curly love stats"
# 控制大小写
> Mon <- substr(month.name,1,3)
> toupper(Mon)
[1] "JAN" "FEB" "MAR" "APR" "MAY" "JUN" "JUL" "AUG" "SEP" "OCT" "NOV" "DEC"
> tolower(Mon)
[1] "jan" "feb" "mar" "apr" "may" "jun" "jul" "aug" "sep" "oct" "nov" "dec"
# 要使首字母大写
# 运用正则表达式 TODO
# g-global sub只匹配修改一次 gsub全局修改
> gsub("^(\\w)", "\\U\\1",tolower(Mon),perl = TRUE)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> gsub("^(\\w)", "\\L\\1",toupper(Mon),perl = TRUE)
[1] "jAN" "fEB" "mAR" "aPR" "mAY" "jUN" "jUL" "aUG" "sEP" "oCT" "nOV" "dEC"
# 匹配查找
> x <- c("b","A+","AC")
> grep ("A+",x,fixed=TRUE)
[1] 2
> match("AC",x)
[1] 3
> grep ("A+",x,fixed=FALSE) # 可使用正则表达式
[1] 2 3
# 分割字符串
> strsplit(path,"/")
[[1]]
[1] "" "usr" "local" "bin" "R"
# 注意 strsplit返回的是列表,便于同时分割多个字符串变量,也方便处理
> strsplit(c(path,path),"/")
[[1]]
[1] "" "usr" "local" "bin" "R"
[[2]]
[1] "" "usr" "local" "bin" "R"
> class(strsplit(path,"/"))
[1] "list"
输出字符串组合,单一组合,所有可能组合
> a <- c('xiaoming','xiaohong')
> b <- c('12','18')
> paste(a,b)
[1] "xiaoming 12" "xiaohong 18"
> outer(a,b,FUN = paste)
[,1] [,2]
[1,] "xiaoming 12" "xiaoming 18"
[2,] "xiaohong 12" "xiaohong 18"
可使用py脚本语言处理字符串(方便) 再R分析统计数据
字符串的读取
> month.name[1]
[1] "January"
> month.name[c(1:3)]
[1] "January" "February" "March"
# 向量化 如提取每个月份英文前三个字母作为缩写
> substr(month.name,1,3)
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
日期和时间
对时间的分析主要包括两个基本问题:
1.对时间序列的描述;
2.利用前面的结果预测。
创建日期类型数据
> a="2017-01-01"
> as.Date(a)
[1] "2017-01-01"
# 可通过format参数调整具体的日期格式(见strftime)
> b=as.Date(a,format="%Y-%m-%d")
> ?strftime
日期类型可进行多种函数处理
如 seq()划分产生日期序列
ts()将数据划分成时间类型
> seq(as.Date("2017-01-01"),as.Date("2017-07-05"),by=5)
[1] "2017-01-01" "2017-01-06" "2017-01-11" "2017-01-16" "2017-01-21"
[6] "2017-01-26" "2017-01-31" "2017-02-05" "2017-02-10" "2017-02-15"
[11] "2017-02-20" "2017-02-25" "2017-03-02" "2017-03-07" "2017-03-12"
[16] "2017-03-17" "2017-03-22" "2017-03-27" "2017-04-01" "2017-04-06"
[21] "2017-04-11" "2017-04-16" "2017-04-21" "2017-04-26" "2017-05-01"
[26] "2017-05-06" "2017-05-11" "2017-05-16" "2017-05-21" "2017-05-26"
[31] "2017-05-31" "2017-06-05" "2017-06-10" "2017-06-15" "2017-06-20"
[36] "2017-06-25" "2017-06-30" "2017-07-05"
> sales <- round(runif(48,min=50,max=100))
> ts(sales,start = c(2010,5),end = c(2014,4),frequency = 1) # 按年划分
Time Series:
Start = 2014
End = 2017
Frequency = 1
[1] 52 72 54 52
> ts(sales,start = c(2010,5),end = c(2014,4),frequency = 4) # 按季度划分
Qtr1 Qtr2 Qtr3 Qtr4
2011 52 72 54 52
2012 96 83 89 87
2013 67 58 79 59
2014 69 53 58 51
> ts(sales,start = c(2010,5),end = c(2014,4),frequency = 12) # 按月划分
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2010 52 72 54 52 96 83 89 87
2011 67 58 79 59 69 53 58 51 57 88 63 54
2012 64 66 94 88 80 66 77 83 71 69 84 88
2013 63 64 87 93 81 89 51 68 89 95 60 60
2014 80 52 87 78
# ts()没有按日划分