R语言如何读取大于5MB的tsv文件

转载

互联网小墨风 2024-10-31 13:21:17

文章标签 R语言如何读取大于5MB的tsv文件数据 Rstudio 语法 R语言编程 文章分类 R语言后端开发

使用字符向量表示文本数据

> x<- "Hello,World"
> x<- "Hello,World"
> is.character(x)
[1] TRUE
> length(x)   //向量个数
[1] 1
> nchar(x)    //向量元素字符数
[1] 11

使用文本

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
[18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
> LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
> LETTERS[24:25]  //取出第24-25个向量元素
[1] "X" "Y"
> tail(LETTERS,7)   //取出LETTERS的最后7个元素
[1] "T" "U" "V" "W" "X" "Y" "Z"
> head(LETTERS,7)   //取出LETTERS的前面7个元素
[1] "A" "B" "C" "D" "E" "F" "G"
> islands
          Africa       Antarctica             Asia        Australia 
           11506             5500            16988             2968 
    Axel Heiberg           Baffin            Banks           Borneo 
              16              184               23              280 
         Britain          Celebes            Celon             Cuba 
              84               73               25               43 
           Devon        Ellesmere           Europe        Greenland 
              21               82             3745              840 
          Hainan       Hispaniola         Hokkaido           Honshu 
              13               30               30               89 
         Iceland          Ireland             Java           Kyushu 
              40               33               49               14 
           Luzon       Madagascar         Melville         Mindanao 
              42              227               16               36 
        Moluccas      New Britain       New Guinea  New Zealand (N) 
              29               15              306               44 
 New Zealand (S)     Newfoundland    North America    Novaya Zemlya 
              58               43             9390               32 
 Prince of Wales         Sakhalin    South America      Southampton 
              13               29             6795               16 
     Spitsbergen          Sumatra           Taiwan         Tasmania 
              15              183               14               26 
Tierra del Fuego            Timor        Vancouver         Victoria 
              19               13               12               82 
> 
> str(islands) //查看islands的结构
 Named num [1:48] 11506 5500 16988 2968 16 ...
 - attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
> islands[c("Asia","Africa")]
  Asia Africa 
 16988  11506 
> names(islands)[1:12]
 [1] "Africa"       "Antarctica"   "Asia"         "Australia"   
 [5] "Axel Heiberg" "Baffin"       "Banks"        "Borneo"      
 [9] "Britain"      "Celebes"      "Celon"        "Cuba"        
> 
> 
> names(sort(islands,decreasing = TRUE))[1:3]   //降序排列，取出命名向量并取前三个
[1] "Asia"          "Africa"        "North America" 
> names(sort(islands,decreasing = TRUE)[1:3])  //降序排列，取出前三个向量元素并找出向量命名元素
[1] "Asia"          "Africa"        "North America"
> 
> names(islands[islands==16988])  //取出向量元素值为16988的元素命名
[1] "Asia"

操作文本

文本分离

> pangram <- "The quick brown fox jumps over the lazy dog"
> pangram
[1] "The quick brown fox jumps over the lazy dog"
> strsplit(pangram," ")  //文本分离
[[1]]
[1] "The"   "quick" "brown" "fox"   "jumps" "over"  "the"   "lazy"  "dog"  

> words <- strsplit(pangram," ")[[1]]
> words
[1] "The"   "quick" "brown" "fox"   "jumps" "over"  "the"   "lazy"  "dog"  
> unique(tolower(words)) //要获得唯一元素构成的向量，对于字符型元素可以先转化为小写，然后使用unique函数
[1] "the"   "quick" "brown" "fox"   "jumps" "over"  "lazy"  "dog"

文本连接

paste("The","quick","brown","fox")
[1] "The quick brown fox"
> paste(words,y, sep = "—",collapse = "  ") #sep参数连接向量内部元素 collapse参数连接重组后的单个向量元素
[1] "The—1  quick—2  brown—3  fox—4  jumps—1  over—2  the—3  lazy—4  dog—1"

文本排序

R的排序不以ASCⅡ码为基础，而是以本地语言字母顺序。例如如果用丹麦语运行软件并排序的话，“aa”将会被视为一个字母，排在“z”的后面。

> sort(letters,decreasing = TRUE)
 [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j"
[18] "i" "h" "g" "f" "e" "d" "c" "b" "a"

查找文本中包含的内容

（1）通过位置查找

> head(substr(state.name,start = 3,stop = 6)) #使用substr函数切割后取出头元素
[1] "abam" "aska" "izon" "kans" "lifo" "lora"
> sort(head(substr(state.name,start = 3,stop = 6)),decreasing = FALSE)  #与排序函数结合
[1] "abam" "aska" "izon" "kans" "lifo" "lora"

（2）通过模式pattern查询
可以使用grep函数（Global Regular Expression Print的缩写译为正则表达式 )

> grep("New",state.name) #返回向量元素位置，并对大小写字母敏感
[1] 29 30 31 32
> state.name[grep("New",state.name)] #查询并取出符合模式的向量元素
[1] "New Hampshire" "New Jersey"    "New Mexico"    "New York" 
> grep(" ",state.name)  #间接查看州名中是否包含连个元素
 [1] 29 30 31 32 33 34 39 40 41 48
> state.name[grep(" ",state.name)]
 [1] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
 [5] "North Carolina" "North Dakota"   "Rhode Island"   "South Carolina"
 [9] "South Dakota"   "West Virginia"

文本替换

sub函数能够检索文本模式执行替换，gsub则适用于所有类型的替换，此函数不改变原本被替换向量的值，可将被替换的向量重新赋予新的命名

> A<-"He is a girl"
> gsub("girl","boy",A)
[1] "He is a boy"
> A
[1] "He is a girl"

使用正则表达式

> words<-c("bach","back","beech","beach","black")
> grep("beach|beech",words)
[1] 3 4
> grep("be(e|a)ch",words)
[1] 3 4
> grep("b(e*|a*)ch",words)
[1] 1 3

创建、转换并使用“因子（factors）”

使用因子factor只要有两个参数levels和labels，可以指定levels，lables则是factors的输出值

> direction<-c("North","East","South","South") 

#指定level不同  存储的值也不同
> str(factor(direction,levels = c("North","East","South","West")))
 Factor w/ 4 levels "North","East",..: 1 2 3 3

#指定level不同  存储的值也不同
> str(factor(direction,levels = c("North","East","South","West"),labels = c("N","E","S","W")))
 Factor w/ 4 levels "N","E","S","W": 1 2 3 3

由于整形数组转为因子取出后为string类型，需要通过转变为数值类型

> as.character(factor(2:6))
[1] "2" "3" "4" "5" "6"
> as.numeric(factor(2:6))
[1] 1 2 3 4 5
> as.numeric(as.character(factor(2:6)))
[1] 2 3 4 5 6

level与因子数值分离，需要查看时可以使用level函数
table()函数可以累计各个因子水平出现的次数

> levels(word)
[1] "N" "E" "S" "W"
> nlevels(word)
[1] 4
> length(levels(word))
[1] 4
> table(word)
word
N E S W 
1 1 2 0

使用有序因子

> status<-c("Lo","Hi","Med","Med","Hi")
> ordered.status<-factor(status,levels = c("Lo","Med","Hi"),ordered = TRUE)
> ordered.status
[1] Lo  Hi  Med Med Hi 
Levels: Lo < Med < Hi
> table(status)
status
 Hi  Lo Med 
  2   1   2 
> table(ordered.status) #使用有序因子时，返回的是按照顺序统计，对比上下，你就会发现。
ordered.status
 Lo Med  Hi 
  1   2   2

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。