目标

  • 掌握 R语言文件读取方法

学习笔记

  1. utils包内Date Input用法
  2. base包内readLines用法
  3. stringi包内stri_read_lines
  4. xlsx包内Date Input用法
  5. readr包内 Read a delimited file 用法

1.utils包内Date Input用法

以read.table为例。

read.table参数详细说明见http://www.360doc.com/showweb/0/0/1029326103.aspx

read.table(file, header = FALSE, sep = “”, quote = “”'",
 dec = “.”, numerals = c(“allow.loss”, “warn.loss”, “no.loss”),
 row.names, col.names, as.is = !stringsAsFactors,
 na.strings = “NA”, colClasses = NA, nrows = -1,
 skip = 0, check.names = TRUE, fill = !blank.lines.skip,
 strip.white = FALSE, blank.lines.skip = TRUE,
 comment.char = “#”,
 allowEscapes = FALSE, flush = FALSE,
 stringsAsFactors = FALSE,
 fileEncoding = “”, encoding = “unknown”, text, skipNul = FALSE)

参数file

写法1:“文件名称”,如果不写路径,是会在当前路径下读取,可用getwd()获取当前路径。可用setwd(“路径”)修改当前路径。
写法2:绝对路径\文件名称,比如“D: \…\test.xlsx”。
写法2:“clipboard”,利用复制,然后读取

getwd()
setwd("....\\...")#输入想要设置的路径

在工作路径中设计一张表来测试,命名为test.xlsx 。

r语言 遍历文件 r语言循环读取文件_数据

x1<-read.table('test.xlsx')
View(x1)

x1为

r语言 遍历文件 r语言循环读取文件_r语言 遍历文件_02

x1<-read.table(‘test.xlsx’)
 Warning messages:
 1: In read.table(“test.xlsx”) : line 1 appears to contain embedded nulls
 2: In read.table(“test.xlsx”) :
 incomplete final line found by readTableHeader on ‘test.xlsx’

报错“incomplete final line”,表示识别不到excel哪里是最后一行,我也不知道该怎么在excel里表示最后一行,所以建议不用read.table() 直接读excel

解决办法:复制数据到txt文件里,命名为test.txt

x2<-read.table('test.txt')
print(x2)

x2为

r语言 遍历文件 r语言循环读取文件_r语言 遍历文件_03


可以看到第一行不被读取,为什么?这就要看下参数comment.char了


参数comment.char

这个参数用来识别注释字符的开始,默认值为“#”,所以我的txt里的#开头的一行被识别为注释,不会被读取。所以设置comment.char = “”,试下

x2<-read.table('test.txt',comment.char = "")

x2为

r语言 遍历文件 r语言循环读取文件_r语言_04


那我现在想把第一行作为表头,就要设置参数header了。


参数header

默认为false,表示第一行不作为表头。若想将第一行作为表头,可设置为TURE。

x2<-read.table('test.txt',comment.char = "",header = TRUE)

x2见下图,表头里本来为#的,无法识别,被记为X.

r语言 遍历文件 r语言循环读取文件_数据_05

想要指定列名,行名,就要用到参数 row.names和col.names了


参数 row.names和col.names

以改变列名举例,

x2<-read.table('test.txt',comment.char = "",header = TRUE,
               col.names=c("a","b","c"))

x2为

r语言 遍历文件 r语言循环读取文件_数据_06


列名修改成功。

这里为什么会用函数c()?函数c()会将赋值结合成向量或者列表,我习惯用这个。

可以用class()查看读取后的数据类型

class(x2)
[1] “data.frame”

可见read.table() 主要用来读取表格型数据,读入后为"data.frame"类型的数据。

以上为read.table的用法研究。


在utils包下除了read.table这个,还有这些读取文件的方法,参数类似,但默认值有所区别。

read.csv(file, header = TRUE, sep = “,”, quote = “”",
 dec = “.”, fill = TRUE, comment.char = “”, …)read.csv2(file, header = TRUE, sep = “;”, quote = “”",
 dec = “,”, fill = TRUE, comment.char = “”, …)read.delim(file, header = TRUE, sep = “\t”, quote = “”",
 dec = “.”, fill = TRUE, comment.char = “”, …)read.delim2(file, header = TRUE, sep = “\t”, quote = “”",
 dec = “,”, fill = TRUE, comment.char = “”, …)

2. base包内readLines用法

readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE,
 encoding = “unknown”, skipNul = FALSE)
x3<-readLines(‘test.txt’)
 x3
 [1] “#\t中文\tEnglish” “1\t2\t3”
 [3] “4\t5\t6” “中文\t8\t9”
 [5] “13\tEnglish\t9” “13\t14\t%”
 [7] “16\t17\t18” “19\t20\t21”
 class(x3)
 [1] “character”

对于表格型数据,readLines会把制表符识别为“\t”.

3. stringi包内stri_read_lines

stri_read_lines(con, encoding = NULL, fname = con, fallback_encoding = NULL)

首先安装stringi包

install.packages("stringi")
library(stringi)
x3<-stri_read_lines(‘test.txt’)
 x3
 [1] “#\t中文\tEnglish” “1\t2\t3”
 [3] “4\t5\t6” “中文\t8\t9”
 [5] “13\tEnglish\t9” “13\t14\t%”
 [7] “16\t17\t18” “19\t20\t21”
 class(x3)
 [1] “character”

对于表格型数据,stri_read_lines会把制表符识别为“\t”.

4.xlsx包内Date Input用法

首先需要用install.packages()安装xlsx包,然后用library()加载包。

install.packages("xlsx")
library(xlsx)

如果电脑上没有安装Java,此时会报错

错误: package or namespace load failed for ‘xlsx’:
loadNamespace()里算’rJava’时.onLoad失败了,详细内容: 调用: fun(libname, pkgname)
错误: JAVA_HOME cannot be determined from the Registry

所以需要通过官网https://www.oracle.com/java/technologies/javase-downloads.html 安装Java.

但是报错。等我解决了这个问题再继续研究。

r语言 遍历文件 r语言循环读取文件_r语言_07

5. readr包内 Read a delimited file 用法

read_delim(
 file,
 delim = NULL,
 quote = “”“,
 escape_backslash = FALSE,
 escape_double = TRUE,
 col_names = TRUE,
 col_types = NULL,col_select = NULL,
 id = NULL,
 locale = default_locale(),
 na = c(”", “NA”),
 quoted_na = TRUE,
 comment = “”,
 trim_ws = FALSE,
 skip = 0,
 n_max = Inf,
 guess_max = min(1000, n_max),
 name_repair = “unique”,
 num_threads = readr_threads(),
 progress = show_progress(),
 show_col_types = should_show_types(),
 skip_empty_rows = TRUE,
 lazy = should_read_lazy()
 )read_csv(
 file,
 col_names = TRUE,
 col_types = NULL,
 col_select = NULL,
 id = NULL,
 locale = default_locale(),
 na = c(“”, “NA”),
 quoted_na = TRUE,
 quote = “”",
 comment = “”,
 trim_ws = TRUE,
 skip = 0,
 n_max = Inf,
 guess_max = min(1000, n_max),
 name_repair = “unique”,
 num_threads = readr_threads(),
 progress = show_progress(),
 show_col_types = should_show_types(),
 skip_empty_rows = TRUE,
 lazy = should_read_lazy()
 )read_csv2(
 file,
 col_names = TRUE,
 col_types = NULL,
 col_select = NULL,
 id = NULL,
 locale = default_locale(),
 na = c(“”, “NA”),
 quoted_na = TRUE,
 quote = “”",
 comment = “”,
 trim_ws = TRUE,
 skip = 0,
 n_max = Inf,
 guess_max = min(1000, n_max),
 progress = show_progress(),
 name_repair = “unique”,
 num_threads = readr_threads(),
 show_col_types = should_show_types(),
 skip_empty_rows = TRUE,
 lazy = should_read_lazy()
 )read_tsv(
 file,
 col_names = TRUE,
 col_types = NULL,
 col_select = NULL,
 id = NULL,
 locale = default_locale(),
 na = c(“”, “NA”),
 quoted_na = TRUE,
 quote = “”",
 comment = “”,
 trim_ws = TRUE,
 skip = 0,
 n_max = Inf,
 guess_max = min(1000, n_max),
 progress = show_progress(),
 name_repair = “unique”,
 num_threads = readr_threads(),
 show_col_types = should_show_types(),
 skip_empty_rows = TRUE,
 lazy = should_read_lazy()
 )