读取大型数据文件的指定行和指定列

在实际数据分析过程中,我们经常会遇到体积比较大的数据文件,有些甚至大至几个G,每次读取大型数据文件的时候都需要耗费大量时间。另外因为R是使用内存进行计算的,所以有些大型数据文件,个人电脑是无法读取的。如果我们只是想熟悉数据的结构,并不需要将数据的完全读取,这时候读取部分行列就会很有用。

1
2
3
4
5
6
7
8
9
10
if(!file.exists("read")){
dir.create("read")
}# 创建工作路径文件夹
setwd("./read") # 设置工作路径
if(!file.exists("data")){
dir.create("data")
} # 新建data文件夹

1. 读取大型*.txt文件的指定行和指定列

首先我们创建一个包含12个变量(列),和100,000行的数据集,并保存成df1.txt,作为我们后面的示例数据

1.1 数据

1
2
3
4
5
set.seed(1024)
# 随机产生10^5个服从Uniform的随机数,重复12次
df1 <- data.frame(replicate(12, sample(runif(1e5), 1e5, replace = FALSE)))
names(df1)<-LETTERS[1:12] # 变量名
print(object.size(df1), units="Mb") # Data Size
1
## 9.2 Mb # Data Size
1
write.table(df1, "./data/df1.txt", sep = "\t", row.names = FALSE) #保存成`.txt`文件

DF1

1.2 读取指定行

1.2.1 方法1: read.table("file.txt",skip= ,nrows= )

第一种方法使用read.table的函数,参见https://stat.ethz.ch/pipermail/r-help/2005-April/070203.html

  • skip = m是从第几行开始

  • nrows = n是需要提取的数据的行数

1
read.table("./data/df1.txt", skip = 5, nrow = 10)# 从第五行读取10行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## V1 V2 V3 V4 V5 V6
## 1 0.7055347 0.12920632 0.4295947 0.00357990 0.49670788 0.01362666
## 2 0.6613471 0.95351707 0.8739549 0.10166125 0.30094111 0.22257004
## 3 0.5567262 0.89433791 0.2537335 0.17371310 0.20480512 0.43330278
## 4 0.4425508 0.46954000 0.2236883 0.50466840 0.05181082 0.20466480
## 5 0.5981411 0.83194728 0.3669091 0.79297566 0.27411663 0.02809154
## 6 0.9747350 0.77751011 0.8816688 0.75316621 0.70018055 0.23151788
## 7 0.6845900 0.03759691 0.8489364 0.74017064 0.78245960 0.49188055
## 8 0.2858485 0.61155436 0.5811817 0.49490496 0.89603403 0.25230591
## 9 0.1705107 0.75533076 0.6896211 0.04542131 0.20055274 0.07782909
## 10 0.6238470 0.77953256 0.5553935 0.68813005 0.09442922 0.94930740
## V7 V8 V9 V10 V11 V12
## 1 0.08933755 0.4630112 0.56212971 0.79979308 0.83367055 0.7287266
## 2 0.02664011 0.5798631 0.05194099 0.81347679 0.64757159 0.4758128
## 3 0.32301409 0.3054171 0.58960042 0.59590144 0.91273150 0.9210145
## 4 0.01953665 0.1504671 0.30098137 0.62692098 0.59301203 0.6033279
## 5 0.80829949 0.5033749 0.85286500 0.46670481 0.34482590 0.4594105
## 6 0.26011167 0.7453824 0.34376219 0.33137941 0.03101849 0.3981119
## 7 0.76514235 0.4330745 0.13669238 0.08658989 0.92115323 0.9402725
## 8 0.75182503 0.7115402 0.04356411 0.20478518 0.13023461 0.6082353
## 9 0.55976961 0.6797878 0.76038988 0.19407901 0.67365844 0.5217162
## 10 0.46805855 0.5972887 0.22406577 0.85538688 0.06790307 0.4130284
1
2
3
read.table("./data/df1.txt", skip = 5, nrow = 10,
col.names = colnames(read.table("./data/df1.txt", nrow = 1,
header = TRUE))) #保留原列名
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## A B C D E F
## 1 0.7055347 0.12920632 0.4295947 0.00357990 0.49670788 0.01362666
## 2 0.6613471 0.95351707 0.8739549 0.10166125 0.30094111 0.22257004
## 3 0.5567262 0.89433791 0.2537335 0.17371310 0.20480512 0.43330278
## 4 0.4425508 0.46954000 0.2236883 0.50466840 0.05181082 0.20466480
## 5 0.5981411 0.83194728 0.3669091 0.79297566 0.27411663 0.02809154
## 6 0.9747350 0.77751011 0.8816688 0.75316621 0.70018055 0.23151788
## 7 0.6845900 0.03759691 0.8489364 0.74017064 0.78245960 0.49188055
## 8 0.2858485 0.61155436 0.5811817 0.49490496 0.89603403 0.25230591
## 9 0.1705107 0.75533076 0.6896211 0.04542131 0.20055274 0.07782909
## 10 0.6238470 0.77953256 0.5553935 0.68813005 0.09442922 0.94930740
## G H I J K L
## 1 0.08933755 0.4630112 0.56212971 0.79979308 0.83367055 0.7287266
## 2 0.02664011 0.5798631 0.05194099 0.81347679 0.64757159 0.4758128
## 3 0.32301409 0.3054171 0.58960042 0.59590144 0.91273150 0.9210145
## 4 0.01953665 0.1504671 0.30098137 0.62692098 0.59301203 0.6033279
## 5 0.80829949 0.5033749 0.85286500 0.46670481 0.34482590 0.4594105
## 6 0.26011167 0.7453824 0.34376219 0.33137941 0.03101849 0.3981119
## 7 0.76514235 0.4330745 0.13669238 0.08658989 0.92115323 0.9402725
## 8 0.75182503 0.7115402 0.04356411 0.20478518 0.13023461 0.6082353
## 9 0.55976961 0.6797878 0.76038988 0.19407901 0.67365844 0.5217162
## 10 0.46805855 0.5972887 0.22406577 0.85538688 0.06790307 0.4130284

1.2.2 方法2 :data.table::fread

我们也可以使用data.tablefread 函数读取特定行,方法和read.table类似, 读取的数据默认是data.table的格式的

1
if (!require("data.table")) install.packages("data.table")
1
2
library(data.table)
fread("./data/df1.txt", nrow = 10, col.names = LETTERS[1:12])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
## A B C D E F
## 1: 0.67670367 0.3026637 0.08774587 0.003039472 0.24752087 0.98302481
## 2: 0.03264569 0.5847326 0.42416403 0.847138941 0.24562266 0.04392178
## 3: 0.09310092 0.9051813 0.81617685 0.866701763 0.39628155 0.59205108
## 4: 0.95241243 0.6262199 0.49515652 0.600203589 0.82854606 0.20061278
## 5: 0.70553467 0.1292063 0.42959473 0.003579900 0.49670788 0.01362666
## 6: 0.66134711 0.9535171 0.87395493 0.101661252 0.30094111 0.22257004
## 7: 0.55672622 0.8943379 0.25373346 0.173713098 0.20480512 0.43330278
## 8: 0.44255077 0.4695400 0.22368835 0.504668399 0.05181082 0.20466480
## 9: 0.59814106 0.8319473 0.36690913 0.792975662 0.27411663 0.02809154
## 10: 0.97473503 0.7775101 0.88166877 0.753166205 0.70018055 0.23151788
## G H I J K L
## 1: 0.42874957 0.79947582 0.78018846 0.4417669 0.57603008 0.58735506
## 2: 0.50474417 0.41183691 0.83649693 0.4562639 0.42943533 0.04432009
## 3: 0.72369268 0.95217577 0.63295196 0.5959357 0.35415262 0.24978755
## 4: 0.33481795 0.07543831 0.45455761 0.6379352 0.91188111 0.83265795
## 5: 0.08933755 0.46301119 0.56212971 0.7997931 0.83367055 0.72872657
## 6: 0.02664011 0.57986306 0.05194099 0.8134768 0.64757159 0.47581280
## 7: 0.32301409 0.30541714 0.58960042 0.5959014 0.91273150 0.92101453
## 8: 0.01953665 0.15046710 0.30098137 0.6269210 0.59301203 0.60332790
## 9: 0.80829949 0.50337487 0.85286500 0.4667048 0.34482590 0.45941051
## 10: 0.26011167 0.74538236 0.34376219 0.3313794 0.03101849 0.39811192

1.3 读取指定列

1.3.1 方法1 read.table(file.txt, colClasses=)

使用read.tablecolClasses参数可以将不想读取的列的类型设成NULL同样可以达到效果,但是要知道每个变量的类型。
参见http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r

1
2
3
temp<-read.table("./data/df1.txt",
colClasses = c(rep("numeric", 6), rep("NULL", 6)),
header = TRUE)

1.3.2 方法2 data.table::fread

使用 data.table::freadselect参数和drop参数可以选择或丢弃你想要的列,这种方法相对前者更快更方便,因此我们推荐这一个。

1
2
3
library(data.table)
tmp1 <- fread("./data/df1.txt", select = 1:6)
head(tmp1)
1
2
3
4
5
6
7
## A B C D E F
## 1: 0.67670367 0.3026637 0.08774587 0.003039472 0.2475209 0.98302481
## 2: 0.03264569 0.5847326 0.42416403 0.847138941 0.2456227 0.04392178
## 3: 0.09310092 0.9051813 0.81617685 0.866701763 0.3962816 0.59205108
## 4: 0.95241243 0.6262199 0.49515652 0.600203589 0.8285461 0.20061278
## 5: 0.70553467 0.1292063 0.42959473 0.003579900 0.4967079 0.01362666
## 6: 0.66134711 0.9535171 0.87395493 0.101661252 0.3009411 0.22257004
1
2
tmp2 <- fread("./data/df1.txt", drop = 7:12)
head(tmp2)
1
2
3
4
5
6
7
## A B C D E F
## 1: 0.67670367 0.3026637 0.08774587 0.003039472 0.2475209 0.98302481
## 2: 0.03264569 0.5847326 0.42416403 0.847138941 0.2456227 0.04392178
## 3: 0.09310092 0.9051813 0.81617685 0.866701763 0.3962816 0.59205108
## 4: 0.95241243 0.6262199 0.49515652 0.600203589 0.8285461 0.20061278
## 5: 0.70553467 0.1292063 0.42959473 0.003579900 0.4967079 0.01362666
## 6: 0.66134711 0.9535171 0.87395493 0.101661252 0.3009411 0.22257004

2 读取EXCEL的指定行和列

2.1 数据

我们选取的数据是美国1954年以来的有效联邦基金利率(Effective Federal Funds Rate (DFF))的日度时间序列数据,下载地址:https://research.stlouisfed.org/fred2/series/DFF/downloaddata

2.2 导入想要的日度数据

源文件包含一些数据描述信息,这些不是我们关心,我们关心的只有日期和利率。我们使用XLConnect读取想要的xls区域。

  • startRow = ,starCol =,起始行列
  • endCol =, endCol = . 末行列
1
if (!require("XLConnect")) install.packages("XLConnect")
1
2
3
4
5
6
7
8
9
10
11
library(XLConnect)
library(ggplot2)
library(ggthemes)
filePath <- "./data/DFF.xls"
# 载入文件
wb <- loadWorkbook(filePath, create = TRUE)
# 读取工作簿(sheet)
DFF <- readWorksheet(wb, sheet = "DFF", header = TRUE,startRow = 59,
endCol =2)
head(DFF)
1
2
3
4
5
6
7
## DATE VALUE
## 1 1954-07-01 1.13
## 2 1954-07-02 1.25
## 3 1954-07-03 1.25
## 4 1954-07-04 1.25
## 5 1954-07-05 0.88
## 6 1954-07-06 0.25
1
2
3
4
# 使用数据
ggplot(DFF, aes(x=DATE, y=VALUE))+geom_line()+
theme_wsj()+
labs(list(title = "Effective Federal Funds Rate"))

DFF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
##
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggthemes_2.2.1 ggplot2_1.0.1 XLConnect_0.2-11
## [4] XLConnectJars_0.2-9 data.table_1.9.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.1 knitr_1.11 magrittr_1.5 MASS_7.3-43
## [5] munsell_0.4.2 colorspace_1.2-6 stringr_1.0.0 plyr_1.8.3
## [9] tools_3.2.2 grid_3.2.2 gtable_0.1.2 htmltools_0.2.6
## [13] yaml_2.1.13 digest_0.6.8 rJava_0.9-7 reshape2_1.4.1
## [17] evaluate_0.8 rmarkdown_0.8 labeling_0.3 stringi_1.0-1
## [21] scales_0.3.0 chron_2.3-47 proto_0.3-10