读取大型数据文件的指定行和指定列

在实际数据分析过程中，我们经常会遇到体积比较大的数据文件，有些甚至大至几个G，每次读取大型数据文件的时候都需要耗费大量时间。另外因为R是使用内存进行计算的，所以有些大型数据文件，个人电脑是无法读取的。如果我们只是想熟悉数据的结构，并不需要将数据的完全读取，这时候读取部分行列就会很有用。

if(!file.exists("read")){
  dir.create("read")
}# 创建工作路径文件夹
setwd("./read") # 设置工作路径
if(!file.exists("data")){
  dir.create("data")
} # 新建data文件夹

1. 读取大型`*.txt`文件的指定行和指定列

首先我们创建一个包含12个变量(列)，和100,000行的数据集，并保存成df1.txt,作为我们后面的示例数据

1.1 数据

set.seed(1024)
# 随机产生10^5个服从Uniform的随机数，重复12次
df1 <- data.frame(replicate(12, sample(runif(1e5), 1e5, replace = FALSE)))
names(df1)<-LETTERS[1:12] # 变量名
print(object.size(df1), units="Mb") # Data Size

1	## 9.2 Mb # Data Size

1	write.table(df1, "./data/df1.txt", sep = "\t", row.names = FALSE) #保存成`.txt`文件

DF1

1.2 读取指定行

1.2.1 方法1： `read.table("file.txt",skip= ,nrows= )`

第一种方法使用read.table的函数,参见https://stat.ethz.ch/pipermail/r-help/2005-April/070203.html

skip = m是从第几行开始
nrows = n是需要提取的数据的行数

1	read.table("./data/df1.txt", skip = 5, nrow = 10)# 从第五行读取10行

##           V1         V2        V3         V4         V5         V6
## 1  0.7055347 0.12920632 0.4295947 0.00357990 0.49670788 0.01362666
## 2  0.6613471 0.95351707 0.8739549 0.10166125 0.30094111 0.22257004
## 3  0.5567262 0.89433791 0.2537335 0.17371310 0.20480512 0.43330278
## 4  0.4425508 0.46954000 0.2236883 0.50466840 0.05181082 0.20466480
## 5  0.5981411 0.83194728 0.3669091 0.79297566 0.27411663 0.02809154
## 6  0.9747350 0.77751011 0.8816688 0.75316621 0.70018055 0.23151788
## 7  0.6845900 0.03759691 0.8489364 0.74017064 0.78245960 0.49188055
## 8  0.2858485 0.61155436 0.5811817 0.49490496 0.89603403 0.25230591
## 9  0.1705107 0.75533076 0.6896211 0.04542131 0.20055274 0.07782909
## 10 0.6238470 0.77953256 0.5553935 0.68813005 0.09442922 0.94930740
##            V7        V8         V9        V10        V11       V12
## 1  0.08933755 0.4630112 0.56212971 0.79979308 0.83367055 0.7287266
## 2  0.02664011 0.5798631 0.05194099 0.81347679 0.64757159 0.4758128
## 3  0.32301409 0.3054171 0.58960042 0.59590144 0.91273150 0.9210145
## 4  0.01953665 0.1504671 0.30098137 0.62692098 0.59301203 0.6033279
## 5  0.80829949 0.5033749 0.85286500 0.46670481 0.34482590 0.4594105
## 6  0.26011167 0.7453824 0.34376219 0.33137941 0.03101849 0.3981119
## 7  0.76514235 0.4330745 0.13669238 0.08658989 0.92115323 0.9402725
## 8  0.75182503 0.7115402 0.04356411 0.20478518 0.13023461 0.6082353
## 9  0.55976961 0.6797878 0.76038988 0.19407901 0.67365844 0.5217162
## 10 0.46805855 0.5972887 0.22406577 0.85538688 0.06790307 0.4130284

1
2
3

read.table("./data/df1.txt", skip = 5, nrow = 10,
           col.names = colnames(read.table("./data/df1.txt", nrow = 1,
                                           header = TRUE))) #保留原列名

##            A          B         C          D          E          F
## 1  0.7055347 0.12920632 0.4295947 0.00357990 0.49670788 0.01362666
## 2  0.6613471 0.95351707 0.8739549 0.10166125 0.30094111 0.22257004
## 3  0.5567262 0.89433791 0.2537335 0.17371310 0.20480512 0.43330278
## 4  0.4425508 0.46954000 0.2236883 0.50466840 0.05181082 0.20466480
## 5  0.5981411 0.83194728 0.3669091 0.79297566 0.27411663 0.02809154
## 6  0.9747350 0.77751011 0.8816688 0.75316621 0.70018055 0.23151788
## 7  0.6845900 0.03759691 0.8489364 0.74017064 0.78245960 0.49188055
## 8  0.2858485 0.61155436 0.5811817 0.49490496 0.89603403 0.25230591
## 9  0.1705107 0.75533076 0.6896211 0.04542131 0.20055274 0.07782909
## 10 0.6238470 0.77953256 0.5553935 0.68813005 0.09442922 0.94930740
##             G         H          I          J          K         L
## 1  0.08933755 0.4630112 0.56212971 0.79979308 0.83367055 0.7287266
## 2  0.02664011 0.5798631 0.05194099 0.81347679 0.64757159 0.4758128
## 3  0.32301409 0.3054171 0.58960042 0.59590144 0.91273150 0.9210145
## 4  0.01953665 0.1504671 0.30098137 0.62692098 0.59301203 0.6033279
## 5  0.80829949 0.5033749 0.85286500 0.46670481 0.34482590 0.4594105
## 6  0.26011167 0.7453824 0.34376219 0.33137941 0.03101849 0.3981119
## 7  0.76514235 0.4330745 0.13669238 0.08658989 0.92115323 0.9402725
## 8  0.75182503 0.7115402 0.04356411 0.20478518 0.13023461 0.6082353
## 9  0.55976961 0.6797878 0.76038988 0.19407901 0.67365844 0.5217162
## 10 0.46805855 0.5972887 0.22406577 0.85538688 0.06790307 0.4130284

1.2.2 方法2 ：`data.table::fread`

我们也可以使用data.table的fread 函数读取特定行,方法和read.table类似, 读取的数据默认是data.table的格式的

1	if (!require("data.table")) install.packages("data.table")

1 2	library(data.table) fread("./data/df1.txt", nrow = 10, col.names = LETTERS[1:12])

##              A         B          C           D          E          F
##  1: 0.67670367 0.3026637 0.08774587 0.003039472 0.24752087 0.98302481
##  2: 0.03264569 0.5847326 0.42416403 0.847138941 0.24562266 0.04392178
##  3: 0.09310092 0.9051813 0.81617685 0.866701763 0.39628155 0.59205108
##  4: 0.95241243 0.6262199 0.49515652 0.600203589 0.82854606 0.20061278
##  5: 0.70553467 0.1292063 0.42959473 0.003579900 0.49670788 0.01362666
##  6: 0.66134711 0.9535171 0.87395493 0.101661252 0.30094111 0.22257004
##  7: 0.55672622 0.8943379 0.25373346 0.173713098 0.20480512 0.43330278
##  8: 0.44255077 0.4695400 0.22368835 0.504668399 0.05181082 0.20466480
##  9: 0.59814106 0.8319473 0.36690913 0.792975662 0.27411663 0.02809154
## 10: 0.97473503 0.7775101 0.88166877 0.753166205 0.70018055 0.23151788
##              G          H          I         J          K          L
##  1: 0.42874957 0.79947582 0.78018846 0.4417669 0.57603008 0.58735506
##  2: 0.50474417 0.41183691 0.83649693 0.4562639 0.42943533 0.04432009
##  3: 0.72369268 0.95217577 0.63295196 0.5959357 0.35415262 0.24978755
##  4: 0.33481795 0.07543831 0.45455761 0.6379352 0.91188111 0.83265795
##  5: 0.08933755 0.46301119 0.56212971 0.7997931 0.83367055 0.72872657
##  6: 0.02664011 0.57986306 0.05194099 0.8134768 0.64757159 0.47581280
##  7: 0.32301409 0.30541714 0.58960042 0.5959014 0.91273150 0.92101453
##  8: 0.01953665 0.15046710 0.30098137 0.6269210 0.59301203 0.60332790
##  9: 0.80829949 0.50337487 0.85286500 0.4667048 0.34482590 0.45941051
## 10: 0.26011167 0.74538236 0.34376219 0.3313794 0.03101849 0.39811192

1.3 读取指定列

1.3.1 方法1 `read.table(file.txt, colClasses=)`

使用read.table的colClasses参数可以将不想读取的列的类型设成NULL同样可以达到效果，但是要知道每个变量的类型。
参见http://stackoverflow.com/questions/5788117/only-read-limited-number-of-columns-in-r

1
2
3

temp<-read.table("./data/df1.txt", 
                 colClasses = c(rep("numeric", 6), rep("NULL", 6)), 
                 header = TRUE)

1.3.2 方法2 `data.table::fread`

使用 data.table::fread的select参数和drop参数可以选择或丢弃你想要的列，这种方法相对前者更快更方便，因此我们推荐这一个。

1
2
3

library(data.table)
tmp1 <- fread("./data/df1.txt", select = 1:6)
head(tmp1)

##             A         B          C           D         E          F
## 1: 0.67670367 0.3026637 0.08774587 0.003039472 0.2475209 0.98302481
## 2: 0.03264569 0.5847326 0.42416403 0.847138941 0.2456227 0.04392178
## 3: 0.09310092 0.9051813 0.81617685 0.866701763 0.3962816 0.59205108
## 4: 0.95241243 0.6262199 0.49515652 0.600203589 0.8285461 0.20061278
## 5: 0.70553467 0.1292063 0.42959473 0.003579900 0.4967079 0.01362666
## 6: 0.66134711 0.9535171 0.87395493 0.101661252 0.3009411 0.22257004

1 2	tmp2 <- fread("./data/df1.txt", drop = 7:12) head(tmp2)

##             A         B          C           D         E          F
## 1: 0.67670367 0.3026637 0.08774587 0.003039472 0.2475209 0.98302481
## 2: 0.03264569 0.5847326 0.42416403 0.847138941 0.2456227 0.04392178
## 3: 0.09310092 0.9051813 0.81617685 0.866701763 0.3962816 0.59205108
## 4: 0.95241243 0.6262199 0.49515652 0.600203589 0.8285461 0.20061278
## 5: 0.70553467 0.1292063 0.42959473 0.003579900 0.4967079 0.01362666
## 6: 0.66134711 0.9535171 0.87395493 0.101661252 0.3009411 0.22257004

2 读取EXCEL的指定行和列

2.1 数据

我们选取的数据是美国1954年以来的有效联邦基金利率(Effective Federal Funds Rate (DFF))的日度时间序列数据,下载地址：https://research.stlouisfed.org/fred2/series/DFF/downloaddata

2.2 导入想要的日度数据

源文件包含一些数据描述信息，这些不是我们关心，我们关心的只有日期和利率。我们使用XLConnect读取想要的xls区域。

startRow = ,starCol =,起始行列
endCol =, endCol = . 末行列

1	if (!require("XLConnect")) install.packages("XLConnect")

library(XLConnect) 
library(ggplot2)
library(ggthemes)
filePath <- "./data/DFF.xls"
# 载入文件
wb <- loadWorkbook(filePath, create = TRUE)
# 读取工作簿(sheet)
DFF <- readWorksheet(wb, sheet = "DFF", header = TRUE,startRow = 59,
                             endCol =2)
head(DFF)

##         DATE VALUE
## 1 1954-07-01  1.13
## 2 1954-07-02  1.25
## 3 1954-07-03  1.25
## 4 1954-07-04  1.25
## 5 1954-07-05  0.88
## 6 1954-07-06  0.25

# 使用数据
ggplot(DFF, aes(x=DATE, y=VALUE))+geom_line()+
  theme_wsj()+
  labs(list(title = "Effective Federal Funds Rate"))

DFF

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.1 (El Capitan)
## 
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggthemes_2.2.1      ggplot2_1.0.1       XLConnect_0.2-11   
## [4] XLConnectJars_0.2-9 data.table_1.9.6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.11       magrittr_1.5     MASS_7.3-43     
##  [5] munsell_0.4.2    colorspace_1.2-6 stringr_1.0.0    plyr_1.8.3      
##  [9] tools_3.2.2      grid_3.2.2       gtable_0.1.2     htmltools_0.2.6 
## [13] yaml_2.1.13      digest_0.6.8     rJava_0.9-7      reshape2_1.4.1  
## [17] evaluate_0.8     rmarkdown_0.8    labeling_0.3     stringi_1.0-1   
## [21] scales_0.3.0     chron_2.3-47     proto_0.3-10

CHANGSHUN的BLOG

Changshun

读取大型数据文件的指定行和指定列

Nov 9, 2015

读取大型数据文件的指定行和指定列

1. 读取大型`*.txt`文件的指定行和指定列

1.1 数据

1.2 读取指定行

1.2.1 方法1： `read.table("file.txt",skip= ,nrows= )`

1.2.2 方法2 ：`data.table::fread`

1.3 读取指定列

1.3.1 方法1 `read.table(file.txt, colClasses=)`

1.3.2 方法2 `data.table::fread`

2 读取EXCEL的指定行和列

2.1 数据

2.2 导入想要的日度数据

CHANGSHUN的BLOG

Changshun

读取大型数据文件的指定行和指定列

Nov 9, 2015

读取大型数据文件的指定行和指定列

1. 读取大型*.txt文件的指定行和指定列

1.1 数据

1.2 读取指定行

1.2.1 方法1： read.table("file.txt",skip= ,nrows= )

1.2.2 方法2 ：data.table::fread

1.3 读取指定列

1.3.1 方法1 read.table(file.txt, colClasses=)

1.3.2 方法2 data.table::fread

2 读取EXCEL的指定行和列

2.1 数据

2.2 导入想要的日度数据

1. 读取大型`*.txt`文件的指定行和指定列

1.2.1 方法1： `read.table("file.txt",skip= ,nrows= )`

1.2.2 方法2 ：`data.table::fread`

1.3.1 方法1 `read.table(file.txt, colClasses=)`

1.3.2 方法2 `data.table::fread`