data.table Way

This is a learning note for the course Data Analysis in R, the data.table Way.
This course introduces the package ‘data.table’ which is a very powerful tool for data manipulation. I borrow two built-in data frames: mtcars and iris throughout the entire article.

1
2
3
library(data.table)
library(ggplot2)
library(plyr)

第一个 data.table

1
2
3
df1 <- data.frame(A = letters[1:6], B = 1:2, C = TRUE)
dt1 <- data.table(A = letters[1:6], B = 1:2, C = TRUE)
dt1
1
2
3
4
5
6
7
## A B C
## 1: a 1 TRUE
## 2: b 2 TRUE
## 3: c 1 TRUE
## 4: d 2 TRUE
## 5: e 1 TRUE
## 6: f 2 TRUE

创建的data.table的类型有两个:

1
class(dt1)
1
## [1] "data.table" "data.frame"

data.frame继承了base包里data.frame类,可以看作是一个次级类
在其他不能识别data.table的package里,它被当作data.frame使用,适用于data.frame的功能也适用于它。

data.frame或者matrix转化成data.fame

1
2
dt2 <- as.data.table(df1) # 也可以使用`data.table()`
dt3 <- as.data.table(matrix(letters[-26], ncol = 5))

Basketball Reference: boxscore

1
2
3
boxscore<-fread("https://s3.amazonaws.com/demo-datasets/basketball/boxscore.csv")# 读取的慢的话可以这个数据集下载下来
boxscore[, ("V1") := NULL] # 删除重复变量
head(boxscore)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## id date game_id team_home_abbr team player
## 1: 1 2012-10-30 201210300CLE CLE WAS A.J. Price
## 2: 2 2012-10-30 201210300CLE CLE WAS Emeka Okafor
## 3: 3 2012-10-30 201210300CLE CLE WAS Trevor Ariza
## 4: 4 2012-10-30 201210300CLE CLE WAS Bradley Beal
## 5: 5 2012-10-30 201210300CLE CLE WAS Trevor Booker
## 6: 6 2012-10-30 201210300CLE CLE WAS Jordan Crawford
## player_id minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists
## 1: priceaj01 29.40 2 13 2 9 1 1 1 1 2 6
## 2: okafoem01 24.58 4 10 0 0 2 4 5 2 7 0
## 3: arizatr01 24.58 3 8 2 4 1 2 1 2 3 4
## 4: bealbr01 21.55 2 8 2 4 2 2 0 3 3 3
## 5: booketr01 16.75 2 9 0 1 0 0 1 0 1 1
## 6: crawfjo02 28.70 4 13 0 6 3 4 1 2 3 5
## steals blocks turnovers fouls points plusminus
## 1: 0 0 1 1 7 -11
## 2: 0 4 1 1 10 -5
## 3: 3 2 0 0 9 -9
## 4: 1 0 2 1 8 -16
## 5: 1 1 4 4 4 -15
## 6: 1 1 1 1 11 0

iris

1
2
DT_iris <- as.data.table(iris)
tail(DT_iris)
1
2
3
4
5
6
7
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 6.7 3.3 5.7 2.5 virginica
## 2: 6.7 3.0 5.2 2.3 virginica
## 3: 6.3 2.5 5.0 1.9 virginica
## 4: 6.5 3.0 5.2 2.0 virginica
## 5: 6.2 3.4 5.4 2.3 virginica
## 6: 5.9 3.0 5.1 1.8 virginica

常用功能

  1. DT[i, i, by] 标准subsetting语句
  2. .()list()的缩写
  3. () : vector的缩写
  4. := 赋值函数
  5. data.table 内置变量
    • .N 行数
    • .SD subset data.table
    • .SDcols 列index

取子集(subsetting)

1
2
3
dtb <- copy(boxscore)
dtb[, id:=NULL]
dti <- copy(DT_iris)

取row

一般方法

1
dtb[3] # return 3rd row
1
2
3
4
5
6
## date game_id team_home_abbr team player player_id
## 1: 2012-10-30 201210300CLE CLE WAS Trevor Ariza arizatr01
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 24.58 3 8 2 4 1 2 1 2 3 4 3
## blocks turnovers fouls points plusminus
## 1: 2 0 0 9 -9
1
dtb[2:3] # return 2-3 rows
1
2
3
4
5
6
7
8
9
## date game_id team_home_abbr team player player_id
## 1: 2012-10-30 201210300CLE CLE WAS Emeka Okafor okafoem01
## 2: 2012-10-30 201210300CLE CLE WAS Trevor Ariza arizatr01
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 24.58 4 10 0 0 2 4 5 2 7 0 0
## 2: 24.58 3 8 2 4 1 2 1 2 3 4 3
## blocks turnovers fouls points plusminus
## 1: 4 1 1 10 -5
## 2: 2 0 0 9 -9
1
dtb[c(1,4,5)] # return 1,4,5 rows
1
2
3
4
5
6
7
8
9
10
11
12
## date game_id team_home_abbr team player player_id
## 1: 2012-10-30 201210300CLE CLE WAS A.J. Price priceaj01
## 2: 2012-10-30 201210300CLE CLE WAS Bradley Beal bealbr01
## 3: 2012-10-30 201210300CLE CLE WAS Trevor Booker booketr01
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 29.40 2 13 2 9 1 1 1 1 2 6 0
## 2: 21.55 2 8 2 4 2 2 0 3 3 3 1
## 3: 16.75 2 9 0 1 0 0 1 0 1 1 1
## blocks turnovers fouls points plusminus
## 1: 0 1 1 7 -11
## 2: 0 2 1 8 -16
## 3: 1 4 4 4 -15

注意:和data.frame取row不同是, 只要给出i(row)就可以了,不需要写,

内部变量.N

.N返回row数

1
dtb[.N-1] #return penultimate(倒数第二) row
1
2
3
4
5
6
## date game_id team_home_abbr team player player_id
## 1: 2005-06-23 200506230SAS SAS SAS Brent Barry barrybr01
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 29 2 3 1 1 0 0 0 4 4 2 2
## blocks turnovers fouls points plusminus
## 1: 1 2 2 5 14

取column(j)

取column时,行可以用一个,跳过,但是不能省区,取多个column时使用list(var1name, var2name),往往简写成.(var1name, var2name)

1
dtb[, .(date,player, points)] #return points data each player every day
1
2
3
4
5
6
7
8
9
10
11
12
## date player points
## 1: 2012-10-30 A.J. Price 7
## 2: 2012-10-30 Emeka Okafor 10
## 3: 2012-10-30 Trevor Ariza 9
## 4: 2012-10-30 Bradley Beal 8
## 5: 2012-10-30 Trevor Booker 4
## ---
## 236199: 2005-06-23 Manu Ginobili 23
## 236200: 2005-06-23 Nazr Mohammed 0
## 236201: 2005-06-23 Robert Horry 15
## 236202: 2005-06-23 Brent Barry 5
## 236203: 2005-06-23 Devin Brown 0

比较 .(var1),和var1

前者保留data.table结构, 后者丢弃结构

1
class(dtb[1:10, .(points)])
1
## [1] "data.table" "data.frame"
1
class(dtb[1:10, points])
1
## [1] "integer"

条件取值

回忆下data.frame条件取值

1
head(as.data.frame(boxscore)[boxscore$player=="LeBron James",])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## id date game_id team_home_abbr team player
## 54 54 2012-10-30 201210300MIA MIA MIA LeBron James
## 440 440 2012-11-02 201211020NYK NYK MIA LeBron James
## 671 671 2012-11-03 201211030MIA MIA MIA LeBron James
## 958 958 2012-11-05 201211050MIA MIA MIA LeBron James
## 1266 1266 2012-11-07 201211070MIA MIA MIA LeBron James
## 1431 1431 2012-11-09 201211090ATL ATL MIA LeBron James
## player_id minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb
## 54 jamesle01 28.87 10 16 2 4 4 5 1 9 10
## 440 jamesle01 36.68 8 16 2 3 5 6 0 7 7
## 671 jamesle01 39.20 8 17 0 1 4 4 1 8 9
## 958 jamesle01 30.18 10 17 2 3 1 3 3 8 11
## 1266 jamesle01 29.98 7 12 2 3 4 4 4 8 12
## 1431 jamesle01 35.67 10 17 0 2 1 3 2 9 11
## assists steals blocks turnovers fouls points plusminus
## 54 3 2 0 0 2 26 12
## 440 5 0 3 5 2 23 -21
## 671 11 0 2 0 2 20 7
## 958 1 0 0 3 2 23 25
## 1266 8 1 0 2 1 20 18
## 1431 9 1 0 2 2 21 0

使用data.table这就会很简单

1
head(dtb[player=='LeBron James'])
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## date game_id team_home_abbr team player player_id
## 1: 2012-10-30 201210300MIA MIA MIA LeBron James jamesle01
## 2: 2012-11-02 201211020NYK NYK MIA LeBron James jamesle01
## 3: 2012-11-03 201211030MIA MIA MIA LeBron James jamesle01
## 4: 2012-11-05 201211050MIA MIA MIA LeBron James jamesle01
## 5: 2012-11-07 201211070MIA MIA MIA LeBron James jamesle01
## 6: 2012-11-09 201211090ATL ATL MIA LeBron James jamesle01
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 28.87 10 16 2 4 4 5 1 9 10 3 2
## 2: 36.68 8 16 2 3 5 6 0 7 7 5 0
## 3: 39.20 8 17 0 1 4 4 1 8 9 11 0
## 4: 30.18 10 17 2 3 1 3 3 8 11 1 0
## 5: 29.98 7 12 2 3 4 4 4 8 12 8 1
## 6: 35.67 10 17 0 2 1 3 2 9 11 9 1
## blocks turnovers fouls points plusminus
## 1: 0 0 2 26 12
## 2: 3 5 2 23 -21
## 3: 2 0 2 20 7
## 4: 0 3 2 23 25
## 5: 0 2 1 20 18
## 6: 0 2 2 21 0

选择所有LALBOS的行:

1
2
3
temp <- dtb[team %in% c("LAL","BOS")] #调出湖人和凯尔特人的数据
count<-temp[, .(`出场次数`=.N), by=player] # 为每名球员计数
head(count[order(-`出场次数`)],10) #排序
1
2
3
4
5
6
7
8
9
10
11
## player 出场次数
## 1: Kobe Bryant 779
## 2: Paul Pierce 764
## 3: Lamar Odom 608
## 4: Rajon Rondo 567
## 5: Kendrick Perkins 512
## 6: Luke Walton 492
## 7: Sasha Vujacic 485
## 8: Kevin Garnett 480
## 9: Andrew Bynum 466
## 10: Pau Gasol 462

选择所有得分(points)大于50的数据:

1
2
3
temp<-dtb[points>=50]
count <- temp[, .(`50+次数`=.N),by=player]
count[order(-`50+次数`)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## player 50+次数
## 1: Kobe Bryant 20
## 2: LeBron James 9
## 3: Allen Iverson 5
## 4: Carmelo Anthony 3
## 5: Dwyane Wade 3
## 6: Gilbert Arenas 3
## 7: Dirk Nowitzki 3
## 8: Kevin Durant 2
## 9: Jamal Crawford 2
## 10: Ray Allen 2
## 11: Michael Redd 2
## 12: Stephen Curry 1
## 13: Deron Williams 1
## 14: Kevin Love 1
## 15: Brandon Jennings 1
## 16: Andre Miller 1
## 17: Tony Parker 1
## 18: Brandon Roy 1
## 19: Kevin Martin 1
## 20: Richard Hamilton 1
## 21: Vince Carter 1
## 22: Paul Pierce 1
## 23: Amar'e Stoudemire 1
## 24: Jermaine O'Neal 1
## 25: Damon Stoudamire 1
## player 50+次数

操作data.table

添加

data.frame way

1
dtb1 <- data.table(letter=LETTERS, dtb)
1
2
## Warning in data.table(letter = LETTERS, dtb): Item 1 is of size 26 but
## maximum size is 236203 (recycled leaving remainder of 19 items)
1
dtb1$num <- 1:dim(dtb1)[1]

:=

:=可以为data.table添加项

1
2
dtb2 <-copy(dtb)
dtb2[, c("letter", "num"):=.(LETTERS, 1:.N)]
1
2
3
## Warning in `[.data.table`(dtb2, , `:=`(c("letter", "num"), .(LETTERS,
## 1:.N))): Supplied 26 items to be assigned to 236203 items of column
## 'letter' (recycled leaving remainder of 19 items).

选择

1
dtb[1:10, .(player, assists)]
1
2
3
4
5
6
7
8
9
10
11
## player assists
## 1: A.J. Price 6
## 2: Emeka Okafor 0
## 3: Trevor Ariza 4
## 4: Bradley Beal 3
## 5: Trevor Booker 1
## 6: Jordan Crawford 5
## 7: Martell Webster 1
## 8: Jan Vesely 1
## 9: Chris Singleton 2
## 10: Jannero Pargo 3
1
dtb[1:10]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## date game_id team_home_abbr team player player_id
## 1: 2012-10-30 201210300CLE CLE WAS A.J. Price priceaj01
## 2: 2012-10-30 201210300CLE CLE WAS Emeka Okafor okafoem01
## 3: 2012-10-30 201210300CLE CLE WAS Trevor Ariza arizatr01
## 4: 2012-10-30 201210300CLE CLE WAS Bradley Beal bealbr01
## 5: 2012-10-30 201210300CLE CLE WAS Trevor Booker booketr01
## 6: 2012-10-30 201210300CLE CLE WAS Jordan Crawford crawfjo02
## 7: 2012-10-30 201210300CLE CLE WAS Martell Webster webstma02
## 8: 2012-10-30 201210300CLE CLE WAS Jan Vesely veselja01
## 9: 2012-10-30 201210300CLE CLE WAS Chris Singleton singlch01
## 10: 2012-10-30 201210300CLE CLE WAS Jannero Pargo pargoja01
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 29.40 2 13 2 9 1 1 1 1 2 6 0
## 2: 24.58 4 10 0 0 2 4 5 2 7 0 0
## 3: 24.58 3 8 2 4 1 2 1 2 3 4 3
## 4: 21.55 2 8 2 4 2 2 0 3 3 3 1
## 5: 16.75 2 9 0 1 0 0 1 0 1 1 1
## 6: 28.70 4 13 0 6 3 4 1 2 3 5 1
## 7: 23.42 4 6 1 3 0 0 0 3 3 1 2
## 8: 21.23 3 4 0 0 1 4 3 1 4 1 0
## 9: 17.18 2 7 0 1 0 0 2 2 4 2 2
## 10: 16.35 2 6 1 4 2 2 0 1 1 3 0
## blocks turnovers fouls points plusminus
## 1: 0 1 1 7 -11
## 2: 4 1 1 10 -5
## 3: 2 0 0 9 -9
## 4: 0 2 1 8 -16
## 5: 1 4 4 4 -15
## 6: 1 1 1 11 0
## 7: 0 0 1 9 -1
## 8: 1 1 3 7 -2
## 9: 0 1 2 4 1
## 10: 0 1 3 7 7
1
dtb[1:10, .(player, asspts =c(points, assists))]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## player asspts
## 1: A.J. Price 7
## 2: Emeka Okafor 10
## 3: Trevor Ariza 9
## 4: Bradley Beal 8
## 5: Trevor Booker 4
## 6: Jordan Crawford 11
## 7: Martell Webster 9
## 8: Jan Vesely 7
## 9: Chris Singleton 4
## 10: Jannero Pargo 7
## 11: A.J. Price 6
## 12: Emeka Okafor 0
## 13: Trevor Ariza 4
## 14: Bradley Beal 3
## 15: Trevor Booker 1
## 16: Jordan Crawford 5
## 17: Martell Webster 1
## 18: Jan Vesely 1
## 19: Chris Singleton 2
## 20: Jannero Pargo 3

删除

DT[, c('var1name','var2name',...)]:=NULL

1
2
3
4
5
6
7
dtb2[, c('letter', 'num'):=NULL]
# OR
temp <- copy(dtb)
temp[, (1:2):=NULL]
varnames <- c('player','points')
temp[, (varnames) :=NULL]
names(temp)
1
2
3
4
5
## [1] "team_home_abbr" "team" "player_id" "minutes"
## [5] "fgm" "fga" "3pm" "3pa"
## [9] "ftm" "fta" "offreb" "defreb"
## [13] "totreb" "assists" "steals" "blocks"
## [17] "turnovers" "fouls" "plusminus"
1
2
temp[,grep('^team', names(temp)):=NULL]
names(temp)
1
2
3
4
## [1] "player_id" "minutes" "fgm" "fga" "3pm"
## [6] "3pa" "ftm" "fta" "offreb" "defreb"
## [11] "totreb" "assists" "steals" "blocks" "turnovers"
## [16] "fouls" "plusminus"

更改

1
2
temp <- copy(dtb)
temp[, c("firstName", "givenName"):=tstrsplit(player, " ")] #tstrsplit将player分成名和姓氏
1
2
3
## Warning in `[.data.table`(temp, , `:=`(c("firstName", "givenName"),
## tstrsplit(player, : Supplied 2 columns to be assigned a list (length 4) of
## values (2 unused)
1
tail(temp)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## date game_id team_home_abbr team player player_id
## 1: 2005-06-23 200506230SAS SAS SAS Tony Parker parketo01
## 2: 2005-06-23 200506230SAS SAS SAS Manu Ginobili ginobma01
## 3: 2005-06-23 200506230SAS SAS SAS Nazr Mohammed mohamna01
## 4: 2005-06-23 200506230SAS SAS SAS Robert Horry horryro01
## 5: 2005-06-23 200506230SAS SAS SAS Brent Barry barrybr01
## 6: 2005-06-23 200506230SAS SAS SAS Devin Brown brownde02
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 38 3 11 1 3 1 2 0 2 2 3 0
## 2: 35 8 13 2 2 5 5 0 5 5 4 1
## 3: 22 0 3 0 0 0 0 1 6 7 0 0
## 4: 32 4 7 2 4 5 6 2 3 5 1 1
## 5: 29 2 3 1 1 0 0 0 4 4 2 2
## 6: 1 0 0 0 0 0 0 0 0 0 0 0
## blocks turnovers fouls points plusminus firstName givenName
## 1: 0 1 4 8 -1 Tony Parker
## 2: 0 3 3 23 5 Manu Ginobili
## 3: 2 0 3 0 -4 Nazr Mohammed
## 4: 1 1 1 15 17 Robert Horry
## 5: 1 2 2 5 14 Brent Barry
## 6: 0 0 0 0 0 Devin Brown
1
2
3
temp <- copy(dtb)
temp[, `:=`(points=points+1, turnovers=turnovers-1)]
head(temp)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## date game_id team_home_abbr team player player_id
## 1: 2012-10-30 201210300CLE CLE WAS A.J. Price priceaj01
## 2: 2012-10-30 201210300CLE CLE WAS Emeka Okafor okafoem01
## 3: 2012-10-30 201210300CLE CLE WAS Trevor Ariza arizatr01
## 4: 2012-10-30 201210300CLE CLE WAS Bradley Beal bealbr01
## 5: 2012-10-30 201210300CLE CLE WAS Trevor Booker booketr01
## 6: 2012-10-30 201210300CLE CLE WAS Jordan Crawford crawfjo02
## minutes fgm fga 3pm 3pa ftm fta offreb defreb totreb assists steals
## 1: 29.40 2 13 2 9 1 1 1 1 2 6 0
## 2: 24.58 4 10 0 0 2 4 5 2 7 0 0
## 3: 24.58 3 8 2 4 1 2 1 2 3 4 3
## 4: 21.55 2 8 2 4 2 2 0 3 3 3 1
## 5: 16.75 2 9 0 1 0 0 1 0 1 1 1
## 6: 28.70 4 13 0 6 3 4 1 2 3 5 1
## blocks turnovers fouls points plusminus
## 1: 0 0 1 8 -11
## 2: 4 0 1 11 -5
## 3: 2 -1 0 10 -9
## 4: 0 1 1 9 -16
## 5: 1 3 4 5 -15
## 6: 1 0 1 12 0

Summarize

1
library(lubridate)
1
2
3
4
5
6
7
8
9
10
##
## Attaching package: 'lubridate'
##
## The following object is masked from 'package:plyr':
##
## here
##
## The following objects are masked from 'package:data.table':
##
## hour, mday, month, quarter, wday, week, yday, year
1
2
3
temp<- copy(dtb)
temp[, year:=year(date)] # 选取2012的数据
temp2012 <- temp[year==2012]

####计算助攻失误比这个数据

1
2
3
ass2turn<- temp2012[, .(ass.sum=sum(assists),turn.sum=sum(turnovers), ass2turn=sum(assists)/sum(turnovers)), by=player]
ass2turn <- ass2turn[!grep("Inf", ass2turn)]# 除去失误为0的数据
head(ass2turn[order(-ass2turn)],10) # 助攻/失误比前10名
1
2
3
4
5
6
7
8
9
10
11
## player ass.sum turn.sum ass2turn
## 1: Josh Childress 37 8 4.625000
## 2: Jose Calderon 680 157 4.331210
## 3: Kent Bazemore 4 1 4.000000
## 4: Tony Battie 16 4 4.000000
## 5: Mike Bibby 87 22 3.954545
## 6: Chris Paul 890 226 3.938053
## 7: Matt Bonner 78 22 3.545455
## 8: Julyan Stone 39 11 3.545455
## 9: Brandon Roy 23 7 3.285714
## 10: Rajon Rondo 1113 342 3.254386

by 函数

我们想获得所有的球员的年均数据

1
2
3
#dtb_3 <- dtb[, c(24, 5, 7:23), with = FALSE]
#stat.boxscore <- ddply(dtb_3, .variables = c("year", "player"), .fun = function(dtb_sub) { round(colMeans(dtb_sub[, -c(1, 2)]), 2)})
#head(stat.boxscore)
1
2
dtb[, year:=year(date)]
head(dtb[,.(Avg_pts=mean(points)),by=.(player,year)])
1
2
3
4
5
6
7
## player year Avg_pts
## 1: A.J. Price 2012 4.907692
## 2: Emeka Okafor 2012 8.711538
## 3: Trevor Ariza 2012 10.132075
## 4: Bradley Beal 2012 12.115385
## 5: Trevor Booker 2012 8.482143
## 6: Jordan Crawford 2012 15.102273
1
2
3
ass2turn.summary<-ass2turn[, .(count=.N), by=.(ass.turn=round(ass2turn,1))]
ass.order<- ass2turn.summary[order(-ass.turn)]
head(ass.order,10)
1
2
3
4
5
6
7
8
9
10
11
## ass.turn count
## 1: 4.6 1
## 2: 4.3 1
## 3: 4.0 3
## 4: 3.9 1
## 5: 3.5 2
## 6: 3.3 2
## 7: 3.2 1
## 8: 3.1 1
## 9: 3.0 5
## 10: 2.9 2
1
ggplot(ass.order, aes(x=ass.turn,y=count))+geom_histogram(stat="identity")
1
## Warning: Removed 1 rows containing missing values (position_stack).

1
2
#stat.boxscore <- data.table(stat.boxscore)
#stat.boxscore[order(-points)][, .(player=head(player,10), pts=head(points,10)), by = .(year)][order(year)]
1
2
dtm <- data.table(mtcars)
dtm[, .(cumulative_su = cumsum(wt)), by = .(cyl, gear)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
## cyl gear cumulative_su
## 1: 6 4 2.620
## 2: 6 4 5.495
## 3: 6 4 8.935
## 4: 6 4 12.375
## 5: 4 4 2.320
## 6: 4 4 5.510
## 7: 4 4 8.660
## 8: 4 4 10.860
## 9: 4 4 12.475
## 10: 4 4 14.310
## 11: 4 4 16.245
## 12: 4 4 19.025
## 13: 6 3 3.215
## 14: 6 3 6.675
## 15: 8 3 3.440
## 16: 8 3 7.010
## 17: 8 3 11.080
## 18: 8 3 14.810
## 19: 8 3 18.590
## 20: 8 3 23.840
## 21: 8 3 29.264
## 22: 8 3 34.609
## 23: 8 3 38.129
## 24: 8 3 41.564
## 25: 8 3 45.404
## 26: 8 3 49.249
## 27: 4 3 2.465
## 28: 4 5 2.140
## 29: 4 5 3.653
## 30: 8 5 3.170
## 31: 8 5 6.740
## 32: 6 5 2.770
## cyl gear cumulative_su
1
2
dtm[,model:=row.names(mtcars)]
dtm[, .(model, mpg), by = .(cyl, am)][order(-mpg)][, .(model = head(model, 2), mpg = head(mpg, 2)), by = .(cyl, am)]
1
2
3
4
5
6
7
8
9
10
11
12
13
## cyl am model mpg
## 1: 4 1 Toyota Corolla 33.9
## 2: 4 1 Fiat 128 32.4
## 3: 4 0 Merc 240D 24.4
## 4: 4 0 Merc 230 22.8
## 5: 6 0 Hornet 4 Drive 21.4
## 6: 6 0 Merc 280 19.2
## 7: 6 1 Mazda RX4 21.0
## 8: 6 1 Mazda RX4 Wag 21.0
## 9: 8 0 Pontiac Firebird 19.2
## 10: 8 0 Hornet Sportabout 18.7
## 11: 8 1 Ford Pantera L 15.8
## 12: 8 1 Maserati Bora 15.0

by可以链式(chaining)使用
2004-2013年每队赛季得分榜前两名

1
2
teamleader<- dtb[, .(player, points), by=.(year,team)][order(-points)][, .(player=head(player,2), points=head(points,2)), by =.(year, team)][order(year,team)]
head(teamleader)
1
2
3
4
5
6
7
## year team player points
## 1: 2004 ATL Antoine Walker 36
## 2: 2004 ATL Antoine Walker 31
## 3: 2004 BOS Paul Pierce 37
## 4: 2004 BOS Paul Pierce 35
## 5: 2004 BRK Richard Jefferson 42
## 6: 2004 BRK Richard Jefferson 38

.SD.SDcols

回忆前面求所有球员的年均数据,我们可以使用下面的方法实现

1
2
3
#data.table(dtb_3)[, lapply(.SD, function(x)round(mean(x),2)), by = .(year,player)][order(year)]
#OR
dtb[, lapply(.SD, function(x)round(mean(x),2)),.SDcols=7:23, by=.(year,player)]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
## year player minutes fgm fga 3pm 3pa ftm fta
## 1: 2012 A.J. Price 16.10 1.74 4.94 0.80 2.52 0.63 0.77
## 2: 2012 Emeka Okafor 26.02 3.79 7.65 0.00 0.00 1.13 2.04
## 3: 2012 Trevor Ariza 30.63 3.83 9.47 0.68 2.23 1.79 2.25
## 4: 2012 Bradley Beal 29.97 4.19 11.88 1.12 4.15 2.62 3.04
## 5: 2012 Trevor Booker 25.87 3.73 7.16 0.02 0.07 1.00 1.68
## ---
## 4985: 2005 Vlade Divac 13.29 1.43 3.00 0.00 0.00 1.14 1.71
## 4986: 2005 Glenn Robinson 12.27 2.27 5.55 0.23 0.73 1.59 1.82
## 4987: 2005 Omar Cook 14.80 2.00 4.80 0.00 0.80 0.60 1.20
## 4988: 2005 Ndudi Ebi 27.00 5.50 10.50 0.00 0.50 2.50 4.50
## 4989: 2005 Lawrence Funderburke 7.00 0.57 1.14 0.00 0.00 0.57 1.00
## offreb defreb totreb assists steals blocks turnovers fouls points
## 1: 0.35 1.35 1.71 2.63 0.46 0.05 0.88 0.83 4.91
## 2: 2.37 4.85 7.21 0.94 0.60 1.12 1.23 2.13 8.71
## 3: 1.02 4.06 5.08 3.02 1.75 0.57 1.77 1.60 10.13
## 4: 0.65 2.92 3.58 2.50 0.92 0.42 1.62 1.92 12.12
## 5: 2.41 4.18 6.59 0.93 1.04 0.86 1.25 2.77 8.48
## ---
## 4985: 1.71 1.57 3.29 2.14 0.29 0.00 1.43 2.71 4.00
## 4986: 0.27 1.77 2.05 0.41 0.32 0.45 0.59 1.77 6.36
## 4987: 0.40 1.00 1.40 4.40 1.20 0.20 1.20 1.20 4.60
## 4988: 3.00 5.00 8.00 0.50 0.50 0.50 1.50 3.50 13.50
## 4989: 0.14 0.71 0.86 0.14 0.00 0.00 0.00 1.43 1.71
## plusminus
## 1: -0.60
## 2: -4.71
## 3: -4.68
## 4: -7.04
## 5: -4.14
## ---
## 4985: -4.71
## 4986: 0.59
## 4987: 4.20
## 4988: 7.00
## 4989: 1.71

2004-2013年NBA得分榜

1
dtb[, lapply(.SD, function(x) round(mean(x),2)),.SDcols=22, by=.(year,player)][order(year, -points)][, .(rank =1:10, player=head(player,10),pts=head(points,10)), by=year]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
## year rank player pts
## 1: 2004 1 Allen Iverson 28.89
## 2: 2004 2 Kobe Bryant 28.41
## 3: 2004 3 Dirk Nowitzki 26.29
## 4: 2004 4 Amar'e Stoudemire 25.59
## 5: 2004 5 Metta World Peace 24.57
## 6: 2004 6 LeBron James 24.41
## 7: 2004 7 Tracy McGrady 23.96
## 8: 2004 8 Dwyane Wade 23.93
## 9: 2004 9 Ray Allen 23.82
## 10: 2004 10 Richard Jefferson 23.21
## 11: 2005 1 Allen Iverson 32.58
## 12: 2005 2 Kobe Bryant 29.40
## 13: 2005 3 LeBron James 29.33
## 14: 2005 4 Gilbert Arenas 27.18
## 15: 2005 5 Amar'e Stoudemire 27.08
## 16: 2005 6 Vince Carter 26.69
## 17: 2005 7 Tracy McGrady 26.40
## 18: 2005 8 Dirk Nowitzki 25.70
## 19: 2005 9 Dwyane Wade 25.54
## 20: 2005 10 Ray Allen 24.54
## 21: 2006 1 Kobe Bryant 33.52
## 22: 2006 2 Allen Iverson 31.65
## 23: 2006 3 LeBron James 30.33
## 24: 2006 4 Gilbert Arenas 30.00
## 25: 2006 5 Carmelo Anthony 28.30
## 26: 2006 6 Dwyane Wade 27.72
## 27: 2006 7 Paul Pierce 26.93
## 28: 2006 8 Michael Redd 26.51
## 29: 2006 9 Dirk Nowitzki 25.77
## 30: 2006 10 Vince Carter 25.08
## 31: 2007 1 Kobe Bryant 31.01
## 32: 2007 2 LeBron James 27.25
## 33: 2007 3 Carmelo Anthony 26.78
## 34: 2007 4 Gilbert Arenas 26.46
## 35: 2007 5 Dwyane Wade 25.87
## 36: 2007 6 Tracy McGrady 25.37
## 37: 2007 7 Allen Iverson 25.01
## 38: 2007 8 Michael Redd 24.48
## 39: 2007 9 Ray Allen 23.95
## 40: 2007 10 Vince Carter 23.72
## 41: 2008 1 LeBron James 29.35
## 42: 2008 2 Kobe Bryant 28.47
## 43: 2008 3 Dwyane Wade 26.65
## 44: 2008 4 Dirk Nowitzki 25.31
## 45: 2008 5 Amar'e Stoudemire 24.95
## 46: 2008 6 Carmelo Anthony 23.78
## 47: 2008 7 Chris Bosh 23.71
## 48: 2008 8 Allen Iverson 23.51
## 49: 2008 9 Danny Granger 22.87
## 50: 2008 10 Kevin Martin 22.83
## 51: 2009 1 LeBron James 29.77
## 52: 2009 2 Dwyane Wade 29.28
## 53: 2009 3 Kobe Bryant 28.75
## 54: 2009 4 Kevin Durant 27.41
## 55: 2009 5 Carmelo Anthony 26.97
## 56: 2009 6 Kevin Martin 26.09
## 57: 2009 7 Danny Granger 25.82
## 58: 2009 8 Dirk Nowitzki 25.72
## 59: 2009 9 Brandon Roy 22.90
## 60: 2009 10 Chris Bosh 22.73
## 61: 2010 1 Kevin Durant 29.76
## 62: 2010 2 LeBron James 27.89
## 63: 2010 3 Dwyane Wade 26.29
## 64: 2010 4 Carmelo Anthony 26.00
## 65: 2010 5 Kobe Bryant 25.88
## 66: 2010 6 Monta Ellis 25.45
## 67: 2010 7 Amar'e Stoudemire 24.96
## 68: 2010 8 Dirk Nowitzki 24.84
## 69: 2010 9 Derrick Rose 23.24
## 70: 2010 10 Danny Granger 22.84
## 71: 2011 1 Kevin Durant 27.67
## 72: 2011 2 LeBron James 27.29
## 73: 2011 3 Carmelo Anthony 26.20
## 74: 2011 4 Derrick Rose 25.73
## 75: 2011 5 Dwyane Wade 25.36
## 76: 2011 6 Kobe Bryant 24.97
## 77: 2011 7 Dirk Nowitzki 23.93
## 78: 2011 8 Dwight Howard 23.84
## 79: 2011 9 Blake Griffin 23.48
## 80: 2011 10 Amar'e Stoudemire 23.32
## 81: 2012 1 Kobe Bryant 28.96
## 82: 2012 2 Kevin Durant 28.26
## 83: 2012 3 LeBron James 27.36
## 84: 2012 4 Carmelo Anthony 24.60
## 85: 2012 5 Kevin Love 24.40
## 86: 2012 6 Russell Westbrook 23.29
## 87: 2012 7 Derrick Rose 22.00
## 88: 2012 8 Dwyane Wade 21.90
## 89: 2012 9 LaMarcus Aldridge 21.51
## 90: 2012 10 Dirk Nowitzki 21.08
## 91: 2013 1 Carmelo Anthony 28.76
## 92: 2013 2 Kevin Durant 28.47
## 93: 2013 3 LeBron James 26.71
## 94: 2013 4 James Harden 25.89
## 95: 2013 5 Kobe Bryant 25.62
## 96: 2013 6 Stephen Curry 24.51
## 97: 2013 7 Russell Westbrook 24.17
## 98: 2013 8 Kyrie Irving 22.32
## 99: 2013 9 LaMarcus Aldridge 21.06
## 100: 2013 10 Tony Parker 21.00
## year rank player pts

需要注意的事情

赋值

如果我们要新建某个data.table的备份,我们使用copy(),而不能简单的使用<-,因为后者可能会造成联动,你改变新的变量的时候,前者也会发生变化

1
2
3
4
5
DT <- as.data.table(mtcars[1:5, 1:5])
DT1 <- DT
# Check DT1
DT1
1
2
3
4
5
6
## mpg cyl disp hp drat
## 1: 21.0 6 160 110 3.90
## 2: 21.0 6 160 110 3.90
## 3: 22.8 4 108 93 3.85
## 4: 21.4 6 258 110 3.08
## 5: 18.7 8 360 175 3.15
1
2
3
DT1 <- DT[, model:=rownames(mtcars)[1:5]]
# Check DT
DT

=:=的区别

两者都会创建新的列,前者新建这个列,不改变原来的data.table,后者会在原来的data.table添加列。