一、Scipy概述

SciPy基于NumPy提供了更为丰富和高级的功能扩展,在统计、优化、插值、数值积分、时频转换等方面提供了大量的可用函数,基本覆盖了基础科学计算相关的问题。

1
2
3
import numpy as np
import scipy.stats as stats
import scipy.optimize as opt

二、统计部分

2.1 生成随机数

  • rv_continuous.rvs(size=n): 其中rv_continuous表示连续型的随机分布,如均匀分布(uniform)、正态分布(norm)、贝塔分布(beta)等
  • rv_discrete.rvs(size=n): rv_discrete表示离散型的随机分布,如伯努利分布(bernoulli)、几何分布(geom)、泊松分布(poisson)等。

见:http://docs.scipy.org/doc/scipy/reference/stats.html

1
2
r_unif = stats.uniform.rvs( size = 10)
print(r_unif)
[ 0.2924269   0.29822465  0.48178216  0.22004046  0.90243864  0.37951304
  0.63613836  0.64975187  0.40374609  0.31864229]
1
2
rv_beta = stats.beta.rvs(size = 10, a = 4, b =2)
print(rv_beta)
[ 0.7330703   0.79941641  0.84296951  0.96649249  0.58666597  0.90552661
  0.81672749  0.62606183  0.53123816  0.33406139]
1
2
beta = stats.beta(a = 4, b = 2)
beta.rvs(size = 10)
array([ 0.54825343,  0.48858131,  0.78255051,  0.66667343,  0.59489605,
        0.74490324,  0.72473494,  0.54107229,  0.84224194,  0.64791601])

2.2 假设检验

1
2
3
4
norm_dist = stats.norm(loc = 0.5, scale = 2) # loc mean, scale
n = 200
dat = norm_dist.rvs(size = n)
np.mean(dat)
0.40692986915948354
1
np.median(dat)
0.36653201314410355
1
np.std(dat)
1.8750494924213752

使用K-S检验( Kolmogorov-Smirnov test)检验正态分布性,使用kstest

1
2
3
mu = np.mean(dat)
sigma = np.std(dat)
stat_val, p_val = stats.kstest(dat, 'norm', (mu, sigma))
1
print('KS-statistic D = %6.3f p-value = %6.4f' % (stat_val, p_val))
KS-statistic D =  0.047 p-value = 0.7663

t-test检验均值是否为0

1
2
stat_val, p_val = stats.ttest_1samp(dat, 0)
print('One-sample t-statistic D = %6.3f p-value = %6.4f' % (stat_val, p_val))
One-sample t-statistic D =  3.061 p-value = 0.0025

拒绝原假设: 数据的均值为0

双样本t-test(ttest_ind)

1
2
3
4
norm_dist2 = stats.norm(loc=-0.2, scale=1.2)
dat2 = norm_dist2.rvs(size=n/2)
stat_val, p_val = stats.ttest_ind(dat, dat2, equal_var=False)
print('Two-sample t-statistic D = %6.3f, p-value = %6.4f' % (stat_val, p_val))
Two-sample t-statistic D =  3.151, p-value = 0.0018


/Users/shihchosen/anaconda/lib/python3.5/site-packages/scipy/stats/_continuous_distns.py:128: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  return self._random_state.standard_normal(self._size)
/Users/shihchosen/anaconda/lib/python3.5/site-packages/numpy/core/fromnumeric.py:225: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  return reshape(newshape, order=order)

2.3 其他函数

  • cdf
1
2
g_dist = stats.gamma(a = 2)
print(g_dist.cdf([2,4,5]))
[ 0.59399415  0.90842181  0.95957232]
  • pdf
1
print(g_dist.pdf([0.25,.5,.95]))
[ 0.1947002   0.30326533  0.36740397]

对于一个给定的分布,可以用moment很方便的查看分布的矩信息,例如我们查看N(0,1)的六阶原点矩

1
stats.norm.moment(6, loc = 0, scale = 1)
15.000000000895332

describe函数提供对数据集的统计描述分析,包括数据样本大小,极值,均值,方差,偏度和峰度

1
2
3
norm_dist = stats.norm(loc=0, scale=1.8)
dat = norm_dist.rvs(size = 100)
info = stats.describe(dat)
1
2
3
4
5
6
7
print("Data size is: " + str(info[0]))
print("Minimum value is: " + str(info[1][0]))
print("Maximum value is: " + str(info[1][1]))
print("Arithmetic mean is: " + str(info[2]))
print("Unbiased variance is: " + str(info[3]))
print("Biased skewness is: " + str(info[4]))
print("Biased kurtosis is: " + str(info[5]))
Data size is: 100
Minimum value is: -4.28963191875
Maximum value is: 3.39969135101
Arithmetic mean is: -0.0583104535814
Unbiased variance is: 2.68846427562
Biased skewness is: 0.04732288942546752
Biased kurtosis is: -0.379124958678446

MLE

1
2
3
4
5
norm_dist = stats.norm(loc=0, scale=1.8)
dat = norm_dist.rvs(size=100)
mu, sigma = stats.norm.fit(dat)
print("MLE of data mean:" + str(mu))
print("MLE of data standard deviation:" + str(sigma))
MLE of data mean:-0.0461510505165
MLE of data standard deviation:1.72294901553
1
2
3
4
5
6
7
8
norm_dist = stats.norm()
dat1 = norm_dist.rvs(size=100)
exp_dist = stats.expon()
dat2 = exp_dist.rvs(size=100)
cor, pval = stats.pearsonr(dat1, dat2)
print("Pearson correlation coefficient: " + str(cor))
cor, pval = stats.pearsonr(dat1, dat2)
print("Spearman's rank correlation coefficient: " + str(cor))
Pearson correlation coefficient: -0.0299016363187
Spearman's rank correlation coefficient: -0.0299016363187

线性回归

1
2
3
4
5
6
x = stats.chi2.rvs(3, size=50)
y = 2.5 + 1.2 * x + stats.norm.rvs(size=50, loc=0, scale=1.5)
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print("Slope of fitted model is:" , slope)
print ("Intercept of fitted model is:", intercept)
print("R-squared:", r_value**2)
Slope of fitted model is: 1.15543693896
Intercept of fitted model is: 2.75126872589
R-squared: 0.770651967105
1
stats.linregress(x, y)
LinregressResult(slope=1.1554369389622545, intercept=2.7512687258945392, rvalue=0.87786785287120206, pvalue=5.8204930077311082e-17, stderr=0.090979593726512847)

三、优化部分

3.1 无约束优化问题

下面的就是无约束优化,
$$
\mathrm{min}\quad f(x) = x^2 - 4.8x + 1.2
$$

下面的是约束优化:
$$
\mathrm{min}\quad f(x) = x^2 - 4.8x + 1.2\
s.t.\quad x \geq 0
$$

考虑Rosenbrock函数:
$$
f(\mathrm{x}) = \sum_{i =1}^{N-1} 100(x_i - xi^2)^2 + (1 - x{i-1})^2
$$

1
2
3
def rosen(x):
"""The Rosenbrock function"""
return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)

3.1.1 Nelder-Mead单纯形法

1
x_0 = np.array([0.5, 1.6, 1.1, 0.8, 1.2])
1
2
res = opt.minimize(rosen, x_0, method='nelder-mead', options={'xtol': 1e-8, 'disp': True})
print(res)
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 436
         Function evaluations: 706
 final_simplex: (array([[ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  0.99999999],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  1.00000001],
       [ 1.        ,  1.        ,  1.        ,  1.        ,  1.00000001]]), array([  1.66149699e-17,   6.32117429e-17,   7.44105349e-17,
         8.24396866e-17,   9.53208876e-17,   1.07882961e-16]))
           fun: 1.6614969876635003e-17
       message: 'Optimization terminated successfully.'
          nfev: 706
           nit: 436
        status: 0
       success: True
             x: array([ 1.,  1.,  1.,  1.,  1.])
1
2
res = opt.minimize(rosen, x_0, method='powell', options={'xtol': 1e-8, 'disp': True})
print(res)
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 24
         Function evaluations: 1924
   direc: array([[  4.55396239e-04,   1.36370090e-03,   2.24582768e-03,
          4.12213272e-03,   7.99477493e-03],
       [ -1.91529419e-03,  -3.00633609e-03,  -6.77652553e-03,
         -1.34919171e-02,  -2.67196797e-02],
       [ -3.76393598e-02,  -2.30587398e-02,   1.05038412e-02,
          3.43628871e-05,   7.36675942e-05],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          1.00000000e+00,   0.00000000e+00],
       [  4.00791294e-06,   1.15417958e-05,   2.01270425e-05,
          5.08729704e-05,   1.09096801e-04]])
     fun: 2.041828615178145e-21
 message: 'Optimization terminated successfully.'
    nfev: 1924
     nit: 24
  status: 0
 success: True
       x: array([ 1.,  1.,  1.,  1.,  1.])