1 Mars numpy与pandas的并行和分布式加速器 秦续业
2020-03-01 642浏览
- 1.Mars:Numpy 与 Pandas 的并 行和分布式加速器 秦续业
- 2.目录 CONTENTS 用30秒了解什么是 Mars 背景和动机 Mars 能做什么和如何做 性能和展望
- 3.1 用30秒了解什么是 Mars • 从 Numpy 到 Mars tensor • 从 Pandas 到 Mars DataFrame • 从 scikit-learn 到 Mars learn
- 4.从 Numpy 到 Mars tensor 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import numpy as np from scipy.special import erf def black_scholes(P, S, T, rate, vol): a = np.log(P / S) b = T * -rate z = T * (vol * vol * 2) c = 0.25 * z y = 1.0 / np.sqrt(z) w1 = (a - b + c) * y w2 = (a - b - c) * y 运⾏时间:11.9 d1 = 0.5 +s 0.5 * erf(w1) d2 = 0.5 + 0.5 * erf(w2) 峰值内存:5479.47 Se = np.exp(b) * S call = P * d1 - Se * d2 put = call - P + Se return call, put N = 50000000 price = np.random.uniform(10.0, 50.0, N), strike = np.random.uniform(10.0, 50.0, N) t = np.random.uniform(1.0, 2.0, N) print(black_scholes(price, strike, t, 0.1, 0.2)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import mars.tensor as mt from mars.tensor.special import erf def black_scholes(P, S, T, rate, vol): a = mt.log(P / S) b = T * -rate z = T * (vol * vol * 2) c = 0.25 * z y = 1.0 / mt.sqrt(z) w1 = (a - b + c) * y w2 = (a - b - c) * y 运⾏时间:5.48 d1 = 0.5 s+ 0.5 * erf(w1) d2 = 0.5 + 0.5 * erf(w2) 峰值内存:1647.85 Se = mt.exp(b) * S call = P * d1 - Se * d2 put = call - P + Se return call, put N = 50000000 price = mt.random.uniform(10.0, 50.0, N) strike = mt.random.uniform(10.0, 50.0, N) t = mt.random.uniform(1.0, 2.0, N) print(mt.ExecutableTuple(black_scholes(price, strike, t, 0.1, 0.2)).execute())
- 5.从 Pandas 到 Mars DataFrame 1 2 3 4 5 6 7 import numpy as np import pandas as pd 运⾏时间:18.7 s df = pd.DataFrame(np.random.rand(100000000, 4), 峰值内存:3430.29 columns=list('abcd')) print(df.sum()) 1 2 3 4 5 6 7 import mars.tensor as mt import mars.dataframe as md 运⾏时间:5.25 s df = md.DataFrame(mt.random.rand(100000000, 4), 峰值内存:2007.92 columns=list('abcd')) print(df.sum().execute())
- 6.从 Scikit-learn 到 Mars learn 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from sklearn.datasets.samples_generator import make_blobs from sklearn.decomposition.pca import PCA X, y = make_blobs(n_samples=100000000, n_features=3, 运⾏时间:19.1 centers=[[3,3, 3], [0,0,0], [ s 1,1,1], [2,2,2]], cluster_std=[0.2, 0.1, 0.2, 0. 峰值内存:7314.82 2], random_state=9) pca = PCA(n_components=3) pca.fit(X) print(pca.explained_variance_ratio_) print(pca.explained_variance_) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from sklearn.datasets.samples_generator import make_blobs from mars.learn.decomposition.pca import PCA X, y = make_blobs(n_samples=100000000, n_features=3, 运⾏时间:12.8 centers=[[3,3, 3], [0,0,0], [ s 1,1,1], [2,2,2]], cluster_std=[0.2, 0.1, 0.2, 0. 峰值内存:3814.32 2], random_state=9) pca = PCA(n_components=3) pca.fit(X) print(pca.explained_variance_ratio_.execute()) print(pca.explained_variance_.execute())
- 7.2 背景和动机 • Python 语言越来越流行 • AI 很火热,机器学习的生命周期中,数据处理往往是瓶颈 • Numpy 与 Pandas 的重要性 • 习惯是生产力,无需学习成本 • 目前存在的问题
- 8.单击此处添加标题 1
- 9.机器学习的生命周期 特征⼯程/ 模型训练 { 新的数据 } Data 数据处理/ 模型部署/ 数据分析 维护/改进 往往要占⽤ 80% 的时间 训练的模型 { 预测 }
- 10.Google 趋势(全球)
- 11.日益增长的数据科学技术栈
- 12.Numpy • ndarray:多维数组 • • • • 对整组数据快速运算的快速数学函数(⽆需循环) 读写磁盘数据的⼯具和操作内存映射⽂件的⼯具 线性代数、随机数⽣成和傅⾥叶变换 Pandas、Scipy、Scikit-learn、Tensorflow 和 Pytorch 的基础 In [78]: %%time ...: from math import sqrt ...: b = np.empty((1000, 1000)) ...: for i in range(1000): ...: for j in range(1000): ...: acc = 0.0 ...: for k in range(10): ...: acc += (a[i, k] - a[j, k]) ** 2 ...: b[i, j] = sqrt(acc) ...: print(b) ...: [[0. 1.34092722 1.03167424 ... 1.28462032 1.10915735 1.30688143] [1.34092722 0. 1.54184418 ... 1.36835935 1.46841795 1.50831086] [1.03167424 1.54184418 0. ... 1.18083129 0.86086453 1.16513362] ... [1.28462032 1.36835935 1.18083129 ... 0. 0.93105133 1.23865215] [1.10915735 1.46841795 0.86086453 ... 0.93105133 0. 1.19356203] [1.30688143 1.50831086 1.16513362 ... 1.23865215 1.19356203 0. ]] CPUtimes:'>times: