清华大学计算机科学与技术系章明星，高品，艾智远，陈康——图计算优化技术探索

2020-02-27 59浏览

1.图计算优化技术探索章明星，高品，艾智远，陈康清华大学计算机科学与技术系 2017年5月19日
2.内容组织图计算优化技术探索 • 图计算简要介绍 • 使用体系结构局部性加速图计算（IEEE Transactions on Computers） • 图的三维划分加速计算(USENIX OSDI 2016) • 外存图计算的加速方法(USENIX ATC 2017)
3.图计算的简要介绍图数据的广泛存在Google:> 1 trillion indexed pages Web Graph 31 billion RDF triples in 2011 Information NetworkFacebook:> 800 million active users Social Graph DeBruijn:4k nodes (k = 20, … , 40) Biological Network
4.图计算的简要介绍图数据规模的迅速增长 100M(108) Social Scale 100B (1011) Web Scale 1T (1012) Brain Scale, 100T (1014) US Road Internet Web graph (Google) Human Connectome, The Human Connectome Project, NIHAcknowledgement:Y. Wu, WSU
5.图计算的简要介绍从数据并行到图并行 6. Before 数据并行 Map Reduce Feature Extraction Cross Validation Computing Sufficient Statistics 图并行 7. After Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Collaborative Filtering Tensor Factorization 8. After Semi-Supervised Learning Label Propagation CoEM Data-Mining PageRank Triangle Counting
6.图计算的简要介绍图计算的特点 • 四大特点 • 高访存计算比 • 数据局部性不好 • 结构不规则 • 受数据驱动
7.图计算的简要介绍图计算的特点 • 四大特点 • 高访存计算比 • 数据局部性不好 • 结构不规则 • 受数据驱动
8.图计算的简要介绍一个典型的图计算PageRankAcknowledgement:I. Mele, Web Information Retrieval
9.图计算的简要介绍图的顶点排序：PageRank åV1 V3 PRk +1 (u) = vÎBu PRk (v) Fv PR(u): Page Rank of node uFu:Out-neighbors of node u V2 V4Bu:In-neighbors of node u Sergey Brin, Lawrence Page, “The Anatomy of Large-Scale Hypertextual Web Search Engine”, WWW ‘98 9
10.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv V2 V4 K=0 PR(V1) 0.25 PR(V2) 0.25 PR(V3) 0.25 PR(V4) 0.25 10
11.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv V2 V4 K=0 K=1 PR(V1) 0.25 ? PR(V2) PR(V3) PR(V4) 0.25 0.25 0.25 11
12.图计算的简要介绍 PageRank的迭代计算 0.25 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv 0.12 0.12 V2 V4 K=0 K=1 PR(V1) 0.25 ? PR(V2) PR(V3) PR(V4) 0.25 0.25 0.25 12
13.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv 0.12 0.12 V2 V4 K=0 K=1 PR(V1) PR(V2) PR(V3) PR(V4) 0.25 0.25 0.25 0.25 0.37
14.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv V2 V4 K=0 K=1 PR(V1) PR(V2) PR(V3) PR(V4) 0.25 0.25 0.25 0.25 0.37 0.08 0.33 0.20 14
15.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv V2 V4 K=0 K=1 K=2 PR(V1) 0.25 0.37 0.43 PR(V2) 0.25 0.08 0.12 PR(V3) 0.25 0.33 0.27 PR(V4) 0.25 0.20 0.16 15
16.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv V2 V4 K=0 K=1 K=2 K=3 PR(V1) 0.25 0.37 0.43 0.35 PR(V2) 0.25 0.08 0.12 0.14 PR(V3) 0.25 0.33 0.27 0.29 PR(V4) 0.25 0.20 0.16 0.20 16
17.图计算的简要介绍 PageRank的迭代计算 V1 V3 V2 V4 PR(V1) PR(V2) PR(V3) PR(V4) K=0 0.25 0.25 0.25 0.25 åPRk +1 (u) = vÎBu PRk (v) Fv K=1 K=2 K=3 K=4 0.37 0.43 0.35 0.39 0.08 0.12 0.14 0.11 0.33 0.27 0.29 0.29 0.20 0.16 0.20 0.19 17
18.图计算的简要介绍 PageRank的迭代计算 V1 V3 V2 PR(V1) PR(V2) PR(V3) PR(V4) V4 K=0 0.25 0.25 0.25 0.25 åPRk +1 (u) = vÎBu PRk (v) Fv K=1 K=2 K=3 K=4 K=5 0.37 0.43 0.35 0.39 0.39 0.08 0.12 0.14 0.11 0.13 0.33 0.27 0.29 0.29 0.28 0.20 0.16 0.20 0.19 0.19 18
19.图计算的简要介绍 PageRank的迭代计算 V1 V3 åPRk +1 (u) = vÎBu PRk (v) Fv V2 V4 FixPoint K=0 K=1 K=2 K=3 K=4 K=5 K=6 PR(V1) 0.25 0.37 0.43 0.35 0.39 0.39 0.38 PR(V2) 0.25 0.08 0.12 0.14 0.11 0.13 0.13 PR(V3) 0.25 0.33 0.27 0.29 0.29 0.28 0.28 PR(V4) 0.25 0.20 0.16 0.20 0.19 0.19 0.19 19
20.图计算的简要介绍图计算系统的计算框架 • 计算框架的作用 • 便于编程，性能扩展，自动容错 • 以顶点为中心的计算框架 • 以边为中心的计算框架
21.图计算的简要介绍以顶点为中心的图计算系统 • “Think like a vertex” • Pregel以及GraphLab开始的编程思想 Data Data Data Data Data Data Data MyFunc(vertex) { // modify neighborhood } Data Data GatherData Apply Scatter
22.使用体系结构局部性加速图计算
23.针对数据局部性的图计算系统设计 • 属性图（Property Graph） • 点和边上存在多个属性数据 • 有多个属性的图算法 • Trust Rank[VLDB’04] name=“Peter” age=21 1 name=“Jim” age=23 2 • Topic Sensitive PageRank[WWW’02] • SVD++[SIGKDD’08] • AdPredictor [ICML’10] 4 name=“John” age=20 3 name=“Alex” age=29
24.图计算的局部性优化：属性数据局部性 “Think Like a Vertex” 属性数据按照点/边的方式组织在一起（Vertex View） TrustRank算法为例子，在某个计算阶段只访问 Score属性。 Score Rank Memory Address Order Vertex
25.图计算的局部性优化：属性数据局部性 “Think Like a Vertex” 属性数据按照点/边的方式组织在一起（Vertex View） Score Rank Memory Address Order TrustRank算法为例子， Vertex S在co某re个属计性算I。n阶te段r只le访a问ved Memory Access!!
26.DTLB : Data TLB Reads DCA : Data Cache Access DCR : Data Cache Read DCW : Data Cache Write
27.图计算的局部性优化：属性数据局部性按照属性组织数据，不同的属性分开存储（Property View） Score Property Array Rank Property Array 只load需要的数据到 cache line，避免内存的间隔访问。 Score Rank Memory Address Order Vertex
28.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） AC D B E 对u局部性最好，对v局部性最差。
29.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） Cache AC AC D B E 对u局部性最好，对v局部性最差。
30.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） Cache AC AC D B E 对u局部性最好，对v局部性最差。
31.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） Cache AE AC D B E 对u局部性最好，对v局部性最差。
32.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） Cache BC AC D B E 对u局部性最好，对v局部性最差。
33.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） Cache BD AC D B E 对u局部性最好，对v局部性最差。
34.图计算的局部性优化：图结构访问局部性以点为中心的图结构访问（Vertex Centric） Cache BD AC D B E 对u局部性最好，对v局部性最差。
35.图计算的局部性优化：图结构访问局部性以边为中心的图结构访问（Edge Centric）通过对边进行合理的排序，达到对所有点访问的局部性都好。
36.图计算的局部性优化：图结构访问局部性以边为中心的图结构访问（Edge Centric）通过对边进行合理的排序，达到对所有点访问的局部性都好。实现时采用Hilbert Order。
37.性能提升（Property View）
38.性能提升（Edge-Centric Execution Engine）
39.性能提升
40.图的三维划分加速计算
41.研究现状：一维划分一维划分：数据图被以点为粒度地划分给各个计算节点原图 V0 V3 使用一维划分划分 Node 0 给四个计算节点 V0 Node 1 V3 V3 V1 V2 四张子图共六个副本 Node 2 V0 V2 Node 3 V3 V3 V1 V2 V2
42.研究现状：一维划分点程序在一维划分的情况下，任务划分的粒度也为点，与数据划分的粒度相同点程序每一个点程序都可以读取或写入其对应点领域范围内的数据通讯通过消息传递 (e.g. Pregel) 通过共享状态 (e.g. GraphLab) 并行通过同时执行多个领域不相交的点程序缺点负载不均衡!!!
43.研究现状：二维划分二维划分：数据图被以边为粒度地划分给各个计算节点原图 V0 V3 使用二维划分划分 Node 0 给四个计算节点 V0 Node 1 V3 V3 V1 V2 四张子图共六个副本 Node 2 V0 V2 Node 3 V3 V2 V1 V1 V2
44.研究现状：二维划分 PowerGraph 的编程模型 Gather 用户自定义: Gather( Y ) àΣ Σ1 + Σ2 à Σ3 Apply 用户自定义: Apply( Y , Σ) à Y’ Scatter 用户自定义: Scaber( Y’ ) à ΣY Y Y’ Y’ 在二维划分的情况下，任务划分的粒度也为边，与数据划分的粒度相同 Y
45.背景：图计算种类丰富传统图分析类应用许多图应用目的在于分析图上的拓扑关系等图论性质示例最短路径三角形个数 PageRank 联通分量
46.背景：图计算种类丰富传统图分析类应用许多图应用目的在于分析图上的拓扑关系等图论性质示例最短路径三角形个数 PageRank 联通分量建模成图计算的机器学习应用同时，许多机器学习和数据挖掘类的应用也可以被建模成图计算应用进行计算示例协同过滤稀疏矩阵相乘神经网络矩阵分解
47.示例：协同过滤通过矩阵描述N:# of usersM:# of items R R[u,v] ≈ D QT P * Q[v] P[u] D R[u,v] ≈建模成图计算 P0 P1 P2 …… Q0 Q1 Q2 …… Pu …… R[u,v] PN Qv …… QM 协同过滤通过给出的用户对物品的打分情况估测未给出的打分建模成图计算后每一个点的点权为长度 D 的向量，是可再划分单元这一现象是很多图计算应用共有的模式
48.研究内容：三维划分三维划分：将数据图中的点切分成子点，然后对由子点构成的图层进行二维划分原图 1 2 V0 V3 1 2 使用三维划分划分 Node 0,0 给四个计算节点 1 2 V0 V3 1 2 Node 0,1 1 2 V0 V3 1 2 1 2 V1 V2 1 2 每层两张子图，三个副本 1 2 V1 1 2 V1 V2 1 2 Node 1,0 1 2 V0 V3 1 2 Node 1,1 1 2 V0 V3 1 2 1 2 V1 1 2 V1 V2 1 2
49.研究内容：两类通讯 Node 0,0 1 2 V0 V3 1 2 1 2 V1 Node 1,0 1 2 V0 V3 1 2 1 2 V1 Node 0,1 1 2 V0 V3 1 2 1 2 V1 V2 1 2 Node 1,1 1 2 V0 V3 1 2 1 2 V1 V2 1 2 层内通讯更少的子图更少的副本更少的通讯！
50.研究内容：两类通讯 Node 0,0 1 2 V0 V3 1 2 1 2 V1 Node 1,0 1 2 V0 V3 1 2 1 2 V1 Node 0,1 1 2 V0 V3 1 2 1 2 V1 V2 1 2 Node 1,1 1 2 V0 V3 1 2 1 2 V1 V2 1 2 层内通讯更少的子图更少的副本更少的通讯！层间通讯之前并不存在不断增大！
51.小结 • 我们找到了一个新的划分图计算任务的维度 • 发现 Ø 点权中的向量上进行的计算经常是对位地 Ø 很容易并行化 v 通过将下标相同的对位数据划分到同一个计算节点上，可以不引发任何通讯！
52.研究内容：新系统 • 现有的分布式图计算系统无法建模层间通讯 • CUBE • 采用三维划分 • 基于新型的计算模型 UPPS • 使用矩阵执行引擎
53.研究内容：三维划分方法 CUBE 三维划分方法 (L, P): L 是层数，P 则是一个已有的二维划分方法 Node 0,0 S 1 2 V0 L S 1 2 V1 Node 0,1 2 S 1 2 V0 S 1 2 V1 V3 1 2 S V3 1 2 S Node 1,0 S 1 2 V0 S 1 2 V1 Node 1,1 S 1 2 V0 S 1 2 V1 V3 1 2 S V2 1 2 S V3 1 2 S V2 1 2 S 共享部分分层部分 = 使用 P 进行划分
54.研究内容：矩阵执行引擎 CUBE 以点为中心 ü 简单易用 ü 与现有系统兼容 ✖ 效率不高 MAP 矩阵执行引擎 ✖ 编程复杂度高 ü 高效 v 简单的寻址方法 v Hilbert order 我们通过自动地将以点为中心的程序转换成矩阵操作执行，同时拥有了两方的好处；该方法由 GraphMat 系统刷先提出。
55.实验环境 CUBE • 比较对象: PowerGraph 和 PowerLyra Ø 缺省的二维划分方法 oblivious 被用于测试 PowerGraph Ø 相对的，PowerLyra 的测试枚举了所有种类的划分方法 Ø CUBE 系统使用的 P 与 PowerLyra 所用的方法保持一致 • 实验平台: 八个计算节点; 通过 1Gb 以太网互联 • 数据集代号 Libimseti Last.fm Netflix U 135,359 359,349 17,770 V 168,791 211,067 480,189 E 17,359,346 17,559,530 100,480,507 最好的二维划分方法 Hybrid-cut Bi-cut Bi-cut
56.微型测试集：SpMM SpMM D M Q Q[v] = P[u] P P[w] * R[u,v] R R[w,v] D Q[v] = R[u,v]*P[u] + R[w,v]*P[w] N 50 40 30 20 10 0 1 通讯开销正比于层数反比于 SpMM 的执行时间 Libimseti, Sc = 256 Lastfm, Sc = 256 Libimseti, Sc = 1024 9 17 25 33 41 49 57 # of layers 副本数子图数负相关
57.微型测试集：SumV & SumE 50 40 30 20 10 0 1 SumV 的执行时间 Libimseti, Sc = 256 Lastfm, Sc = 256 Libimseti, Sc = 1024 9 17 25 33 41 49 57 # of layersSumV:对点权向量累和 50 40 30 20 10 0 1 SumE 的执行时间 Libimseti, Sc = 256 Lastfm, Sc = 256 Libimseti, Sc = 1024 9 17 25 33 41 49 57 # of layersSumE:对边权向量累和通讯开销正比于 !"! #, L 为层数
58.测试结果：PowerLyra & PowerGraph CUBE 数据集执行时间 D GD ALS PowerGraph PowerLyra CUBE PowerGraph PowerLyra CUBE 64 Libimseti 128 6.82 11.64 6.89 11.62 2.59 (4) 3.33 (8) 87.0 331 86.8 28 (64) 331 109 (64) Lastfm 64 128 10.4 18.6 9.86 2.48 (4) 158 17.8 3.47 (8) Failed 111 Failed 57 (64) 230 (64) Netflix 64 128 18.3 30.6 7.42 11.3 4.16 (1) 6.55 (2) 179 Failed 66.0 42.5 (8) 239 118 (8) ü 较之 PowerLyra 提速最高 4.7 倍 ü 较之 PowerGraph 提速最高 7.3 倍
59.测试结果：通讯量减少 GD 算法上的通讯量减少 CUBE ALS 算法上的通讯量减少 ü 对于 Lastfm 和 Libimseti 数据集一半的加速来自于通讯减少 ✖ 对 Netflix 数据集效果不佳 v 如果设定 D 为 2048，即使是对 Netflix 数据集也能提速 2.5 倍 ü 几乎全部的加速来自于通讯减少. v ALS 的计算需要执行复杂度为 O(N3) 的计算单元
60.测试结果：内存消耗 30 25 20 15 10 5 0 1 总内存消耗 11 21 31 41 51 61 层数在 64 个计算节点上执行 ALS 算法的总内存消耗，其中 D=32, SC=D2+D=1056 CUBE 内存消耗比 L=1 时要小最佳的执行时间不一定同时有最小的内存开销
61.外存图计算的加速
62.单机的外存图计算现有的单机的计算规模已经可以处理大规模的图（这种情况下，磁盘的IO的操作将成为瓶颈）单机的图计算系统 2012 GraphChi 2013 2015 X-Stream FlashGraph Our work 2015 GridGraph Wrongtrade-off:They improve the disk I/O locality at the cost of increasing the total amount of disk I/O.Clip:新型单机图计算系统特征 l More flexible processing order l Breaking the neighborhood constraint l More efficient algorithms can be used Primary concern Reducing the total amount of disk I/O
63.现有工作的限制 Edge block-1 Each edge is processed once ??????# (??????#, ??????#, ??????#) ??????- (??????-, ??????-, ??????-) ??????. (??????., ??????., ??????.) ??????/ (??????/, ??????/, ??????/) ⋮ Memory Load Disk User-defined function ??????# ??????(??????) ??????- v ??????4 ??????[??????] e ??????[??????] ??????(??????) Vertex-centric Edge-centric Neighborhood constraint vvv Vertex part-1 Part-1 Part-2 ⋯ ⋯ Part-n Vertexes Block-1 Block-2 One by one ⋯ ⋯ Block-n Edges Next iteration 每一条边都只被处理一次，之后就被换出用户定义的函数只能作用在邻边上和邻接点上增加了循环的次数增加了IO的数目，降低性能
64.解决的方法 ↑ ?????????????????????????????????????????? ↓ ?????????????????????????????????????????????????????? Memory Load ??????8, ??????# (??????-, ??????#) ⋮ (??????9, ??????:) ??????(??????) S Block-1 Block-2 ⋯⋯ Block-n Edges (Sequential Read) eeee Reentry vv Disk Block-1 Block-2 ⋯ ⋯ Block-n All vertices (Random R/W)CLIP:Squeezing out all the value of loaded data 装载所有节点数据到内存，尽可能多使用已经装载到内存中的所有数据 1 尽可能在装载的数据上多次进行计算 2 更新节点数据的时候，可以更新任何在内存中的节点的数据
65.节点信息存放在内存中？ Fromhttps://www.amazon.com/
66.典型的图数据集的大小情况 The real-world graph datasets LiveJournal Dimacs Twitter Friendster Yahoo Twitter-Big Facebook-Big Vertexes 4.85M 23.9M 41.7M 65.6M 1.4B 288M 1.39B Edges 69.0M 58.3M 1.47B 1.8B 6.64B 60B 400B Vertexes Size 37MB 183MB 317MB 501MB 10.5GB 2.25GB 10.4G Edges Size 0.53GB 0.67GB 10.9GB 13.5GB 49.4GB 480GB 3.1TB The biggest real-worldgraph:l The edges size are much larger than vertexes size l 32G memory is sufficient for vertexes size l 32G memory cannot hold all edges l 32G DDR3 chips are very cheap and prevalent nowTips:Some single-machine graph systems have similar design, such as GraphChi and FlashGraph
67.计算方式的改进Example:Calculating single source shortest path (SSSP) Start 1 6 Priorsystem:process once 25 34 ?????? ?????? ???????????????????????? ???????????????????????? 0 ∞ ∞ ∞ ∞ ∞ ???????????????????????????????????? ???????????????????????????????????? ???????????????????????????????????? e1 1 6 1 e2 2 1 1 e3 3 2 1 e4 4 3 1 e5 5 4 1 e6 6 5 1 Edge list on disk Disk Memory e1 e2 e3 e4 once e5 e6 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? Iteration 1 0 ∞ ∞ ∞ ∞?????? ∞??????Clip:process multiple times Disk Memory e1 e2 e3 e4 e5 e6
68.计算方式的改进Example:Calculating single source shortest path (SSSP) Start 1 6 Priorsystem:process once 25 34 ?????? ?????? ???????????????????????? ???????????????????????? 0 ∞ ∞ ∞ ∞ ∞ ???????????????????????????????????? ???????????????????????????????????? ???????????????????????????????????? e1 1 6 1 e2 2 1 1 e3 3 2 1 e4 4 3 1 e5 5 4 1 e6 6 5 1 Edge list on disk Disk Memory e1 e2 e3 e4 once e5 e6 Iteration 1 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? 0 ∞ ∞ ∞ ∞?????? ∞?????? Iteration 2 0 ∞ ∞ ∞?????? 2 1Clip:process multiple times Disk Memory e1 e2 e3 e4 e5 e6
69.计算方式的改进Example:Calculating single source shortest path (SSSP) Start 1 6 Priorsystem:process once 25 34 ?????? ?????? ???????????????????????? ???????????????????????? 0 ∞ ∞ ∞ ∞ ∞ ???????????????????????????????????? ???????????????????????????????????? ???????????????????????????????????? e1 1 6 1 e2 2 1 1 e3 3 2 1 e4 4 3 1 e5 5 4 1 e6 6 5 1 Edge list on disk Disk Memory e1 e2 e3 e4 once e5 e6 Iteration 1 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? 0 ∞ ∞ ∞ ∞?????? ∞?????? Iteration 2 0 ∞ ∞ ∞?????? 2 1 Iteration 3 0 ∞ ∞?????? 3 2 1Clip:process multiple times Disk Memory e1 e2 e3 e4 e5 e6
70.计算方式的改进Example:Calculating single source shortest path (SSSP) Start 1 6 Priorsystem:process once 25 34 ?????? ?????? ???????????????????????? ???????????????????????? 0 ∞ ∞ ∞ ∞ ∞ ???????????????????????????????????? ???????????????????????????????????? ???????????????????????????????????? e1 1 6 1 e2 2 1 1 e3 3 2 1 e4 4 3 1 e5 5 4 1 e6 6 5 1 Edge list on disk Disk Memory e1 e2 e3 e4 once e5 e6 Iteration 1 Iteration 2 Iteration 3 Iteration 4 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? 0 ∞ ∞ ∞ ∞?????? ∞?????? 0 ∞ ∞ ∞?????? 2 1 0 ∞ ∞?????? 3 2 1 0 ∞?????? 4 3 2 1 ?????????????????????????????? ????????????????????????????????????: ?????? ∗ ?????? ∗ ????????????????????????????????????(??????)Clip:process multiple times Disk Memory e1 e2 e3 e4 e5 e6
71.计算方式的改进Example:Calculating single source shortest path (SSSP) Start 1 6 Priorsystem:process once 25 34 ?????? ?????? ???????????????????????? ???????????????????????? 0 ∞ ∞ ∞ ∞ ∞ ???????????????????????????????????? ???????????????????????????????????? ???????????????????????????????????? e1 1 6 1 e2 2 1 1 e3 3 2 1 e4 4 3 1 e5 5 4 1 e6 6 5 1 Edge list on disk Disk Memory e1 e2 e3 e4 e5 e6 Iteration 1 Iteration 2 Iteration 3 Iteration 4 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? 0 ∞ ∞ ∞ ∞?????? ∞?????? 0 ∞ ∞ ∞?????? 2 1 0 ∞ ∞?????? 3 2 1 0 ∞?????? 4 3 2 1 ?????????????????????????????? ????????????????????????????????????: ?????? ∗ ?????? ∗ ????????????????????????????????????(??????)Clip:process multiple times Disk Memory e1 e2 e3 e4 Pass towneo e5 e6 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? Iteration 1 0 ∞ ∞ ∞?????? ∞?????? ∞??????
72.计算方式的改进Example:Calculating single source shortest path (SSSP) Start 1 6 Priorsystem:process once 25 34 ?????? ?????? ???????????????????????? ???????????????????????? 0 ∞ ∞ ∞ ∞ ∞ ???????????????????????????????????? ???????????????????????????????????? ???????????????????????????????????? e1 1 6 1 e2 2 1 1 e3 3 2 1 e4 4 3 1 e5 5 4 1 e6 6 5 1 Edge list on disk Disk Memory e1 e2 e3 e4 e5 e6 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? Iteration 1 0 ∞ ∞ ∞ ∞?????? ∞?????? Iteration 2 0 ∞ ∞ ∞?????? 2 1 Iteration 3 0 ∞ ∞?????? 3 2 1 Iteration 4 0 ∞?????? 4 3 2 1 ?????????????????????????????? ????????????????????????????????????: ?????? ∗ ?????? ∗ ????????????????????????????????????(??????)Clip:process multiple times Disk Memory e1 e2 e3 e4 Pass otwneo e5 e6 ???????????? ???????????? ???????????? ???????????? ???????????? ???????????? Iteration 1 0 ∞ ∞ ∞?????? ∞?????? ∞?????? Iteration 2 0 ∞?????? ∞?????? 3 2 1 ?????????????????????????????? ????????????????????????????????????: ?????? ∗ ?????? ∗ ????????????????????????????????????(??????)
73.计算方式的改进Example:Calculating weakly connected component (WCC) 12 34 Undirected graph ???????????????????????????????????? ???????????????????????????????????? e1 2 3 e2 2 1 e3 1 2 e4 4 3 e5 3 4 e6 3 2 Edge list on disk Priorsystem:label propagation based Disk ?????? → ?????? ?????? → ?????? ?????? → ?????? ?????? → ?????? ?????? → ?????? ?????? → ?????? Memory v12 v32 v34 ???????????? ???????????? ???????????? ???????????? Iteration 1 1 12 23 24Clip:disjoint set (all vertexes in memory) Disk Memory ?????? → ?????? ?????? → ?????? ?????? → ?????? v1 v2 v3 v4 ?????? → ?????? ?????? → ?????? ?????? → ?????? Iteration 1 ???????????? ???????????? ???????????? ???????????? 1234
74.计算方式的改进Example:Calculating weakly connected component (WCC) 12 34 Undirected graph ???????????????????????????????????? ???????????????????????????????????? e1 2 3 e2 2 1 e3 1 2 e4 4 3 e5 3 4 e6 3 2 Edge list on disk Priorsystem:label propagation based Disk ?????? → ?????? ?????? → ?????? ?????? → ?????? ?????? → ?????? ?????? → ?????? ?????? → ?????? Memory v12 v32 v34 ???????????? ???????????? ???????????? ???????????? Iteration 1 1 12 23 24 Iteration 2 1 1 31 31 ?????????????????????????????? ????????????????????????????????????: ?????? ∗ ?????? ∗ ????????????????????????????????????(??????)Clip:disjoint set (all vertexes in memory) Disk Memory ???????????? → ?????? ???→??? →?????? ?????? ???????????? → ?????? → ?????? ???????????? → ?????? ???→??? →?????? ?????? ?????? → ?????? v1 v2 v3 v4 ???????????? ???????????? ???????????? ???????????? Iteration 1 1 2 3 4 ?????????????????????????????? ????????????????????????????????????: ?????? ∗ ?????? ∗ ????????????????????????????????????(??????)
75.编程模型 Programming model of CLIP All vertexes
76.编程模型的效果 The user only write a very short code———SSSP on CLIP vs SSSP by userCLIP:50 lines of code are enoughSorting:440 linesExec:614 lines The user need to write more than 1000 lines of code to implement the optimized SSSP algorithm, which consists of sorting, overlapping, multi-thread…
77.系统实现：排序 An Example for Sorting ??????# ??????- ??????## ??????#??????8 ??????# ??????- Sorted ??????-# ??????-??????8 ??????# ??????- (??????8, ??????8 , ??????#, ??????-, ??????# , ??????-) e ⊕ ????????????(??????) [????????????, ????????????) [????????????, ????????????) [????????????, ????????????) [????????????, ????????????) [????????????, ????????????] block- block- block- block- block1234 5 Only need 3 iterations to read edges Step 1 Ø Reading all edges once and determining the bucket boundaries Step 2 Ø Sorting bucket boundaries and creating ?????? ∗ ?????? + 1 − 1 buckets Ø Reading edges and splitting them into buckets Step 3 Ø Reading each buckets again and sorting the edges in memory
78.系统实现：优化 l Disk I/O—Overlapping Task queue ⋯ File Read Main s Proces s l Concurrency control (Multi-thread) l Neighborhoodconstraint:fined-grained locking l Many iterative algorithms can tolerate updateoverwriting:SSSP (Hogwild[NIPS’11]) l Selective scheduling v v v ⋯ v Active bitmap Buffer block Process Edge active? eYes:load Active? ⋯block block block Memory Disk
79.实验设置 The real-world graph datasets Vertexes Edges Vertexes Size Edges Size LiveJournal 4.85M 69.0M 37MB 0.53GB Dimacs 23.9M 58.3M 183MB 0.67GB Twitter 41.7M 1.47B 317MB 10.9GB Friendster 65.6M 1.8B 501MB 13.5GB Yahoo 1.4B 6.64B 10.5GB 49.4GBTips:Dimacs and Yahoo have a large diameter Single machine Test environmentCPU:Two Intel(R) Xeon(R) CPU E5- 2640 v2 @ 2.00GHz (each has 8-cores)DRAM:32GB, L3 cache 20MBDisk:Standard 1TB SSD drive. The average throughput is about 250MB/s for sequential read. Test benchmarks Relaxing-based applications SSSP (Single Source Shortest Path) BFS (Breadth-first Search) Beyond-neighborhood applications WCC (Weakly connected McoImS p(Monaexnimt)al Independent Set ) MCST (Minimum Cost Spanning Tree) Beyond the Neighborhood PageRank SpMV (Multiply the sparse adjacency matrix of a directed graph with a vector of values, one per vertex)
80.数据复用的效果 Execution time (in seconds) I/O amount of SSSP (GB) Twitter Friends ter Dimacs 20.4 XStream GridGra ph LiveJou rnal 6 28.649 0 2000 4000 6000 8000 10000 Iteration Count SSSPspeedup:'>speedup: