大数据分析的一些项目,如Spark,Hive,Presto和数据可视化(如Superset)
大数据分析的一些项目,如Spark,Hive,Presto和数据可视化(如Superset)
Java 分布式应用
共45Star
详细介绍
致一亩三分地: 选取这篇Repository是为了展示自己不断学习的过程,虽然工作中没有太多机会用到大数据相关的技术栈,但我仍然在业余时间自学并时刻保持好学的心态!谢谢!
Big Data Learning on Hive, Spark, Presto, Superset(Data Visulaization)
Learning Part 1
Basic operations on Hive
Create table and load data
-
Create a table without partition stored in textfile
hive> create table student_nopart(name string, age int, gpa double, gender string, state string) row format delimited fields terminated by '\t' stored as textfile;
Check whether table is created or not by:
hive> show tables; 2018-03-30T19:07:18,959 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_database: default 2018-03-30T19:07:19,105 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_database: default 2018-03-30T19:07:19,122 INFO [pool-11-thread-1] org.apache.hadoop.hive.metastore.HiveMetaStore - 1: source:127.0.0.1 get_tables: db=default pat=.* OK student_nopart Time taken: 0.229 seconds, Fetched: 1 row(s)
we can see table student_nopart was created successfully
-
Load local data into hive
hive> load data local inpath 'studentgenderstatetab10k' into table student_nopart; .... OK Time taken: 1.505 seconds
check if data is loaded into hive warehouse
hive> dfs -ls /user/hive/warehouse/student_nopart; Found 1 items -rwxr-xr-x 1 hadoop supergroup 268219 2018-03-30 19:10 /user/hive/warehouse/student_nopart/studentgenderstatetab10k
Another way to check
hive> select COUNT(*) from student_nopart; ..... Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.55 sec HDFS Read: 276442 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 550 msec OK 10000 Time taken: 35.192 seconds, Fetched: 1 row(s)
Dynamic partition during insertion
-
create a new table partitioned by column gender and state.
hive> create table student_part(name string, age int, gpa double) partitioned by (gender string, state string) row format delimited fields terminated by '\t' stored as textfile; .... 2018-03-30T19:48:16,514 INFO [pool-11-thread-4] org.apache.hadoop.hive.common.FileUtils - Creating directory if it doesn't exist: hdfs://localhost:9000/user/hive/warehouse/student_part OK Time taken: 0.698 seconds
-
Load data from student_nopart into table student_part, dynamic partition
hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> insert into student_part partition(gender,state) select * from student_nopart;
Check the directionary struct in HDFS
hive> dfs -ls /user/hive/warehouse/student_part; Found 2 items drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:01 /user/hive/warehouse/student_part/gender=F drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:01 /user/hive/warehouse/student_part/gender=M hive> dfs -ls /user/hive/warehouse/student_part/gender=M; Found 4 items drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=AZ drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=CA drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=FL drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=M/state=WA hive> dfs -ls /user/hive/warehouse/student_part/gender=F; Found 4 items drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=AZ drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=CA drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=FL drwxr-xr-x - hadoop supergroup 0 2018-03-30 20:00 /user/hive/warehouse/student_part/gender=F/state=WA
-
Do a simple query
hive> select COUNT(name) from student_part where gender = 'F' and state = 'AZ'; ...... MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.92 sec HDFS Read: 45557 HDFS Write: 104 SUCCESS Total MapReduce CPU Time Spent: 4 seconds 920 msec OK 1693 Time taken: 25.845 seconds, Fetched: 1 row(s)
Store file
-
Store table as ORC file
hive> create table student_orc(name string, age int, gpa double, gender string, state string) row format delimited fields terminated by '\t' stored as orcfile;
Insert data into table
hive> insert into table student_orc select * from student_part;
-
Check ORC file
hadoop@1053d6dea9e8:/src/week3/lab1/data$ hive --orcfiledump /user/hive/warehouse/student_orc/000000_0
Hive UDF
-
Write UDF in Eclipse. There are two UDFs:
- Hello from Daniel's example:
public class Hello extends UDF { public Text evaluate(Text input) { return new Text("Hello " + input.toString()); } }
- MyUpper wrote by myself:
public class MyUpper extends UDF { public Text evaluate(Text input) { return new Text(input.toString().toUpperCase()); } }
-
Use UDF in Hive Firstly, add jar into classpath and create temporary function
hive> add jar target/evalfunc-0.0.1-SNAPSHOT.jar; .... Added resources: [target/evalfunc-0.0.1-SNAPSHOT.jar] hive> create temporary function hello as 'com.example.hive.evalfunc.Hello'; OK Time taken: 0.558 seconds hive> create temporary function myupper as 'com.example.hive.evalfunc.MyUpper'; OK Time taken: 0.011 seconds
Use UDF in query
hive> select hello(myupper(name)) from student_nopart limit 10; ..... OK Hello ULYSSES THOMPSON Hello KATIE CARSON Hello LUKE KING Hello HOLLY DAVIDSON Hello FRED MILLER Hello HOLLY WHITE Hello LUKE STEINBECK Hello NICK UNDERHILL Hello HOLLY DAVIDSON Hello CALVIN BROWN Time taken: 3.137 seconds, Fetched: 10 row(s)
HiveServer2
-
start HiveServer2 service
hadoop@1053d6dea9e8:/src/week3/lab1/evalfunc$ hive --service hiveserver2 &
-
Connect HiveServer2 using beeline
beeline> !connect jdbc:hive2://localhost:10000 hadoop hadoop Connecting to jdbc:hive2://localhost:10000 2018-03-30T21:04:15,672 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source: 127.0.0.1 get_database: default 2018-03-30T21:04:15,713 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: Opening raw store with implementation class:org.apache.hadoop.hive.metastore.ObjectStore Connected to: Apache Hive (version 2.3.2) Driver: Hive JDBC (version 2.3.2) Transaction isolation: TRANSACTION_REPEATABLE_READ
-
Show tables
0: jdbc:hive2://localhost:10000> show tables; 2018-03-30T21:04:24,479 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_database: default 2018-03-30T21:04:24,847 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_database: default 2018-03-30T21:04:24,862 INFO [pool-11-thread-12] org.apache.hadoop.hive.metastore.HiveMetaStore - 12: source:127.0.0.1 get_tables: db=default pat=.* OK +-----------------+ | tab_name | +-----------------+ | student_nopart | | student_orc | | student_part | +-----------------+ 3 rows selected (1.858 seconds)
Have to say, aesthetic table makes me comfortable and happy. : )
-
Create a new table
jdbc:hive2://localhost:10000> create table voter(name string, age int, registration string, contribution double) row format delimited fields terminated by '\t' stored as textfile; ...... OK No rows affected (0.188 seconds)
Load data into table
jdbc:hive2://localhost:10000> load data inpath votertab10k into table voter;
-
Do some simple queries
-
query how many students coming from California
jdbc:hive2://localhost:10000> select COUNT(DISTINCT(name)) as numOfCaliPerson from student_part where state = 'CA'; ...... OK +------------------+ | numofcaliperson | +------------------+ | 595 | +------------------+ 1 row selected (23.091 seconds)
-
join table student_orc and voter
jdbc:hive2://localhost:10000> select * from student_orc s join voter v on s.name = v.name limit 10; ...... OK +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+ | s.name | s.age | s.gpa | s.gender | s.state | v.name | v.age | v.registration | v.contribution | +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+ | luke king | 65 | 0.73 | F | AZ | luke king | 26 | libertarian | 604.71 | | luke king | 65 | 0.73 | F | AZ | luke king | 62 | democrat | 281.26 | | luke king | 65 | 0.73 | F | AZ | luke king | 56 | libertarian | 177.84 | | luke king | 65 | 0.73 | F | AZ | luke king | 61 | republican | 239.53 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 59 | democrat | 957.45 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 62 | green | 934.61 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 18 | democrat | 980.63 | | holly davidson | 57 | 2.43 | F | AZ | holly davidson | 61 | green | 320.68 | | rachel robinson | 62 | 2.25 | F | AZ | rachel robinson | 36 | republican | 704.71 | | rachel robinson | 62 | 2.25 | F | AZ | rachel robinson | 59 | democrat | 296.66 | +------------------+--------+--------+-----------+----------+------------------+--------+-----------------+-----------------+ 10 rows selected (23.826 seconds)
-
Hive Window Functions
I want to find every state's students who keep the highest GPA among their state's students.
hive> select T.name, T.state, T.gpa
> from
> (select name, state, gpa, rank() over(partition by state order by gpa desc) as ranking from student_orc) T
> where T.ranking = 1;
.....
OK
nick falkner AZ 4.0
wendy king AZ 4.0
mike johnson AZ 4.0
calvin brown AZ 4.0
david allen CA 3.99
wendy underhill CA 3.99
sarah polk CA 3.99
alice johnson CA 3.99
david underhill CA 3.99
holly brown CA 3.99
katie thompson CA 3.99
tom thompson FL 4.0
calvin carson FL 4.0
quinn van buren FL 4.0
ulysses carson FL 4.0
tom xylophone WA 4.0
wendy zipper WA 4.0
Time taken: 25.336 seconds, Fetched: 17 row(s)
Basic operations on Presto
-
Start Presto server and connect it via
presto --server localhost:8080 --catalog hive --schema default
We specify a catalog to Presto, which is hive. Presto will access Hive data via connector.
-
show tables
presto:default> show tables; ..... Table ---------------- bookinfo student_nopart student_orc student_part voter (5 rows) Query 20180403_214104_00008_sr2td, FINISHED, 1 node Splits: 18 total, 18 done (100.00%) 0:01 [5 rows, 135B] [7 rows/s, 197B/s]
-
Do some simple query
presto:default> select * from student_orc limit 10; ...... name | age | gpa | gender | state ---------------------+-----+------+--------+------- luke king | 65 | 0.73 | F | AZ holly davidson | 57 | 2.43 | F | AZ rachel robinson | 62 | 2.25 | F | AZ gabriella van buren | 68 | 1.32 | F | AZ bob ichabod | 56 | 2.81 | F | AZ bob zipper | 40 | 0.28 | F | AZ gabriella zipper | 77 | 1.51 | F | AZ zach nixon | 74 | 2.75 | F | AZ irene davidson | 27 | 1.22 | F | AZ ulysses polk | 58 | 0.01 | F | AZ (10 rows) Query 20180403_214327_00009_sr2td, FINISHED, 1 node Splits: 18 total, 18 done (100.00%) 0:01 [1.02K rows, 46.9KB] [1.53K rows/s, 70.2KB/s]
-
Join operation
presto:default> select s.name, AVG(s.age) as AvgAge from student_orc s join voter v on s.name = v.name group by s.name limit 10; ..... name | AvgAge -------------------+-------------------- fred garcia | 58.46666666666667 quinn garcia | 53.473684210526315 yuri robinson | 48.07142857142857 rachel laertes | 53.13333333333333 nick polk | 47.38461538461539 alice robinson | 46.42857142857143 priscilla king | 52.36363636363637 mike robinson | 41.642857142857146 fred thompson | 49.2 ulysses hernandez | 43.588235294117645 (10 rows) Query 20180403_214915_00014_sr2td, FINISHED, 1 node Splits: 83 total, 83 done (100.00%) 0:01 [11K rows, 79.9KB] [11.1K rows/s, 80.5KB/s]
Basic operations on Spark SQL
-
Start Spark master and Spark slaves. And start Spark SQL via
spark-sql --master spark://1053d6dea9e8:7077
-
Show tables
spark-sql> show tables; ..... default bookinfo false default student_nopart false default student_orc false default student_part false default voter false Time taken: 0.12 seconds, Fetched 5 row(s)
-
Run simple query
spark-sql> select * from voter where age > 50 limit 10; ....... tom van buren 68 socialist 390.19 ethan johnson 70 socialist 804.81 irene falkner 53 republican 45.34 victor xylophone 70 green 141.12 nick ichabod 58 libertarian 306.16 quinn white 52 independent 63.89 alice van buren 63 independent 260.44 jessica brown 53 independent 568.9 oscar laertes 67 green 984.62 rachel thompson 63 socialist 244.44 Time taken: 5.102 seconds, Fetched 10 row(s)
-
Join query
spark-sql> select s.name, AVG(s.age) as AvgAge from student_orc s join voter v on s.name = v.name group by s.name limit 10; ........ xavier white 51.84615384615385 irene robinson 48.21052631578947 wendy underhill 49.90909090909091 nick falkner 45.63636363636363 gabriella quirinius 55.0 wendy quirinius 47.523809523809526 yuri robinson 48.07142857142857 nick ellison 55.833333333333336 ulysses nixon 46.46666666666667 zach falkner 59.5 Time taken: 4.063 seconds, Fetched 10 row(s)
-
Spark SQL can reuse Hive's UDF, let's try it
spark-sql> add jar /src/week3/lab1/evalfunc/target/evalfunc-0.0.1-SNAPSHOT.jar; spark-sql> create temporary function Hello as 'com.example.hive.evalfunc.Hello'; spark-sql> create temporary function Upper as 'com.example.hive.evalfunc.MyUpper'; spark-sql> select Hello(Upper(name)) from student_orc limit 10; ...... Hello LUKE KING Hello HOLLY DAVIDSON Hello RACHEL ROBINSON Hello GABRIELLA VAN BUREN Hello BOB ICHABOD Hello BOB ZIPPER Hello GABRIELLA ZIPPER Hello ZACH NIXON Hello IRENE DAVIDSON Hello ULYSSES POLK Time taken: 1.004 seconds, Fetched 10 row(s)
Realtime Log Ingestion using Flink
-
Firstly, create a table named access_log_partitioned in Hive
hive> create table access_log_partitioned(country string, dt timestamp, cat string, count int) partitioned by (d date) row format delimited fields terminated by '\t' stored as textfile;
Set up SERDE property for table access_log_partitioned, so that it can know use what format to deal with timestamp field
hive> alter table access_log_partitioned set SERDEPROPERTIES ('timestamp.formats'='yyyy-MM-dd HH:mm:ss');
Add a new partitions, where flink will feed data into
hive> alter table access_log_partitioned add partition(d='2017-11-01');
Check Hive warehouse in HDFS
hive> dfs -ls /user/hive/warehouse/access_log_partitioned; Found 1 items drwxr-xr-x - hadoop supergroup 0 2018-04-03 23:19 /user/hive/warehouse/access_log_partitioned/d=2017-11-01
-
Run flink and submit job
hadoop@1053d6dea9e8:/src/week3/lab2/flink$ flink run --class com.example.flink.Ingestion --classpath file:///$PWD/resource/geoip-api-1.3.1.jar target/hourlyinjestion-0.0.1-SNAPSHOT.jar
-
Simulate log growing using python script
hadoop@1053d6dea9e8:/src/week3/lab2/flink/log$ python process.py sample.txt > /mnt/access_log/localhost_access_log.2018-04-03.txt
-
Check whether there are data coming into HDFS or not
hive> dfs -ls /user/hive/warehouse/access_log_partitioned/d=2017-11-01; Found 1904 items -rw-r--r-- 1 hadoop supergroup 37 2018-04-03 23:35 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_02_real-estate -rw-r--r-- 1 hadoop supergroup 34 2018-04-03 23:35 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_02_shopping -rw-r--r-- 1 hadoop supergroup 33 2018-04-03 23:36 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_10_society -rw-r--r-- 1 hadoop supergroup 32 2018-04-03 23:36 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_10_travel -rw-r--r-- 1 hadoop supergroup 43 2018-04-03 23:37 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_12_social-networking -rw-r--r-- 1 hadoop supergroup 33 2018-04-03 23:37 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_12_unknown -rw-r--r-- 1 hadoop supergroup 43 2018-04-03 23:38 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AE_14_financial-service -rw-r--r-- 1 hadoop supergroup 33 2018-04-03 23:34 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AM_00_unknown -rw-r--r-- 1 hadoop supergroup 33 2018-04-03 23:37 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AM_11_unknown -rw-r--r-- 1 hadoop supergroup 46 2018-04-03 23:36 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AP_09_business-and-economy -rw-r--r-- 1 hadoop supergroup 33 2018-04-03 23:37 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AP_11_unknown -rw-r--r-- 1 hadoop supergroup 34 2018-04-03 23:34 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AR_00_auctions -rw-r--r-- 1 hadoop supergroup 46 2018-04-03 23:34 /user/hive/warehouse/access_log_partitioned/d=2017-11-01/AR_00_business-and-economy ......
-
Yeah, we can see lots of data are flowing into Hive warehouse
hive> select * from access_log_partitioned limit 50; OK AE 2017-11-01 02:00:00 real-estate 1 2017-11-01 AE 2017-11-01 02:00:00 shopping 1 2017-11-01 AM 2017-11-01 00:00:00 unknown 1 2017-11-01 AR 2017-11-01 00:00:00 auctions 1 2017-11-01 AR 2017-11-01 00:00:00 business-and-economy 1 2017-11-01 AR 2017-11-01 00:00:00 computer-information 1 2017-11-01 AR 2017-11-01 00:00:00 healthy-and-medicine 1 2017-11-01 AR 2017-11-01 00:00:00 search-engine 2 2017-11-01 AR 2017-11-01 00:00:00 shopping 1 2017-11-01 AR 2017-11-01 00:00:00 software-download 1 2017-11-01 AR 2017-11-01 00:00:00 travel 1 2017-11-01 AR 2017-11-01 00:00:00 unknown 2 2017-11-01 AR 2017-11-01 01:00:00 business-and-economy 2 2017-11-01 AR 2017-11-01 01:00:00 entertainment-and-art 1 2017-11-01 AR 2017-11-01 01:00:00 internet-communication 1 2017-11-01 AR 2017-11-01 01:00:00 internet-portal 1 2017-11-01 AR 2017-11-01 01:00:00 search-engine 3 2017-11-01 AR 2017-11-01 01:00:00 unknown 2 2017-11-01 AR 2017-11-01 02:00:00 government 1 2017-11-01 AR 2017-11-01 02:00:00 unknown 1 2017-11-01 AR 2017-11-01 03:00:00 business-and-economy 1 2017-11-01 AR 2017-11-01 03:00:00 computer-information 1 2017-11-01 AR 2017-11-01 03:00:00 financial-service 1 2017-11-01 AR 2017-11-01 03:00:00 search-engine 2 2017-11-01 AR 2017-11-01 03:00:00 unknown 1 2017-11-01 AR 2017-11-01 04:00:00 unknown 1 2017-11-01 AR 2017-11-01 05:00:00 search-engine 1 2017-11-01 AR 2017-11-01 05:00:00 unknown 1 2017-11-01 AR 2017-11-01 06:00:00 search-engine 1 2017-11-01 AU 2017-11-01 00:00:00 entertainment-and-art 1 2017-11-01 AU 2017-11-01 00:00:00 search-engine 1 2017-11-01 AU 2017-11-01 01:00:00 search-engine 1 2017-11-01 AU 2017-11-01 02:00:00 web-mail 1 2017-11-01 AU 2017-11-01 03:00:00 business-and-economy 1 2017-11-01 AU 2017-11-01 03:00:00 computer-information 1 2017-11-01 AU 2017-11-01 03:00:00 healthy-and-medicine 1 2017-11-01 AU 2017-11-01 03:00:00 real-estate 1 2017-11-01 AU 2017-11-01 04:00:00 financial-service 1 2017-11-01 AU 2017-11-01 05:00:00 entertainment-and-art 1 2017-11-01 AU 2017-11-01 05:00:00 unknown 1 2017-11-01 BB 2017-11-01 02:00:00 search-engine 1 2017-11-01 BD 2017-11-01 01:00:00 personal-site-and-blog 1 2017-11-01 BD 2017-11-01 04:00:00 computer-information 1 2017-11-01 BD 2017-11-01 04:00:00 financial-service 1 2017-11-01 BD 2017-11-01 05:00:00 unknown 1 2017-11-01 BD 2017-11-01 06:00:00 online-advertisement 1 2017-11-01 BE 2017-11-01 04:00:00 unknown 1 2017-11-01 BH 2017-11-01 04:00:00 entertainment-and-art 1 2017-11-01 BN 2017-11-01 00:00:00 unknown 1 2017-11-01 BN 2017-11-01 03:00:00 computer-information 1 2017-11-01 Time taken: 0.306 seconds, Fetched: 50 row(s)
Data analysis by Hive
As Flink process tons of data every second, our Hive's access_log table grows very quickly. The next job, we will use Hive to do some analysis on those data.
-
Put history log into access log table and put all data from access log into table access log partitioned
-
Create country mapping table and load local data into it
-
Find the busiest hour
--- Firstly we can group by hour, and count its frenquency and further sort it by frenquency --- Then we take the top one record. hive> select hour(dt), count(*) cnt from access_log_partitioned group by hour(dt) order by cnt desc limit 1; .... OK 14 128765
-
Find Countries which have more than average total visits
--- Firstly, find out average visits hive> select AVG(a.visits) > from > (select country, SUM(count) as visits from access_log_partitioned group by country) a; ...... OK 2356731.371681416 --- Then, Find Countries which have more than average total visits hive> select country, SUM(count) visits > from access_log_partitioned > group by country > having visits > 2356731.371681416; ... OK AR 12022976 AU 3568988 BO 17605652 BR 4867771 CA 7542249 CL 5227759 CO 21391952 DE 7828091 DO 2438458 DZ 6484983 EC 12835092 EG 5296202 ES 8281273 FR 11461862 GB 8044593 GT 2424293 ID 10870138 IL 2570654 IN 12021301 IT 18015359 MX 23807685 NL 3147264 PE 85283216 PH 50572854 PL 6718559 RO 3447958 RU 7740110 TR 4086138 US 90079276 VE 14599674
-
Join with country_mapping
hive> select c.country, v.visits > from (select country, SUM(count) visits from access_log_partitioned group by country having visits > 2356731.371681416) v > join country_mapping c on v.country = c.code > order by v.visits desc; ...... OK United States 90079276 Peru 85283216 Philippines 50572854 Mexico 23807685 Colombia 21391952 Italy 18015359 Bolivia 17605652 Venezuela 14599674 Ecuador 12835092 Argentina 12022976 India 12021301 France 11461862 Indonesia 10870138 Spain 8281273 United Kingdom 8044593 Germany 7828091 Russia 7740110 Canada 7542249 Poland 6718559 Algeria 6484983 Egypt 5296202 Chile 5227759 Brazil 4867771 Turkey 4086138 Australia 3568988 Romania 3447958 Netherlands 3147264 Israel 2570654 Dominican Republic 2438458 Guatemala 2424293
Data Visulaization using Superset
Built some viz charts, timelines based on access_log_partitioned table, and built a dashboard to display them.
Learning Part 2
-
Write a Hive windowing function query to rank country by count within every day
Firstly, We need to count every country's daily visits. We can use group by function to do this count.
Then we use a window functions rank() over a window partitioned by date with descending orders on country's daily visits.jdbc:hive2://localhost:10000> . . . > select T.d, c.country, T.visits, rank() over(partition by T.d order by T.visits desc) as rank . . . > from (select d, country, SUM(count) visits from access_log_partitioned group by d,country) T . . . > join country_mapping c on c.code=T.country; ...... OK +-------------+-----------------------------------+-----------+-------+ | t.d | c.country | t.visits | rank | +-------------+-----------------------------------+-----------+-------+ | 2017-09-30 | Peru | 35 | 1 | | 2017-09-30 | United States | 9 | 2 | | 2017-09-30 | Germany | 4 | 3 | | 2017-09-30 | Colombia | 4 | 3 | | 2017-09-30 | Argentina | 4 | 3 | | 2017-09-30 | Ecuador | 3 | 6 | | 2017-09-30 | Mexico | 3 | 6 | | 2017-09-30 | Venezuela | 3 | 6 | | 2017-09-30 | Philippines | 3 | 6 | | 2017-09-30 | Bolivia | 3 | 6 | | 2017-09-30 | Canada | 2 | 11 | | 2017-09-30 | United Kingdom | 2 | 11 | | 2017-09-30 | Malaysia | 2 | 11 | | 2017-09-30 | Chile | 2 | 11 | | 2017-09-30 | Israel | 1 | 15 | | 2017-09-30 | Italy | 1 | 15 | | 2017-09-30 | Morocco | 1 | 15 | | 2017-09-30 | Singapore | 1 | 15 | | 2017-09-30 | Romania | 1 | 15 | | 2017-09-30 | Australia | 1 | 15 | | 2017-10-01 | Peru | 2690525 | 1 | | 2017-10-01 | Philippines | 1750802 | 2 | | 2017-10-01 | United States | 1287871 | 3 | | 2017-10-01 | Mexico | 549741 | 4 | | 2017-10-01 | Colombia | 544513 | 5 | | 2017-10-01 | Ecuador | 502115 | 6 | | 2017-10-01 | Bolivia | 453835 | 7 | | 2017-10-01 | France | 387293 | 8 | | 2017-10-01 | India | 370591 | 9 | | 2017-10-01 | Venezuela | 331311 | 10 | | 2017-10-01 | Argentina | 322516 | 11 | | 2017-10-01 | Indonesia | 290125 | 12 | | 2017-10-01 | Spain | 256080 | 13 | | 2017-10-01 | Germany | 254847 | 14 | | 2017-10-01 | United Kingdom | 250834 | 15 | | 2017-10-01 | Algeria | 237420 | 16 | | 2017-10-01 | Russia | 230438 | 17 | | 2017-10-01 | Canada | 215608 | 18 | | 2017-10-01 | Italy | 210959 | 19 | | 2017-10-01 | Poland | 209299 | 20 | | 2017-10-01 | Egypt | 165588 | 21 | | 2017-10-01 | Turkey | 136573 | 22 | .......
-
Write a Hive query to find out the second largest category visited during the busiest day of the month
Firstly, We need to find the biggest day of every month:jdbc:hive2://localhost:10000> . . . > select T2.m, T2.day . . . > from ( . . . > select T1.m, T1.day, T1.visits, rank() over(partition by T1.m order by T1.visits desc) as rank . . . > from (select month(dt) m, day(dt) day, SUM(count) visits from access_log_partitioned group by month(dt),day(dt)) T1 . . . > ) T2 . . . > where rank = 1; .... OK +-------+---------+ | t2.m | t2.day | +-------+---------+ | 9 | 30 | | 10 | 16 | | 11 | 1 | +-------+---------+
We can store this SQL as a view named busistDay, and we do further queries on this view.
jdbc:hive2://localhost:10000> . . . > create VIEW busistDay . . . > as . . . > select T2.m, T2.day . . . > from ( . . . > select T1.m, T1.day, T1.visits, rank() over(partition by T1.m order by T1.visits desc) as rank . . . > from (select month(dt) m, day(dt) day, SUM(count) visits from access_log_partitioned group by month(dt),day(dt)) T1 . . . > ) T2 . . . > where rank = 1;
Secondly we need to retrieve all visiting records happened on busist day of the month, and in order to reuse in further query, we store the result as view again.
jdbc:hive2://localhost:10000> . . . > create VIEW visitingRecords . . . > as . . . > select a.cat cat, a.count count, b.m month, b.day day . . . > from access_log_partitioned a join busistDay b . . . > on month(a.dt)=b.m and day(a.dt)=b.day
Then we can query on this view and find the second largest category visited via window function rank()
0: jdbc:hive2://localhost:10000> . . .> select T2.cat, T2.visits, T2.month, T2.day . . .> from ( . . .> select T1.cat, T1.visits, T1.month, T1.day, dense_rank() over(partition by T1.month, T1.day order by T1.visits desc) as rank . . .> from (select cat, SUM(count) as visits, month, day from visitingRecords group by cat,month,day) T1 . . .> ) T2 . . .> where T2.rank = 2; ..... OK +-----------------------+------------+-----------+---------+ | t2.cat | t2.visits | t2.month | t2.day | +-----------------------+------------+-----------+---------+ | search-engine | 14 | 9 | 30 | | business-and-economy | 2872456 | 10 | 16 | | unknown | 8 | 11 | 1 | +-----------------------+------------+-----------+---------+