在线教育仓库示例-访问与咨询看板
建模分析
提取指标维度
根据主题的需求,我们可以看出,包含的指标有一些是可以提取合并的:
- 地区独立访客热力图、总访问客户量、时间段访问客户量趋势、来源渠道访问量占比、搜索来源访问量占比、活跃页面排行榜的指标都可以合并为一个:访问客户量。
- 合并后的访问客户量指标,维度不同,而且数据来源也不同。
- 访客咨询率趋势、客户访问量和访客咨询率双轴趋势,都包含了访客咨询率指标。
- 访客咨询率=发起咨询的人数/访问客户量,分母访问客户量,可以复用前面的指标。因此只需要计算出分子:咨询客户量。
由此我们可以推断出,指标有两个:访问客户量和咨询客户量。
访问客户量
虽然访问客户量的时间维度只有年、季度、月,但是展示粒度要具体到天,所以统计时也要包含日维度。同时要求能够下钻到小时数据,所以维度中也要包含小时。
来源渠道访客量占比,虽然最终要的是占比的比值,但是这个比值是可以通过具体的访客量计算出来,所以我们只需要提供不同来源渠道的访客量数据,柱状图前端就能够自动的计算比值。因此这里的指标也归类于访问客户量,维度为来源渠道。
活跃页面排行榜,字面看是和页面相关的指标,实际统计的却是每个页面的访问客户量,然后再进行排序后得出的排行榜。所以这里的指标也归类于访问客户量,统计的维度是具体的页面。但是要注意这里的数据来源有变化。
维度包括:年、季度、月、天、小时(天区间内小时段)、地区、来源渠道、搜索来源、会话来源页面、总访问量。
咨询客户量
访客咨询率趋势统计中,访客咨询率=发起咨询的人数/访问客户量,分母访问客户量我们可以复制上面的指标数据,而分子咨询客户量则是我们要统计的新指标。
维度包括:年、季度、月、日、地区、来源渠道。
分层设计
访问客户量
抽取咨询原始数据到ODS层以后,如何分析出对应维度的数据呢?
我们可以采取结果导向的方式来进行倒推:
- 最终的数据维度:年、季度、月、日、小时、天区间内小时、地区、来源渠道、搜索来源、会话来源页面、总访问量;
- 首先要有DWD层对ODS原始数据进行清洗和转换,作为我们的明细数据;
- 维度我们可以分为两类:时间维度(年、季度、月、日、小时)和产品属性维度(地区、来源渠道、搜索来源、会话来源页面、总访问量);
- 我们可以将产品属性维度和最小的时间粒度(小时)来统计,作为共享表,放置在DWM层;
- 在DWM层小时数据的基础上,进行上卷sum统计(年、季度、月、日、小时),即可得到DWS层的数据集市;
- 注意:最终的数据要求在统计之前,要先根据客户进行去重,这也对我们的中间层进行了限制,不能简单的先按天去重count,然后再按月和年sum,因为不同天的客户可能存在重复,直接sum会导致结果不正确;
- 所以DWM层不能进行count,那么如果将DWD客户去重后的数据,直接保存为中间表呢?也会存在问题,因为如果DWM层把全表的用户去重后,在DWS层数据会存在丢失减少的情况,比如小时数据和天数据。
- DWS层直接根据DWD的数据进行统计,得出数据集市;
- 将宽表数据导出到mysql,由FineBI灵活选择APP数据字段进行展示。
- ODS——》DWD——》DWS。
咨询客户量
- 最终的数据维度:年、季度、月、日、地区、来源渠道;
- 统计的数据和访问客户量指标相似,唯一的不同点是,多了一个条件:和客服有聊天信息;
- 因为咨询客户量的数据来源和访问客户量相同,所以ODS层可以复用;
- DWD层对ODS原始数据进行清洗和转换,可以复用;
- DWM层先去重,再在DWS中sum的结果是不正确的;所以跳过DWM层;
- DWS层直接在DWD层的基础上,加上聊天信息的条件后,按照维度进行统计。
列式存储和行式存储
- 行存储的特点: 查询满足条件的一整行(所有列)数据的时候,列存储则需要去每个聚集的字段找到对应的每个列的值,行存储只需要找到其中一个值,其余的值都在相邻地方,所以此时行存储查询的速度更快。
- 列存储的特点: 因为每个字段的数据聚集存储,在查询只需要少数几个字段的时候,能大大减少读取的数据量;每个字段的数据类型一定是相同的,列式存储可以针对性的设计更好的设计压缩算法。
TEXTFILE
默认格式,行式存储。可结合Gzip、Bzip2使用(系统自动检查,执行查询时自动解压),但使用这种方式,hive不会对数据进行切分,从而无法对数据进行并行操作。并且反序列化过程中,必须逐个字符判断是不是分隔符和行结束符,性能较差。
ORCFILE
使用ORC文件格式可以提高hive读、写和处理数据的能力。ORCFile是RCFile的升级版。
在ORC格式的hive表中,数据按行分块,每块按列存储。结合了行存储和列存储的优点。记录首先会被横向的切分为多个stripes,然后在每一个stripe内数据以列为单位进行存储,所有列的内容都保存在同一个文件中。
每个stripe的默认大小为256MB,相对于RCFile每个4MB的stripe而言,更大的stripe使ORC可以支持索引,数据读取更加高效。
zlib压缩
优点:压缩率比较高;hadoop本身支持,在应用中处理gzip格式的文件就和直接处理文本一样。
缺点:压缩性能一般。
snappy压缩
优点:高速压缩速度和合理的压缩率。
缺点:压缩率比zlib要低;hadoop本身不支持,需要安装(CDH版本已自动支持,可忽略)。
系统采用的格式
因为ORCFILE的压缩快、存取快,而且拥有特有的查询优化机制,所以系统采用ORCFILE存储格式(RCFILE升级版),压缩算法采用orc支持的ZLIB和SNAPPY。
在ODS数据源层,因为数据量较大,可以采用orcfile+ZLIB的方式,以节省磁盘空间;
而在计算的过程中(DWD、DWM、DWS、APP),为了不影响执行的速度,可以浪费一点磁盘空间,采用orcfile+SNAPPY的方式,提升hive的执行速度。
存储空间足够的情况下,推荐采用SNAPPY压缩。
动态分区
set hive.exec.dynamic.partition=true; 是开启动态分区
set hive.exec.dynamic.partition.mode=nonstrict; 这个属性默认值是strict,就是要求分区字段必须有一个是静态的分区值。全部动态分区插入,需要设置为nonstrict非严格模式。
代码中标红的部分,partition(year,month,day,hour) 就是要动态插入的分区。对于大批量数据的插入分区,动态分区相当方便。
静态分区和动态分区混用
insert overwrite table device_open partition(year='2017',month='05',day,hour)
partition(year='2017', month='05', day, hour),year和month是静态分区字段,day和hour是动态分区字段,这里指将2017年5月份的数据插入分区表,对应底层的物理操作就是将2017年5月份的数据load到hdfs上对应2017年5月份下的所有day和hour目录中去。
注意混用的情况下,静态分区的上层必须也是静态分区,如果partition(year, month, day=’05’, hour=’08’),则会报错:FAILED: SemanticException [Error 10094]: Line 1:50 Dynamic partition cannot be the parent of a static partition ''day''。
有序动态分区
可以设置hive.optimize.sort.dynamic.partition进行避免内存溢出问题:
设置为true后,当启用动态分区时,reducer仅随时保持一个记录写入程序,从而降低对 reducer产生的内存压力。但同时也会使查询性能变慢。
动态分区其他相关属性设置:
-- 原始数据导入mysql
create database nev default character set utf8mb4 collate utf8mb4_unicode_ci;
mysql -h 172.16.100.42 -P 4000 -uroot -p'84^j1TZ-+2b9A@S7Hs' nev < /mnt/nev.sql
-- hive ods层建库表
CREATE DATABASE IF NOT EXISTS `itcast_ods`;
-- 注意,设置ORC压缩格式前一定要先设置hive.exec.orc.compression.strategy,否则压缩不生效:
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 建表web_chat_ems
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_ems (
id INT comment '主键',
create_date_time STRING comment '数据创建时间',
session_id STRING comment '七陌sessionId',
sid STRING comment '访客id',
create_time STRING comment '会话创建时间',
seo_source STRING comment '搜索来源',
seo_keywords STRING comment '关键字',
ip STRING comment 'IP地址',
area STRING comment '地域',
country STRING comment '所在国家',
province STRING comment '省',
city STRING comment '城市',
origin_channel STRING comment '投放渠道',
user_match STRING comment '所属坐席',
manual_time STRING comment '人工开始时间',
begin_time STRING comment '坐席领取时间 ',
end_time STRING comment '会话结束时间',
last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间',
last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间',
reply_msg_count INT comment '客服回复消息数',
msg_count INT comment '客户发送消息数',
browser_name STRING comment '浏览器名称',
os_info STRING comment '系统名称')
comment '访问会话信息表'
PARTITIONED BY(starts_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
-- 建表web_chat_text_ems
CREATE EXTERNAL TABLE IF NOT EXISTS itcast_ods.web_chat_text_ems (
id INT COMMENT '主键来自MySQL',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录'
) comment 'EMS-PV测试表'
PARTITIONED BY(start_time STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_ods.db/web_chat_text_ems_ods'
TBLPROPERTIES ('orc.compress'='ZLIB');
-- DWD层
-- 维度:
-- 时间维度:年、季度、月、天、小时
-- 业务属性维度:地区、来源渠道、搜索来源、会话来源页面、总访问量。
-- 建库:
CREATE DATABASE IF NOT EXISTS `itcast_dwd`
WITH DBPROPERTIES ( 'creator' = 'tang', 'create_date' = '2024-10-14');
-- 将ODS层数据,进行清洗转换,并且将web_chat_ems主表和web_chat_text_ems附表的内容根据id合并在一起。数据粒度保持不变。
-- 数据清洗:空数据、不满足业务需求的数据处理。
-- 数据转换:数据格式和数据形式的转换,比如时间类型可以转换为同样的展现形式“yyyy-MM-dd HH:mm:ss”或者时间戳类型,金钱类型的数据可以统一转换为以元为单位或以分为单位的数值。
create table if not exists itcast_dwd.visit_consult_dwd(
session_id STRING comment '七陌sessionId',
sid STRING comment '访客id',
create_time bigint comment '会话创建时间',
seo_source STRING comment '搜索来源',
ip STRING comment 'IP地址',
area STRING comment '地域',
msg_count int comment '客户发送消息数',
origin_channel STRING COMMENT '来源渠道',
referrer STRING comment '上级来源页面',
from_url STRING comment '会话来源页面',
landing_page_url STRING comment '访客着陆页面',
url_title STRING comment '咨询页面title',
platform_description STRING comment '客户平台信息',
other_params STRING comment '扩展字段中数据',
history STRING comment '历史访问记录',
hourinfo string comment '小时',
quarterinfo string comment '季度'
)
comment '访问咨询DWD表'
partitioned by(yearinfo String,monthinfo String,dayinfo string)
row format delimited fields terminated by '\t'
stored as orc
location '/user/hive/warehouse/itcast_dwd.db/visit_consult_dwd'
tblproperties ('orc.compress'='SNAPPY');
-- DWS层
-- 在DWD层的基础上,按照业务的要求进行统计分析;时间和业务属性三个维度分类,可以在模型中增加对应的属性标识:
-- 时间维度:1.年、2.季度、3.月、4.天、5.小时
-- 业务属性维度:1.地区、2.来源渠道、3.搜索来源、4.会话来源页面、5.总访问量
-- 建库:
CREATE DATABASE IF NOT EXISTS `itcast_dws`
WITH DBPROPERTIES ( 'creator' = 'tang', 'create_date' = '2024-10-14');
-- 访问量统计结果表
CREATE TABLE IF NOT EXISTS itcast_dws.visit_dws (
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area STRING COMMENT '区域信息',
seo_source STRING COMMENT '搜索来源',
origin_channel STRING COMMENT '来源渠道',
hourinfo STRING COMMENT '创建时间,统计至小时',
time_str STRING COMMENT '时间明细',
from_url STRING comment '会话来源页面',
groupType STRING COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;')
comment 'EMS访客日志dws表'
PARTITIONED BY(yearinfo STRING,quarterinfo STRING,monthinfo STRING,dayinfo STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
stored as orc
location '/user/hive/warehouse/itcast_dws.db/visit_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');
-- 咨询量统计结果表
CREATE TABLE IF NOT EXISTS itcast_dws.consult_dws
(
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area STRING COMMENT '区域信息',
origin_channel STRING COMMENT '来源渠道',
hourinfo STRING COMMENT '创建时间,统计至小时',
time_str STRING COMMENT '时间明细',
groupType STRING COMMENT '产品属性类型:1.地区;2.来源渠道',
time_type STRING COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;'
)
COMMENT '咨询量DWS宽表'
PARTITIONED BY (yearinfo string,quarterinfo STRING, monthinfo STRING, dayinfo string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC
LOCATION '/user/hive/warehouse/itcast_dws.db/consult_dws'
TBLPROPERTIES ('orc.compress'='SNAPPY');
sqoop导入原始数据至hive
-- 导入数据的SQL语句:
-- 访问咨询主表:
SELECT
id,
create_date_time,
session_id,
sid,
create_time,
seo_source,
seo_keywords,
ip,
AREA,
country,
province,
city,
origin_channel,
USER AS user_match,
manual_time,
begin_time,
end_time,
last_customer_msg_time_stamp,
last_agent_msg_time_stamp,
reply_msg_count,
msg_count,
browser_name,
os_info,
'2021-09-24' AS starts_time
FROM
web_chat_ems_2019_07
-- 访问咨询附属表
SELECT
*, '2021-09-24' AS start_time
FROM web_chat_text_ems_2019_07
# 执行sqoop脚本, 完成数据采集
# mysql -h 172.16.100.42 -P 4000 -uroot -p'84^j1TZ-+2b9A@S7Hs' new < /mnt/nev.sql
# -- 访问咨询主表
sqoop import \
--connect jdbc:mysql://172.16.100.42:4000/nev \
--username root \
--password 84^j1TZ-+2b9A@S7Hs \
--query 'SELECT
id,create_date_time,session_id,sid,create_time,seo_source,seo_keywords,ip,
AREA,country,province,city,origin_channel,USER AS user_match,manual_time,begin_time,end_time,
last_customer_msg_time_stamp,last_agent_msg_time_stamp,reply_msg_count,
msg_count,browser_name,os_info, "2021-09-24" AS starts_time
FROM web_chat_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_ems \
-m 1
# -- 访问咨询附属表
sqoop import \
--connect jdbc:mysql://172.16.100.42:4000/nev \
--username root \
--password 84^j1TZ-+2b9A@S7Hs \
--query 'SELECT
*, "2021-09-24" AS start_time
FROM web_chat_text_ems_2019_07 where 1=1 and $CONDITIONS' \
--hcatalog-database itcast_ods \
--hcatalog-table web_chat_text_ems \
-m 1
# 验证数据
SELECT * from itcast_ods.web_chat_ems limit 10;
SELECT * FROM itcast_ods.web_chat_text_ems limit 10;
数据清洗转换
目的: 将ODS层数据导入到DWD层
DWD层作用:
1) 清洗转换操作
2) 少量维度退化操作
思考1: 是否需要做清洗转换操作, 如果需要做什么呢?
- 清洗操作: 不需要
- 转换操作: 将create_time日期 转换为 yearinfo quarterinfo monthinfo dayinfo hourinfo
- 额外加一个转换: 将create_time日期数据转换为时间戳
思考2: 是否需要进行维度退化操作, 如果需要做什么呢?
- 需要的, 将两个事实表合并称为一个事实表
-- SQL的实现: 未完成转换操作
select
wce.session_id,
wce.sid,
wce.create_time, -- 此处需要转换: 将字符串日期转换时间戳
wce.seo_source,
wce.ip,
wce.area,
wce.msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
wce.create_time as hourinfo, -- 此处需求转换
wce.create_time as yearinfo, -- 此处需求转换
wce.create_time as quarterinfo, -- 此处需求转换
wce.create_time as monthinfo, -- 此处需求转换
wce.create_time as dayinfo -- 此处需求转换
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id;
-- 转换1: 将create_time 转换为 int类型的数据 (说白: 转换为时间戳)
-- 方案: 日期 转换时间戳的函数 unix_timestamp(string date, string pattern)
-- 案例:
select unix_timestamp('2019-07-01 23:45:00', 'yyyy-MM-dd HH:mm:ss')
-- 转换2: 将create_time转换为 yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo:
-- 方案一: 通过 year() quarter() month() day() hour()
select year('2019-07-01 23:45:00') ; -- 2019
select month('2019-07-01 23:45:00') ; -- 7
select day('2019-07-01 23:45:00') ; -- 1
select hour('2019-07-01 23:45:00') ; -- 23
select quarter('2019-07-01 23:45:00'); -- 3
-- 方案二: 通过字符串的截取操作 substr('字符串',从第几个截取, 截取多少个)
select substr('2019-07-01 23:45:00',1,4); --2019
select substr('2019-07-01 23:45:00',6,2); -- 07
select substr('2019-07-01 23:45:00',9,2); -- 01
select substr('2019-07-01 23:45:00',12,2); -- 23
-- 实现最终转换的SQL
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time) as create_time,
wce.seo_source,
wce.ip,
wce.area,
wce.msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time,12,2) as hourinfo,
substr(wce.create_time,1,4) as yearinfo,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time,6,2) as monthinfo,
substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id;
-- 注意:
-- 在执行转换操作的时候, 由于需要进行二表联查操作, 其中一个表数据量比较少, 此时hive会对其优化,
-- 采用map join的方案进行处理, 而map join需要将小表的数据加载到内存中, 但是内存不足, 导致出现内存溢出错误, 此错误的报错可能会两个信息:
-- 第一个错误信息: return code 1
-- 第二个错误信息: return code -137 (等待一会会爆出来)
-- 解决方案:
-- 关闭掉map join 让其采用reduce join即可
-- 如何关闭呢?
-- set hive.auto.convert.join= false;
-- 将结果数据灌入到DWD层的表中
beeline -u jdbc:hive2://kube-40:10000 -n admin
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.exec.max.created.files=100000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
insert into table itcast_dwd.visit_consult_dwd partition(yearinfo,monthinfo,dayinfo)
select
wce.session_id,
wce.sid,
unix_timestamp(wce.create_time) as create_time,
wce.seo_source,
wce.ip,
wce.area,
wce.msg_count,
wce.origin_channel,
wcte.referrer,
wcte.from_url,
wcte.landing_page_url,
wcte.url_title,
wcte.platform_description,
wcte.other_params,
wcte.history,
substr(wce.create_time,12,2) as hourinfo,
quarter(wce.create_time) as quarterinfo,
substr(wce.create_time,1,4) as yearinfo,
substr(wce.create_time,6,2) as monthinfo,
substr(wce.create_time,9,2) as dayinfo
from itcast_ods.web_chat_ems wce join itcast_ods.web_chat_text_ems wcte
on wce.id = wcte.id;
数据分析
目的: 将DWD层数据灌入到DWS层
DWS层作用: 细化维度统计操作
-- 如何计算访问量:
-- 访问量:
-- 固有维度:
-- 时间维度: 年 季度 月 天 小时
-- 产品属性维度:
-- 地区维度
-- 来源渠道
-- 搜索来源
-- 受访页面
-- 总访问量
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.exec.max.created.files=100000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计每年的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'-1' as from_url,
'5' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo;
-- 统计每年每季度的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo;
-- 统计每年每季度每月的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年每季度每月每天的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年每季度每月每天每小时的总访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'-1' as from_url,
'5' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.exec.max.created.files=100000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计每年各个受访页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
from_url,
'4' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,from_url;
-- 统计每年,每季度各个受访页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
from_url,
'4' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,from_url;
-- 统计每年,每季度,每月各个受访页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
from_url,
'4' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,from_url;
-- 统计每年,每季度,每月.每天各个受访页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
from_url,
'4' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,from_url;
-- 统计每年,每季度,每月.每天,每小时各个受访页面的访问量
insert into table itcast_dws.visit_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as seo_source,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
from_url,
'4' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,from_url;
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.exec.max.created.files=100000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计每年的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'3' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo;
-- 统计每年每季度的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'3' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo;
-- 统计每年每季度每月的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'3' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo;
-- 统计每年每季度每月每天的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'3' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,dayinfo;
-- 统计每年每季度每月每天每小时的总咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
'-1' as area,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'3' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo;
--动态分区配置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=1000;
set hive.exec.max.created.files=100000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
-- 统计每年各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
yearinfo as time_str,
'1' as grouptype,
'5' as time_type,
yearinfo,
'-1' as quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,area;
-- 统计每年每季度各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'_',quarterinfo) as time_str,
'1' as grouptype,
'4' as time_type,
yearinfo,
quarterinfo,
'-1' as monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,area;
-- 统计每年每季度每月各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo) as time_str,
'1' as grouptype,
'3' as time_type,
yearinfo,
quarterinfo,
monthinfo,
'-1' as dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,area;
-- 统计每年每季度每月每天各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
'-1' as hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo) as time_str,
'1' as grouptype,
'2' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,dayinfo,area;
-- 统计每年每季度每月每天每小时各个地区的咨询量
insert into table itcast_dws.consult_dws partition(yearinfo,quarterinfo,monthinfo,dayinfo)
select
count(distinct sid) as sid_total,
count(distinct session_id) as sessionid_total,
count(distinct ip) as ip_total,
area,
'-1' as origin_channel,
hourinfo,
concat(yearinfo,'-',monthinfo,'-',dayinfo,' ',hourinfo) as time_str,
'1' as grouptype,
'1' as time_type,
yearinfo,
quarterinfo,
monthinfo,
dayinfo
from itcast_dwd.visit_consult_dwd where msg_count >= 1
group by yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area;
数据导出
目的: 从hive的DWS层将数据导出到mysql中对应目标表中
-- 第一步: 在mysql中创建目标表:
create database scrm_bi default character set utf8mb4 collate utf8mb4_general_ci;
-- 访问量的结果表:
CREATE TABLE IF NOT EXISTS scrm_bi.visit_dws (
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
seo_source varchar(32) COMMENT '搜索来源',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '创建时间,统计至小时',
time_str varchar(32) COMMENT '时间明细',
from_url varchar(256) comment '会话来源页面',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.搜索来源;3.来源渠道;4.会话来源页面;5.总访问量',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年' ,
quarterinfo varchar(32) COMMENT '季度',
monthinfo varchar(32) COMMENT '月',
dayinfo varchar(32) COMMENT '天'
)comment 'EMS访客日志dws表';
-- 咨询量的结果表:
CREATE TABLE IF NOT EXISTS scrm_bi.consult_dws
(
sid_total INT COMMENT '根据sid去重求count',
sessionid_total INT COMMENT '根据sessionid去重求count',
ip_total INT COMMENT '根据IP去重求count',
area varchar(32) COMMENT '区域信息',
origin_channel varchar(32) COMMENT '来源渠道',
hourinfo varchar(32) COMMENT '创建时间,统计至小时',
time_str varchar(32) COMMENT '时间明细',
groupType varchar(32) COMMENT '产品属性类型:1.地区;2.来源渠道',
time_type varchar(32) COMMENT '时间聚合类型:1、按小时聚合;2、按天聚合;3、按月聚合;4、按季度聚合;5、按年聚合;',
yearinfo varchar(32) COMMENT '年' ,
quarterinfo varchar(32) COMMENT '季度',
monthinfo varchar(32) COMMENT '月',
dayinfo varchar(32) COMMENT '天'
)COMMENT '咨询量DWS宽表';
# 第二步执行sqoop的数据导出
# -- 先导出 咨询量数据
sqoop export \
--connect jdbc:mysql://172.16.100.42:4000/scrm_bi \
--username root \
--password 84^j1TZ-+2b9A@S7Hs \
--table consult_dws \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1
# 解决乱码:
sqoop export \
--connect "jdbc:mysql://172.16.100.42:4000/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 84^j1TZ-+2b9A@S7Hs \
--table consult_dws \
--hcatalog-database itcast_dws \
--hcatalog-table consult_dws \
-m 1
# 完成访问量数据导出
sqoop export \
--connect "jdbc:mysql://172.16.100.42:4000/scrm_bi?useUnicode=true&characterEncoding=utf-8" \
--username root \
--password 84^j1TZ-+2b9A@S7Hs \
--table visit_dws \
--hcatalog-database itcast_dws \
--hcatalog-table visit_dws \
-m 1