子查询
接下来将统计每天每个渠道的所有事件,
最后一步是在创建的事件列中计算平均值,为此,我们需要在上述查询结果中进行再好查询。将查询放在括号中,赋一个名字(就像为表格添加别名一样),作为外查询的FROM来源,那么它就成了查询中的查询,即子查询。
总体流程,先运行内层查询,内层查询必须独立运行,然后外查询从内查询的结果集中再查询。
整个流程如下:
- 首先,我们需要按照日期和渠道分组。然后按事件数(第三列)排序,这样可以快速得出每个渠道每天的事件发生次数。
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC;
- 现在创建一个子查询,只是用来提供第一个查询中的所有数据。可以看出,要获得这一结果,提供了整个原始表格。查询的附加部分包括
*
,并且我们需要为表格设置别名。此外,是在SELECT
语句中(而不是FROM
)中提供表格。
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC) sub;
- 最后,我们在以下语句中能够获得显示每个渠道一天的平均事件数的表格。
SELECT channel, AVG(events) AS average_events
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2) sub
GROUP BY channel
ORDER BY 2 DESC;
练习
子查询的格式
子查询格式
在编写子查询时,查询很容易就看起来很复杂。为了便于阅读,其实日后经常只是你自己要阅读:
要记住的重要事项是,在使用子查询时,要让读者能够轻松地判断查询的哪个部分将一起执行。大部分人的做法是按照某种方式缩进子查询,上一页面的解决方案就是这么做的。
这节课的示例缩进很明显,一直到小括号。如果你嵌套了很多的子查询,则不适用,一般法则就是思考下如何以便于阅读的方式编写查询。下面给出了以多种方式编写同一查询的示例。你会发现,某些示例明显比其他的容易阅读。
格式糟糕的查询
虽然这些格式糟糕的查询和格式清晰的查询一样会执行,但是却不容易让人理解查询的作用!
以下是第一个示例,根本无法判断查询的作用:
SELECT * FROM (SELECT DATE_TRUNC('day',occurred_at) AS day, channel, COUNT(*) as events FROM web_events GROUP BY 1,2 ORDER BY 3 DESC) sub;
下面的第二个示例不是太糟糕,但是你会发现最后一个示例依然更容易读懂。
SELECT *
FROM (
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC) sub;
格式清晰的查询
与之前的示例相比,在这个格式清晰的示例中,我们很容易就看出要从哪个表格中获取数据。此外,如果在子查询后面有 GROUP BY、ORDER BY、WHERE、HAVING 或任何其他语句,则按照外部查询的同一级别缩进,正如最后一个示例所显示的,它是上个练习的最后一个解决方案。
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC) sub;
下面的查询很相似,但是向外部查询应用了其他逻辑,因此按照外部查询的级别缩进。而内部查询逻辑的缩进级别与内部表格匹配。
SELECT *
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
ORDER BY 3 DESC) sub
GROUP BY channel
ORDER BY 2 DESC;
最后两个查询容易读懂多了!
子查询
在你写的第一个子查询中,你编写了一个子查询来创建表格,然后可以在 FROM 语句中查询该表格。但是,如果只返回一个值,则可以在逻辑语句中使用该值,例如 WHERE、HAVING,甚至 SELECT,该值可以嵌套在 CASE 语句中。
在下一页面中,我们将讲解这个示例,然后你将尝试自己回答一些问题。
提示
注意,在条件语句中编写子查询时,不能包含别名。这是因为该子查询会被当做单个值(或者对于 IN 情况是一组值),而不是一个表格。
同时注意,这里的查询对应的是单个值。如果我们返回了整个列,则需要使用 IN 来执行逻辑参数。如果我们要返回整个表格,则必须为该表格使用别名,并对整个表格执行其他逻辑。
类似的查询方式:
以下是从 orders 表格中获取第一个订单的年/月信息的查询。
SELECT DATE_TRUNC('month', MIN(occurred_at))
FROM orders;
然后,为了获取每个订单的平均值,我们可以在一个查询中执行所有的任务。但是为了便于阅读,我在下面提供了两个查询,单独执行每一步。
SELECT AVG(standard_qty) avg_std, AVG(gloss_qty) avg_gls, AVG(poster_qty) avg_pst
FROM orders
WHERE DATE_TRUNC('month', occurred_at) =
(SELECT DATE_TRUNC('month', MIN(occurred_at)) FROM orders);
SELECT SUM(total_amt_usd)
FROM orders
WHERE DATE_TRUNC('month', occurred_at) =
(SELECT DATE_TRUNC('month', MIN(occurred_at)) FROM orders);
练习 爱上子查询
提供每个区域拥有最高销售额 (total_amt_usd) 的销售代表的姓名。
- 首先,我要算出与每个销售代表相关的总销售额 (total_amt_usd),并且要得出他们所在的区域。以下查询提供了这一信息。
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC;
- 接着,得出每个区域的最高销售额,然后使用该信息从最终结果中获取这些行。
SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) inner1
GROUP BY 1;
- 本质上,这是两个表格的连接,其中区域和销售额相匹配。
SELECT t1.rep_name, t1.region_name, t1.total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC) t1
JOIN (SELECT region_name, MAX(total_amt) total_amt
FROM(SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1, 2) inner1
GROUP BY 1) t2
ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;
- 对于具有最高销售额 (total_amt_usd) 的区域,总共下了多少个订单?
我写的第一个查询是获取每个区域的 total_amt_usd。
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name;
然后,我们仅从该表格中获取销售额最高的区域。可以通过两种方法来获取,一种是使用子查询后的最大值,另一种是按降序排序,然后获取最高值。
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub;
最终,我们要获取具有该区域销售额的总订单量:
SELECT r.name, SUM(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (
SELECT MAX(total_amt)
FROM (SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name) sub);
结果就是 Northeast,总订单为 1230378 个。
- 对于购买标准纸张数量 (standard_qty) 最多的客户(在作为客户的整个时期内),有多少客户的购买总数依然更多?
首先,我们要得出购买标准纸张数量 (standard_qty) 最多的客户。以下查询获取了该客户,以及总消费:
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1;
现在,我将使用上述信息获取总消费更高的所有客户:
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) sub);
上述查询列出了具有更多订单的客户列表。我们还可以使用另一个简单的子查询获取数量。
SELECT COUNT(*)
FROM (SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total
FROM (SELECT a.name act_name, SUM(o.standard_qty) tot_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1) inner_tab)
) counter_tab;
- 对于(在作为客户的整个时期内)总消费 (total_amt_usd) 最多的客户,他们在每个渠道上有多少 web_events?
我们首先需要获取在整个客户时期内消费最多的客户。
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1;
现在,我们要获取该企业(可以使用 id 进行匹配)在每个渠道上的事件数。
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id = (SELECT id
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1) inner_table)
GROUP BY 1, 2
ORDER BY 3 DESC;
我添加了 ORDER BY,其实并没特别的理由,并添加了客户名称,确保仅从一个客户那获取数据。
- 对于总消费前十名的客户,他们的平均终身消费 (total_amt_usd) 是多少?
首先,我们需要找出总消费 (total_amt_usd) 在前十名的客户。
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10;
现在计算这十个客户的平均消费。
SELECT AVG(tot_spent)
FROM (SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10) temp;
- 比所有客户的平均消费高的企业平均终身消费 (total_amt_usd) 是多少?
首先,算出所有客户的总消费 (total_amt_usd) 平均值:
SELECT AVG(o.total_amt_usd) avg_all
FROM orders o
JOIN accounts a
ON a.id = o.account_id;
然后,只获取高于这一平均值的客户。
SELECT o.account_id, AVG(o.total_amt_usd)
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o
JOIN accounts a
ON a.id = o.account_id);
最后,算出这些值的平均值。
SELECT AVG(avg_amt)
FROM (SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT AVG(o.total_amt_usd) avg_all
FROM orders o
JOIN accounts a
ON a.id = o.account_id)) temp_table;
WITH
WITH 语句经常称为公用表表达式(简称 CTE)。虽然这些表达式和子查询的目的完全一样,但是实际更常用,因为对未来的读者来说,更容易看懂其中的逻辑。
练习
复习
问题:你需要算出每个渠道每天的平均事件数。
解决方案:
SELECT channel, AVG(events) AS average_events
FROM (SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2) sub
GROUP BY channel
ORDER BY 2 DESC;
我们使用 WITH 语句重新编写查询。
注意:你可以获取内部查询:
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2
我们在此部分放入 WITH 语句。注意,在下面我们将表格的别名设为 events:
WITH events AS (
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2)
现在,我们可以像对待数据库中的任何其他表格一样使用这个新创建的 events 表格:
WITH events AS (
SELECT DATE_TRUNC('day',occurred_at) AS day,
channel, COUNT(*) as events
FROM web_events
GROUP BY 1,2)
SELECT channel, AVG(events) AS average_events
FROM events
GROUP BY channel
ORDER BY 2 DESC;
对于上述示例,我们只需一个额外的表格,但是想象下我们要创建第二个表格来从中获取数据。我们可以按照以下方式来创建额外的表格并从中获取数据:
WITH table1 AS (
SELECT *
FROM web_events),
table2 AS (
SELECT *
FROM accounts)
SELECT *
FROM table1
JOIN table2
ON table1.account_id = table2.id;
然后,你可以按照相同的方式使用 WITH 语句添加越来越多的表格。底部的练习将确保你掌握了这些新查询的所有必要组成部分。
练习
WITH 练习
本质上,WITH 语句和子查询执行的任务相同。因此,你可以使用 WITH 编写之前的任何一个查询。尝试使用 WITH(而不是子查询)再次执行之前的每个查询。
以上是数据库的 ERD,当你处理下面的练习时,可能会用到该 ERD。你应该将答案写成 WITH 语句,而不是得出一个答案并复制输出。这么做的重要性是查询能够动态地回答问题,即使数据变化了,依然能获得正确的答案。
以下是使用 WITH 条件重新编写的之前问题的答案。这些查询通常更容易读懂。
提供每个区域拥有最高销售额 (total_amt_usd) 的销售代表的姓名。
WITH t1 AS (
SELECT s.name rep_name, r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY 1,2
ORDER BY 3 DESC),
t2 AS (
SELECT region_name, MAX(total_amt) total_amt
FROM t1
GROUP BY 1)
SELECT t1.rep_name, t1.region_name, t1.total_amt
FROM t1
JOIN t2
ON t1.region_name = t2.region_name AND t1.total_amt = t2.total_amt;
对于具有最高销售额 (total_amt_usd) 的区域,总共下了多少个订单?
WITH t1 AS (
SELECT r.name region_name, SUM(o.total_amt_usd) total_amt
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name),
t2 AS (
SELECT MAX(total_amt)
FROM t1)
SELECT r.name, SUM(o.total) total_orders
FROM sales_reps s
JOIN accounts a
ON a.sales_rep_id = s.id
JOIN orders o
ON o.account_id = a.id
JOIN region r
ON r.id = s.region_id
GROUP BY r.name
HAVING SUM(o.total_amt_usd) = (SELECT * FROM t2);
- 对于购买标准纸张数量 (standard_qty) 最多的客户(在作为客户的整个时期内),有多少客户的购买总数依然更多?
WITH t1 AS (
SELECT a.name account_name, SUM(o.standard_qty) total_std, SUM(o.total) total
FROM accounts a
JOIN orders o
ON o.account_id = a.id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 1),
t2 AS (
SELECT a.name
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY 1
HAVING SUM(o.total) > (SELECT total FROM t1))
SELECT COUNT(*)
FROM t2;
- 对于(在作为客户的整个时期内)总消费 (total_amt_usd) 最多的客户,他们在每个渠道上有多少 web_events?
```sql
WITH t1 AS (
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 1)
SELECT a.name, w.channel, COUNT(*)
FROM accounts a
JOIN web_events w
ON a.id = w.account_id AND a.id = (SELECT id FROM t1)
GROUP BY 1, 2
ORDER BY 3 DESC;
- 对于总消费前十名的客户,他们的平均终身消费 (total_amt_usd) 是多少?
WITH t1 AS (
SELECT a.id, a.name, SUM(o.total_amt_usd) tot_spent
FROM orders o
JOIN accounts a
ON a.id = o.account_id
GROUP BY a.id, a.name
ORDER BY 3 DESC
LIMIT 10)
SELECT AVG(tot_spent)
FROM t1;
- 比所有客户的平均消费高的企业平均终身消费 (total_amt_usd) 是多少?
WITH t1 AS (
SELECT AVG(o.total_amt_usd) avg_all
FROM orders o
JOIN accounts a
ON a.id = o.account_id),
t2 AS (
SELECT o.account_id, AVG(o.total_amt_usd) avg_amt
FROM orders o
GROUP BY 1
HAVING AVG(o.total_amt_usd) > (SELECT * FROM t1))
SELECT AVG(avg_amt)
FROM t2;
哇!太夸张了,如果你都答对了,那么很棒!