카일스쿨 BigQuery 바짝스터디/주차별 과제

인프런 빅쿼리 빠짝스터디 2주차 과제

gooreumsea 2024. 11. 3. 22:18

대체적으로 연습 및 실무에서 윈도우 함수 관련 쿼리문을 많이 작성해봤던터라, 강의 이해가 잘되었던 것 같다.

 

FIRST VALUE, LAST VALUE, QUALIFY 연습문제 등을 통해서 몰랐던 새로운 함수들을 알게되었고

특히 QUALIFY 함수는 유용하게 쓰일 수 있을 것 같다.!!!

 

 

 

탐색함수 연습문제 1~3


문제 1) user들의 다음 접속 월과 다다음 접속 월을 구하는 쿼리를 작성해주세요.

LEAD는 반드시 정렬이 먼저 되어야 함 -> ORDER BY 추가

SELECT *,
       LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS lead_visit_month,
       LEAD(visit_month,2) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS lead2_visit_month



FROM advanced.analytics_function_01


문제 2) user들의 다음 접속 월과 다다음 접속 월, 이전 접속 월을 구하는 쿼리를 
작성해주세요.


SELECT *,
       LEAD(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS after_visit_month,
       LEAD(visit_month,2) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS after2_visit_month,
       LAG(visit_month) OVER (PARTITION BY user_id ORDER BY visit_month ASC) AS before_visit_month

ORDER BY user_id, visit_month       
FROM advanced.analytics_function_01


문제 3) 유저가 접속 했을 때, 다음 접속까지의 간격


SELECT *,
       after_visit_month - visit_month AS diff

FROM
(       
    SELECT *,
    LEAD(visit_month, 1) OVER (PARTITION BY user_id ORDER BY visit_month) AS after_visit_month

    FROM advanced.analytics_function_01
)

 

 

 

 

FIRST VALUE -> 첫번째 값 구하기 (NULL 값 포함) \, NULL 값을 제외하고 싶으면, OVER 앞 IGNORE NULLS 추가
LAST VALUE -> 마지막 값 구하기 (NULL 값 포함), NULL 값을 제외하고 싶으면, OVER 앞 IGNORE NULLS 추가


집계함수는 NULL을 제외 후 연산, COUNT * 제외


* 파티션이 없을 때
프레임 범위 기본값
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING


* 파티션이 있을 때
프레임 범위 기본값
파티션 내에서 현재 로우 까지가 기본값

---------------------------------------------------------------------

---------------------------------------------------------------------


Frame 설정하는 2가지 방법

1) ROWS
- 물리적인 행 수를 기준으로 경계를 지정
- 이전 행, 이후 3개의 행
- ROWS Frame를 더 많이 사용

2) RANGE
- 논리적인 값의 범위를 기준으로 지정
- 값의 3일 전, 3일 후




EX) 

AVG(col) OVER (PARTITION BY product_type ORDER BY timestamp

ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
Frame의 시작과 끝 지점 명시하기
- PRECEDING : 현재 행 기준으로 이전 행
- CURRENT ROW : 현재 행
- FOLLOWING : 현재 행 기준으로 이후 행
- UNBOUNDED : 처음부터 또는 끝까지(사전적 의미 : 묶이지 않고 제한되지 않음)




AVG(col) OVER (PARTITION BY product_type ORDER BY timestamp
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
 = 현재 행과 그 앞뒤 한 행씩을 포함해서 평균

AVG(col) OVER (PARTITION BY product_type ORDER BY timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
 = 파티션의 처음부터 현재 행을 포함한 평균

 

 

 

 

FRAME 함수 연습문제

-- amount_total: 전체 SUM
-- cumulative_sum : row 시점에 누적 SUM
-- cumulative_sum_by_user : row 시점에 유저별 누적 SUM
-- last_5_orders_avg_amount : order_id 기준으로 정렬하고, 직전 5개 주문의 평균 amount
-- 집계분석함수() OVER(PARTITION BY ~~~ ORDER BY ~~~ ROWS BETWEEN A AND B)



SELECT *,
       SUM(amount) OVER() AS amount_total,
-- OVER()에 아무것도 들어가지 않을 수 있음
       SUM(amount) OVER(ORDER BY order_id) AS cumulative_sum,
       SUM(amount) OVER(PARTITION BY user_id ORDER BY order_id) AS cumulative_sum_by_user,
       AVG(amount) OVER(ORDER BY order_id ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING)
       

FROM advanced.orders      
ORDER BY user_id, order_date

 

 

 

 

QUALIFY 연습문제


WHERE 대신 QUALIFY를 사용하면 윈도우 함수의 결과에 대해 필터링할 수 있음
WHERE과 같이 사용하는 경우엔 WHERE 아래에 작성하면 됨

 SELECT
 order_id,
 order_date,
 user_id,
 amount,
 SUM(amount) OVER (PARTITION BY user_id) AS amount_total
 FROM advanced.orders
 WHERE 1=1
 QUALIFY amount_total >= 500

 

 

 

 

 

윈도우 함수 연습문제


문제 1

SELECT user,
       team,
       query_date,
       COUNT(query_date) OVER (PARTITION BY user) AS total_query_cnt
FROM advanced.query_logs
ORDER BY user, query_date




문제 2 

WITH base AS (
    SELECT 
          EXTRACT(week FROM query_date) AS week_number,
          user,
          team,
          COUNT(query_date) AS total_query_cnt
  FROM advanced.query_logs
  GROUP BY week_number, user, team
)

SELECT 
      week_number,
      team,
      user,
      total_query_cnt,
      RANK() OVER (PARTITION BY team ORDER BY total_query_cnt DESC) AS ranking_in_team
FROM base
QUALIFY ranking_in_team = 1
ORDER BY week_number, team




문제 3

WITH base AS (
    SELECT 
          EXTRACT(week FROM query_date) AS week_number,
          user,
          team,
          COUNT(query_date) AS query_cnt
  FROM advanced.query_logs
  GROUP BY week_number, user, team
)

SELECT 
      user,
      team,
      week_number,
      query_cnt,
      LAG(query_cnt) OVER (PARTITION BY user ORDER BY week_number) AS prev_week_query_cnt
FROM base 
ORDER BY user, week_number




문제 4

SELECT 
      user,
      team,
      query_date,
      query_count,
      SUM(query_count) OVER (PARTITION BY user ORDER BY query_date,
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_query_count
FROM (
      SELECT 
          user, 
          team,
          query_date,
          COUNT(query_date) AS query_count
  FROM advanced.query_logs
  GROUP BY 1,2,3
)
ORDER BY user, query_date




문제 5

WITH raw_data AS (
     SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL
     SELECT DATE '2024-05-02', 13 UNION ALL
     SELECT DATE '2024-05-03', NULL UNION ALL
     SELECT DATE '2024-05-04', 16 UNION ALL
     SELECT DATE '2024-05-05', NULL UNION ALL
     SELECT DATE '2024-05-06', 18 UNION ALL
     SELECT DATE '2024-05-07', 20 UNION ALL
     SELECT DATE '2024-05-08', NULL UNION ALL
     SELECT DATE '2024-05-09', 13 UNION ALL
     SELECT DATE '2024-05-10', 14 UNION ALL
     SELECT DATE '2024-05-11', NULL UNION ALL
     SELECT DATE '2024-05-12', NULL
)

SELECT 
      date,
      LAST_VALUE(number_of_orders ignore nulls) OVER (ORDER BY date) AS number_of_orders
FROM raw_data




문제 6


WITH raw_data AS (
     SELECT DATE '2024-05-01' AS date, 15 AS number_of_orders UNION ALL
     SELECT DATE '2024-05-02', 13 UNION ALL
     SELECT DATE '2024-05-03', NULL UNION ALL
     SELECT DATE '2024-05-04', 16 UNION ALL
     SELECT DATE '2024-05-05', NULL UNION ALL
     SELECT DATE '2024-05-06', 18 UNION ALL
     SELECT DATE '2024-05-07', 20 UNION ALL
     SELECT DATE '2024-05-08', NULL UNION ALL
     SELECT DATE '2024-05-09', 13 UNION ALL
     SELECT DATE '2024-05-10', 14 UNION ALL
     SELECT DATE '2024-05-11', NULL UNION ALL
     SELECT DATE '2024-05-12', NULL
)

SELECT 
      date, 
      number_of_orders,
      AVG(number_of_orders) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg
FROM (
      SELECT 
            date,
            LAST_VALUE(number_of_orders ignore nulls) OVER (ORDER BY date) AS number_of_orders
      FROM raw_data
)




문제 7


WITH base AS (
  SELECT 
    event_date,
    event_timestamp,
    DATETIME(TIMESTAMP_MICROS(event_timestamp),'Asia/Seoul') AS event_datetime,
    event_name,
    user_id,
    user_pseudo_id,
    LAG(DATETIME(TIMESTAMP_MICROS(event_timestamp),'Asia/Seoul')) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp) AS before_event_datetime
  FROM advanced.app_logs
  WHERE 1=1
  AND event_date = '2022-08-18'
)

SELECT
    *,
    DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) AS second_diff,
    CASE WHEN 
      DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) IS NULL
          OR DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) >= 20 THEN 1 
      ELSE 0 
    END AS session_start,
    SUM(CASE WHEN DATETIME_DIFF(event_datetime, before_event_datetime, SECOND) >= 20 THEN 1 ELSE 0 END)
    OVER (PARTITION BY user_pseudo_id ORDER BY event_datetime) + 1 AS session_temp
FROM base