BigQuery - SQL Joins (Photo by Resource Database on Unsplash) |
SQL joins are used to combine columns from multiple tables to get desired result set. In a typical Relational model we use normalized tables, each table represents an entity (example: employee, department, etc) and its relationships and when we need to get data from more than one tables, for example employee name and employee department, we use joins to combine employee name column from employee table, department name column from department table based on employee number key column, which is available on both the tables.
Similarly, typical data warehouse setup follows Star or Snowflake schema consisting of a primary fact table and satellite dimension tables. Fact tables represents events (example: orders table in a ecommerce business) and dimension table represents attributes and slowly changing information (example: customer, product tables).
Star Schema Example (Source: Wikipedia by SqlPac) |
Fact tables are denormalized for faster reads and dimension tables are normalized. In most cases, your analytics query need to join a fact table with multiple dimension tables to get the desired results, for example finding top 10 customers whose transaction value is greater than $100 and their ordered products. These type of queries requires join operation on transaction, customer and product tables to get desired result sets.
BigQuery as a serverless enterprise data warehouse support ANSI SQL joins types such as inner join, left outer join, right outer join, full outer join and cross join. Let's discuss about how join works in BigQuery, basic join types and try it out on BigQuery public dataset: `bigquery-public-data.thelook_ecommerce`
How join operations are performed in BigQuery?
- Create two result set items: In BigQuery join operation can be performed on tables, subquery, Arrays or WITH statement.
- Identify a join condition: The join condition does not need to be an equality condition; any Boolean condition can be used. The join condition specified using ON clause or USING clause.
- Select the column list you wanted in final result set: If the joining result set/table consists of identical column names, use alias to differentiate the columns.
- Specify the join type: if no joining type is specified, by default considered as inner join.
Inner Join
Inner Join returns only rows that meets the join condition. Effectively calculates Cartesian product of two tables based on a common values and discard all rows that do not meet the join condition.
BigQuery - SQL Inner Join |
Joining table A and B and joining B and A are same. If the column names are same, join condition can be specified using USING clause.
Example:
List product names and its order status.
Product schema:
❯ bq query --nouse_legacy_sql 'SELECT column_name, data_type, is_nullable FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.COLUMNS`WHERE table_name = "products";'+------------------------+-----------+-------------+| column_name | data_type | is_nullable |+------------------------+-----------+-------------+| id | INT64 | YES || cost | FLOAT64 | YES || category | STRING | YES || name | STRING | YES || brand | STRING | YES || retail_price | FLOAT64 | YES || department | STRING | YES || sku | STRING | YES || distribution_center_id | INT64 | YES |+------------------------+-----------+-------------+
Order Items schema:
❯ bq query --nouse_legacy_sql 'SELECT column_name, data_type, is_nullable FROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.COLUMNS`WHERE table_name = "order_items";'+-------------------+-----------+-------------+| column_name | data_type | is_nullable |+-------------------+-----------+-------------+| id | INT64 | YES || order_id | INT64 | YES || user_id | INT64 | YES || product_id | INT64 | YES || inventory_item_id | INT64 | YES || status | STRING | YES || created_at | TIMESTAMP | YES || shipped_at | TIMESTAMP | YES || delivered_at | TIMESTAMP | YES || returned_at | TIMESTAMP | YES || sale_price | FLOAT64 | YES |+-------------------+-----------+-------------+
Selecting Product name and Order status using inner join:
SELECTp.name as `Product Name`,oi.status as `Order Status`FROM`bigquery-public-data.thelook_ecommerce.order_items` as oiINNER JOIN`bigquery-public-data.thelook_ecommerce.products` as pONoi.product_id = p.idLIMIT 10;+--------------------------------------------------------------------------------------------------------+--------------+| Product Name | Order Status |+--------------------------------------------------------------------------------------------------------+--------------+| Elegant PASHMINA SCARF WRAP SHAWL STOLE | Shipped || Elegant PASHMINA SCARF WRAP SHAWL STOLE | Shipped || Elegant PASHMINA SCARF WRAP SHAWL STOLE | Complete || Elegant PASHMINA SCARF WRAP SHAWL STOLE | Complete || Elegant PASHMINA SCARF WRAP SHAWL STOLE | Returned || Elegant PASHMINA SCARF WRAP SHAWL STOLE | Returned || Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Complete || Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Complete || Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Returned || Nice Shades Black One Size Canvas Military Web Belt With Black Slider Buckle. Many Colors Available 56 | Returned |+--------------------------------------------------------------------------------------------------------+--------------+
Left Outer Join
Returns all rows from the left side of the join even if the rows not matching join predicate and only matching rows from the right side of the join. All the remaining rows from the right side of join returned as NULL.
Example
List all the products which never ordered (products not ordered even once).
SELECTp.name as `Product Name`,oi.order_id as `Order ID`FROM`bigquery-public-data.thelook_ecommerce.products` as pLEFT JOIN`bigquery-public-data.thelook_ecommerce.order_items` as oiONoi.product_id = p.idWHERE oi.status IS NULLLIMIT 10;+-------------------------------------------------------------------------------------------+----------+| Product Name | Order ID |+-------------------------------------------------------------------------------------------+----------+| Cloudveil Men's Kahuna Short | NULL || 30 inch Inseam Smith's Denim Fleece - lined Jeans Dark Stonewash | NULL || RUDE Dark Vintage Skinny Jeans | NULL || Nike Golf Men's Flat Front Tech Short | NULL || Hurley Men's Poolside Boardwalk Walkshort | NULL || Polo Ralph Lauren Men's Walk Golf Shorts Red | NULL || Men's Cashmere Cardigan | NULL || Lucky Brand Men's Boxer Set | NULL || ililily Simple Basic 100% Cotton Baseball 3/4 raglan sleeve T-shirt for Men (tshirts-008) | NULL || Pendleton Men's Classic Fit Board Shirt | NULL |+-------------------------------------------------------------------------------------------+----------+
The above query can rewritten using subqueries with EXISTS condition to fetch the results without using join as we just need the Product name, however we will continue with this syntax for explaining joins.
Right Outer Join
Returns all rows from the right side of join even if the rows are not matching the join predicate and returns only matching rows from the left side of the join. All the remaining rows from left side of the side returned as NULL.
BigQuery - SQL Right Join |
Example
List all the orders with no products (where products removed from the product table, but there were previous orders exists for those products).
SELECTp.name as `Product Name`,oi.order_id as `Order ID`,FROM`bigquery-public-data.thelook_ecommerce.products` as pRIGHT JOIN`bigquery-public-data.thelook_ecommerce.order_items` as oiONoi.product_id = p.idWHERE p.name IS NULLLIMIT 10;+--------------+----------+| Product Name | Order ID |+--------------+----------+| NULL | 39411 || NULL | 75248 || NULL | 76724 || NULL | 117507 || NULL | 32556 || NULL | 42489 || NULL | 57019 || NULL | 92339 || NULL | 49356 || NULL | 75177 |+--------------+----------+
Full Outer Join
Full join returns all rows from both side of join for matching conditions and If a row from one side does not join any other row in other side, the row returns with NULL value for other side of the table.
BigQuery - SQL Full Outer Join |
Example
SELECTp.name as `Product Name`,oi.order_id as `Order ID`,FROM`bigquery-public-data.thelook_ecommerce.products` as pFULL OUTER JOIN`bigquery-public-data.thelook_ecommerce.order_items` as oiONoi.product_id = p.idWHERE p.name IS NULL OR oi.order_id IS NULL;+------------------------------------------------------------------------------------------------------------------------+----------+| Product Name | Order ID |+------------------------------------------------------------------------------------------------------------------------+----------+| Lucky Brand Men's Boxer Set | NULL || NULL | 39411 || NULL | 75248 || NULL | 76724 || NULL | 117507 || NULL | 32556 || NULL | 42489 || NULL | 57019 || NULL | 92339 || So So Happy Junior Ozzie Hoodie in Blue | NULL || ililily Simple Basic 100% Cotton Baseball 3/4 raglan sleeve T-shirt for Men (tshirts-008) | NULL || Cloudveil Men's Kahuna Short | NULL || Allegra K Woman Faux Crystal Detail Leopard Print Strapped Tank Top M | NULL || Fox Womens Juniors Trials Pullover Hoody Sweater | NULL || Russell Athletic Women's Dri-Power Fleece Mid Rise Pant | NULL || A. Byer Juniors Tropical Cambridge Pant | NULL || Lily Of France Womens Sport In Action Sport Bra | NULL |+------------------------------------------------------------------------------------------------------------------------+----------+
CROSS JOIN
Example:
SELECTCONCAT(u.first_name, " ", u.last_name) as `Customer Name`,p.name as `Product Name`FROM`bigquery-public-data.thelook_ecommerce.users` as uCROSS JOIN`bigquery-public-data.thelook_ecommerce.products` as pLIMIT 10;+---------------+---------------------------------------------------------------+| Customer Name | Product Name |+---------------+---------------------------------------------------------------+| Michelle Rowe | TYR Alliance Team Splice Jammer || Michelle Rowe | TYR Sport Men's Solid Racer Swim Suit || Michelle Rowe | TYR Sport Men's Square Leg Short Swim Suit || Michelle Rowe | TYR Sport Men's 4-Inch Nylon Trainer-A Swim Suit || Michelle Rowe | TYR Sport Men's Swim Short/Resistance Short Swim Suit || Michelle Rowe | TYR Sport Men's Poly Mesh Trainer Swim Suit || Michelle Rowe | 2XU Men's Swimmers Compression Long Sleeve Top || Michelle Rowe | TYR Sport Men's Solid Durafast Jammer Swim Suit || Michelle Rowe | TYR Sport Men's Alliance Durafast Splice Square Leg Swim Suit || Michelle Rowe | TYR Sport Men's Solid Jammer Swim Suit |+---------------+---------------------------------------------------------------+
SELECTCONCAT(u.first_name, " ", u.last_name) as `Customer Name`,p.name as `Product Name`FROM`bigquery-public-data.thelook_ecommerce.users` as u, `bigquery-public-data.thelook_ecommerce.products` as pORDER BY1 DESC,2 DESCLIMIT 10;+---------------+------------------------------------------------------------------------------------------------------------+| Customer Name | Product Name |+---------------+------------------------------------------------------------------------------------------------------------+| Zoe Wood | Â Â Exclusive Hawaiian Tropic Sunset In Paradise Aloha Shirt || Zoe Wood | Â Â Exclusive Hawaiian Sunset In Paradise Aloha Shirt || Zoe Wood | Â Â Exclusive Hawaiian Orchid (100% cotton) Aloha Shirt || Zoe Wood | Â Â Exclusive Hawaiian Flowers In Paradise Aloha Shirt || Zoe Wood | Â Â Exclusive Hawaiian All New Hibiscus In Paradise Aloha Shirt || Zoe Wood | white Half slip Culotte Underworks Pettipants nylon for women plus size trim double lace knickers Lingerie || Zoe Wood | under.me The Leggings || Zoe Wood | tokidoki X Marvel Women Retro Spidey Dark Red Hoody || Zoe Wood | tokidoki X Marvel Women Captain Kitty Hoodie || Zoe Wood | tokidoki The Marvel Pop Zip Hoody in Dark Gray Heather |+---------------+------------------------------------------------------------------------------------------------------------+
BigQuery Joining Strategies
Broadcast Join
❯ bq show --project_id=bigquery-public-data thelook_ecommerce.distribution_centersTable bigquery-public-data:thelook_ecommerce.distribution_centersLast modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Total Logical Bytes Total Physical Bytes Labels----------------- --------------------- ------------ ------------- ------------ ------------------- ------------------ --------------------- ---------------------- --------23 Mar 10:06:38 |- id: integer 10 409 409 13988|- name: string|- latitude: float|- longitude: float
Inventory Items:
❯ bq show --project_id=bigquery-public-data thelook_ecommerce.inventory_itemsTable bigquery-public-data:thelook_ecommerce.inventory_itemsLast modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes Labels----------------- -------------------------------------------- ------------ ------------- --------------------- ---------------------- --------23 Mar 10:06:47 |- id: integer 491237 81455014 81455014 53738166|- product_id: integer|- created_at: timestamp|- sold_at: timestamp|- cost: float|- product_category: string|- product_name: string|- product_brand: string|- product_retail_price: float|- product_department: string|- product_sku: string|- product_distribution_center_id: integer
SELECTiv.product_name as `Product Name`,dc.name as `Distribution Center`FROM`bigquery-public-data.thelook_ecommerce.inventory_items` as ivINNER JOIN`bigquery-public-data.thelook_ecommerce.distribution_centers` as dcONiv.product_distribution_center_id = dc.id;+----------------------------------------------------------------------------+---------------------+| Product Name | Distribution Center |+----------------------------------------------------------------------------+---------------------+| Quiksilver Waterman Men's On The Rise | Houston TX || Quiksilver Waterman Men's On The Rise | Houston TX || Quiksilver Waterman Men's On The Rise | Houston TX || Quiksilver Waterman Men's On The Rise | Houston TX || KEEN Women Bellingham Low Ultralite Sock | Houston TX || KEEN Women Bellingham Low Ultralite Sock | Houston TX || KEEN Women Bellingham Low Ultralite Sock | Houston TX || Husky Animal Hat with Mittens | Houston TX || Husky Animal Hat with Mittens | Houston TX || Husky Animal Hat with Mittens | Houston TX || Husky Animal Hat with Mittens | Houston TX || Volcom Men's Argyle Socks | Houston TX || Volcom Men's Argyle Socks | Houston TX || Volcom Men's Argyle Socks | Houston TX || Volcom Men's Argyle Socks | Houston TX || Volcom Men's Argyle Socks | Houston TX || Volcom Men's Argyle Socks | Houston TX || Wendy Glez Rose Cami Black | Houston TX || Wendy Glez Rose Cami Black | Houston TX || Wendy Glez Rose Cami Black | Houston TX || Carhartt Women's Hooded Knit Jacket | Houston TX || Carhartt Women's Hooded Knit Jacket | Houston TX || Carhartt Women's Hooded Knit Jacket | Houston TX || Carhartt Women's Hooded Knit Jacket | Houston TX || Carhartt Women's Hooded Knit Jacket | Houston TX || NEW Aluminum Credit Card Wallet - RFID Blocking Case - Pink (New Products) | Houston TX || NEW Aluminum Credit Card Wallet - RFID Blocking Case - Pink (New Products) | Houston TX || NEW Aluminum Credit Card Wallet - RFID Blocking Case - Pink (New Products) | Houston TX |+----------------------------------------------------------------------------+---------------------+
BigQuery - Broadcast join - input stage |
In S02 stage, two READ operations, 10 rows from S01 and 4,91, 247 rows from inventory items table and BigQuery broadcasted all 10 rows to ALL the rows from the right side of the table for matching. This is specified by the keyword INNER HASH JOIN EACH WITH ALL and output is written to S03 with columns Product Name and Distribution Center.
When joining two large tables BigQuery uses Hash joins. This is computationally more expensive join operation. Broadcasting larger table rows to every worker is inefficient, therefore BigQuery hashes table rows from both side of the join operation, this ensures rows with same key (based on hash value) routed and end up in the same buckets, enabling more efficient local join operation.
Let's understand hash join with an example, we will cross join products and users table to illustrate hash join. This output of cross join is all possible combination of user and products.
❯ bq show --project_id=bigquery-public-data thelook_ecommerce.users;Table bigquery-public-data:thelook_ecommerce.usersLast modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes----------------- --------------------------- ------------ ------------- --------------------- ----------------------24 Mar 10:23:38 |- id: integer 100000 15816853 15816853 25734386|- first_name: string|- last_name: string|- email: string|- age: integer|- gender: string|- state: string|- street_address: string|- postal_code: string|- city: string|- country: string|- latitude: float|- longitude: float|- traffic_source: string|- created_at: timestamp
❯ bq show --project_id=bigquery-public-data thelook_ecommerce.products;Table bigquery-public-data:thelook_ecommerce.productsLast modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes----------------- ------------------------------------ ------------ ------------- --------------------- ----------------------24 Mar 10:23:37 |- id: integer 29120 4285975 4285975 9762698|- cost: float|- category: string|- name: string|- brand: string|- retail_price: float|- department: string|- sku: string|- distribution_center_id: integer
WITHusers as (SELECTid,CONCAT(first_name, " ", last_name) as `CustomerName`FROM`bigquery-public-data.thelook_ecommerce.users`),products as (SELECTid,name,distribution_center_idFROM`bigquery-public-data.thelook_ecommerce.products`)SELECT * FROM users CROSS JOIN products;+-------+------------------+-------+---------------------------------------------------------------+------------------------+| id | CustomerName | id_1 | name | distribution_center_id |+-------+------------------+-------+---------------------------------------------------------------+------------------------+| 96585 | Michael Anderson | 27457 | TYR Sport Men's Solid Durafast Jammer Swim Suit | 1 || 96585 | Michael Anderson | 27466 | TYR Sport Men's Swim Short/Resistance Short Swim Suit | 1 || 96585 | Michael Anderson | 27529 | TYR Sport Men's Poly Mesh Trainer Swim Suit | 1 || 96585 | Michael Anderson | 27552 | TYR Sport Men's Solid Racer Swim Suit | 1 || 96585 | Michael Anderson | 27487 | TYR Sport Men's 4-Inch Nylon Trainer-A Swim Suit | 1 || 96585 | Michael Anderson | 27481 | TYR Alliance Team Splice Jammer | 1 || 96585 | Michael Anderson | 27537 | TYR Sport Men's Alliance Durafast Splice Square Leg Swim Suit | 1 || 96585 | Michael Anderson | 27510 | TYR Sport Men's Solid Jammer Swim Suit | 1 || 96585 | Michael Anderson | 27569 | 2XU Men's Swimmers Compression Long Sleeve Top | 1 || 96585 | Michael Anderson | 27445 | TYR Sport Men's Square Leg Short Swim Suit | 1 |+-------+------------------+-------+---------------------------------------------------------------+------------------------+
BigQuery - Hash Join - Execution Plan |
We are interested in S03 Join stage, here BigQuery performs EACH WITH EACH matching, i.e every row from left side of table is matched with every row from right side of the table. Additionally, we are limiting the number of rows to 100 to avoid CROSS JOIN output errors.
BigQuery - Hash Join - EACH WITH EACH |
How to optimize BigQuery Join operations?
Avoiding Self-joins
Avoid joins that generate more output than input
Typically, when use Cross join two big tables, every row from first table is joined with every row in second table, you may observed the query may not be completing as well. It is recommended to avoid cross joins not just in terms of performance, it impacts the overall cost/slot usage as well.To avoid performance issues associated with joins that generate more output than input, use aggregate functions to pre-aggregate the data or use window functions.
Join on integer data type columns
Broadcast Joins
Hash Joins
Reduce the size of data being joined
Performing join operation on smaller dataset is better, whenever possible filter the dataset before joining. Rewriting queries to fetch only required data from both the joining tables / subqueries helps in reducing heavy shuffling and other such complex operation on larger datasets and improves overall query performance. When there is a need to perform GROUP BY and JOIN operation on a query, perform aggregation earlier in the query - this will reduce the number of rows joined.
Experimenting with BigQuery Nested and Repeated Structures
Generally speaking reading from a flattened table is much faster and efficient than reading from joining tables. Denormalizing schema helps us to achieve flatten table structure where events and metrics available in a same table. In BigQuery we can use repeated and nested fields to denormalize a table.
Let's analyse the denormalized BigQuery tables using nested and repeated fields and joining normalized table performance with an example.
We have two tables Orders and Orders Nested. In Orders table, we have user_id column which references id column of Users table and orders_nested table is a denormalized table, where users details are stored in a nested column with STRUCT data type.
Orders:
❯ bq show --project_id=bigquery-public-data thelook_ecommerce.orders;Table bigquery-public-data:thelook_ecommerce.ordersLast modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes----------------- ---------------------------- ------------ ------------- --------------------- ----------------------25 Mar 09:46:02 |- order_id: integer 125402 6776727 6776727 14873107|- user_id: integer|- status: string|- gender: string|- created_at: timestamp|- returned_at: timestamp|- shipped_at: timestamp|- delivered_at: timestamp|- num_of_item: integer
Orders Nested:
❯ bq show BigQueryLab.orders_nested;Table bpo-bu-prod-datawarehouse:BigQueryLab.orders_nestedLast modified Schema Total Rows Total Bytes Total Logical Bytes Total Physical Bytes----------------- ---------------------------- ------------ ------------- --------------------- ----------------------25 Mar 13:19:24 |- order_id: integer 125402 10172371 10172371 2812454|- status: string|- gender: string|- created_at: timestamp|- returned_at: timestamp|- shipped_at: timestamp|- delivered_at: timestamp|- num_of_item: integer+- users: record| |- user_id: integer| |- first_name: string| |- last_name: string| |- city: string
We will get all the orders and corresponding user_id, first_name, last_name and city columns using above tables and analyse its execution plan.
Joining Orders and Users table:
SELECTorders.*,users.first_name,users.last_name,users.cityFROM`bigquery-public-data.thelook_ecommerce.orders` as ordersINNER JOIN`bigquery-public-data.thelook_ecommerce.users` as usersONorders.user_id = users.id;+----------+---------+---------+--------+---------------------+----------------------+------------+-----------+----------------+| order_id | user_id | status | gender | created_at | shipped_at | first_name | last_name | city |+----------+---------+---------+--------+---------------------+----------------------+------------+-----------+----------------+| 15 | 15 | Shipped | F | 2023-03-24 13:06:16 | 2023-03-24 18:29:16 | Jennifer | Morris | City of Yantai || 21 | 20 | Shipped | F | 2021-07-27 03:41:00 | 2021-07-28 00:11:00 | Alice | Davis | Fort Worth || 22 | 20 | Shipped | F | 2022-08-24 03:41:00 | 2022-08-25 11:06:00 | Alice | Davis | Fort Worth || 26 | 23 | Shipped | F | 2020-10-12 16:54:00 | 2020-10-13 08:14:00 | Kristy | Miller | Bella Vista || 27 | 23 | Shipped | F | 2021-03-19 16:54:00 | 2021-03-20 17:46:00 | Kristy | Miller | Bella Vista || 32 | 27 | Shipped | F | 2022-03-06 09:01:00 | 2022-03-07 10:43:00 | Lindsey | Singh | London || 51 | 44 | Shipped | F | 2022-05-13 12:38:00 | 2022-05-13 23:22:00 | Jeanne | Lamb | Pontal || 54 | 46 | Shipped | F | 2022-07-06 14:51:00 | 2022-07-08 07:26:00 | Deborah | Nunez | Chengdu || 56 | 46 | Shipped | F | 2022-10-03 14:51:00 | 2022-10-04 18:25:00 | Deborah | Nunez | Chengdu || 57 | 47 | Shipped | F | 2022-04-05 15:26:00 | 2022-04-07 04:52:00 | Maria | Li | Chicago |+----------+---------+---------+--------+---------------------+----------------------+------------+-----------+----------------+
Some of the query statistics given below:
"totalBytesProcessed": "10258525","totalBytesBilled": "20971520","totalSlotMs": "2208","finalExecutionDurationMs": "1431"
When joining users and orders table, the query processed around 9 MB of data and returned results in 1.4 seconds and used 2208 milliseconds of slot time and total billable bytes is 20MB. Let's get the same result set from running SELECT statement from orders_nested table.
Selecting from orders_nested table:
SELECT * FROM BigQueryLab.orders_nested;+----------+---------+--------+---------------------+---------------------+-----------------------------------------------------------------------------------------+| order_id | status | gender | created_at | shipped_at | users |+----------+---------+--------+---------------------+---------------------+-----------------------------------------------------------------------------------------+| 2028 | Shipped | F | 2021-05-14 16:37:00 | 2021-05-17 00:45:00 | {"user_id":"1585","first_name":"Mariah","last_name":"Johnson","city":"Xingtai"} || 2842 | Shipped | F | 2021-09-07 01:22:00 | 2021-09-08 10:18:00 | {"user_id":"2215","first_name":"Jody","last_name":"Smith","city":"Detroit"} || 4134 | Shipped | F | 2020-05-19 01:02:00 | 2020-05-20 22:22:00 | {"user_id":"3247","first_name":"Beth","last_name":"Clark","city":"Shanghai"} || 4817 | Shipped | F | 2020-12-03 05:35:00 | 2020-12-06 00:00:00 | {"user_id":"3799","first_name":"Donna","last_name":"Spencer","city":"South Hill"} || 5259 | Shipped | F | 2023-02-20 00:45:00 | 2023-02-21 09:56:00 | {"user_id":"4147","first_name":"Kristi","last_name":"Luna","city":"Millington"} || 5279 | Shipped | F | 2022-09-08 00:55:00 | 2022-09-08 06:33:00 | {"user_id":"4160","first_name":"Caroline","last_name":"Gillespie","city":"Bloomington"} || 6412 | Shipped | F | 2022-07-29 02:08:00 | 2022-07-31 02:04:00 | {"user_id":"5065","first_name":"Rachael","last_name":"Wilson","city":"Cape Coral"} || 7139 | Shipped | F | 2022-05-12 13:37:00 | 2022-05-14 13:41:00 | {"user_id":"5636","first_name":"Kaitlin","last_name":"Christian","city":"Houston"} || 7385 | Shipped | F | 2021-02-09 03:35:00 | 2021-02-11 20:08:00 | {"user_id":"5823","first_name":"Jasmine","last_name":"Stewart","city":"Zhangzhou"} || 7692 | Shipped | F | 2020-11-19 02:29:00 | 2020-11-21 15:42:00 | {"user_id":"6074","first_name":"Kim","last_name":"Carey","city":"Pinheiro"} |+----------+---------+--------+---------------------+---------------------+-----------------------------------------------------------------------------------------+
Some of the above query statistics given below:
"totalBytesProcessed": "10172371","totalBytesBilled": "10485760","totalSlotMs": "1141","finalExecutionDurationMs": "1229"
When selecting from the nested table directly without any join, the query processed around 9 MB of data and returned results in 1.2 seconds and used 1141 milliseconds of slot time and total billable bytes is 10 MB.
Following observations made from this experimentation:
- Read from nested table is comparatively faster than joining from multiple tables.
- In both scenarios bytes processed is same, but billable bytes processed is higher when joining the tables.
- Total slot time consumed is higher when joining tables and comparatively low when reading from a nested table.
- The tables used in this experiment is smaller in size with few number of rows - so the difference is not big, you can see few milliseconds of difference in execution time, but when testing with bigger tables, this results may vary.
- The query used is very simple and straightforward, data shuffling is may not be a major consideration here, my guess is when joining bigger tables with uneven distribution, data shuffling will be more.
- When using nested columns, the table size is increased as we store the additional columns in the same table, this increases the storage cost.
0 thoughts:
Post a Comment