Explains what the feature is or what its benefits are to the user or customer.
For the best performance, you should split (or shard) very large tables across nodes. If you have a large dimension table, you might choose to co-shard it with the fact table it will be joined with.
Sharding a fact table
Use sharding to split large tables into parts for distribution across nodes. This is typically done with large fact tables, to provide optimal performance.
When sharding, you'll choose a column to be the distribution key. This column should contain a value that has a good distribution (roughly similar number of rows with each value in that column). This is typically the primary key, but it can be any single column or a set of columns.
TQL> CREATE TABLE ... ...PARTITION BY HASH (96) KEY ("customer_id");
If no sharding is specified or the number of shards specified is one, the table is assumed to be unsharded (i.e. the table physically exists on each node).
If no sharding key is specified, but the number of shards is greater than one, the table is assumed to be sharded randomly. The system does not use primary keys as sharding keys by default.
TQL> CREATE TABLE "supplier" ( "s_suppkey" BIGINT, "s_name" VARCHAR(255), "s_address" VARCHAR(255), "s_city" VARCHAR(255), "s_phone" VARCHAR(255), CONSTRAINT PRIMARY KEY ("s_suppkey") ) PARTITION BY HASH (96) KEY ("s_suppkey");
Sharded (distributed) dimension tables
In a typical schema, you'd have a sharded fact table with foreign keys to replicated dimension tables (which exist on every node). This works best where dimension tables are small (under 50MB). So if your dimension tables are small, you should shard the fact tables and not shard the dimension tables they will be joined with.
- The dimension table is large (over 50MB).
- The tables are always joined using the same columns.
- The tables are related by a primary key and foreign key.
- The tables are partitioned on the same primary key/foreign key.
- The tables have the same number of regions (or shards).
When a fact and its dimension table are co-sharded:
- The two tables will always be joined on the sharding key.
- Data skew can develop if a very large proportion of the rows have the same sharding key.
This example shows the CREATE TABLE statements that meet the criteria for co-sharding of a fact table and a distributed dimension table:
TQL> CREATE TABLE products_dim ( "id" int, "prod_name" varchar(30), "prod_desc" varchar(100), PRIMARY KEY ("id") ) PARTITION BY HASH (96) KEY ("id") ; TQL> CREATE TABLE retail_fact ( "trans_id" int, "product_id" int, "amount" double, FOREIGN KEY ("product_id") REFERENCES products_dim ("id") ) PARTITION BY HASH (96) KEY ("product_id") ;