Regular indexed keys and the
STORING
clause
serve very similar purposes in Google Cloud Spanner.
For example, let's consider this table which describes completable tasks.
CREATE TABLE tasks
created_by_user_id STRING(36) NOT NULL,
# e.g. "Buy milk"
content STRING(MAX),
# "HIGH" | "LOW"
priority STRING(MAX),
completed_at TIMESTAMP,
Let's imagine we want to show the current user all their tasks for a given priority. One way we can speed up this query is by creating the following index.
CREATE INDEX
active_tasks_by_priority_idx
ON
user (
created_by_user_id,
completed_at,
priority
)
Alternatively, we could have priority
in a STORING
clause instead
of in the index keys.
CREATE INDEX
active_tasks_by_priority_idx
ON
user (
created_by_user_id,
completed_at
) STORING (
priority
)
The advantage of adding STORING
over an index with just the other
two keys is that it removes the need to join on the base tasks
table
which makes the query cheaper as priority
is colocated with the
index. If we did not store priority
, Spanner would have to look at
the values in our index then go to the base tasks
table to look up
the priority
value.
But adding priority
to the indexed keys would also avoid having to
look up the value in the base table. Let's imagine our app limits the
user to three active (compelted_at IS NOT NULL
) HIGH priority tasks
and five active MEDIUM priority tasks (focus is important). This means
that when we create a new task, we need to first do a read inside a
read-write transaction to avoid a task being created which exceeds the
task limit, then do our write to create the task. What is the
difference between using indexed keys and STORING
if we create one
HIGH priority task then one MEDIUM priority task?
If STORING
is used and we create a new HIGH priority task, the
transaction will lock all rows matching the specified
created_by_user_id
and completed_at
values. i.e. All user task
creation will be locked, regardless of priority. This means that the
subsequent MEDIUM priority task creation will be blocked.
If priority
is an indexed key, on the other hand, the transaction
will only lock rows which match the queried priority
. Consequently,
the MEDIUM priority task creation will not be blocked.
This is a trivial example where the blocking is minimal. However, with larger amounts of data, this difference can cause serious performance issues.
Also note that read-only transactions do not acquire locks so this difference only applies to read-write transactions.
STORING
comes at the cost of extra storage space vs not using
STORING
. However, this is also true of additional index keys. What
is the difference in cost here?INTERLEAVE
fit in here? Like STORING
, it is an approach
to reduce join costs.