Queries

SELECT Statements

SELECT Statement Clauses

The WHERE clause cannot refer to expressions defined in the SELECT clause. A SQL engine executes the clauses in this order:

  1. FROM
  2. JOIN
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. SELECT
  7. ORDER BY
  8. LIMIT
  9. OFFSET

The only difference from the order in which the clauses appear in a SELECT statement is that the SELECT clause is moved up to the top, allowing us to quickly see the column definitions of the result set.

Remember that the SELECT expressions are computed after the HAVING and before the ORDER BY.

The SELECT clause is the only required clause. If the FROM clause is omitted, the query returns a single row.

How to Format a SELECT Statement

A common convention is to write SQL keywords and functions in all-caps and table and column names in lowercase.

Write each top level clause of the SELECT statement (e.g. SELECT, FROM, JOIN, WHERE, GROUP BY, HAVING, ORDER BY, OFFSET, LIMIT) on its own line. At first glance it seems there can be at most 9 lines, but in fact there can be multiple JOIN clauses.

If it is necessary to break one of the clause lines, all lines after a break are indented one level deeper than the head of the clause.

INFORMATION_SCHEMA

The SQL standard calls for an information_schema with tables containing information about the entities in a database. The columns table provides all information needed for writing SELECT statements:

SELECT table_name, column_name, data_type, column_default, is_nullable
FROM information_schema.columns;

SQLite does not have support for the information_schema. Use the schema command to get the DDL for tables:

> schema
> schema foo

In the MySQL command line client, one can also use show and describe commands to get schema information:

> show tables;
> describe foo;

In the PostgreSQL command line client, one can also use the \d command to get schema information:

> \d
> \d foo

Types and Expressions

Types

These type names are short and reasonable portable:

  • BOOLEAN
  • INTEGER
  • REAL
  • DOUBLE
  • DECIMAL(total digits, fractional digits)
  • DECIMAL(total digits)
  • CHAR(length)
  • VARCHAR(length)
  • TEXT
  • TIMESTAMP
  • DATE
  • TIME

PostgreSQL does not DOUBLE. Use DOUBLE PRECISION.

Athena does not have TEXT. Use VARCHAR without a length;

Casts

This is the portable way to cast to another type:

SELECT CAST('7' AS INTEGER) + 3;

PostgreSQL and Snowflake have double colon syntax for casting:

SELECT '7'::INTEGER + 3;

NULL

All data types can be NULL unless declared as "NOT NULL".

Test whether a value is NULL with "IS NULL". "NULL IS NULL" evaluates to TRUE, but "NULL = NULL" evaluates to NULL.

The COALESCE function takes two or more arguments and returns the first non-NULL argument, or NULL if all the arguments are NULL. It is a convenient way to remove NULL values.

NULLIF takes two arguments and returns NULL if the first is equal to the second. It is loosely the inverse of the COALESCE function.

Boolean

The boolean literals are TRUE and FALSE.

The logical operators are AND, OR, and NOT.

Numeric

arithmetic operators + - * / %
power, square root, and logarithm power(base, exponent) sqrt, log(base, exponent)
natural exponent and logarithm exp ln
trigonometic sin cos tan asin acos atan atan2(x, y)
absolute value abs
float truncation floor, ceil, round, cast(real as integer)

SQLite, PostgreSQL, and Athena: the / operator performs integer division on integers.

MySQL and Snowflake: the / operator performs real division on integers.

SQLite has no transcendental arithmetic functions. It has abs and round, both which return floats.

Athena and Snowflake: the log function must be provided two arguments.

MySQL and PostgreSQL: if the log function is given a single argument, it computes the natural logarithm.

Strings

CHAR(100) is always 100 characters in length; VARCHAR(100) is up to 100 characters in length; TEXT is variable length with no upper limit.

String literals are single quote delimited. Inside a string literal single quotes are represented by doubling them.

The double pipe || operator is used for string concatenation.

The LIKE operator is used to compare a string with a simple pattern in which underscore "_" matches a single character and percent sign "%" matches zero or more characters. All other characters in the pattern match themselves. To match only an underscore or percent sign, use a preceding backslash character to escape them.

Some widely supported string functions:

  • LENGTH(s)
  • UPPER(s)
  • LOWER(s)
  • POSITION(s1 IN s2)
  • SUBSTRING(s FROM n FOR m)
  • TRIM(LEADING [s1] FROM s2)
  • TRIM(TRAILING [s1] FROM s2)
  • TRIM(BOTH [s1] FROM s2)
  • LPAD(s1, n, [s2])
  • RPAD(s1, n, [s2])

Regular Expressions

mysql postgresql athena snowflake
test for substring
matching pattern
'foo' regexp 'f.+' 'foo' ~ 'f.+' regexp_like(
  'foo',
  'f.+')
regexp_like(
  'foo',
  'f.+')
test if string
matches pattern
'foo' regexp '^f.+$' 'foo' ~ '^f.+$' regexp_like(
  'foo',
  '^f.+$')
regexp_like(
  'foo',
  '^f.+$')
single
substitution
regexp_replace(
  'do re mi mi',
  'mi',
  'ma')
regexp_replace(
  'do re mi mi',
  'mi',
  'ma',
  1,
  1);
global
substitution
regexp_replace(
  'do re mi mi',
  'mi',
  'ma')
regexp_replace(
  'do re mi mi',
  'mi',
  'ma',
  'g')
regexp_replace(
  'do re mi mi',
  'mi',
  'ma')
regexp_replace(
  'do re mi mi',
  'mi',
  'ma')
capture group
extraction
(regexp_matches(
  'do re mi',
  '(r.)'))[1]
regexp_extract(
  'do re mi',
  '(r.)',
  1);
regexp_substr(
  'do re mi',
  'r.',
  1,
  1,
  'e');
character class
abbrevations
. .
\d \s \w
\D \S \W
.
\d \h \s \v \w
\D \H \S \V \W
.
\b \d \s \w
\B \D \S W

SQLite does not have regular expressions.

Times and Dates

mysql postgresql athena snowflake
today and now current_date
current_time
current_timestamp
current_date
current_time
current_timestamp
current_date
current_time
current_timestamp
current_date
current_time
current_timestamp
to string date_format(
  current_timestamp,
  '%Y-%m-%d')

date_format(
  current_timestamp,
  '%H:%i:%S')
to_char(
  current_timestamp,
  'yyyy-mm-dd')

to_char(
  current_timestamp,
  'hh24:mi:ss')
to_char(
  current_timestamp,
  'yyyy-mm-dd')

to_char(
  current_timestamp,
  'hh24:mi:ss')
to_char(
  current_timestamp,
  'yyyy-mm-dd')

to_char(
  current_timestamp,
  'hh24:mi:ss')
from string str_to_date(
  '2019-10-12',
  '%Y-%m-%d')
to_timestamp(
  '2019-10-12',
  'yyyy-mm-dd')
to_timestamp(
  '2019-10-12',
  'yyyy-mm-dd')
to_timestamp(
  '2019-10-12',
  'yyyy-mm-dd')
implicit conversions current_timestamp
  '2019-10-12
cast('2019-10-12'
  as timestamp)

current_timestamp
  '2019-10-12'
cast('2019-10-12'
  as timestamp)
cast('2019-10-12'
  as timestamp)

current_timestamp
  > '2019-10-12'
time intervals date_add(
  current_timestamp,
  interval 1 day)
current_timestamp +
  interval '1' day
current_timestamp +
  interval '1' day
current_timestamp +
  interval '1' day

In PostgreSQL and Snowflake, date and time format string specifiers can be uppercase or lowercase. In Athena they must be lowercase. In MySQL they are case sensitive.

Comparison Tests

equality test =
non-equality test <> !=
relational operators < > <= >=

<> and != are synonyms in all SQL engines described here.

One can use the IN operator to test whether a value is equal to one of the values in a list; e.g. the following expressions are synonyms:

> SELECT 1 IN (1, 2, 4);
> SELECT 1 = 1 OR 1 = 2 OR 1 = 4;

All types define an order for their values. As a result the relational operators and the BETWEEN operator can be used with them and they can appear as arguments in the ORDER BY clause.

CASE

An example of a CASE expression:

SELECT CASE WHEN some_value > 0 THEN 1 WHEN some_value = 0 THEN 0 ELSE -1 END;

Identifiers and AS

Identifiers which start with a letter and only contain letters, digits, and the underscore "_" do not need to be quoted. Other identifiers must be quoted with double quotes.

Column expressions in the SELECT clauses can be renamed with an AS clause at the end. The AS keyword is optional.

Table names in FROM and JOIN clauses can be renamed with AS clauses as well.

Wildcard: *

The asterisk provides a concise way to select all the columns from the tables in the FROM and JOIN clauses.

The asterisk can be used with other expressions, e.g.:

SELECT *, clicks / impressions AS ctr
FROM web_log_summary;

The qualified wildcard can be used to select columns from a specific table:

SELECT a.*
FROM alpha AS a
JOIN beta AS b ON a.id = b.id;

DISTINCT

SELECT DISTINCT name, age FROM customers;

The DISTINCT keyword is used to remove duplicate rows from the output. It is best thought of as a modifier of all the SELECT clauses expressions, and not just the first. One could think of it as a modifier of the SELECT keyword itself, but then how would one explain this query:

SELECT COUNT(DISTINCT *) FROM customers;

Aggregation and Window Functions

Aggregation Functions

The most widespread are COUNT, SUM, MIN, MAX, and AVG.

When some expressions in the SELECT clause contain aggregation functions, other expressions must be in the GROUP BY clause.

putting the arguments of the GROUP BY clause first in the SELECT clause: the grain of an aggegation (the primary key of the result set)

what is going on with the arguments of the COUNT function

Window Functions

The SELECT clause is a list of expressions which define the columns of the result set. Any of these expressions can contain window functions, which are a small set of functions that are always followed by the OVER keyword. The OVER keyword is followed by a clause in parens consisting of an optional PARTITION BY clause and and an optional ORDER BY clause.

The aggregration functions COUNT, SUM, MIN, MAX, and AVG can all be used as window functions. If the OVER clause is empty, the aggregation function is computed over the entire result set and the value is repeated for each row in the result set.

When the PARTITION BY clause is present, the results are grouped by the values of expression following PARTITION BY and the aggretation function is computed for each partition and that value is repeated for each row that belongs to the partition.

When the ORDER BY clause is present, the results are ordered by the expression and the aggregation function is computed for all rows up to and including the current row. Thus SUM(...) OVER(ORDER BY ...) can be used to compute a cumulative sum.

The PARTITION BY and ORDER BY clauses can both be used, in which case the ORDER BY applies to each partition.

RANK() and ROW_NUMBER() are window functions which require and ORDER BY clause. Both count starting at 1. In the case of ties, RANK() will assign multiple rows the same rank and there will be a gap to the next rank. ROW_NUMBER() is similar to RANK() but gives each row a unique number with no gaps.

Joins and Subqueries

Joins

One can specify multiple tables in the FROM clause, separated by commas. One gets a Cartesian join which can then be narrowed down by a condition in the WHERE clause.

Alternatively one can put additional tables after the first in JOIN clauses. The join condition is moved from the WHERE clause into the ON clause of the JOIN clause. The JOIN syntax makes it easier to write LEFT, RIGHT, or FULL OUTER joins.

A LEFT join is where rows in the FROM table that don't match any rows in the JOIN table are preserved in the result set. All columns that would have been derived from the JOIN column are set to NULL. This would be tedious to implement without the JOIN clause though it could be done with a UNION statement.

Subqueries

Subqueries can appear in FROM, WHERE and HAVING clausess. They are always surrounded by parans.

In FROM clauses, they can be followed by AS <alias> to aid referencing the columns in other clauses.

In WHERE and HAVING clauses they can be used in three ways

  • EXISTS, NOT EXISTS:
  • IN: singlue column, multiple row
  • single column and row

Subqueries can have expressions referring to the FROM clause of the query they are embedded in.

WITH

WITH query1 AS (
  SELECT ...
),
query2 AS (
  SELECT ...
),
...
queryN AS (
  SELECT ...
)
SELECT ...

Normal Form and Semi-structured Data

Normal Form

1NF: no repeating values in columns

Rectangular data is in first normal form (1NF) if not of the columns contain arrays or lists. Traditionally relational databases did not provide an array type, but list data can always be serialized as a string using a delimiter such as a pipe or semicolon.

2NF: no columns dependent on proper subset of candidate key

Rectangular data is in second normal form (2NF) if it is in 1NF and no columns can be a function of a proper subset of the candidate key. That is, it must not be possible to remove the columns from the table and lookup them up in an auxillary table where the proper subset of the candidate key is the primary key.

3NF: no transitive dependencies

Rectangular data is in third normal form (3NF) if it is in 2NF and no columns can be determined by other columns which are not part of the candidate key. That is, it must not be possible to remove the columns from the table and lookup them up in an auxiary table where the other columns make up the primary key.

Splitting a String

mysql postgresql athena snowflake
array type
split and select element SUBSTRING_INDEX(
  SUBSTRING_INDEX(
    'foo,bar,baz', ',', 2),
  ',', -1)
(STRING_TO_ARRAY(
  'foo,bar,baz', ','))[2]

PostgreSQL: string_to_array

Athena: SPLIT(string, separator)

Snowflake: SPLIT(string, separator)

JSON

mysql postgresql athena snowflake
datatype JSON JSONB VARCHAR VARIANT
extract json value by path JSON_EXTRACT(doc, path) jsonb_extract_path()
extract string value by path JSON_EXTRACT(doc, path) jsonb_extract_path_text()
  • how are errors (e.g. element not found) handled
  • what is the type of the extracted element

Expanding List Data in a field to Multiple Rows

Suppose that you have a table called "events" with a column "ads" that contains a list of data. The table is thus not in 1NF and can be hard to query.

In Snowflake, assuming the data is an array type, you can normalize the data like this:

SELECT a.value, a.this, e.*
FROM reviews.events AS e
JOIN LATERAL FLATTEN (input => ads) as a

For each row in the table, there are as many rows in the result set as there are ads in that row. The ad appears in the "a.value" column, all of the ads appear in the "a.this" column, and all of the columns from original row appear in the "e.*" columns. Since one of those is the "ads" column, that column is repeated twice in the result set.

DDL and DML

It is better to log in to the database with a readonly account. After
all, the purpose of the relational database is to issue SELECT
statements. A readonly account allows one to do this in a carefree
manner with few consequences. Still someone must undertake the burden
of issuing the DDL and DML statements or there will be nothing to
query. Limiting the accounts, services, and code that can issue DDL
and DML statements makes debugging easier. The number of suspects to
investigate when a crime is committed is smaller, as it were.

WHERE clauses in UPDATE and DELETE statements

UPDATE and DELETE statements have optional WHERE clauses. If the WHERE
clause is omitted, the statement applies to all rows in the
TABLE. Think about the consequences of omitting a WHERE clause by
accident.

Idempotent DDL and DML

Mysql, Snowflake have "CREATE OR REPLACE TABLE".

PostgreSQL: "DROP TABLE IF EXISTS" followed by "CREATE TABLE"

DELETE and UPDATE are already idempotent, usually. (WHERE clause could refer to values being deleted or modified).

INSERT is not idempotent. UPSERT or MERGE?

Sequences

  • syntax of sequences by database
  • distributed sequences
  • joining on integers vs joining on strings or dates. Encoding dates as integers.
  • hash collisions

Database Setup

Installing Databases

On a Mac with Homebrew:

$ brew install mariadb postgresql sqlite

On Ubuntu:

$ sudo apt-get install -y mariadb-server postgresql sqlite3

Running the Servers

On a Mac with Homebrew, these commands give information on how to start the server, or to schedule to server to be started at startup:

$ brew info mariadb
$ brew info postgresql

Creating Users and Databases

SQLite does not have user accounts. To create a database, specify the file on the command line when starting sqlite. If the file does not exist it is created.

$ sqlite3 /tmp/my_database_file.db

Log in to the database as root to create a MySQL database and user:

$ mysql -uroot
> CREATE DATABASE bob;
> CREATE USER bob;
> GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, INDEX, DROP, ALTER, CREATE TEMPORARY TABLES, LOCK TABLES ON bob.* TO bob;

PostgreSQL provides command line tools for creating users and databases.

$ sudo -u postgres createuser -s $USER
$ createdb $USER
$ psql