tsv-sql

tsv-sql is a relational database which stores tables in TSV files. tsv-sql has no server and no indices: all operations are table scans. The tsv-sql tool converts a SQL SELECT statement to a pipeline of awk, sort, and join commands.

rename to sqawkl?

overview

$ tsv-sql 'SELECT * FROM foo JOIN bar ON foo.name = bar.name WHERE foo.city = "Denver";' foo.tsv bar.tsv > baz.tsv
  • tab and newline delimited
  • file name root becomes the table name
  • uses double quotes in place of single quotes for string; this makes tsv-sql easier to use on the command line
  • headers define columns name
  • column type can optionally be specified in braces: name{text} \t age{integer} \t city{text}
  • dates are in this format: YYYY-MM-DD HH24:MI:SS
  • true and false are 1 and 0
  • null is the same as empty string
  • sql keywords case insensitive, other identifiers case sensitive
  • subselects not supported; multiple files in the FROM clause not supported
  • join conditions must be equality conditions
  • errors: no header in file, line with too few, too many fields
  • strings and files are 8 bit ASCII: UTF-8 might work, but invalid UTF-8 is not rejected
  • PostgreSQL used as model for SQL dialect
  • SELECT: selected rows written to stdout

todo

  • just get it to read the table names and join them
  • identifiers must support TABLE.COLUMN
  • the * operator and count(*) [other args cannot take * as param]
  • understand distinct
  • stdin -> string
  • figure out execution structure
  • to_char(x, fmt): x can be integer, numeric, timestamp, interval
  • to_date
  • to_number(text, fmt)
  • date intervals
  • ascii() and chr()
  • a between x and y
  • a not between x and y
  • a in (x, …)
  • a not in (x, …)
  • type checking
  • locations
  • single quote strings? Make double quote strings honor JSON backslashes instead of duplication?
  • -f flag (like awk for a select stmt in a file)

grammar

we seem to be using a mix of BNF and EBNF notation…

also a mix of regex/lex/yacc notation…

do we need to explicitly add whitespace?

select statement

<select_stmt> ::= <select> [ <distinct> ] <select_list>
                  [ <from> <table_spec> [ <join_clause> {  <join_clause> } ] ]
                  [ <where> <boolean_expr> ]
                  [ <group_by> ( <column_spec> | <postitive_int> ) { , <column_spec> | <positive_int> } [ <having> <boolean_expr> ] ]
                  [ <order_by> <order_spec> { , <order_spec> } ]
                  [ <limit> <nonnegative_int> ]
                  [ <offset> <nonnegative_int> ]

<join_clause> ::=  [ <join_type> ] <join> <table_spec> <on> <column_spec> = <column_spec>
                  | [ <join_type> ] <join> <table_spec> <using> <column>

<join_type> ::  [ <inner> ] ( <left> | <right> | <full> ) [ <outer> ]  

<order_spec> ::= ( <column_spec> | <integer> ) [ <asc> | <desc> ]

<select_list> ::= <select_term> {',' <select_term> }

<select_term> ::= <expression> [ <as> <name> ]

<table_spec> ::= <table> | <table> <as> <name>

<column_spec> ::= <column> | <table>'.'<column>

<column> ::= <name>

<table> ::= <name>

<name> ::= [a-zA-Z_][a-zA-Z0-9]*

<number> ::= <integer> | <numeric>

<integer> ::= <column_spec> | <integer_literal> | <function_invocation>

<numeric> ::= <column_spec> | <numeric_literal> | <function_invocation>

<string> ::= <column_spec> | <string_literal> | <function_invocation>

<numeric_literal> ::= [ '-' | '+' ] ( [0-9]+ '.' [0-9]* | [0-9]* '.' [0-9]+ )

<integer_literal> ::= [ '-' | '+' ] [0-9]+

<nonnegative_int> ::= '0' | <positive_int>

<positive_int> ::= [1-9][0-9]*

expression

<expression> ::= <arithmetic_expr> | <boolean_expr> | <string_expr> | <null>

<boolean_expr> ::= <boolean_expr2> | <boolean_expr> <or> <boolean_expr2>

<boolean_expr2> ::= <boolean_expr3> | <boolean_expr2> <and> <boolean_expr3>

<boolean_expr3> ::= <boolean_expr4> | <not> <boolean_expr3>

<boolean_expr4> ::= <comparison_expr> | <is_expr> | <like_expr> | <similar_expr> | <true> | <false> | '(' <boolean_expr> ')'

<is_expr> ::= <expression> <is> <null> | <expression> <is> <not> <null>

<like_expr> ::= <expression> <like> <string> | <expression> <not> <like> <string>

<similar_expr> ::= <expression> <similar_to> <string> | <expression> <not> <similar_to> <string>

<comparison_expr> ::=  <expression> <comparison_op> <expression>

<comparison_op> ::= '=' | '!=' | '<' | '>' | '<>' | '<=' | '>='

<arithmetic_expr> ::= <arithmetic_expr2>
                      | <arithmetic_expr2> '+' <arithmetic_expr>
                      | <arithmetic_expr2> '-' <arithmetic_expr>

<arithmetic_expr2> ::= <arithmetic_expr3>
                      | <arithmetic_expr3> '*' <arithmetic_expr2>
                      | <arithmetic_expr3> '/' <arithmetic_expr2>
                      | <arithmetic_expr3> '%' <arithmetic_expr2>

<arithmetic_expr3> ::= <arithmetic_expr4>
                      | <arithmetic_expr4> '^' <arithmetic_expr3>

<arithmetic_expr4> ::= <number>
                      | '(' <arithmetic_expr> ')'

<string_op> ::= '||'

<string_expr> ::= <string_expr2> | <string_expr2> <string_op> <string_expr>

<string_expr2> ::= <string> | '(' <string_expr> ')'

(* string chunks cannot be whitespace separated, perhaps this is wrong *)
<string> ::= <string_chunk> { <string_chunk> }

<string_chunk> ::= '"' { <string_char> } '"'

<string_char> ::= [ASCII except for " and control characters 0-31, 127]

<function_invocation> ::= <function> '(' <expression_list> ')'

<function> ::= [LIST FUNCTIONS]

<expression_list> ::= <expression> { ',' <expression_list> }

keywords

keywords are case insensitive

AND
AS
ASC
BY
CASE
CURRENT_TIMESTAMP
DESC
DISTINCT
ELSE
END
FALSE
FULL
GROUP
HAVING
INNER
IS
JOIN
LEFT
LIKE
LIMIT
NOT
NULL
OFFSET
OR
ORDER
OUTER
RIGHT
SELECT
SIMILAR
THEN
TO
TRUE
WHEN
WHERE

engine

execution stages

optional pass/clause description tools
Y from/join join conditions must be quality; sort each table except the first by the join column and put the results in a temp table; sort the initial table in a pipeline with join statements sort + join once for each additional table
Y where expr apply where clause filter and compute expressions in select, group by, and order by clauses; where clause should be applied as soon as all the tables for it have been joined awk
Y group by apply group by and compute aggregate arguments sort + awk
Y aggregate apply aggregate functions and expressions containing them; compute having clause targets awk
Y having apply having clause filter awk
Y order by sort the rows sort
select rearrange columns per select clause and drop unmentioned columns awk
Y distinct sort -u
Y limit/offset awk

context

typedef int column_type;

struct column {
  char *name;
  column_type type;
  column *next;
};

struct target {
  int id;
  expression expr;
  target *next;
};

struct table {
  char *filename;
  char *name;
  char *alias;
  column *columns;
  table *next;
};

enum expression_type { ADD, SUBTRACT, ... }

/* must be able to convert this to awk;
 column names must be replaced by column numbers
*/
struct expression {
  expression_type head;
  expression *left_child;
  expression *right_child;
  ??? value;  /* for constants */
}

enum aggregration_type { COUNT, SUM, MIN, MAX, AVG, STDDEV };

struct aggregation {
  aggregation_type agg_type;
  expression arg;
}

struct select {
  table *tables;
  target *targets;
  expression where;
  char *where_table_names[]; /* so that we know how soon we can apply where filter */
  int order_by_target_ids[];
  int group_by_targets_ids[];
}

awk function definitions

Note: for current_timestamp to work, must export CURRENT_TIMESTAMP=‘date -u +’%Y-%m-%d %H:%M:%S'`

function abs(x) { if ( x > 0 ) y = 3; else y = -x; return y }
function trim(c, s) { gsub("^[" c "]+","",s); gsub("[" c "]+$","",s);  return s }
function ltrim(c, s) { gsub("^[" c "]+","",s); return s }
function rtrim(c, s) { gsub("[" c "]+$","",s); return s }
function case(a,b,c) { if (a) return b; else return c }
function ceil(x) { if (x > 0) return int(x+1); else return int(x) }
function floor(x) { if (x > 0) return int(x); else return int(x-1) }
function round(x) { if (x > 0) return int(x+.5); else return int(x-.5) }
function sign(x) { if (x > 0) return 1; else if (x < 0) return -1; else return 0 }
function nextval(seq) { if (!sequences[seq]) sequences[seq] = 0; sequences[seq] += 1; return sequences[seq] }
function currval(seq) { return sequences[seq] }
function setval(seq, val) { sequences[seq] = val }
function nullif(a,b) { if ( a == b ) return ""; else return a }
function current_timestamp() {  return ENVIRON["CURRENT_TIMESTAMP"] }

sql

We use PostgreSQL as a model for our dialect SQL. We make the changes:

  • "" and NULL are the same.
  • true, false, 0, and 1 are the same.
  • strings are double quoted instead of single quoted.
  • quoted identifiers are not supported.
  • subqueries are not supported.
  • we have a reduced set of types.
  • many functions are not implemented.

operators

Arranged from lowest precedence to highest. Evaluated from left to right unless otherwise stated.

level sql awk
logic
1 or ||
2 and &&
3 not !
comparison
a = true|false true(a), true(!a)
true|false = a true(a), true(!a)
a != true|false !true(a), !true(!a)
true|false != a !true(a), !true(!a)
> >= <= <> != > >= <= != !=
a like b match(a,like_regex(b))
a not like b !match(a,like_regex(b))
a similar to b match(a,b)
a not similar to !match(a,b)
exists
in
not in
arithmetic and concatenation
0 || whitespace
1 |/ sqrt
1 @ abs()
2 + - + -
3 unary - + - nothing
3 * % * %
3 / / or trunc(a / b)
4 ^ ^
pseudo function
1 substr(a [from b] [for c]) substring(a,b,c) or substring(a,0,c) or substring(a,b)
1 position(a in b) index(b,a)
1 trim(c from s) trim(c,s)
1 trim(leading c from s) ltrim(c,s)
1 trim(trailing c from s) rtrim(c,s)
1 trim(leading from s) ltrim(" ",s)
1 trim(trailing from s) rtrim(" ",s)
1 case when a1 then b1 [ when a2 then b2 … ] [ else c ] end case(a1,b1,""), case(a1,b1,c), case(a1,b1,case(a2,b2,"")), …
1 coalesce(a,b[,c…]) coalesce(a,b), coalesce(a,coalesce(b,c)), …

functions

sql awk
abs() abs()
ceil(), ceiling() ceil()
div() / or trunc(a / b)
exp() exp()
floor() floor()
ln() log()
log(a) log(a)/log(10)
log(a,b) log(a)/log(b)
mod() %
power() ^
random() rand()
round() round()
setseed() srand()
sign() sign()
sqrt() sqrt()
trunc(x) int(x)
trunc(x,n) int(x * 10^n) / 10^n
cos() cos()
sin() sin()
atan2() atan2()
length() length()
lower() tolower()
trim(s) trim(" ",s)
upper() toupper()
regexp_replace(a, b, c) sub(b,c,a)
regexp_replace(a, b, c,'g') gsub(b,c,a)
nextval(a) nextval(a)
currval(a) currval(a)
setval(a,v) setval(a,v)
nullif(a,b) nullif(a,b)
count()
sum()
avg()
max()
min()
current_timestamp() current_timestamp()

precedence tests for logic

postgres=# select false and false or true;
 ?column? 
----------
 t

postgres=# select not false and false;
 ?column? 
----------
 f

postgres=# select not true or true;
 ?column? 
----------
 t

these show that like has higher precedence that and

clark=# select null is true and null;
 ?column? 
----------
 f
(1 row)

clark=# select true and null;
 ?column? 
----------

(1 row)

clark=# select null is null;
 ?column? 
----------
 t
(1 row)

precedence and associativity tests for arithmetic

postgres=# select |/ -4 ^ 2;
 ?column? 
----------
        4

postgres=# select |/ -4;
ERROR:  cannot take square root of a negative number

postgres=# select |/ 4 + 4;
     ?column?     
------------------
 2.82842712474619

postgres=# select @ 2 + -3;
 ?column? 
----------
        1

postgres=# select 2 * 3 + 4;
 ?column? 
----------
       10

postgres=# select 2 ^ 3 ^ 4;
 ?column? 
----------
     4096

postgres=# select 1 - 2 - 3;
 ?column? 
----------
       -4

postgres=# select 8 / 2 / 2;
 ?column? 
----------
        2

postgres=# select 2 ^ 3 * 4;
 ?column? 
----------
       32

postgres=# select - 3 - 2;
 ?column? 
----------
       -5

postgres=# select - 3 + 2;
 ?column? 
----------
       -1

postgres=# select - 3 ^ 2;
 ?column? 
----------
        9
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License