Very cool! A couple questions/suggestions off the top of my head:
1. Did you consider using a keyword like `let` for column declarations, e.g. `let gross_salary = salary + payroll_tax` instead of just `gross_salary = salary + payroll_tax`? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.
2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
> Did you consider using a keyword like `let` for column declarations
Yeah, the current design for that is not nice. Good point re the keyword scanning. I actually listed `let` as an option in the notes section. Kusto uses `extend`; dplyr uses `mutate`; pandas uses `assign`.
Awesome that you're responding to feedback like this!
Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`. Those are in effect pretty different operations: you need to know which one is happening to know how many columns the table will have, and renaming a table column can with your current system change which operation is happening.
Splitting them into separate operations would make things easier on the reader: they can tell what's happening without having to know all the column names of the table. And it shouldn't really be harder for the writer, who ought to already know which they're doing.
I encountered something like this at my previous job. We had a DSL with an operation that could either create or modify a value. This made the code harder to read, because you had to have extra state in your head to know what the code was doing. When I rewrote the DSL (the rewrite was sorely needed for other reasons), I split the operation in two. I was worried people would have been too used to the old language, but in practice everyone was happy with it.
> Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`.
Couldn’t we just not allow modifying an existing column? Ie. we would not allow
count = count + 1
But force the use of a new variable name:
new_count = count + 1
I think this makes for much more readable code, since the value of a variable does not depend on line number.
> 2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
I don't have an example on the Readme, but I was thinking of something like (toy example):
table newest_employees = (
from employees
sort tenure
take 50
)
from newest_employees
join salary [id]
select [name, salary]
Or were you thinking something more sophisticated? I'm keen to get difficult examples!
There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.
I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.
1. Did you consider using a keyword like `let` for column declarations, e.g. `let gross_salary = salary + payroll_tax` instead of just `gross_salary = salary + payroll_tax`? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.
2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?