Very cool! A couple questions/suggestions off the top of my head: 1. Did you con...

maximilianroos · on Jan 24, 2022

Thanks!

> Did you consider using a keyword like `let` for column declarations

Yeah, the current design for that is not nice. Good point re the keyword scanning. I actually listed `let` as an option in the notes section. Kusto uses `extend`; dplyr uses `mutate`; pandas uses `assign`.

I opened an issue here: https://github.com/max-sixty/prql/issues/2

maximilianroos · on Jan 24, 2022

I've added the `let` keyword given a few people commented on this.

justinpombrio · on Jan 24, 2022

Awesome that you're responding to feedback like this!

Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`. Those are in effect pretty different operations: you need to know which one is happening to know how many columns the table will have, and renaming a table column can with your current system change which operation is happening.

Splitting them into separate operations would make things easier on the reader: they can tell what's happening without having to know all the column names of the table. And it shouldn't really be harder for the writer, who ought to already know which they're doing.

I encountered something like this at my previous job. We had a DSL with an operation that could either create or modify a value. This made the code harder to read, because you had to have extra state in your head to know what the code was doing. When I rewrote the DSL (the rewrite was sorely needed for other reasons), I split the operation in two. I was worried people would have been too used to the old language, but in practice everyone was happy with it.

maximilianroos · on Jan 24, 2022

Yes this is a good idea. dplyr has something similar with `mutate` & `transmute`.

This could _mostly_ be enforced by PRQL. There's a case where we transpile to:

  select *, x+1 as x_plus_one

...where we don't know whether or not we're overwriting an existing column. But it's a minority of cases, and the contract could stand within PRQL.

I opened an issue here: https://github.com/max-sixty/prql/issues/6

runeks · on Jan 25, 2022

> Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`.

Couldn’t we just not allow modifying an existing column? Ie. we would not allow

  count = count + 1

But force the use of a new variable name:

  new_count = count + 1

I think this makes for much more readable code, since the value of a variable does not depend on line number.

hadley · on Jan 25, 2022

acquero uses derive (https://uwdata.github.io/arquero/api/verbs#derive) which I rather like (it's better than mutate, IMO)

maximilianroos · on Jan 24, 2022

> 2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?

I don't have an example on the Readme, but I was thinking of something like (toy example):

  table newest_employees = (
    from employees
    sort tenure
    take 50
  )
  
  from newest_employees
  join salary [id]
  select [name, salary]

Or were you thinking something more sophisticated? I'm keen to get difficult examples!

Edit: formatting

mcdonje · on Jan 24, 2022

When you add in the ability to reference different tables like that to the piping syntax, it starts to remind me of the M query language: https://docs.microsoft.com/en-us/powerquery-m/quick-tour-of-...

There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.

I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.

twic · on Jan 24, 2022

My gut reaction is that if we have "from first" then maybe we should have to "to last":

  from employees
  sort tenure
  take 50
  as newest_employees
  
  from newest_employees
  join salary [id]
  select [name, salary]