Hacker News new | past | comments | ask | show | jobs | submit login

Very cool! A couple questions/suggestions off the top of my head:

1. Did you consider using a keyword like `let` for column declarations, e.g. `let gross_salary = salary + payroll_tax` instead of just `gross_salary = salary + payroll_tax`? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.

2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?




Thanks!

> Did you consider using a keyword like `let` for column declarations

Yeah, the current design for that is not nice. Good point re the keyword scanning. I actually listed `let` as an option in the notes section. Kusto uses `extend`; dplyr uses `mutate`; pandas uses `assign`.

I opened an issue here: https://github.com/max-sixty/prql/issues/2


I've added the `let` keyword given a few people commented on this.


Awesome that you're responding to feedback like this!

Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`. Those are in effect pretty different operations: you need to know which one is happening to know how many columns the table will have, and renaming a table column can with your current system change which operation is happening.

Splitting them into separate operations would make things easier on the reader: they can tell what's happening without having to know all the column names of the table. And it shouldn't really be harder for the writer, who ought to already know which they're doing.

I encountered something like this at my previous job. We had a DSL with an operation that could either create or modify a value. This made the code harder to read, because you had to have extra state in your head to know what the code was doing. When I rewrote the DSL (the rewrite was sorely needed for other reasons), I split the operation in two. I was worried people would have been too used to the old language, but in practice everyone was happy with it.


Yes this is a good idea. dplyr has something similar with `mutate` & `transmute`.

This could _mostly_ be enforced by PRQL. There's a case where we transpile to:

  select *, x+1 as x_plus_one
...where we don't know whether or not we're overwriting an existing column. But it's a minority of cases, and the contract could stand within PRQL.

I opened an issue here: https://github.com/max-sixty/prql/issues/6


> Another suggestion around `let`: consider splitting it into two operations, for creating a new column and for modifying an existing one. E.g. called `let` and `set`.

Couldn’t we just not allow modifying an existing column? Ie. we would not allow

  count = count + 1
But force the use of a new variable name:

  new_count = count + 1
I think this makes for much more readable code, since the value of a variable does not depend on line number.


acquero uses derive (https://uwdata.github.io/arquero/api/verbs#derive) which I rather like (it's better than mutate, IMO)


> 2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?

I don't have an example on the Readme, but I was thinking of something like (toy example):

  table newest_employees = (
    from employees
    sort tenure
    take 50
  )
  
  from newest_employees
  join salary [id]
  select [name, salary]

Or were you thinking something more sophisticated? I'm keen to get difficult examples!

Edit: formatting


When you add in the ability to reference different tables like that to the piping syntax, it starts to remind me of the M query language: https://docs.microsoft.com/en-us/powerquery-m/quick-tour-of-...

There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.

I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.


My gut reaction is that if we have "from first" then maybe we should have to "to last":

  from employees
  sort tenure
  take 50
  as newest_employees
  
  from newest_employees
  join salary [id]
  select [name, salary]




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: