Glowdust is a new kind of database management system

vkazanov · on Feb 15, 2024

This feels a lot like many Prolog-likes, and also tuple-based DBs, and also a bunch of other things.

The README doesn't explain the way it solves the real DBMS problems: consistency, isolation, etc. Without looking at the code my bet is that this is mostly playing with the data/language itself.

ithkuil · on Feb 15, 2024

Reminds me of datalog, but no mention of it in the readme. I wonder what the big difference is (a part from the syntax)

okennedy · on Feb 15, 2024

The big difference I see at a skim is that, in classical datalog, facts are only allowed to contain domain attributes, and not value attributes. E.g., you can express binary facts like Raining(12:00) (it's raining at 12:00), but not Rain(12:00) = 5 in (At 12:00, 5 inches of rain had accumulated).

Value attributes make it much easier to express most forms of aggregation (sum, min, max), so you'll find very similar patterns in practical datalog variants e.g., RelationalAI's Rel [1], DBToaster's AGCA [2], etc...

Apart from that, and a syntax that seems to resemble map-style collection programming a bit more than datalog, yeah, this basically looks like datalog.

[1] https://docs.relational.ai/getting-started/rel/my-first-rel-... [2] https://dbtoaster.github.io/

ithkuil · on Feb 15, 2024

I always thought you modelled this as Rain(12:00, 5) (which is similar to how a relational db table would look like, with two columns etc)

But I'm not a huge expert in that, is there a profound difference between the two?

okennedy · on Feb 21, 2024

From a practical standpoint for most database systems, sort of? One might say that there's a functional dependency from the 'time' to the 'precipitation' attribute, and providing that information to the optimizer might affect its decisions... but at the level of data storage and query evaluation runtimes, there's not a huge difference.

From a data modeling and query optimization perspective, however, there's some value in distinguishing attributes uniquely related to identity (e.g., keys, or group-by attributes) and attributes that we're only interested in computing statistics over. This makes it easier to automatically create e.g., data cubes or similar indexes, and many useful statistics can be modeled using a nice mathematical structure like a ring or semiring [1], who's properties (commutativity, associativity, distributivity) are very helpful when optimizing queries.

Classical Datalog, in particular, is entirely based on the former type of attribute; value (dependent) attributes always need to be hacked in, in some way.

[1] https://dl.acm.org/doi/10.1145/1265530.1265535

tommiegannert · on Feb 15, 2024

At a firsta glance, this looks semantically equivalent to SPARQL, except you can add function expressions. I like combining data and code, but would like to know more about the inspiration and differences to triplestores.

> Of course, temperature is not defined at every possible time value. For such values, Glowdust defaults to returning no value

Makes it sound like you can add interpolation expressions to functions with defined triples, but I don't see an example/confirmation of that. Can you?

DeathArrow · on Feb 15, 2024

I rather have data separated from operations done on data. I am very happy since I don't have to deal anymore with stored procedures.

littlestymaar · on Feb 15, 2024

Honestly, the main issue I have with stored procedures is mostly the tooling that is stuck in the 80s…

For many things, having an “application server” running code that is merely chaining SQL queries (often hiding them with an ORM and performing way too much queries, destroying performance in the process) is not a particularly good idea… The main reason we do it is because we can have a proper debugger, write unit tests, have an IDE, etc. but there's no fundamental why stored procedures could not give the same developer experience, except that RDBMS vendors don't give a sh*t.

danpalmer · on Feb 15, 2024

This is true, I'd love to see a truly modern application development process built entirely around hosting in a database.

However, one other concern is that scaling out databases is much harder than scaling out stateless processing layers, and therefore at scale there's a huge benefit to separating as much out of the storage layer as possible.

refset · on Feb 15, 2024

One such effort: https://github.com/omnigres/omnigres

danpalmer · on Feb 15, 2024

Yeah this is sort of what I'm thinking of, thanks!

dfee · on Feb 15, 2024

Horrifying, yet brilliant.

zokier · on Feb 15, 2024

In addition to sibling comments, there is also https://postgrest.org/

gtirloni · on Feb 15, 2024

Something like CouchDB "applications"? [0]

0 - https://docs.couchdb.com/en/latest/ddocs/index.html

zokier · on Feb 15, 2024

There is the operational aspect that managing stateless components is usually much more pleasant from ops point of view. And DMBS is the very opposite of stateless.

So in many cases it is preferrable to extract as much as possible any computational etc parts of an application from DBMS to separate stateless components that can then be scaled/updated/maintained/etc independently.

metricspaces · on Feb 15, 2024

This is an interesting point but imo ignores the facilities afforded by an RDBMS and possibly even misunderstands the notion of fully hosting applications on a RDBMS.

Some sort of check pointing is inevitable for recoverable state-less systems (that naturally react to a stream of data/queries) and that ultimately is delegated to either a streaming/messaging (e.g. kafka) or a database (of some sort).

Components of an application hosted in a RDBMS are tables, indexes, views, and code. The code is stateless — they’re just functions. Application state is in tables. Replication is supposed to handle your more hairy operational concerns. What’s the problem?

zokier · on Feb 15, 2024

> Replication is supposed to handle your more hairy operational concerns. What’s the problem?

Replication is slow, and requires each node to have lot of resources, and generally prefers nodes to be realtively homogenous. But for non-trivial applications replicating the whole application (or some shard of it) to all nodes is simply impractical, and it is useful to be able to scale and adjust different parts of system independently.

For example on AWS Lambda you can scale out up to 1000 new instances per 10 seconds (and scale in at similar rate). Can you imagine any DMBS replication working very effectively at such circumstances?

adrian_b · on Feb 15, 2024

While this Glowdust obviously implements a query language very different from SQL, whichever differences exist between it and relational databases are not because, as it claims, "It uses functions as a model for storing data".

The so-called relational model and relational databases do not have an appropriate name, a much more suitable name would be relational-functional model and relational-functional databases.

Even if, according to the usual definitions, the functions are a subset of the relations, for designing and using efficiently databases it is important to be very aware about the differences between functions and non-functional relations a.k.a. many-to-many relations.

Any data table from a database is composed of 2 sub-tables, one that consists of relational columns, i.e. the so-called primary key columns, an another, which may be empty and which consists of functional columns, i.e. where each column is a function of the primary key.

There are many relational databases that do not contain any relation, because all the tables have primary keys that consist of a single column. Such databases contain only functions. Only the tables where the primary key includes multiple columns contain relations.

The most frequently encountered joins are not true relational joins (which create a new relation, i.e. a table where the primary key is the concatenation of the primary keys of the joined tables), but they just extend a table with functional columns that correspond to functions defined in tabular form instead of being defined by formulae.

The true relational joins are those that match columns that are not the primary keys in any of the joined tables. The most frequently used joins match a foreign key in one table with a primary key in another table and such joins are just tabular function computations, they do not produce any new relation.

So Glowdust might be implemented in an unusual way for a database management system, but if it is restricted to functions that does not make its use different from the many databases implemented with a relational DBMS, but which nonetheless do not actually use relations.

Any relation can be modeled by a function, i.e. by a predicate defined on the Cartesian product, but this may result in less efficient implementations than when using a true relational DBMS.

daralthus · on Feb 15, 2024

Is there any data interpolation/generation if I run a "function" with a new argument?

happytiger · on Feb 15, 2024

I gave a talk in the future of cloud databases many years ago at Cloudcamp (love you Dave, you the best) on this very subject.

A broad DMS service fabric that fully integrates containerization into its core product is such an obvious evolution and it’s awesome to see this experiment moving forward.

It has so many advantages over existing systems as an architecture, but the core is that it makes building fault tolerant infrastructure incredibly easy and could take a lot of the front end business logic away, helping us move back to a simpler, faster web.

Neat post. Nice work. Ty for sharing.