Hacker News new | past | comments | ask | show | jobs | submit login
Cost of a Join (brianlikespostgres.com)
208 points by DerpyBaby123 on Aug 14, 2018 | hide | past | favorite | 72 comments



I love a good writeup backed up with some performance benchmarks.

But I'm a little befuddled as to why the three possible solutions to putting a "status" column in your Products table were:

    1. Add a status_id column to the product table and reference a new status table
    2. Add a status_id column to the product table and let the app define the mapping of what each status_id is
    3. Add a status column of type text to the product table
What about :

    4. create an enum column with your possible statuses.
If you ever reach more than 65k possible statuses, you've got a different problem.

Perhaps the status example is just a hypothetical example? Although the author says they would use option 1 in this case, so what gives? Seems crazy to create a join table for 'status'.


I was really excited when enums were added to postgresql. But after a couple years of experience I stopped recommending them. Enums are a type. Which is fine, but if you have several different enum types used in different tables it can complicate moving data between databases as you can't simply pg_dump/pg_restore one table to a different database, you also have to create all the types used by that table, ie the enums. Which turns a simple

pg_dump -t atable | psql -d otherdb

into an exercise querying pg_depend, pg_enum and the other catalogs or hand parsing pg_dump schema output. If your operation does any amount of ad-hoc ETL you may find enums more trouble than they are worth.


Because enums only work for the simplest use cases.

If you replace status_id with country_id, where each country_id has more than one property (country name, ISO alpha 2 code, ISO aplha 3, currency_id etc), you can see why enum isn't good enough.


You might not even need most of that country data in your database though, it could be in your application in many cases (the fastest join your database can do is the one that you do in frontend code instead)


I don't know why people think this. In the specific example of country, in fact I _have_ tested using a country class with hard coded maps of country codes to arrays of data. Even so, benchmarking showed the single database call with a join (MySQL 5.1 or 5.5 I think) to be faster as I was already hitting the database anyway.

Don't prematurely optimize away your database's flexibility (managed content) before testing a properly-coded query.


I've seen MySQL do a lot of fairly dumb stuff with temp tables and order by. I've seen a lot of cases where just moving sorting to the frontend took a 3 second query down to 10ms and sorting on the frontend wasn't just a few ms too either. There were too many sorts available in the UI to add matching indexes in the db.

If I can move CPU to the frontend from the db, that's a win, because scaling database servers is harder than frontends.


> If I can move CPU to the frontend from the db, that's a win, because scaling database servers is harder than frontends.

This may only true for data manipulation that's necessarily [1] more CPU-intensive/bound than I/O-intensive.

It's difficult to imagine a situation where that would actually be true, however, in the context of a manipulation like a join.

Given how much faster CPU power has increase relative to I/O, over the entire history of computing, CPU on database servers has been progressively less of an issue.

This isn't to say that CPU power is infinite, and I've seen naive analyses attribute attribute performance problems to inadequate CPU power, even when it's an I/O issue masquerading as (system, not user) CPU time.

In that respect, it's hard to scale a database server, in that now-rare and undervalued-by-programmers Ops knowledge of things like hardware and I/O and offloading [2] but that difficulty doesn't actually go away by moving the data around in a partly-distributed or even fully distributed system. It just drives up the cost.

See also Amdahl's Law and the Fallacies of Distributed Computing (in particular the ones about bandwidth and latency).

[1] as opposed to merely as-implemented, such as due to PEBKAC, as a sibling comment points out

[2] e.g. why ZFS might have wonderful features, but harware RAID cards can offload a remarkably significant pressure from the CPU and main memory, when it matters most


In that respect, it's hard to scale a database server, in that now-rare and undervalued-by-programmers Ops knowledge of things like hardware and I/O and offloading [2] but that difficulty doesn't actually go away by moving the data around in a partly-distributed or even fully distributed system. It just drives up the cost.

In a cloud environment - which is relevant since he was using AWS - it’s much cheaper and easier to autoscale app servers than database servers based on load. It could also be conceivable cheaper if you’re doing non time sensitive online analytics processing instead of something that requires a real time response. On AWS, you could set autoscaling and use spot instances to scale more when the rates are cheaper for compute.

Also, when scaling up a database, there is more of a potential for wasting money because of idle capacity when the resources aren’t needed. It’s painless to remove an app VM out of service.

Yes I know you can also use DynamoDB, but it’s non relational, the cost of adding additional global indexes and paying for extra read and write capacity gets expensive.

There’s also Serverless Aurora that auto scales but it would also get really expensive fast.


> which is relevant since he was using AWS

Of course, in that case, one is so limited by Amazon's specific hardware choices, if not by virtualization [1]. The OC didn't specify AWS but did mention MySQL, instead of Postgres, which further limits ones ability to implement basic performance optimizations, like separating data and index I/O [2].

> it’s much cheaper and easier to autoscale app servers than database servers based on load

Yet I'm still not convinced this is true if the "load" is actually I/O-bound rather than CPU-bound [3]. It may be easier to scale (especially autoscale), but, with cloud providers, that doesn't necessarily mean cheaper. In practice, distributed computing never scales as good as linearly, and people tend not to even measure for diminishing returns or how Amdahl's Law affects them.

> It could also be conceivable cheaper if you’re doing non time sensitive online analytics processing instead of something that requires a real time response.

I believe GCP has instituted single-instance pricing that's essentially pay-what-you-use in terms of number of cores, so that's even possible with a database server on that provider, but, again, only if the issue is CPU, which I maintain it generally isn't.

> Also, when scaling up a database, there is more of a potential for wasting money because of idle capacity when the resources aren’t needed.

I recall this sort of thing being the original argument for virtualization, when it first came out, but it only held water if one was grossly over-paying for the resources to begin with (which most "enterprise" shops certainly were, and still are).

To me, it's an argument for why, beyond modest size, buying cloud infrastructure is just over-paying.

> paying for extra read and write capacity gets expensive

> Serverless Aurora that auto scales but it would also get really expensive fast

Par for the course for any cloud services, in that they may seem convenient or "auto scale" but the price premium is tremendous, especially at scale. How do they compare with paying for that capacity over EBS?

[1] Yes, I know they now have "bare metal" instances, or is it still only instance (singular)?

[2] Not that it matters much if one is limited to 10Gb/s or even 40Gb/s (if you can get it) EBS volumes.

[3] Which the OC's sort very likely would be but the OP's join would not. To that extent, perhaps the OC's whole premise is a red herring, because I doubt anyone is suggesting not offloading to the app an ORDER BY that's just the last step before presentation to the user.


in that they may seem convenient or "auto scale" but the price premium is tremendous, especially at scale. How do they compare with paying for that capacity over EBS?

Like a lot of things dealing with autoscaling - lambda, Fargate (serverless Docker), serverless Aurora - autoscaling saves money when your demand is really spikey and even then you have to do a cost comparison.

For Linux instances, you get per second (or minute?) billing so the cost of bringing instances up and down isn't that expensive. For Windows instances, you are always billed by the hour.

* [2] Not that it matters much if one is limited to 10Gb/s or even 40Gb/s (if you can get it) EBS volumes. *

The fastest throughput that I am aware of with EBS is 500MB/sec without doing RAID (https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-eb...)

Of course instance storage is faster (https://gist.github.com/ktheory/3c3616fca42a3716346b), but is that a thing people do for databases?


> autoscaling saves money when your demand is really spikey

Indeed. I've found the spikiness ratio (if that coined term makes sense) to be 2x-10x for AWS in general, and I was wondering if you (or anyone else) had a sense of how much that increased for these more premium services.

> and even then you have to do a cost comparison.

I have rarely seen this done, and, when I've talked to people attempting it, the common complaint is that AWS doesn't make it easy.

> The fastest throughput that I am aware of with EBS is 500MB/sec without doing RAID

Yes, per volume. I was, perhaps, too terse in my footnote. I was referring to the maximum throughput per EC2 node. Considering 40Gb/s would be saturated by about 10 of these, that's not very much. Real hardware can handle quite a bit more than 10 such SSDs attached.

> Of course instance storage is faster (https://gist.github.com/ktheory/3c3616fca42a3716346b), but is that a thing people do for databases?

I would hope so, but probably not in general practice, as it would require too much specialized knowledge. The main problems are that it's ephemeral and very limited in quantity. The impact of the former is reduced by the latter, of course.


I was wondering if you (or anyone else) had a sense of how much that increased for these more premium services.

I’m first and foremost a developer, I play the role of the “resident certified AWS Architect” when necessary. But the only time I concern myself with cost is when the CTO - who is also pretty knowledgeable about AWS - calls me into his office to complain....


I'm confident that situation is far from unusual, and it's my theory as to why AWS is so wildly succussful.

For all the lip service given to "you have to do a cost comparison", it doesn't happen in practice.


I don't see AWS saving money over a well run self hosted data center as far as computer resource costs.

The issue is, there are relatively few people that can run a data center well compared to the number of people who need servers to run thier business. The cost of hiring people to manage local servers is higher. For a small company, you can get away with one or two devs who also have AWS training and outsourcing your heavy lifting to an APN that charges around 10% of your monthly AWS cost. They usually outsource the grunt work overseas or do "rural sourcing" if regulations don't allow overseas access to the data.

Also, you don't spend as much up front. If you are running a well run business. Your expenses on AWS should scale a lot slower than your revenue.


> The issue is, there are relatively few people that can run a data center well compared to the number of people who need servers to run thier business.

I don't think that's true, other than as a result of a self-fulfilling prophecy.

> The cost of hiring people to manage local servers is higher.

That, there, is the falsehood at the root of the prophecy. I keep reading comments on HN of the form "and then we'll have to hire 5 sysadmins for those 200 physical servers" but they're invariably from people who have done nothing of the sort.

From personal experience, the incremental effort for dealing with the physical part takes up around a quarter of a full-time Ops person's time at that scale [1]. Even at high salaries, that's less than any cloud premium. Perhaps more importantly, the vast majority of those skills aren't difficult and could easily be imparted onto a lower-cost employee. Of course, the idea of providing mentorship or career paths for junior employees is anathema in today's environment, especially in smaller companies.

None of the other Ops labor costs go away, and, judging by salary ranges in SFBA job postings, AWS-expert Ops folks are 10-15% more expensive.

> Also, you don't spend as much up front.

I routinely hear and read this argument, as well, but, given the huge price premium of AWS [2], it doesn't withstand scrutiny.

If the price premium is 6x [3], that means paying the equivalent of the "up front" cost to AWS in 6 months. It might just feel emotionally easier to have it spread out over half a year rather than having to write a big check all at once.

All that said, I wouldn't advocate an early startup (or other small company) doing anything but cloud or VPS.

> Your expenses on AWS should scale a lot slower than your revenue.

The issue is, there are relatively few people that can run AWS well compared to the number of people who use AWS to run their business.

[1] and tends to scale very sub-linearly, often logarithmically if there's low diversity in model/configuration, with number of servers

[2] GCP has slightly better pricing and some VPS providers are remarkably better, but I posit cost (up front or not) is never a real consideration and so these aren't even brought up in conversation

[3] For various workloads, my analyses have fallen in the range 2x-10x, assuming 3 year contracts, 3 years of use out of the hardware, and zero residual value. Yes, that's all-inclusive, even of my own labor. 6x for just the hardware (up front) portion is about right for typical workloads.


That, there, is the falsehood at the root of the prophecy. I keep reading comments on HN of the form "and then we'll have to hire 5 sysadmins for those 200 physical servers" but they're invariably from people who have done nothing of the sort. From personal experience, the incremental effort for dealing with the physical part takes up around a quarter of a full-time Ops person's time at that scale [1].

But you also have the turn around time to actual get the physical boxes and you can’t run experiments as easily. Just the other day I was working on a project to see how much throughout I could get processing a queue of messages based on the number of instances, the optimal size of the instance, the correct criteria for scaling in and out before other constraints started hitting us. I kept playing around with it rerunning the Cloud Formation script with different parameters until I got my answer at one point I got up to 20 8GB 4vcpu instances. We settled on running one instance all of the time up to 8 based on load and just for grins and giggles I set up a separate ElasticSearch node just to run analytics. Can you imagine the red tape and the time it would have taken to do that self hosted? We also would have been paying for 14 (two similar processes that work on different queues) to sit idle most of the time just to handle the load.

I also would have had to provision VMs for ElasticSearch and Kibanna, RabbitMq, and some type of alerting system. All this would have needed to be redundant across two data centers for DR once we moved into production.

I didn’t even mention the build server that would sit idle most of the time. CodeBuild is also “Serverless”. You either use one of the prebuilt Docker images or use your own and only get charged for use.

* Even at high salaries, that's less than any cloud premium. Perhaps more importantly, the vast majority of those skills aren't difficult and could easily be imparted onto a lower-cost employee.*

You can’t get too much of a lower cost than outsourcing your netops to a separate company that charges around 10% of your monthly bill. They then outsource to either cheaper foreign labor or to some place in middle America.

Of course, the idea of providing mentorship or career paths for junior employees is anathema in today's environment, especially in smaller companies.

Who is going to train the trainer? It wasn’t until 4 years in my company that we had any onsite AWS expertise. They outsourced support.

None of the other Ops labor costs go away, and, judging by salary ranges in SFBA job postings, AWS-expert Ops folks are 10-15% more expensive.

They don’t go away, they get spread out among multiple customers.

We spend very little time doing “netops” and a lot of the things we pay our APN for, we now have the expertise to do ourselves.

In my experience, an “AWS Architect” who doesn’t understand the development and devops side will do suboptimal “consulting”. The consultancy company never suggested, better, cheaper, more efficient ways to use AWS because all they knew was netops. The developers who have really gotten into AWS can go back and forth between netops, devops and development and can build much more efficient systems. Of course, you still need that one “benevolent dictator” that really knows their way around AWS (raises hand).

> Also, you don't spend as much up front.

I routinely hear and read this argument, as well, but, given the huge price premium of AWS [2], it doesn't withstand scrutiny. If the price premium is 6x [3], that means paying the equivalent of the "up front" cost to AWS in 6 months. It might just feel emotionally easier to have it spread out over half a year rather than having to write a big check all at once.

If you’re paying six times the cost, I would suspect....

1. You’re not taking into account the total cost of an equivalent redundant system within a data center and between geographically redundant data centers (OMG our load balancer went down in the middle of the night).

2. You’re overprovisioning and not architecting with the mindset of elasticity.

3. You’re not taking into account the ability to move a lot faster by not having the lead time between an idea and provisioning hardware and getting things to market. The whole project I spoke about earlier that included 16 VMs at maximum load with a total 64 hyperthreads, 128GB of RAM, separate servers to run ElasticSearch, a performant queueing system, some type of monitoring system, etc. could never have been stood up in a week.

That’s not even to mention that we had to duplicate the same setup in a Dev environment - which could be done on both Dev and production by just running a CF script with different parameters.

My CxO came to me and said we need to prove that we can handle this load before we can get this contract. It took me less than two weeks to create the entire architecture - that included some rework on the programs. I didn’t have to work with netops and devops who then had to work with the finance department who had to worry about what would we do with all of this hardware we provisions just as a proof of concept.

Hypothetically, if we moved on to something else and we didn’t need the resources from the project, I would just delete the Cloud Formation stack and all of the resources are gone - as opposed to having all of this hardware sitting around.

4. You’re not taking into account either a global workforce or global customer base. At one company we had people working overseas, performance was horrible.

We created another account, duplicated our Dev infrastructure using CloudFormer and put everything in thier region. Of course I didn’t want to deal with it. We paid our Amazon Partner for a one time bucket of hours. Even our netops and infrastructure support is “elastic”.

The issue is, there are relatively few people that can run AWS well compared to the number of people who use AWS to run their business.

I blame “AWS Architects” who take thier on prem netops mindset to the cloud, developers who don’t know how to architect with elasticity in mind, and devops folks who don’t know how to use AWSs built in devops tools. There is no reason that costs shouldn’t go up linearly (at a lower slope) as the number of customers increase.

[1] and tends to scale very sub-linearly, often logarithmically if there's low diversity in model/configuration, with number of servers

If you have a stable workload, with predictable growth, and you’re a typical large enterprise, the cloud may not make sense (see GE’s Predix platform).

Also if you are a low margin B2C company it may not make sense at a certain scale.


> But you also have the turn around time to actual get the physical boxes and you can’t run experiments as easily.

That's a false dichotomy. There's nothing saying that, when one has physical infrastructure, that one cannot also use cloud infrastructure in cases where it's needed (and transition when it's not.. it's the transition part that's usually missing).

> Can you imagine the red tape and the time it would have taken to do that self hosted?

I don't have to imagine, as I have experience doing things self hosted. "Red tape" is just a political problem within a company, not a technical one. In some cases, even time is zero, if I happen to have spare hardware to lend.

Regardless, the whole experimentation situation is a strawman. It is, effectively, indistinguishable from early startup.

> 1. You’re not taking into account the total cost of an equivalent redundant system within a data center and between geographically redundant data centers

Nope. Always apples-to-apples. Physical DC redundancy costs something, but it's approximately linear. (Double for twice the price). Multi-cloud gets very costly in engineering time, too, if you want true redundancy in that context.

> 2. You’re overprovisioning and not architecting with the mindset of elasticity.

Overprovisioning goes against the definition of comparable. The "mindset of elasticity" is entirely too hand-wavy, as it's only a savings mechanism, and only for part of the infrastucture. It also only works if the workload itself is actually elastic enough.

That's part of why my range for the AWS premium is 2x-10x. The other part is that not everything is about CPU+memory (i.e. EC2), and I/O, including EBS and data transfer costs can play a big role.

> 3. You’re not taking into account the ability to move a lot faster by not having the lead time between an idea and provisioning hardware and getting things to market.

Besides being a strawman argument, that has nothing to do with the infrastructure cost. After bridging the initial lead time, AWS will still cost multiple times more, over years, if not transitioned off of.

> 4. You’re not taking into account either a global workforce or global customer base. At one company we had people working overseas, performance was horrible.

Again, nothing to do with the cost of the underlying infrastructure. It's just an argument of the form "<benefit of cloud> makes it worth the premium price", which isn't the same as the premium not existing.

I'm willing to concede that, for certain location, AWS might actually be cost-competitive, due to, for example, otherwise insufficient competition in commodity hardware vendors or internet service. However, I'm skeptical that such a place exists and is served by AWS.

> I blame “AWS Architects” who take thier on prem netops mindset to the cloud,

I'm not sure how you can legitimately blame the existence of those you consider incompetent for the non-existence of those who are competent.

> developers who don’t know how to architect with elasticity in mind,

I'd argue there's nothing wrong with that, as they shouldn't have to care. Plus, some workloads just aren't easily elasticized, just like some (many) aren't easily parallelizable with software, no matter how much you might personally wish it were so. That not-easy can translate to huge engineering costs (which I've seen for parallelization, though not elasticity, for the very reason you mentioned).

> and devops folks who don’t know how to use AWSs built in devops tools.

That sounds like a recipe for vendor lock-in, which is another way of saying higher-cost-lock-in.

> There is no reason that costs shouldn’t go up linearly (at a lower slope) as the number of customers increase.

There are plenty of reasons, not the least of which is naive/inefficient distributed systems design. Notwithstanding the fact that No True Scotsman would ever create such a system, I've witnessed more than a few.

> with predictable growth

Everyone has predictable growth, given a short enough timescale. That's part of the point of using commodity hardware, rather than something like "enterprise" hardware, where sometimes just the sales cycle can be prohibitively long.

Extrapolating lead times from ones experience in a giant corporation or from the dot-com era is unrealistic.

> Also if you are a low margin B2C company it may not make sense at a certain scale.

I posit that neither the size of the margin nor the nature of the business makes any difference. Having a vastly too high COGS doesn't make sense for any business.


> just moving sorting to the frontend took a 3 second query down to 10ms

I've seen this too, but the problem can often be optimized in SQL as well. For instance, if you expect that only one User will match a specific username, then adding LIMIT 1 to the query will instruct the DB to return after finding a single hit, which can significantly shorten query time. But then adding an ORDER BY clause will cause (at least MySQL) to parse the entire table anyway, even though you've got the LIMIT clause. This is just an example, one really needs to know their database and their application to make these types of decisions.

But the premature optimization applies to SQL as well, if the results can be sorted/filtered/transformed in application memory better than in the database so be it.


What about analytics, hosting dimensional data outside your DB sounds terrible.


> I've seen MySQL do a lot of fairly dumb stuff with temp tables and order by.

I too have seen front end of an application do a lot of dumb stuff with state of my requests.

The point is, don't confuse PEBKAC with an issue with a tool.


You are only thinking of just an application. No offense, but this is a classic case of app devs not realizing that databases exist for more than just storing data.

Ever thought that databases are so used to generate reports, dashboards? What about a data warehouse? OLAP? Cubes? Aggregates? What about feeding other systems (e.g. A third party downstream system)?


Isn't that basically just syntactic sugar for #1? When you create an enum, it creates entries in a table with numeric values for enumsortorder and a string field for enumlabel. It seems like effectively the same thing.

edit: It is different, as all enum values are rows in the same table and there is a third column to identify which type they belong to. The actual key used to refer to an enum value is not in the table; it's the oid of the table entry. So, I guess retrieving the data from the enum table is technically not a join.


It is logically a join on oid to the pg_enum table. The implementation takes a few shortcuts but really the basic join machinery has been hand polished for decades so that doesn't make much difference.

Because enums are stored in pg_enum which is just a table like the rest of the catalogs you can join to it with your own SQL queries. If you are very bold and a bit bad superuser you can even update them.


Except enum abstracts away the need to create a join. Plus performance of enum is a tad bit better.


How is performance when adding a new, additional enumerated value to the table that already has millions of rows?


I was using enums, but then I found out that you can't alter an enum inside a transaction.. very painful in a knex environment where up/down all occurs within a transaction


If knex doesn't provide facilities for running migrations outside of a transaction, you're going to have an awful time.

Useful operations like CREATE INDEX CONCURRENTLY can't be run in transactions.


This is exactly the problem I found when wanting to use enums. They do not play nicely with transactional DDL.


Enums add complexity that is better served by keeping the data relational, unless you have a really strong reason otherwise. If it was, for example, a statistical data table that was likely to have many billions of rows, I could see it, but then you're getting outside of the realm where you have truly relational data.


I would argue enums are less complex and easier to use. Imagine we were talking about a date field instead which is more complex?

A products table with a column called date (of type date)

Or a products table with a date_id column and a dates table with the columns date_id, date_year, date_month, date_day

Sure the second option may seem more relational but it is more complex. It allows you to do more relational things later if you want to attach things to a date like an is_holiday or a was_raining column.

So the question is, is there any value in adding the extra complexity now? If you foresee having to add more information to a status in a relational way like an is_active column then maybe a joined table is better.


> It allows you to do more relational things later if you want to attach things to a date like an is_holiday or a was_raining column.

While I get that calendar tables are fun and all, if you have a date type you can just add a “holidays” table for that if you need it later, without a table of all dates.


Precisely. This is exactly why decomposing things into relations is important. The naive approach is to "keep all the things and all their details in one big table", and that's what causes so many problems.


This thread is rehashing the trade offs of each form or “level” of normalization.

Various normal forms serve various purposes. Use what makes the most sense for your read and write demands.


One big caveat with enums in postgres is that adding values can't be done in a transaction https://www.postgresql.org/docs/9.4/static/sql-altertype.htm...

It's modifying a type rather than values as it would be if we added a table, or schema if we added a column.


It's annoying of course, but when is this a big problem in practice? I can imagine how it would be in theory, but in all cases where I needed to add an additional value to an enum so far, I could just do it independently of other schema changes and well in advance of deploying code that made use of the new value.


Why do that when you can define it as a foreign key into a proper relation and modify it with full transactional guarantees? I just don't see the advantage of bending over backwards.

Remember, relations are "types" with existence checking built in and a whole powerful algebra around them. By dropping down to a more primitive type you gain very little and lose strong guarantees.


Convenience. Having a table mapping id to string is logically equivalent and plays nicer with transactions. But it's much less convenient.

In principle SQL databases could have some sugar for foreign keys to tables that just bijectively map ids to human-readable identifiers, but they don't. I always wondered why not.

With enums I get readable, type-checked values I can auto-complete on and that I can use "in-place". With a table I don't get auto-completion and I need to do a join for readability if my enums are "true" enums (and have an ordering).


An enum is a relation, it's just a relation that is not stored in the same kind of table that other data is.

Your example is really a fairly poor strawman. I would suggest something more like a product with a "release date" column, which would be decomposed into a "releases" table with a foreign key. That's the kind of decomposition I've done a lot, and it's usually a useful one.

Another example would be having an "address" field, versus having number, street, apartment/suite, city, state, zip, country. I generally go with address data in a separate, relational table, instead of inline (user_addresses vs on the user table directly), and I've found that to be very useful at times.

In a lot of cases, you actually want a separate table (keyed on the real value, of course, not a generated foreign key) that stores things like zip codes or countries, because there is a lot of ancillary data to associate with them.

If you value homoiconicity, you want all your data to be in as similar a format as possible. Enums are a special case, and so you should avoid them unless you have a strong reason otherwise.


> An enum is a relation

This is TRUE. Most people ignore the relational model despite being so simple to learn. But one important thing to note, is that similar how you can say "anything is a object" you can say "anything is a relation".

This is valid SQL:

    SELECT * FROM (SELECT 1) a
And because "1" is a relation, a enum is.

Normal rdbms fail at operate to inner values (like the array feature on postgresql) and don't allow to put tables into cells. This is valid at the relational level but you need to do hacky stuff for common rdbms.

    SELECT Head =(SELECT * FROM INVOICE)
I'm working in a relational language, and this will be valid:

    for i in 1 do
and

    select value:Some(1).... where value.isSome


Enums are not relations but data types as indicated in the first paragraph here: https://www.postgresql.org/docs/current/static/datatype-enum..., and by the related command ‘CREATE TYPE ...’.

Relations are commonly used when referring to tables and views.

What do you gain by calling types relations while introducing confusion and diminishing practicality of the relation concept?


Finite data types are, or are isomorphic to, (constant) relations, they aren't relvars like tables/views (relations include both constant relations and relvars.)

> What do you gain by calling types relations

The fact that everything true of other relations is true of finite types.


That is true. In postgresql.

Not int the relational model.


I don't buy "enum is a relation" any more than "VARCHAR is a relation" or "REAL is a relation". These just data types that represent the domain or range of possible values that may participate in a relation.


> that represent the domain or range of possible values that may participate in a relation

Like a foreign key constraint?


Enums are much easier to deal with in application code, because you don't have to deal with joins and keep track of a separate entity. You could get the same effect using a string pkey, if performance isn't an issue. Enums are also ordered, which can be useful in certain circumstances.


> Enums are also ordered, which can be useful in certain circumstances.

Are they ordered in a way that's different from an integer value that's a foreign key into another table?

The main difference I can perceive is that the ordering is non-obvious to a human, unlike integers, which seems less useful, not more.


In MySQL at least, removing enums from a list requires running some expensive `alter table` operations.

It's fine for lists where the possible values definitely, absolutely, for sure won't change, but for things like attributes on elements of a set, enums are evil.


But it's just removing that causes the problem. Which doesn't seem like an important feature. I almost always keep old values for historical purposes/backwards compatibility with backups.

You can add values without issue.


It could be considerably easier to manage adding and removing and editing rows from a table rather than enum options (plus you can store other metadata). Particularly if you have some sort of more or less autogenerated CRUD system.

Also, while this may be something of a "schema smell" a lookup table could be shared. In a small, fast project this could be useful.


enum does exist in PostgreSQL (which is in the article's URL). But an enum still requires a lookup for the associated value when setting and getting rows, it's not free.


It's got to be way closer to free than a separate table join.

But I did find a good answer hear about the pitfalls of enum in MySQL. Some/most of which would apply in the case of PostgreSQL.


Postgres joins against small tables are really really efficient. It wouldn't shock me if enums actually were implemented as tables.


Does an enum have any performance or consistency benefits over 1 or 3? (I discount 2 because it seems like the worst possible option.)

It would seem like 1 is essentially an ad hoc enum with the same guarantees and 3 is just a bit easier to use as you can have the enum defined in your code rather than SQL.


I feel like status ought to go in its own table since it future proofs against other things having the same statuses - easier to refactor application logic and the table to be polymorphic than pull statuses from a potentially large table and do the same.


A table is easier to update, add, and delete from. Using an enum out of the gate is likely not the best first option, in my opinion.


Possibly because ENUM column types don't seem to exist in all SQL databases? I could be wrong though...


ENUM is in mysql/mariadb. Not sure about oracle or some of the more obscure ones..



Yea I assumed Postgres was covered by OP's suggestion to use ENUM.


Just poking around in docs, seems first added in 8.3?


You can just make an other table and map a few integers to the statuses.

It's another join but I wouldn't think so much difference over the 'lookup' required for proper ENUM support?


I have the same idea, I like to use enums or flags. I don't need a separate table for it


The submitted URL is not stable, it just links to the front page. So for the benefit of later-visitors, here's the direct link:

https://www.brianlikespostgres.com/cost-of-a-join.html


Thanks!


Hmm, I think these results are somewhat suspect. Postgres isn't even really hitting its stride yet at a million rows. You have to start looking at 100GB+ database sizes otherwise most everything fits in working memory, or at least the indexes do.

My experience has been that joins are, in fact, cheap in larger databases but they do scale on the size of tables, so you should be cautious about excessive joining.

In the extreme case, I wouldn't make a table for each attribute unless there were some overriding reason (at which point you've rediscovered column-based databases essentially). I do sometimes go up to 5th or DKNF normal form though, which I think is substantially beyond what most people designing or revamping a database would do.


Normalize till it hurts... Denormalize till it works :)


Golden words. This is the cycle that keeps repeating in complex production systems.


This is actually terrific advice. Thanks!


Great to see this analysis. Especially with the proliferation of ORMs that create relationship tables like confetti it is very useful (and comforting) to know that needing to join a large number of tables is not in and of itself a performance disaster. It does not alleviate all my concerns with join-happy designs, but it is definitely good to know.


There are two general (potential) problems due to the use of (multiple) joins:

* run time: performance disaster

* design time: conceptual chaos

Some of them are analyzed in [1] where it is essentially argued that join considered harmful and a join-free (column-oriented) approach is described. This approach has been implemented in [2] which is a library for batch and stream data processing, and in [3] which is a library for data analysis. Their main common unique feature is that they rely on column operations, that is, data processing is described as a graph of operations with columns as opposed to using set operations like join.

[1] https://www.researchgate.net/publication/301764816_Joins_vs_... Joins vs. Links or Relational Join Considered Harmful

[2] https://github.com/asavinov/bistro - A general-purpose data analysis engine radically changing the way batch and stream data is processed

[3] https://github.com/asavinov/lambdo - A column-oriented approach to feature engineering. Feature engineering and machine learning: together at last!

(Disclaimer: I am an author)


I really don't understand the reasoning for the 3 choices:

1. Add a status_id column to the product table and reference a new status table

2. Add a status_id column to the product table and let the app define the mapping of what each status_id is

3. Add a status column of type text to the product table

3 is unacceptable under any circumstances, and 2 really just makes no sense from a database perspective (unexplained values in a database is just wrong).

Choice 1 is pretty normal, but I am really curious about the whole idea that any app needs to join with that status table to get anything but flags, values, state, or labels associated with that status. The idea of using a status table as the starting point of any join is just wrong and shows some very poor data management in the relationship between the app folks and database folks. I can see using a text field to look up and store the id, but really the mechanism that is populating your status table (with any flags, values, state, or text that is associated with each status) should generate the equivalent code or config for the app.

Indexes on status are often painful because status is generally not a large number of values and having an index that doesn't cut down the number of rows significantly is often a problem. My big rule about this is if the app asks for a status field on an object, expect someone (probably your report writer) to want all the objects with the same value of status field. A typical example is open invoices. You really need to look at where the index is sparse and does it help in each situation. This is especially true if you track statuses through time (e.g. itemId, statusId, effBegUTC, effEndUTC)[1]. It is definitely worth the time to look at work tables, views, and each index with someone with some experience (yes, a good DBA).

1) If you are mostly looking to get all the items currently in a single status then leave out effBegUTC from the index and alway put a single value in the effEndDt for current (e.g. 9999-12-31). an index with (EffEndUTC, statusId) is better than (effBegDt, effEndDt, statusId). Including the itemId in the index really depends on the optimizer and how covered indexes are handled.


Would love to see an analysis that depends on the width (number of columns) of the table.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: