Hacker News new | past | comments | ask | show | jobs | submit login
Fast Way to Load Data into PostgreSQL Using Python (hakibenita.com)
246 points by gilad on July 10, 2019 | hide | past | favorite | 29 comments



As a former DBA and closet python super fan I found this post to follow all the things I would have done. Kudos to the op.

Loading to a temp non-logged table is the right way to go and using built in bulk loading options is smart, too. The type annotations looked really good and kudos for using modern python here (although for the examples I’m not sure they were needed, but being that they could have been used in a larger context that this article was just a microcosm if I could be wrong). Very good article. I’ve bookmarked the blog to consume other such articles.


What about translating data on the fly to the one big copy statement ?


I could see that working. Just depends on your use case and what you want to accomplish.

The approach the author takes has the advantage that when data is in the load table it can then be used to load into a real table via some transform in batches and be done in the background.


I haven't done any benchmarking, but for around 200,000 rows of data, which I am handling daily at work, the following line of python code takes around 10 seconds.

pandas.read_json(json_obj).to_sql()

Any transformation of the data before writing into DB is split second with vectorized pandas operation.

Pandas is still my first choice when it comes to tasks like this.


Not sure what the python code translates to, so maybe this is covered, but in my experience, if you can't use COPY, you should:

1 - Generate a static SQL to insert N rows (say 10 to 50): insert into blah (id, name) values ($1, $2), ($3, $4),

2 - Loop through the data in batch size

3 - Generate a dynamic sql for M rows where M is whatever's left over

Wrap it in a transaction. You can have an SQL builder where you give it the list of columns and how many rows you want to generate and use that for both steps 1 and 3. You can prepare statement 1 (and keep reusing it), but I'mnot sure how much that'll buy you.

If you need to deal with conflicts. Inserting the above into a temp table (on commit drop) and then doing an insert + update from the team into the real table works well.

Then there's a handful of postgresql configs to tweak, but if you're willing to risk a bit of data loss (but not corruption), asynchronous_commit=off.


Seems like a pretty good overview, but I'm not surprised at all that in the end it's COPY TO/FROM that won out by a good margin.

Listing all the columns in order seems a bit superfluous and error prone, though. I'd recommend creating the list of columns to be inserted (or copied; copy_from has a 'columns' arg) and slicing the input hash with the same list (dict comprehension in Py).


I am looking for ways to quickly load data into a PG database for local testing. An initial seed run is ok, but I want test suites to run on identical data sets and since PG doesn't (yet) support nested transactions, I a currently use my seed database as a TEMPLATE for suite-specific databases.

It works reasonably well but creating the database with a template takes several seconds, even though my test data is fairly simple. I imagine it should be faster than anything I can do from my programming language (nodejs in my case), or is creating with templates slow for some reason that I don't know of?


Not an expert, but I'm curious - what use case is not sufficiently solved by savepoints?


They might well be faster, I haven't tried. I just remember looking into them and thinking that it was more involved than what I had now so I decided against them.


PG supports nested transactions, save points. My team uses that extensively in their tests.


Ok, so perhaps I should take a second look at those.


I concur - we load a fair amount of data into Postgres using Python and we've learned these lessons the hard way. We ended up feeding generators into execute_batch.

We don't use execute_values, because we want to generate dictionaries (since there are usually multiple generators along the way) and we don't use copy_from, because the cognitive overhead of data prep is just way too high for the benefit (for us! your mileage may vary).

Great stuff, I've already circulated it at work.


For old school CSV and Excel that requires any processing before inserting I Have found Perl with the Relevant Modules from CPAN for CSV and Excel is the best solution


Very nice article! Explores all the tricks I know of, and a few more. Never heard of the execute_values before, only a bit (3x in the example) slower than using copy but much simpler code.


Why do you drop and recreate an unlogged table? Why not create a temporary table, which is also not WAL-logged and automatically gets cleaned up? It looks like you're not closing the connection. The only advantage is that the staging table remains in place until the next execution, but presumably so does the data file.

Alternately, if you have an unlogged table already created, why not simply truncate the table if it exists instead of recreating it on every execution? Drop and truncate should take about the same amount of time, but creating a table is going to be slightly more expensive. If you're worried about making sure the db table schema matches the file format, I don't think I would be. File formats change, you're already going to be manually working on it. You can drop the table so the updated script recreates it.

I suppose it doesn't really matter, but I'm wondering if your choice was one of intent or one of convenience.


I followed a similar path with TEMPORARY tables and using pgcopy [1] to insert data quickly - the copy operation was extremely quick and didn't require only passing strings.

[1] https://github.com/altaurog/pgcopy


You can go faster than this. (Mind you, the following suggestions will only matter if the DB is still your real bottleneck. Once you’re using COPY, you start to have to optimize the rest of your ETL pipeline as well.)

I’ve done all of the following (except the last bit about pg_bulkload) in my own ETL stack. They work!

——

1. COPY supports a binary format in place of CSV. It’s proprietary (it’s not even exposed in libpq, only statically linked into the postmaster, pgdump and pgrestore binaries!) but also pretty easy to implement. Here’s a python package implementing it: https://pypi.org/project/pgcopy/. As long as you’ve got the rows in memory at some point, just generate that instead of CSV. The rest of the architecture is the same.

2. Run concurrent SQL COPY commands on separate connections for each [staging] table you’re populating. This allows PG to populate the tables in parallel.

3. Create (or TRUNCATE) the staging tables in the same transaction that contains the COPY. This way, the COPY won’t need to operate on a differential MVCC state, but rather can operate on the “raw” table. (And, if this is the “base” data load for a given table, rather than an incremental one against an already-populated table, then you don’t need a staging table; the target table itself is now your staging table. Just ensure that any CREATE INDEX or CREATE PRIMARY KEY or ADD CONSTRAINT happens after you populate the table with data. No need for a TEMPORARY/unlogged intermediate; the performance would actually be worse if you added one.)

4. Pre-bake your .sql, embedding the `FORMAT binary` data, then stream it once you’re done. I can’t tell you how many times the problem with an ETL pipeline wasn’t PG’s ingestion speed, but rather the speed at which the ETL transformer process was writing to the socket. The rest of your pipeline can be asynchronous, but actually streaming into your PG instance takes up valuable shared resources on the instance. Assuming that the instance being ETLed to isn’t just a replication master, and actually receives read traffic, you’ll want writes to that instance to complete as quickly as possible—even at the expense of making the rest of your ETL pipeline more complicated. So pre-bake your .sql files, and then stream them. (One advantage, once you do this: you don’t need a runtime with an SQL client library any more. You can just spawn instances of `psql foo.sql`. These processes don’t even have to coincide with your transform stage any more! Push your `.sql` files to object storage, and then run `psql` on a dedicated loader VM—or from an object-storage-lifecycle-event-triggered Lambda function, if you like.)

Fun fact: when you `pgdump --format=directory`, pgrestore(1) will do techniques #1 - #4 for you automatically.

Key insight: the pgdump(1) and pgrestore(1) pair of binaries are optimized to heck and back. If pgdump(1) or pgrestore(1) has some special way of doing something, it’s probably for performance; look into doing that thing in your pipeline. (Even if there’s no library for your runtime to allow you to do that thing. Write one!)

——

You can go faster than even pgdump+pgrestore will “naively” do, while still being compatible with “standard Postgres”, if you’re willing to mangle your data architecture a bit. Specifically, you can take advantage of Postgres’s table partitioning support. (Yes, now you have to deal with partitions. But hey, you probably already had to at the data sizes we’re talking about.)

5. Turn your incremental loads into table foo (via fresh staging tables) into base loads of fresh partitions of table foo. Each load creates its own partition. Techniques #1 and #2 will accelerate and parallelize the loading of sibling partitions of a parent table, just as well as they’ll accelerate the loading of separate tables.

Once you do this, you don’t need any staging tables, because every load is a “base” load of a table. (Yes, you’ll have a lot of tables. You’ll need some agent sitting around to do async rollups using `CREATE TABLE AS ...` to consolidate these. If you have access to the TimescaleDB extension, you might be able to trick it into doing this for you automatically.)

#6 (or #3b). Include your CREATE INDEX statements in the same transaction that CREATEs and COPYies into the partitions (after the data is populated, though.) One of the worst thing for the production performance of a data warehouse is contention between application reads and an index rebuild. If you always create new partitions, and only ever index them when you’re creating them, then your indices will be created against tables that aren’t visible to new clients yet, and so clients and the indexer will never get in one-another’s way.

——

If, after all this, you’re looking for the absolute fastest way to load data into Postgres, and you have control over your PG instance (i.e. it’s not a hosted DBaaS), you could try https://pgxn.org/dist/pg_bulkload/. It’s a PG extension that serves as a faster replacement for the COPY command, by bypassing most of the buffers and locks related to PG’s individual request-handling. It shouldn’t cause any problems with MVCC, but only because it assumes/requires that you’re also doing all of the above, so you’re never trying to load into a table users can “see.” (I haven’t tried it personally, but if PG’s ingest speed becomes my bottleneck again, I just might.)


Somewhat related: how does one insert a Pandas DataFrame into a Postgres table, in which most fields are merely integers being foreign keys of other tables? The DataFrame itself contains the real values (e.g. Cities, Countries etc.).


i use write_sql but in this case you would have to resolve the object values to their integer primary keys.


Very nice and educational read ! one note though, it's usually a good practice in performance testing to repeat your measurements more than once, it will mitigate local problems or at least point to problems in the measurement.


Would also like this with asyncpg. psycopg2 is kind of obsolete today.


Bold claim, since psycopg2 is used overwhelmingly more than any other library. Though asyncpg (or something like it) is likely the future, it’s a long way to go before calling psycopg obsolete.


Asyncpg is not future, it is present. We use it in production and quite happy with it.


That still doesn't really challenge parent's claims.


If asyncpg is qualitatively better across the board and psycopg2 is only being used preferentially due to historical precedence, I'd say calling it obsolete is fair.


Doesn't asyncpg require asyncio? That's a fairly niche market, especially given the alternative async libraries even in that space (twisted, trio). Why would it be faster anyway, on a use case like this?


There’s also copy_expert in psycopg2. I believe csv with copy_expert should be the fastest but have not tested in as much detail as op. Great work on the article.


IIRC copy_expert just lets you enter the postgres COPY command that's used to start the transfer manually, which by itself doesn't change anything.

It does allow you to use the BINARY transfer format, though, and that's quite often the fastest version, at the cost of a more complicated data transformation step.


Plus freeze for use cases like the example in the post.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: