Tips for Importing Data

Bear GilesJanuary 28th, 2015Last Updated: January 27th, 2015

0 18 4 minutes read

I’m currently importing a large amount of spatial data into a PostgreSQL/PostGIS database and realized others could learn from my experience. Most of the advice is not specific to PostgreSQL or PostGIS.

Know the basic techniques

Know the basic techniques for loading bulk data. Use COPY if possible. If not use batch processing if possible. If not turn off auto-commits before doing individual INSERT or UPDATE calls and only commit every N^th calls. Use auto-committed INSERT and UPDATE as an absolute last resort. Unfortunately the latter is the usual default with JDBC connections.

Other standard technique is to drop any indexes before loading large amounts of data and recreate them afterwards. This is not always possible, e.g., if you’re updating a live table, but it can mean a big performance boost.

Finally remember to update your index statistics. In PostgreSQL this is VACUUM ANALYZE.

Know your database

PostgreSQL allows tables to be created as “UNLOGGED”. That means there can be data loss if the system crashes while (or soon after) uploading your data but so what? If that happens you’ll probably want to restart the upload from the start anyway.

I haven’t done performance testing (yet) but it’s another trick to keep in mind.

Know how to use the ExecutorService for multithreading

Every call to the database will have dead time due to network latency and the time required for the database to complete the call. On the flip side the database is idle while waiting for the desktop to prepare and upload each call. You can fill that dead time by using a multithreaded uploader.

The ExecutorService makes it easy to break the work into meaningful chunks. E.g., each thread uploads a single table or a single data file. The ExecutorService also allows you to be more intelligent about how you upload your data.

Upload to a dedicated schema

If possible upload to a dedicated schema. This gives you more flexibility later.

In PostgreSQL this can be transparent to most users by calling ALTER DATABASE database SET SEARCH_PATH=schema1,schema2,schema3 and specifying both public and the schema containing the uploaded data. In other databases you can only have one active schema at a time and you will need to explicitly specify the schema later.

Upload unprocessed data

I’ve sometimes found it faster to upload raw data and process it in the database (with queries and stored procedures) than to process it on the desktop and upload the cooked data. This is especially true with fixed format records. There are benefits besides performance – see following items.

Keep track of your source

Few things are more frustrating than having questions about what’s in the database and no way to trace it back to its source. (Obviously this applies to data you uploaded from elsewhere, not data your application generated itself.) Adding a couple fields for, e.g., filename and record number, can save a lot of headaches later.

Plan your work

In my current app one of my first steps is to decompress and open each shapefile (data file), get the number of records, and then close it and delete the uncompressed file. This seems like pointless work but it allows me to get a solid estimate on the amount of time that will be required to upload the file. That, in turn, allows me to make intelligent decisions about how to schedule the work. E.g., one standard approach is to use a priority queue where you always grab the most expensive work item (e.g., number of records to process) when a thread becomes available. This will result in the fastest overall uploads.

This has two other benefits. First, it allows me to verify that I can open and read all of the shapefiles. I won’t waste hours before running into a fatal problem. Second it allows me to verify that I wrote everything expected. There’s a problem if I found 813 records while planning but later could only find 811 rows in the table. Or worse I found 815 rows in the table.

Log everything

I write to a log table after every database call. I write to a log table after every exception. It’s expensive but it’s a lot easier to query a database table or two later than to parse log files. It’s also less expensive than you think if you have a large batch size.

I log the time, the shapefile’s pathname, the number of records read and the number of records written. The exceptions log the time, shapefile’s pathname, and the exception messages. I log all exception messages – both ’cause’ and ‘nextException’ (for SQLExceptions).

Perform quality checks

Check that you’re written all expected records and had no exceptions.

Check the data for self-consistency. Add unique indexes. Add referential integrity constraints. There’s a real cost to adding indexes when inserting and updating data but there’s no downside if these will be read-only tables.

Check for null values. Check for reasonable values, e.g., populations and ages can never be negative.

You can often perform checks specific to the data. E.g., I know that states/provinces must be WITHIN their respective countries, and cities must be WITHIN their respective states/provinces. (Some US states require cities to be WITHIN a single county but other states allow cities to cross country boundaries.)

There are two possibilities if the QC fails. First, your upload could be faulty. This is easy to check if you recorded the source of each record. Second, the data itself could be faulty. In either case you want to know this as soon as possible.

Backup your work

This should be a no-brainer but immediately perform a database backup of these tables after the QC passes.

Lock down the data

This only applies if the tables will never or rarely be modified after loading. (Think things like tables containing information about states or area codes.)

Lock down the tables. You can use both belts and suspenders: REVOKE INSERT, UPDATE, TRUNCATE ON TABLE table FROM PUBLIC. ALTER TABLE table SET READ ONLY. Some people prefer VACUUM FREEZE table SET READ ONLY. If you uploaded the data into a dedicated schema you can use the shorthand REVOKE INSERT, UPDATE, TRUNCATE ON ALL TABLES IN SCHEMA schema FROM PUBLIC. You can also REVOKE CREATE permissions on a schema.

Reference:

Tips for Importing Data from our JCG partner Bear Giles at the Invariant Properties blog.