Quick PostgreSQL Optimization

  • warning: include(/tmp/fortune.txt): failed to open stream: No such file or directory in /home/mohawksoft/org/www/htdocs/includes/common.inc(1696) : eval()'d code on line 1.
  • warning: include(): Failed opening '/tmp/fortune.txt' for inclusion (include_path='.:/usr/share/php:/usr/share/pear') in /home/mohawksoft/org/www/htdocs/includes/common.inc(1696) : eval()'d code on line 1.

A Quick Look at How To Optimize PostgreSQL Queries

Introduction

PostgreSQL is, if not the most, one of the most advanced open source database systems. With amazing features that rival the most advanced commercial databases, PostgreSQL is an enterprise database that is open source and free.

As advanced as PostgreSQL is, some of the very features that make it such a great system may confuse the database novice and result in sub-optimal performance.

This document is not a full analysis of PostgreSQL. PostgreSQL is a large software package with a lot of options and nuanced configuration details. Most of which should be left to the DBA. This document is intended for the novice or casual database developer who just wants their system to respond better, leaving the higher level database configuration to the DBA.

EXPLAIN

PostgreSQL has a utility called EXPLAIN, it is used to examine how a query will be executed. It is similar to Oracle's EXPLAIN PLAN, but easier to use. Think of EXPLAIN as your query debugger, it shows you how long it takes to get the first row of data, how long it will take to get every row, how many rows will be returned, and how wide a row is in bytes. Before you get too excited, everything is an merely an estimate and the units of time are an arbitrary number that loosely represents the time it takes to fetch a disk page.

Even though everything is an estimate, it is an amazing tool that shows you how PostgreSQL will execute your query will execute, and if you pay attention, it will show you why your queries don't perform like they should.

Query Analyzer and Planner

One of the things that stands out in PostgreSQL over packages like Oracle is a seeming lack of control over how queries are executed. This is because the PostgreSQL developers have taken the approach that PostgreSQL should be able to analyze and plan queries efficiently without user direction. Oracle on the other hand allows “hints” to be passed to the query planner to help guide its operation. While there are great points to be made for each approach, PostgreSQL focuses on its analyzer and planner and that's where the query writer needs to direct his or her energy.

Before you feel completely lost, there are some variables that can be set prior to submitting a query that will affect how the analyzer and planner operate. While they do not have the degree of control that Oracle hints do, they will help you out of a tricky situation if you just can't get the query to operate efficiently. Their operation is not part of this document and are recommended reading in the PostgreSQL documentation.

An Example: Zipcity

Since this is not an in-depth dissection of PostgreSQL, but merely a quick guide to basic optimization, we'll dispense with the long detail analysis and dive right in to an example.

The Zipcity example is a precursor to a geographical location system where by locations can be found by proximity to a given address. The first step, presented here, is to find the geographical location, in longitude and latitude, based on a street address.

There are four tables: addrs, streets, zipcity, and zipcode.

ADDRS

Table "public.addrs"

Column | Type | Modifiers

----------+---------+-----------

streetid | integer |

fraddr | integer |

toaddr | integer |

frlat | numeric |

frlong | numeric |

tolat | numeric |

tolong | numeric |

The ADDRS table contains the geographical information for a range of street addresses. The STREETID is a unique ID of a particular street in the USA. The FRADDR and the TOADDR columns represent the high and low numerical address on a street, i.e. 1060 West Addison would fall between a FRADDR of 1059 and a TOADDR of 1062. The FRLAT and FRLONG as well as the TOLAT and TOLONG represent the upper and lower longitude and latitude points of a rectangle that encompasses the locations.

The are 16,989,536 rows in the ADDRS table.

STREETS

Table "public.streets"

Column | Type | Modifiers

----------+---------+-------------------------------------------

street | text |

type | text |

zipcode | integer |

streetid | integer |



The STREETS table contain an entry for every street in the U.S.A. The STREET column is the name of the street, like MASSACHUSETTS. The TYPE is an abbreviation of the street type, like AVE or ST. The ZIPCODE is an integer that represents the 5 digit zip code, i.e. 2186 instead of 02186. The STREETID is an arbitrary unique identification of the street that links the STREETS table with the ADDRS table.

There are 4,475,533 rows in the STREETS table.

ZIPCODE

Table "public.zipcode"

Column | Type | Modifiers

----------+-------------------+-----------

zipcode | integer |

city | character varying |

state | character varying |

lat | numeric |

lng | numeric |

timezone | integer |

dst | integer |

orig | integer |

The ZIPCODE table is publicly available data that you can find easily with Google. The columns of interest are ZIPCODE, CITY, STATE, LNG, and LAT. The ZIPCODE, CITY, and STATE are pretty obvious. The LAT and LNG are the geographical center of the area encompassed by the ZIPCODE.

There are 43,191 entries in the ZIPCODE table.

The zipcode table is not currently used in this example but will be referenced in subsequent examples.

ZIPCITY

Table "public.zipcity"

Column | Type | Modifiers

---------+---------+-----------

zipcode | integer |

city | text |

state | text |

type | integer |

The ZIPCITY table is used to correct a deficiency in the ZIPCODE table. The ZIPCODE table has the “official” city names and not the commonly used canonical city names. For instance, Boston, MA has over 50 entries. The ZIPCODE table does not contain names like “Dorchester” or “Hyde Park.” The ZIPCITY table contains alternate city names. The TYPE column indicates whether or not the name is an “official” name or a canonical alternate name. An “official” name has a TYPE of '0' and an alternate name has a TYPE of '1.'

There are 79,000 entries in the ZIPCITY table.

Indexing and ANALYZE

Before we start writing and optimizing queries, we need to go over some inter-related aspects of the PostgreSQL query analyzer, table indexes, and table statistics. Specifically, how and when PostgreSQL decides to use an index and plan your query.

If you are a complete novice at using SQL you may be confused about what an index is, an index is a device that facilitates quick access to a table by creating an ordered reference to the data based on a column within the table. For instance, “CREATE INDEX foobar on ZIPCITY(city);” will create a quick access to the data in the ZIPCITY table based on the CITY column.

While it may seem counter intuitive, but using an index is not always the right choice. PostgreSQL attempts to guess when it is best to use an index and when it is best not. Indexes are not free, in fact, they can be pretty expensive. They only reason to use one is when the index reduces the amount of work that is required to find one or more records from a table. For instance, using the ZIPCITY example, this select statement “SELECT * FROM ZIPCITY ORDER BY CITY,” on the surface, it would seem that you should just scan the index you created previously and output the data in that order. Simple, right?

One most database systems the index file is structured similarly to the disk tables themselves, as a set of disk blocks. What's difference is that the structure of the data in the index isn't rows and columns of data, but a set of entries for each row in the data table it indexes. Each unique entries with have one or more references to a row in the data table. Typically structured as some binary tree variant, a number of blocks must be read in order to find the correct entry. Reading a large amount of the index in addition to a substantial amount of the data table, could end up being more disk I/O than simply reading all of the data table without the index.

To effectively use an index (or not) PostgreSQL needs statistics to be gathered on a table. Without proper statistics, PostgreSQL's query analyzer and planner will not be able to calculate the proper query paths. So, it is important to run ANALYZE or VACUUM ANALYZE periodically to ensure that the optimization steps you take are handled properly by PostgreSQL.

Additionally, the statistics gathered by ANALYZE are used for far more than just choosing an index. The statistics describe the data contained in a table and the analyzer and planner use them to choose strategies used in the query for operations such as joining and sorting.

Finding an Address in ADDRS

Lets start looking for an address in the ADDRS table. For now, we'll just use a random number for STREETID.

zipcity=# select * from addrs where streetid = 123;

streetid | fraddr | toaddr | frlat | frlong | tolat | tolong

----------+--------+--------+-----------+------------+-----------+------------

123 | 8851 | 8881 | 45.863189 | -87.063913 | 45.864212 | -87.064164

123 | 8835 | 8849 | 45.864212 | -87.064164 | 45.864746 | -87.064249

123 | 8727 | 8833 | 45.864746 | -87.064249 | 45.868630 | -87.065551

(3 rows)

The first thing you'll notice is that it took a long time. That's because the database had to scan the whole table looking for all the rows where STREETID equals 123.

The EXPLAIN of the query looks like this:

zipcity=# explain select * from addrs where streetid = 123;

QUERY PLAN

----------------------------------------------------------------

Seq Scan on addrs (cost=0.00..437676.20 rows=84948 width=140)

Filter: (streetid = 123)

(2 rows)

As you can see, it has a HUGE cost, 437676! Now, lets create an index:

zipcity=# create index addrs_streetid on addrs(streetid) ;

CREATE INDEX

Now, lets re-run that EXPLAIN and see the difference.

zipcity=# explain select * from addrs where streetid = 123;

QUERY PLAN

------------------------------------------------------------------------------------

Bitmap Heap Scan on addrs (cost=1408.09..167596.78 rows=84948 width=140)

Recheck Cond: (streetid = 123)

-> Bitmap Index Scan on addrs_streetid (cost=0.00..1386.85 rows=84948 width=0)

Index Cond: (streetid = 123)

(4 rows)



Notice that the “up front” cost is 1408 and the overall cost is 167596! That's because with the index, there is some minimum amount of work to be done to find the records, and the maximum cost is much higher (remember we said indexes were expensive?)

Lets run the original query again. You'll see the same results, but the database responded almost instantaneously.

Now, lets find a specific address.

zipcity=# select * from addrs where streetid = 123 and fraddr <= 8727 and toaddr >= 8727;

streetid | fraddr | toaddr | frlat | frlong | tolat | tolong

----------+--------+--------+-----------+------------+-----------+------------

123 | 8727 | 8833 | 45.864746 | -87.064249 | 45.868630 | -87.065551

(1 row)

That worked pretty fast, but as you know, a high performance web site has to perform much faster than a single person can see, so lets see what happened with EXPLAIN.

zipcity=# explain select * from addrs where streetid = 123 and fraddr <= 8727 and toaddr >= 8727;

QUERY PLAN

------------------------------------------------------------------------------------

Bitmap Heap Scan on addrs (cost=1389.21..168002.64 rows=9439 width=140)

Recheck Cond: (streetid = 123)

Filter: ((fraddr <= 8727) AND (toaddr >= 8727))

-> Bitmap Index Scan on addrs_streetid (cost=0.00..1386.85 rows=84948 width=0)

Index Cond: (streetid = 123)

(5 rows)

As you can see, it found the list of records using the same method as before, but it added a filter for the fraddr and toaddr. Now, what if our street was a long street with a lot of buildings? For instance, StreetID 357872 has 251 entries in the table. To satisfy this query we would have to scan all 251 entries to find the correct entry. So, lets create an index on fraddr and toaddr.

zipcity=# create index addrs_addrs on addrs(fraddr, toaddr) ;

CREATE INDEX

Now lets run the explain again

zipcity=# explain select * from addrs where streetid = 357172 and fraddr <= 8727 and toaddr >= 8727;

QUERY PLAN

-------------------------------------------------------------------------------------------

Bitmap Heap Scan on addrs (cost=120141.72..151671.83 rows=9439 width=140)

Recheck Cond: ((streetid = 357172) AND (fraddr <= 8727) AND (toaddr >= 8727))

-> BitmapAnd (cost=120141.72..120141.72 rows=9439 width=0)

-> Bitmap Index Scan on addrs_streetid (cost=0.00..1386.85 rows=84948 width=0)

Index Cond: (streetid = 357172)

-> Bitmap Index Scan on addrs_addrs (cost=0.00..118749.90 rows=1887726 width=0)

Index Cond: ((fraddr <= 8727) AND (toaddr >= 8727))

(7 rows)

See how it is using the index to find the addresses, then a second index to check the streetid? In my opinion, that seems broken, its cost estimates are higher than without using the new index. What went wrong? We forgot to ANALYZE the table!! (You should ANALYZE periodically or when ever the table changes dramatically. In this example, the database had not been analyzed since before it had been populated.)

zipcity=# vacuum analyze addrs ;

VACUUM

Now, lets try the EXPLAIN again:

zipcity=# explain select * from addrs where streetid = 357172 and fraddr <= 8727 and toaddr >= 8727;

QUERY PLAN

--------------------------------------------------------------------------------

Bitmap Heap Scan on addrs (cost=14.66..2532.62 rows=90 width=64)

Recheck Cond: (streetid = 357172)

Filter: ((fraddr <= 8727) AND (toaddr >= 8727))

-> Bitmap Index Scan on addrs_streetid (cost=0.00..14.64 rows=653 width=0)

Index Cond: (streetid = 357172)

(5 rows)

That's certainly better, lets look at what's happening. PostgreSQL knows that getting the list of the streets by streetid is more efficient than using the addresses. Notice that this plan completely ignores the address index we created. That's because the index of addresses is less exclusionary than is streetid. Since the addresses index creates more work than the streetid index, the index of addresses is ignored in the query.

Lets create a new composite index for ADDRS.

zipcity=# create index addrs_strtofr on addrs(streetid,fraddr, toaddr) ;

CREATE INDEX

Now, lets run that EXPLAIN again

zipcity=# explain select * from addrs where streetid = 357172 and fraddr <= 8727 and toaddr >= 8727;

QUERY PLAN

-------------------------------------------------------------------------------------

Bitmap Heap Scan on addrs (cost=17.40..373.58 rows=90 width=64)

Recheck Cond: ((streetid = 357172) AND (fraddr <= 8727) AND (toaddr >= 8727))

-> Bitmap Index Scan on addrs_strtofr (cost=0.00..17.38 rows=90 width=0)

Index Cond: ((streetid = 357172) AND (fraddr <= 8727) AND (toaddr >= 8727))

(4 rows)

Look at the costs of the query now. The startup cost is slightly higher, but the total cost is far far lower. But, hey, we already have an index on streetid, do we have to keep both? The answer is no. Since we put “streetid” first in our index, PostgreSQL can us the composite index “ADDRS_STFRTO” for streetid based queries. So, lets drop our unused indexes.

zipcity=# drop index addrs_addrs;

DROP INDEX

zipcity=# drop index addrs_streetid;

DROP INDEX

Let's double check that our streetid query still works:

zipcity=# explain select * from addrs where streetid = 357172;

QUERY PLAN

-------------------------------------------------------------------------------

Bitmap Heap Scan on addrs (cost=19.55..2534.25 rows=653 width=64)

Recheck Cond: (streetid = 357172)

-> Bitmap Index Scan on addrs_strtofr (cost=0.00..19.39 rows=653 width=0)

Index Cond: (streetid = 357172)

(4 rows)

Yup, everything is working just fine.

Finding Streets in STREETS

Now that we can find geographical coordinates of an address based on the streetid and numerical street address, it is time to find the streetid. Lets find a pretty generic street: Maple Street. As you can imagine, there are a lot of towns and cities that have a “Maple Street.” According to this database, there's about 4500 of them, or about 1% of the streets in the U.S.A are named maple street, and over 12% of the zipcodes have a Maple Street. There are only 34160 distinct zip codes in the streets table and 447533 rows. That's means about 131 streets per zipcode, on average, with many cities having well over a thousand.

Just as a baseline, lets use EXPLAIN to show the query:

zipcity=# explain select * from streets where street = 'Maple' and type = 'St' and zipcode = 2186;

QUERY PLAN

-------------------------------------------------------------------------------------

Seq Scan on streets (cost=0.00..111725.83 rows=1 width=72)

Filter: ((street = 'Maple'::text) AND ("type" = 'St'::text) AND (zipcode = 2186))

(2 rows)

Let's create an index:

zipcity=# create index streets_zip_str_type on streets (zipcode, street, "type") ;

CREATE INDEX

Lets look at that query again:

zipcity=# explain select * from streets where street = 'Maple' and type = 'St' and zipcode = 2186;

QUERY PLAN

-----------------------------------------------------------------------------------------

Index Scan using streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=72)

Index Cond: ((zipcode = 2186) AND (street = 'Maple'::text) AND ("type" = 'St'::text))

(2 rows)

That was pretty easy.

Finding a ZIPCODE

With STREETS and ADDRS we can find the geographical location of an address based on street, address, and zipcode. This is good for most queries, but what if you don't know the zipcode? Since most people don't use the official names of their city, we'll use the ZIPCITY table to find the zipcodes based on city and state names.

zipcity=# select * from zipcity where city='DORCHESTER' and state = 'MA';

zipcode | city | state | type

---------+------------+-------+------

2121 | DORCHESTER | MA | 1

2122 | DORCHESTER | MA | 1

2124 | DORCHESTER | MA | 1

2125 | DORCHESTER | MA | 1

(4 rows)

An important thing to notice is that some cities have multiple zipcodes and it follows that streets that run for any length of the city may also have multiple zipcodes.

zipcity=# explain select * from zipcity where city='DORCHESTER' and state = 'MA';

QUERY PLAN

------------------------------------------------------------------

Seq Scan on zipcity (cost=0.00..1797.00 rows=2 width=72)

Filter: ((city = 'DORCHESTER'::text) AND (state = 'MA'::text))

(2 rows)

As before, we'll create a composite index for our query conditions.

zipcity=# create index zipcity_city_state on zipcity (city,state) ;

CREATE INDEX

Now, we use EXPLAIN again.

zipcity=# explain select * from zipcity where city='DORCHESTER' and state = 'MA';

QUERY PLAN

-----------------------------------------------------------------------------------

Index Scan using zipcity_city_state on zipcity (cost=0.00..8.29 rows=1 width=27)

Index Cond: ((city = 'DORCHESTER'::text) AND (state = 'MA'::text))

(2 rows)

(You did remember to run ANALYZE right?)

Putting it All Together

Now we have the basis of our application, lets put in all together:

Query by City, State, Street, and Address

The following query is three table implicit inner join that links addrs, streets, and zipcity.

zipcity=# select * from addrs, streets, zipcity where zipcity.city = 'DORCHESTER' and zipcity.state = 'MA' and zipcity.zipcode = streets.zipcode and streets.street = 'Maple' and streets.type = 'St' and streets.streetid = addrs.streetid and addrs.fraddr <= 97 and addrs.toaddr >= 97;

streetid | fraddr | toaddr | frlat | frlong | tolat | tolong | street | type | zipcode | streetid | zipcode | city | state | type

----------+--------+--------+-----------+------------+-----------+------------+--------+------+---------+----------+---------+------------+-------+------

2644797 | 94 | 104 | 42.307677 | -71.086603 | 42.307376 | -71.086858 | Maple | St | 2121 | 2644797 | 2121 | DORCHESTER | MA | 1

(1 row)

zipcity=# explain select * from addrs, streets, zipcity where zipcity.city = 'DORCHESTER' and zipcity.state = 'MA' and zipcity.zipcode = streets.zipcode and streets.street = 'Maple' and streets.type = 'St' and streets.streetid = addrs.streetid and addrs.fraddr <= 97 and addrs.toaddr >= 97;

QUERY PLAN

----------------------------------------------------------------------------------

Nested Loop (cost=0.00..345.33 rows=1 width=116)

-> Nested Loop (cost=0.00..17.49 rows=1 width=52)

-> Index Scan using zipcity_city_state on zipcity (cost=0.00..8.29 rows=1 width=27)

Index Cond: ((city = 'DORCHESTER'::text) AND (state = 'MA'::text))

-> Index Scan using streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=25)

Index Cond: ((zipcity.zipcode = streets.zipcode) AND (streets.street = 'Maple'::text) AND (streets."type" = 'St'::text))

-> Index Scan using addrs_strtofr on addrs (cost=0.00..326.82 rows=82 width=64)

Index Cond: ((streets.streetid = addrs.streetid) AND (addrs.fraddr <= 97) AND (addrs.toaddr >= 97))

(8 rows)

This query looks pretty good. It utilizes all the indexes we created, has a pretty low cost, and we don't see any obvious problems. Is it perfect? probably not. The reader should try to re-write the query using the JOIN syntax and see if they can create a query which produces the same data but performs better.

Query by Zipcode, Street, and Address

The following query is three table implicit inner join that links addrs, streets, and zipcity like before, except we are not searching the zipcity table by city and state, but by zipcode.

zipcity=# select * from addrs, streets, zipcity where zipcity.zipcode = 2121 and zipcity.type = 0 and zipcity.zipcode = streets.zipcode and streets.street = 'Maple' and streets.type = 'St' and streets.streetid = addrs.streetid and addrs.fraddr <= 97 and addrs.toaddr >= 97;

streetid | fraddr | toaddr | frlat | frlong | tolat | tolong | street | type | zipcode | streetid | zipcode | city | state | type

----------+--------+--------+-----------+------------+-----------+------------+--------+------+---------+----------+---------+--------+-------+------

2644797 | 94 | 104 | 42.307677 | -71.086603 | 42.307376 | -71.086858 | Maple | St | 2121 | 2644797 | 2121 | BOSTON | MA | 0

zipcity=# explain select * from addrs, streets, zipcity where zipcity.zipcode = 2121 and zipcity.type = 0 and zipcity.zipcode = streets.zipcode and streets.street = 'Maple' and streets.type = 'St' and streets.streetid = addrs.streetid and addrs.fraddr <= 97 and addrs.toaddr >= 97;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------

Nested Loop (cost=0.00..2143.94 rows=2 width=116)

-> Nested Loop (cost=0.00..346.92 rows=1 width=89)

-> Index Scan using streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=25)

Index Cond: ((zipcode = 2121) AND (street = 'Maple'::text) AND ("type" = 'St'::text))

-> Index Scan using addrs_strtofr on addrs (cost=0.00..336.71 rows=82 width=64)

Index Cond: ((streets.streetid = addrs.streetid) AND (addrs.fraddr <= 97) AND (addrs.toaddr >= 97))

-> Seq Scan on zipcity (cost=0.00..1797.00 rows=2 width=27)

Filter: ((zipcode = 2121) AND ("type" = 0))

(8 rows)

While this query generates the same basic data, but performs much worse. First, it scans the zipcity table by zipcode not city and state. This forces a sequential scan of the whole zipcity table which increases our overall costs. Since zipcodes should be more specific than the city and state (the same city and state may have multiple entries.), this query should be more efficient than the previous one.

-> Seq Scan on zipcity (cost=0.00..1797.00 rows=2 width=27)

Filter: ((zipcode = 2121) AND ("type" = 0))

Lets create a new index based on zipcode:

zipcity=# create index zipcity_zipcode on zipcity (zipcode) ;

CREATE INDEX

zipcity=# explain select * from addrs, streets, zipcity where zipcity.zipcode = 2121 and zipcity.type = 0 and zipcity.zipcode = streets.zipcode and streets.street = 'Maple' and streets.type = 'St' and streets.streetid = addrs.streetid and addrs.fraddr <= 97 and addrs.toaddr >= 97;

QUERY PLAN

-------------------------------------------------------------------------------------------

Nested Loop (cost=7.60..350.68 rows=2 width=116)

-> Nested Loop (cost=7.60..342.34 rows=1 width=89)

-> Index Scan using streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=25)

Index Cond: ((zipcode = 2121) AND (street = 'Maple'::text) AND ("type" = 'St'::text))

-> Bitmap Heap Scan on addrs (cost=7.60..332.13 rows=82 width=64)

Recheck Cond: ((streets.streetid = addrs.streetid) AND (addrs.fraddr <= 97) AND (addrs.toaddr >= 97))

-> Bitmap Index Scan on addrs_strtofr (cost=0.00..7.58 rows=82 width=0)

Index Cond: ((streets.streetid = addrs.streetid) AND (addrs.fraddr <= 97) AND (addrs.toaddr >= 97))

-> Index Scan using zipcity_zipcode on zipcity (cost=0.00..8.32 rows=2 width=27)

Index Cond: (zipcode = 2121)

Filter: ("type" = 0)

(11 rows)

That's a lot better, we eliminated the sequential scan and reduced the total estimated cost of the query, but the cost is still higher than the query by city and state, do you know why? Can you create an index that will fix that problem?

To Be Continued!!

Geographical queries!

Conclusion

This article is not a comprehensive tutorial on writing and optimizing SQL queries, it is just a brief introduction to the process. The above queries are probably sub-optimal in some respects, but perform far better than without examination.

Further Reading

If you read no other PostgreSQL documentation, you should consult the on-line PostgreSQL manual, specifically “Performance Tips.” You are also encouraged to read the postgresql.conf file, typically found in your data directory for the various logging options. It is heavily commented and contains a lot of useful information.