A Quick Look at How To Optimize PostgreSQL
Queries
Introduction
PostgreSQL is, if not the most, one of the most advanced open
source database systems. With amazing features that rival the most
advanced commercial databases, PostgreSQL is an enterprise database
that is open source and free.
As advanced as PostgreSQL is, some of the very features that make
it such a great system may confuse the database novice and result in
sub-optimal performance.
This document is not a full analysis of PostgreSQL. PostgreSQL is
a large software package with a lot of options and nuanced
configuration details. Most of which should be left to the DBA. This
document is intended for the novice or casual database developer who
just wants their system to respond better, leaving the higher level
database configuration to the DBA.
EXPLAIN
PostgreSQL has a utility called EXPLAIN, it is used to examine how
a query will be executed. It is similar to Oracle's EXPLAIN PLAN, but
easier to use. Think of EXPLAIN as your query debugger, it shows you
how long it takes to get the first row of data, how long it will take
to get every row, how many rows will be returned, and how wide a row
is in bytes. Before you get too excited, everything is an merely an
estimate and the units of time are an arbitrary number that loosely
represents the time it takes to fetch a disk page.
Even though everything is an estimate, it is an amazing tool that
shows you how PostgreSQL will execute your query will execute, and if
you pay attention, it will show you why your queries don't perform
like they should.
Query Analyzer and Planner
One of the things that stands out in PostgreSQL over packages like
Oracle is a seeming lack of control over how queries are executed.
This is because the PostgreSQL developers have taken the approach
that PostgreSQL should be able to analyze and plan queries
efficiently without user direction. Oracle on the other hand allows
“hints” to be passed to the query planner to help guide its
operation. While there are great points to be made for each approach,
PostgreSQL focuses on its analyzer and planner and that's where the
query writer needs to direct his or her energy.
Before you feel completely lost, there are some variables that can
be set prior to submitting a query that will affect how the analyzer
and planner operate. While they do not have the degree of control
that Oracle hints do, they will help you out of a tricky situation if
you just can't get the query to operate efficiently. Their operation
is not part of this document and are recommended reading in the
PostgreSQL documentation.
An Example: Zipcity
Since this is not an in-depth dissection of PostgreSQL, but merely
a quick guide to basic optimization, we'll dispense with the long
detail analysis and dive right in to an example.
The Zipcity example is a precursor to a geographical location
system where by locations can be found by proximity to a given
address. The first step, presented here, is to find the geographical
location, in longitude and latitude, based on a street address.
There are four tables: addrs, streets, zipcity, and zipcode.
ADDRS
Table "public.addrs"
Column | Type | Modifiers
----------+---------+-----------
streetid | integer |
fraddr | integer |
toaddr | integer |
frlat | numeric |
frlong | numeric |
tolat | numeric |
tolong | numeric |
The ADDRS table contains the geographical information for a range
of street addresses. The STREETID is a unique ID of a particular
street in the USA. The FRADDR and the TOADDR columns represent the
high and low numerical address on a street, i.e. 1060 West Addison
would fall between a FRADDR of 1059 and a TOADDR of 1062. The FRLAT
and FRLONG as well as the TOLAT and TOLONG represent the upper and
lower longitude and latitude points of a rectangle that encompasses
the locations.
The are 16,989,536 rows in the ADDRS table.
STREETS
Table "public.streets"
Column | Type |
Modifiers
----------+---------+-------------------------------------------
street | text |
type | text |
zipcode | integer |
streetid | integer |
The STREETS table contain an entry for every street in the U.S.A.
The STREET column is the name of the street, like MASSACHUSETTS. The
TYPE is an abbreviation of the street type, like AVE or ST. The
ZIPCODE is an integer that represents the 5 digit zip code, i.e. 2186
instead of 02186. The STREETID is an arbitrary unique identification
of the street that links the STREETS table with the ADDRS table.
There are 4,475,533 rows in the STREETS table.
ZIPCODE
Table "public.zipcode"
Column | Type | Modifiers
----------+-------------------+-----------
zipcode | integer |
city | character varying |
state | character varying |
lat | numeric |
lng | numeric |
timezone | integer |
dst | integer |
orig | integer |
The ZIPCODE table is publicly available data that you can find
easily with Google. The columns of interest are ZIPCODE, CITY, STATE,
LNG, and LAT. The ZIPCODE, CITY, and STATE are pretty obvious. The
LAT and LNG are the geographical center of the area encompassed by
the ZIPCODE.
There are 43,191 entries in the ZIPCODE table.
The zipcode table is not currently used in this example but
will be referenced in subsequent examples.
ZIPCITY
Table "public.zipcity"
Column | Type | Modifiers
---------+---------+-----------
zipcode | integer |
city | text |
state | text |
type | integer |
The ZIPCITY table is used to correct a deficiency in the ZIPCODE
table. The ZIPCODE table has the “official” city names and not
the commonly used canonical city names. For instance, Boston, MA has
over 50 entries. The ZIPCODE table does not contain names like
“Dorchester” or “Hyde Park.” The ZIPCITY table contains
alternate city names. The TYPE column indicates whether or not the
name is an “official” name or a canonical alternate name. An
“official” name has a TYPE of '0' and an alternate name has a
TYPE of '1.'
There are 79,000 entries in the ZIPCITY table.
Indexing and ANALYZE
Before we start writing and optimizing queries, we need to go over
some inter-related aspects of the PostgreSQL query analyzer, table
indexes, and table statistics. Specifically, how and when PostgreSQL
decides to use an index and plan your query.
If you are a complete novice at using SQL you may be confused
about what an index is, an index is a device that facilitates quick
access to a table by creating an ordered reference to the data based
on a column within the table. For instance, “CREATE INDEX foobar on
ZIPCITY(city);” will create a quick access to the data in the
ZIPCITY table based on the CITY column.
While it may seem counter intuitive, but using an index is not
always the right choice. PostgreSQL attempts to guess when it is best
to use an index and when it is best not. Indexes are not free, in
fact, they can be pretty expensive. They only reason to use one is
when the index reduces the amount of work that is required to find
one or more records from a table. For instance, using the ZIPCITY
example, this select statement “SELECT * FROM ZIPCITY ORDER BY
CITY,” on the surface, it would seem that you should just scan the
index you created previously and output the data in that order.
Simple, right?
One most database systems the index file is structured similarly
to the disk tables themselves, as a set of disk blocks. What's
difference is that the structure of the data in the index isn't rows
and columns of data, but a set of entries for each row in the data
table it indexes. Each unique entries with have one or more
references to a row in the data table. Typically structured as some
binary tree variant, a number of blocks must be read in order to find
the correct entry. Reading a large amount of the index in addition to
a substantial amount of the data table, could end up being more disk
I/O than simply reading all of the data table without the index.
To effectively use an index (or not) PostgreSQL needs statistics
to be gathered on a table. Without proper statistics, PostgreSQL's
query analyzer and planner will not be able to calculate the proper
query paths. So, it is important to run ANALYZE or VACUUM ANALYZE
periodically to ensure that the optimization steps you take are
handled properly by PostgreSQL.
Additionally, the statistics gathered by ANALYZE are used for far
more than just choosing an index. The statistics describe the data
contained in a table and the analyzer and planner use them to choose
strategies used in the query for operations such as joining and
sorting.
Finding an Address in ADDRS
Lets start looking for an address in the ADDRS table. For now,
we'll just use a random number for STREETID.
zipcity=# select * from addrs where streetid
= 123;
streetid | fraddr | toaddr | frlat |
frlong | tolat | tolong
----------+--------+--------+-----------+------------+-----------+------------
123 | 8851 | 8881 | 45.863189 |
-87.063913 | 45.864212 | -87.064164
123 | 8835 | 8849 | 45.864212 |
-87.064164 | 45.864746 | -87.064249
123 | 8727 | 8833 | 45.864746 |
-87.064249 | 45.868630 | -87.065551
(3 rows)
The first thing you'll notice is that it took a long time. That's
because the database had to scan the whole table looking for all the
rows where STREETID equals 123.
The EXPLAIN of the query looks like this:
zipcity=# explain select * from addrs where
streetid = 123;
QUERY PLAN
----------------------------------------------------------------
Seq Scan on addrs (cost=0.00..437676.20
rows=84948 width=140)
Filter: (streetid = 123)
(2 rows)
As you can see, it has a HUGE cost, 437676! Now, lets create an
index:
zipcity=# create index addrs_streetid on
addrs(streetid) ;
CREATE INDEX
Now, lets re-run that EXPLAIN and see the difference.
zipcity=# explain select * from addrs where
streetid = 123;
QUERY
PLAN
------------------------------------------------------------------------------------
Bitmap Heap Scan on addrs
(cost=1408.09..167596.78 rows=84948 width=140)
Recheck Cond: (streetid = 123)
-> Bitmap Index Scan on addrs_streetid
(cost=0.00..1386.85 rows=84948 width=0)
Index Cond: (streetid = 123)
(4 rows)
Notice that the “up front” cost is 1408 and the overall cost
is 167596! That's because with the index, there is some minimum
amount of work to be done to find the records, and the maximum cost
is much higher (remember we said indexes were expensive?)
Lets run the original query again. You'll see the same results,
but the database responded almost instantaneously.
Now, lets find a specific address.
zipcity=# select * from addrs where streetid
= 123 and fraddr <= 8727 and toaddr >= 8727;
streetid | fraddr | toaddr | frlat |
frlong | tolat | tolong
----------+--------+--------+-----------+------------+-----------+------------
123 | 8727 | 8833 | 45.864746 |
-87.064249 | 45.868630 | -87.065551
(1 row)
That worked pretty fast, but as you know, a high performance web
site has to perform much faster than a single person can see, so lets
see what happened with EXPLAIN.
zipcity=# explain select * from addrs where
streetid = 123 and fraddr <= 8727 and toaddr >= 8727;
QUERY
PLAN
------------------------------------------------------------------------------------
Bitmap Heap Scan on addrs
(cost=1389.21..168002.64 rows=9439 width=140)
Recheck Cond: (streetid = 123)
Filter: ((fraddr <= 8727) AND (toaddr
>= 8727))
-> Bitmap Index Scan on addrs_streetid
(cost=0.00..1386.85 rows=84948 width=0)
Index Cond: (streetid = 123)
(5 rows)
As you can see, it found the list of records using the same method
as before, but it added a filter for the fraddr and toaddr. Now, what
if our street was a long street with a lot of buildings? For
instance, StreetID 357872 has 251 entries in the table. To satisfy
this query we would have to scan all 251 entries to find the correct
entry. So, lets create an index on fraddr and toaddr.
zipcity=# create index addrs_addrs on
addrs(fraddr, toaddr) ;
CREATE INDEX
Now lets run the explain again
zipcity=# explain select * from addrs where
streetid = 357172 and fraddr <= 8727 and toaddr >= 8727;
QUERY
PLAN
-------------------------------------------------------------------------------------------
Bitmap Heap Scan on addrs
(cost=120141.72..151671.83 rows=9439 width=140)
Recheck Cond: ((streetid = 357172) AND
(fraddr <= 8727) AND (toaddr >= 8727))
-> BitmapAnd
(cost=120141.72..120141.72 rows=9439 width=0)
-> Bitmap Index Scan on
addrs_streetid (cost=0.00..1386.85 rows=84948 width=0)
Index Cond: (streetid =
357172)
-> Bitmap Index Scan on
addrs_addrs (cost=0.00..118749.90 rows=1887726 width=0)
Index Cond: ((fraddr <=
8727) AND (toaddr >= 8727))
(7 rows)
See how it is using the index to find the addresses, then a second
index to check the streetid? In my opinion, that seems broken, its
cost estimates are higher than without using the new index. What went
wrong? We forgot to ANALYZE the table!! (You should ANALYZE
periodically or when ever the table changes dramatically. In this
example, the database had not been analyzed since before it had been
populated.)
zipcity=# vacuum analyze addrs ;
VACUUM
Now, lets try the EXPLAIN again:
zipcity=# explain select * from addrs where
streetid = 357172 and fraddr <= 8727 and toaddr >= 8727;
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on addrs
(cost=14.66..2532.62 rows=90 width=64)
Recheck Cond: (streetid = 357172)
Filter: ((fraddr <= 8727) AND (toaddr
>= 8727))
-> Bitmap Index Scan on addrs_streetid
(cost=0.00..14.64 rows=653 width=0)
Index Cond: (streetid = 357172)
(5 rows)
That's certainly better, lets look at what's happening.
PostgreSQL knows that getting the list of the streets by streetid is
more efficient than using the addresses. Notice that this plan
completely ignores the address index we created. That's because the
index of addresses is less exclusionary than is streetid. Since the
addresses index creates more work than the streetid index, the index
of addresses is ignored in the query.
Lets create a new composite index for ADDRS.
zipcity=# create index addrs_strtofr on
addrs(streetid,fraddr, toaddr) ;
CREATE INDEX
Now, lets run that EXPLAIN again
zipcity=# explain select * from addrs where
streetid = 357172 and fraddr <= 8727 and toaddr >= 8727;
QUERY
PLAN
-------------------------------------------------------------------------------------
Bitmap Heap Scan on addrs
(cost=17.40..373.58 rows=90 width=64)
Recheck Cond: ((streetid = 357172) AND
(fraddr <= 8727) AND (toaddr >= 8727))
-> Bitmap Index Scan on addrs_strtofr
(cost=0.00..17.38 rows=90 width=0)
Index Cond: ((streetid = 357172) AND
(fraddr <= 8727) AND (toaddr >= 8727))
(4 rows)
Look at the costs of the query now. The startup cost is slightly
higher, but the total cost is far far lower. But, hey, we already
have an index on streetid, do we have to keep both? The answer is no.
Since we put “streetid” first in our index, PostgreSQL can us the
composite index “ADDRS_STFRTO” for streetid based queries. So,
lets drop our unused indexes.
zipcity=# drop index addrs_addrs;
DROP INDEX
zipcity=# drop index addrs_streetid;
DROP INDEX
Let's double check that our streetid query still works:
zipcity=# explain select * from addrs where
streetid = 357172;
QUERY PLAN
-------------------------------------------------------------------------------
Bitmap Heap Scan on addrs
(cost=19.55..2534.25 rows=653 width=64)
Recheck Cond: (streetid = 357172)
-> Bitmap Index Scan on addrs_strtofr
(cost=0.00..19.39 rows=653 width=0)
Index Cond: (streetid = 357172)
(4 rows)
Yup, everything is working just fine.
Finding Streets in STREETS
Now that we can find geographical coordinates of an address based
on the streetid and numerical street address, it is time to find the
streetid. Lets find a pretty generic street: Maple Street. As you can
imagine, there are a lot of towns and cities that have a “Maple
Street.” According to this database, there's about 4500 of them, or
about 1% of the streets in the U.S.A are named maple street, and over
12% of the zipcodes have a Maple Street. There are only 34160
distinct zip codes in the streets table and 447533 rows. That's
means about 131 streets per zipcode, on average, with many cities
having well over a thousand.
Just as a baseline, lets use EXPLAIN to show the query:
zipcity=# explain select * from streets where
street = 'Maple' and type = 'St' and zipcode = 2186;
QUERY
PLAN
-------------------------------------------------------------------------------------
Seq Scan on streets (cost=0.00..111725.83
rows=1 width=72)
Filter: ((street = 'Maple'::text) AND
("type" = 'St'::text) AND (zipcode = 2186))
(2 rows)
Let's create an index:
zipcity=# create index streets_zip_str_type
on streets (zipcode, street, "type") ;
CREATE INDEX
Lets look at that query again:
zipcity=# explain select * from streets where
street = 'Maple' and type = 'St' and zipcode = 2186;
QUERY
PLAN
-----------------------------------------------------------------------------------------
Index Scan using streets_zip_str_type on
streets (cost=0.00..9.19 rows=1 width=72)
Index Cond: ((zipcode = 2186) AND (street
= 'Maple'::text) AND ("type" = 'St'::text))
(2 rows)
That was pretty easy.
Finding a ZIPCODE
With STREETS and ADDRS we can find the geographical location of an
address based on street, address, and zipcode. This is good for most
queries, but what if you don't know the zipcode? Since most people
don't use the official names of their city, we'll use the ZIPCITY
table to find the zipcodes based on city and state names.
zipcity=# select * from zipcity where
city='DORCHESTER' and state = 'MA';
zipcode | city | state | type
---------+------------+-------+------
2121 | DORCHESTER | MA | 1
2122 | DORCHESTER | MA | 1
2124 | DORCHESTER | MA | 1
2125 | DORCHESTER | MA | 1
(4 rows)
An important thing to notice is that some cities have multiple
zipcodes and it follows that streets that run for any length of the
city may also have multiple zipcodes.
zipcity=# explain select * from zipcity where
city='DORCHESTER' and state = 'MA';
QUERY PLAN
------------------------------------------------------------------
Seq Scan on zipcity (cost=0.00..1797.00
rows=2 width=72)
Filter: ((city = 'DORCHESTER'::text) AND
(state = 'MA'::text))
(2 rows)
As before, we'll create a composite index for our query
conditions.
zipcity=# create index zipcity_city_state on
zipcity (city,state) ;
CREATE INDEX
Now, we use EXPLAIN again.
zipcity=# explain select * from zipcity where
city='DORCHESTER' and state = 'MA';
QUERY
PLAN
-----------------------------------------------------------------------------------
Index Scan using zipcity_city_state on
zipcity (cost=0.00..8.29 rows=1 width=27)
Index Cond: ((city = 'DORCHESTER'::text)
AND (state = 'MA'::text))
(2 rows)
(You did remember to run ANALYZE right?)
Putting it All Together
Now we have the basis of our application, lets put in all
together:
Query by City, State, Street, and Address
The following query is three table implicit inner join that links
addrs, streets, and zipcity.
zipcity=# select * from addrs, streets,
zipcity where zipcity.city = 'DORCHESTER' and zipcity.state = 'MA'
and zipcity.zipcode = streets.zipcode and streets.street = 'Maple'
and streets.type = 'St' and streets.streetid = addrs.streetid and
addrs.fraddr <= 97 and addrs.toaddr >= 97;
streetid | fraddr | toaddr | frlat |
frlong | tolat | tolong | street | type | zipcode |
streetid | zipcode | city | state | type
----------+--------+--------+-----------+------------+-----------+------------+--------+------+---------+----------+---------+------------+-------+------
2644797 | 94 | 104 | 42.307677 |
-71.086603 | 42.307376 | -71.086858 | Maple | St | 2121 |
2644797 | 2121 | DORCHESTER | MA | 1
(1 row)
zipcity=# explain select * from addrs,
streets, zipcity where zipcity.city = 'DORCHESTER' and zipcity.state
= 'MA' and zipcity.zipcode = streets.zipcode and streets.street =
'Maple' and streets.type = 'St' and streets.streetid =
addrs.streetid and addrs.fraddr <= 97 and addrs.toaddr >= 97;
QUERY PLAN
----------------------------------------------------------------------------------
Nested Loop (cost=0.00..345.33 rows=1
width=116)
-> Nested Loop (cost=0.00..17.49
rows=1 width=52)
-> Index Scan using
zipcity_city_state on zipcity (cost=0.00..8.29 rows=1 width=27)
Index Cond: ((city =
'DORCHESTER'::text) AND (state = 'MA'::text))
-> Index Scan using
streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=25)
Index Cond: ((zipcity.zipcode
= streets.zipcode) AND (streets.street = 'Maple'::text) AND
(streets."type" = 'St'::text))
-> Index Scan using addrs_strtofr on
addrs (cost=0.00..326.82 rows=82 width=64)
Index Cond: ((streets.streetid =
addrs.streetid) AND (addrs.fraddr <= 97) AND (addrs.toaddr >=
97))
(8 rows)
This query looks pretty good. It utilizes all the indexes we
created, has a pretty low cost, and we don't see any obvious
problems. Is it perfect? probably not. The reader should try to
re-write the query using the JOIN syntax and see if they can create a
query which produces the same data but performs better.
Query by Zipcode, Street, and Address
The following query is three table implicit inner join that links
addrs, streets, and zipcity like before, except we are not searching
the zipcity table by city and state, but by zipcode.
zipcity=# select * from addrs, streets,
zipcity where zipcity.zipcode = 2121 and zipcity.type = 0 and
zipcity.zipcode = streets.zipcode and streets.street = 'Maple' and
streets.type = 'St' and streets.streetid = addrs.streetid and
addrs.fraddr <= 97 and addrs.toaddr >= 97;
streetid | fraddr | toaddr | frlat |
frlong | tolat | tolong | street | type | zipcode |
streetid | zipcode | city | state | type
----------+--------+--------+-----------+------------+-----------+------------+--------+------+---------+----------+---------+--------+-------+------
2644797 | 94 | 104 | 42.307677 |
-71.086603 | 42.307376 | -71.086858 | Maple | St | 2121 |
2644797 | 2121 | BOSTON | MA | 0
zipcity=# explain select * from addrs,
streets, zipcity where zipcity.zipcode = 2121 and zipcity.type = 0
and zipcity.zipcode = streets.zipcode and streets.street = 'Maple'
and streets.type = 'St' and streets.streetid = addrs.streetid and
addrs.fraddr <= 97 and addrs.toaddr >= 97;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..2143.94 rows=2
width=116)
-> Nested Loop (cost=0.00..346.92
rows=1 width=89)
-> Index Scan using
streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=25)
Index Cond: ((zipcode = 2121)
AND (street = 'Maple'::text) AND ("type" = 'St'::text))
-> Index Scan using
addrs_strtofr on addrs (cost=0.00..336.71 rows=82 width=64)
Index Cond: ((streets.streetid
= addrs.streetid) AND (addrs.fraddr <= 97) AND (addrs.toaddr >=
97))
-> Seq Scan on zipcity
(cost=0.00..1797.00 rows=2 width=27)
Filter: ((zipcode = 2121) AND
("type" = 0))
(8 rows)
While this query generates the same basic data, but performs much
worse. First, it scans the zipcity table by zipcode not city and
state. This forces a sequential scan of the whole zipcity table
which increases our overall costs. Since zipcodes should be more
specific than the city and state (the same city and state may have
multiple entries.), this query should be more efficient than the
previous one.
-> Seq Scan on zipcity
(cost=0.00..1797.00 rows=2 width=27)
Filter: ((zipcode = 2121) AND
("type" = 0))
Lets create a new index based on zipcode:
zipcity=# create index zipcity_zipcode on
zipcity (zipcode) ;
CREATE INDEX
zipcity=# explain select * from addrs,
streets, zipcity where zipcity.zipcode = 2121 and zipcity.type = 0
and zipcity.zipcode = streets.zipcode and streets.street = 'Maple'
and streets.type = 'St' and streets.streetid = addrs.streetid and
addrs.fraddr <= 97 and addrs.toaddr >= 97;
QUERY
PLAN
-------------------------------------------------------------------------------------------
Nested Loop (cost=7.60..350.68 rows=2
width=116)
-> Nested Loop (cost=7.60..342.34
rows=1 width=89)
-> Index Scan using
streets_zip_str_type on streets (cost=0.00..9.19 rows=1 width=25)
Index Cond: ((zipcode = 2121)
AND (street = 'Maple'::text) AND ("type" = 'St'::text))
-> Bitmap Heap Scan on addrs
(cost=7.60..332.13 rows=82 width=64)
Recheck Cond:
((streets.streetid = addrs.streetid) AND (addrs.fraddr <= 97) AND
(addrs.toaddr >= 97))
-> Bitmap Index Scan on
addrs_strtofr (cost=0.00..7.58 rows=82 width=0)
Index Cond:
((streets.streetid = addrs.streetid) AND (addrs.fraddr <= 97) AND
(addrs.toaddr >= 97))
-> Index Scan using zipcity_zipcode on
zipcity (cost=0.00..8.32 rows=2 width=27)
Index Cond: (zipcode = 2121)
Filter: ("type" = 0)
(11 rows)
That's a lot better, we eliminated the sequential scan and reduced
the total estimated cost of the query, but the cost is still higher
than the query by city and state, do you know why? Can you create an
index that will fix that problem?
To Be Continued!!
Geographical queries!
Conclusion
This article is not a comprehensive tutorial on writing and
optimizing SQL queries, it is just a brief introduction to the
process. The above queries are probably sub-optimal in some
respects, but perform far better than without examination.
Further Reading
If you read no other PostgreSQL documentation, you should consult
the on-line PostgreSQL manual, specifically “Performance Tips.”
You are also encouraged to read the postgresql.conf file, typically
found in your data directory for the various logging options. It is
heavily commented and contains a lot of useful information.