Recently in PostgreSQL Category

postgreslogo.png
Earlier this year, blogger and PostgreSQL Committer Andrew Dunstan drew up a list of individual Committers to the PostgreSQL Project. We are proud to say that this list featured some of our employees.


In May, PostgreSQL's Andrew Dunstan published some data about the productivity of PostgreSQL Committers at 30,000 commits and still going strong, detailing the number of commits made by developers with commit rights. Incidentally, becoming a Committer is no mean feat; although there is no set procedure for acquiring the right to commit, it will generally follow a candidate having sent numerous good patches over a long period of time. Existing Committers, or the core team will then propose and approve assigning Committer's rights to the candidate.

credativ can claim involvement with many other Open Source Projects in addition to PostgreSQL. Community involvement is taken seriously at credativ, as is evident from Andrew Dunstan's statistics. A few of the Committers mentioned work at various international credativ offices; Michael Meskes, Joe Conway and Dave Cramer. What is not clear from Dunstan's list is the number of credativ employees who contribute large amounts of code but are not actually Committers; take Bernd Helmle, for example, who readers of this blog will be familiar with from his PostgreSQL articles not only as author but also as a developer, yet he does not feature in Andrew's statistics.

Nevertheless credativ's presence on this list is indicative of our achievements as well as our employees' connections with Open Source; if you would like to know more about our Open Source involvement simply leave us a comment here... and if you are interested in Open Source Support, please contact us.

postgreslogo.png
I've been asked on at least three separate occasions lately how to know if changing a particular postgresql.conf item requires a restart, or a reload, of PostgreSQL. Here is my quick and dirty favorite way to answer this question:

-- configs requiring postgresql restart
select name, setting, context
  from pg_settings where context = 'postmaster';

-- configs requiring postgresql reload
select name, setting, context
 from pg_settings where context = 'sighup';

postgreslogo.pngRlogo.jpgWhen you pass large amounts of data to and from PL/R, quite a lot of time is needed for converting. A change is being tested which treats arrays of 4 byte integers and 8 byte floating point values as a special case, resulting in a dramatic performance improvement.

In a recent post, I discussed PL/R performance related to seismic timeseries data stored as an array of floats that are all recorded during some seismic event at a constant sampling rate. The problem was that when dealing with, say, 14000 arrays of floats, each having on the order of 16000 elements, passing the data to and from PL/R proved slower than hoped.

My ultimate solution was to show how a significant performance improvement could be achieved by importing the arrays into Postgres tables directly as raw R objects, and then operating on those objects later using PL/R. The problem with this approach is that in some, if not most, cases, you may want to access that same data from other procedural languages or hand off the arrays to some client other than R. In this case the raw R object does not meet your needs.

So I thought about it a bit and researched the source code on the Postgres and R sides of PL/R, and concluded that for certain special cases it was possible to dramatically improve speed by skipping the one-at-a-time element conversion as arrays are processed going between PostgreSQL and R. Specifically, the in-memory storage of the array data is binary compatible in the following circumstances:

  1. pgsql -> R
    • Argument is integer or double precision array
    • Element data type is pass-by-value for given Postgres version and architecture
    • No NULL elements
    • Array is one dimensional
  2. R -> pgsql
    • Integer vector returned with integer array return type
    • Real vector returned with double precision array return type
    • No NA elements
    • One dimensional vector

Pass-by-value is most likely true for double precision (float8) if PostgreSQL is at least version 8.4 and was built with a 64 bit system architecture. If these conditions are met, PL/R now simply copies en masse the in-memory array data from the PostgreSQL array data structure to the R vector data structure. This avoids all the overhead associated with iterating over the array element by element. Although I am not a fan of special case code such as this, the use case is important (if you are crunching numbers, they are likely stored as double precision elements), and the performance benefit is huge. Here is the timing difference with the patched PL/R versus the unpatched PL/R:

DROP TABLE IF EXISTS test_ts;
CREATE TABLE test_ts
(
  dataid bigint NOT NULL,
  data double precision[],
  CONSTRAINT pk_data PRIMARY KEY (dataid)
);

CREATE OR REPLACE FUNCTION load_test(int) RETURNS text AS $$
 DECLARE
  i    int;
 BEGIN
  FOR i IN 1..$1 LOOP
   --16789 double precision elements in the data array
   INSERT INTO test_ts (dataid, data) VALUES (i, '{-0.0205086770285039,...}');
  END LOOP;
  RETURN 'OK';
 END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION
filt_r_nothing(ts double precision[])
RETURNS double precision[] AS $$
  return(ts);
$$ LANGUAGE 'plr' IMMUTABLE;

CREATE OR REPLACE FUNCTION
filt_r_avg(ts double precision[])
RETURNS double precision AS $$
  return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

-- INSERT 14000 rows of 16789 element arrays
SELECT load_test(14000);

-- unpatched code
UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 1224087.064 ms

-- patched code
UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 225591.429 ms

-- unpatched code
contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 441573.619 ms

-- patched code
contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 6541.039 ms

-- unpatched code
select array_upper(filt_r_nothing(data),1) from test_ts;
 array_upper 
-------------
       16879
       16879
[...]
(14000 rows)
Time: 1108651.349 ms

-- patched code
select array_upper(filt_r_nothing(data),1) from test_ts;
 array_upper 
-------------
       16879
       16879
[...]
(14000 rows)
Time: 23101.602 ms


So to summarize:

TestCaseTime (ms)Improvement
UPDATE NOOPUnpatched1224087.064--
UPDATE NOOPPatched225591.42982%
SELECT AVGUnpatched441573.619--
SELECT AVGPatched6541.03998%
SELECT NOOPUnpatched1108651.349--
SELECT NOOPPatched23101.60298%

Pretty substantial improvement in these particular, but I think common, use cases. The UPDATE test sees less overall benefit because the time to write out the changes would be significant and the same regardless of array handling in PL/R. The difference between SELECT NOOP and SELECT AVG is driven by the fact that the latter returns a scalar result, while the former returns the entire array. The reason SELECT NOOP does array_upper() on the returned array, is that otherwise all that array data (something like 4 GB) gets materialized in memory by psql, which of course greatly slows things further and is not what we are trying to test.

Please give the changes a try and provide feedback before I release another PL/R version. You can grab the new code from github and sign up for the PL/R mailing list to post your results or report any questions/problems. And of course visit the PL/R homepage and PL/R wiki for more general information about PL/R -- particularly to watch for these changes in the next official release. Finally, don't hesitate to contact me directly if the other choices don't suit you for some reason.

postgreslogo.pngRlogo.jpgWhen you pass large amounts of data to and from PL/R, quite a lot of time is needed for converting. It's better to directly store the data as R objects.

I had been planning to continue with timeseries aggregation, but decided to take a side-road based on a recent question on the PL/R mailing list.

The question was related to seismic data, which is in fact timeseries data. However, I guess the data is normally stored as an array of floats that are all recorded during some seismic event at a constant sampling rate. The arrays are available from online sources in an individual file for each event being analyzed. The problem was that when dealing with, say, 14000 arrays of floats, each having on the order of 16000 elements, passing the data to and from PL/R proved slower than hoped.

So we start with loading of sample data for a performance test:

DROP TABLE IF EXISTS test_ts;
CREATE TABLE test_ts
(
  dataid bigint NOT NULL,
  data double precision[],
  CONSTRAINT pk_data PRIMARY KEY (dataid)
);

CREATE OR REPLACE FUNCTION filt_r_nothing(ts double precision[])
RETURNS double precision[] AS $$
 return(ts);
$$ LANGUAGE 'plr' IMMUTABLE;

CREATE OR REPLACE FUNCTION load_test(int) RETURNS text AS $$
  DECLARE
   i    int;
  BEGIN
    FOR i IN 1..$1 LOOP
      INSERT INTO test_ts(dataid,data) VALUES (i,'{-0.0205086770285039, ...'})
    END LOOP;
    RETURN 'OK';
  END;
$$ LANGUAGE plpgsql;

SELECT load_test(14000);
 load_test 
-----------
 OK
(1 row)

Time: 123861.362 ms

The array in the VALUES clause of that function actually contains 16879 float8 elements. You can see that it takes over two minutes on my development machine to load the table with 14000 rows of this array. Note that on my development machine I have done no tuning of PostgreSQL configs, and I built with --enable-debug, --enable-cassert, and CFLAGS='-O0 -g3'.

Next, we update the data column with filt_r_nothing() which does nothing other than returning the same array it was passed.

UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 1224087.064 ms

Not pretty. Over 20 minutes. I did some profiling of PL/R and concluded most of the time was being spent converting 16879 PostgreSQL array elements from float8 datums to R vector elements one at a time while processing the function argument, and then repeating the process in reverse while creating the returned result. Perhaps there are optimizations that can be made to that process, but since PostgreSQL and R each have their own binary representation of this data, there is no avoiding the conversion overhead.

However, what is the point of the proposed performance test? The comparison was being made to another procedural language, which apparently does not convert the array elements if they are not used. A real function is presumably going to do some calculation over the array elements, requiring that they be individually accessed.

I decided to see how PL/pgSQL performs if forced to modify and return the passed array. The difference between this test and the PL/R one will give some insight on the time spent converting elements from PostgreSQL to R native form.

CREATE OR REPLACE FUNCTION
filt_plpgsql_nothing(ts double precision[])
RETURNS double precision[] AS $$
 BEGIN
  RETURN ts || 3.14159::float8;
 END
$$ LANGUAGE 'plpgsql' IMMUTABLE;

UPDATE test_ts SET data = filt_plpgsql_nothing(data);
UPDATE 14000
Time: 239054.580 ms

About 6 minutes. Much better. But let's see what happens if we do some more meaningful, if simple, calculations on the array elements.

CREATE OR REPLACE FUNCTION
filt_plpgsql_avg(ts double precision[])
RETURNS double precision AS $$
 DECLARE
  i int;
  numts int = array_upper(ts,1);
  ts_sum float8 = 0.0;
 BEGIN
  FOR i IN 1..numts LOOP
    ts_sum := ts_sum + ts[i];
  END LOOP;
  RETURN (ts_sum/numts::float8);
 END
$$ LANGUAGE 'plpgsql' IMMUTABLE;

select filt_plpgsql_avg(data) from  test_ts;
--killed after > 1 hour

CREATE OR REPLACE FUNCTION filt_r_avg(ts double precision[])
RETURNS double precision AS $$
 return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 441573.619 ms

Although the PL/R function still took over 7 minutes to process 14000 rows with 16879 elements, PL/pgSQL took long enough that I killed it out of impatience.

It occurred to me that a feature I added to PL/R within the past year or so might come in handy about now. Namely, it is possible to directly store R objects in PostgreSQL tables. This means that when the datum is passed to a PL/R function, it is all ready to go -- no conversion needed. Let's take a look at that scenario.

DROP TABLE IF EXISTS test_ts_obj;
CREATE TABLE test_ts_obj
(
  dataid serial PRIMARY KEY,
  data bytea
);

CREATE OR REPLACE FUNCTION make_r_object(fname text)
RETURNS bytea AS $$
 myvar<-scan(fname,sep=",")
 return(myvar);
$$ LANGUAGE 'plr' IMMUTABLE;

INSERT INTO test_ts_obj (data) SELECT make_r_object('array-data.csv') from generate_series(1,14000);
INSERT 0 14000
Time: 44182.598 ms

CREATE OR REPLACE FUNCTION filt_r_avg(ts bytea)
RETURNS double precision AS $$
 return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

select filt_r_avg(data) from  test_ts_obj;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
 [...]
 0.656530643017027
(14000 rows)

Time: 12828.331 ms

This results in 44 seconds to load the same 14000 rows of array data as before, but
directly as R objects. Compare that to the 2 minutes to load as PostgreSQL arrays as seen at the beginning of this article. And now it only takes 13 seconds to operate on the 14000 R objects compared to 442 seconds. That's a nice improvement!

But PL/R gives you access to the full power of the R environment for statistical computing and graphics. Just for fun, here is a PL/R function that calculates the "Power Spectrum" of the seismic data, and returns the result as a JPEG of the plot.

CREATE OR REPLACE FUNCTION
filt_r_ps(ts bytea)
RETURNS bytea AS $$
  library(quantmod)
  library(cairoDevice)
  library(RGtk2)

  fourier<-fft(ts)
  magnitude<-Mod(fourier)
  y2 <- magnitude[1:(length(magnitude)/10)]
  x2 <- 1:length(y2)/length(magnitude)
  mydf <- data.frame(x2,y2)

  pixmap <- gdkPixmapNew(w=500, h=500, depth=24)
  asCairoDevice(pixmap)

  plot(mydf,type="l")
  plot_pixbuf <- gdkPixbufGetFromDrawable(NULL, pixmap,
                                                        pixmap$getColormap(),
                                                        0, 0, 0, 0, 500, 500)
  buffer <- gdkPixbufSaveToBufferv(plot_pixbuf,
                                                       "jpeg",
                                                        character(0),
                                                        character(0))$buffer
  return(buffer)
$$ LANGUAGE 'plr' IMMUTABLE;

This is now not about performance so much as it is about analytical power. About half of the lines in this function are setting up to capture the output graph. The "meat" of the function can be contained in these few lines:

fourier<-fft(ts)
magnitude<-Mod(fourier)
plot(x=1:length(y2)/length(magnitude),
       y=magnitude[1:(length(magnitude)/10)],
       type="l")

Compliment that PL/R function with a bit of PHP code...

<?php
function hex2bin($data)
{
	$data = ltrim($data, "\x");
	$len = strlen($data);
	return pack("H" . $len, $data);
} 

$dbconn = pg_connect("dbname=contrib_regression");
$rs = pg_query( $dbconn, "select plr_get_raw(filt_r_ps(data))
                                    from test_ts_obj where dataid = 42");
$hexpic = pg_fetch_array($rs);
$cleandata = hex2bin($hexpic[0]);

header("Content-Type: image/jpeg");
header("Last-Modified: " .
date("r", filectime($_SERVER['SCRIPT_FILENAME'])));
header("Content-Length: " . strlen($cleandata));
echo $cleandata;
?>

...and the output looks like:plr-blog.jpg

Fairly sophisticated output for relatively little effort! For more information or assistance with respect to PostgreSQL, PL/R, and/or advanced analytics, don't hesitate to contact us.

postgreslogo.pngFrequently when dealing with parametric data, you need to "roll up" the data in summary fashion as it ages in order to reduce the volume kept on hand, or maybe because the summary statistics are what really interests you. There are several ways to do that, and this post highlights four different approaches.

I was reminded of this kind of "roll ups" today by a question on the pgsql-novice list. This is actually quite a large topic, so I this tip will likely just scratch the surface. The question was related to storing min, max, and avg summaries on an hourly, daily, and weekly basis. The basic idea, for example, is that you can keep raw data for maybe a week, hourly summaries for 6 months, daily summaries for 3 years, and weekly summaries forever. As I mentioned in my reply, I have done this kind of thing over the years using at least 4 approaches:

  1. Aggregate on demand
  2. Batch aggregate on a periodic basis -- e.g. run your aggregate query with a cron job which truncates and rebuilds a table (i.e. a "materialized view")
  3. Write a C based trigger that does "continuous aggregation" to a materialized table
  4. Write a C based bulk loader that aggregates as it bulk loads the raw data into a materialized table

The first approach is simply to run an aggregate query whenever you need the summarized data. Obviously this does not really satisfy the stated desire to discard aged raw data, but I mention it for completeness. In some cases you have sufficient storage given your data volume, and performance of the aggregate is "good enough".

The second is the rough equivalent of a materialized view. In other words, run a batch job via cron or something similar that TRUNCATEs and then repopulates a table used for storage of the aggregate result. Particularly for daily or weekly summary data, when the consumers of the data are 9-5 folk, this approach works pretty well. This also fits in nicely with common partitioning schemes.

The third is one where you want summary statistics to be updated live. In this case you actually want the summary data for the current hour/day/week to all be constantly updated as new raw data comes in. Otherwise you are stuck always looking at last hour's, or yesterday's, or last weeks, data. The way to do this is through a trigger. A while back I implemented a continuous aggregation trigger in C that used prepared queries to update my aggregate table for every INSERT/UPDATE/DELETE occurring on the target table. However even with the trigger written in C and using prepared queries, the performance impact of the trigger firing for every DML event was significant.

Finally, the forth method can be used when your reporting needs are such that the raw data can be collected for some period before storing in your database. Let's say the summary reports are never run against the current hour. What you can do is build up a file in suitable format for bulk loading via COPY. Then process the data as it is bulk loaded to calculate and insert the summary at the same time. Again, I had done that in the past using a C program that read in the stored files, generated the summary data while building a string buffer, and finally using libpq's PQputCopyData() to populate the tables.

More than likely some combination of the above is what you really want. Perhaps use method 2 to maintain your weekly and daily aggregate materialized views, and use method 4 to update your hourly aggregate data.

This post was a lot of discussion and no code -- perhaps tomorrow I will continue with some more concrete examples.

postgreslogo.pngSomeone posted a dilemma to the pgsql-sql list today that involved many if not all of his sequences getting out of sync with their respective "serial" columns. In other words, something like "SELECT max(id) FROM sometable" yields 42, but the sequence nextval for sometable.id is currently set to 36. This is obviously bad (for reasons left as an exercise for the reader). So besides trying to figure out how the database ended up in this state, he needed a script to reset all of his sequences to the correct next value.

I had run into a similar need not too long ago. Namely, when setting up multi-master replication with Bucardo you need your sequences to draw different values on either master so as not to conflict. One solution is to set up all your sequences to jump by 2, and use even numbers on one master and odd numbers on the other. Again, a script makes this easier to deal with, and I had developed one for this situation. So I modified it for the problem mentioned above.

Both versions follow:

-- create "odd" and "even" sequences in multi-master scenario
CREATE OR REPLACE FUNCTION adjust_seqs(namespace text, starteven bool)
  RETURNS text AS $$
DECLARE
  rec         record;
  startval   bigint;
  sql          text;
  fqname  text;
BEGIN
  FOR rec in EXECUTE 'select relname from pg_class where relkind = ''S''
                      and relnamespace = (select oid from pg_namespace
                      where nspname=''' || namespace || ''')' LOOP
    fqname :=  namespace || '.' ||  rec.relname;
    IF starteven THEN
      EXECUTE 'SELECT ((last_value / 2) * 2) + 2 from ' || fqname INTO startval;
    ELSE
      EXECUTE 'SELECT ((last_value / 2) * 2) + 1 from ' || fqname INTO startval;
    END If;
    sql := 'ALTER SEQUENCE ' || fqname || ' INCREMENT BY 2 RESTART WITH ' || startval;
    EXECUTE sql;
    RAISE NOTICE '%', sql;
  END LOOP;
  RETURN 'OK';
END;
$$ LANGUAGE plpgsql STRICT;
SELECT adjust_seqs('public', true);  -- in master1 (even)
SELECT adjust_seqs('public', false); -- in master2 (odd)


-- update sequences that have gotten out-of-sync with the
-- PK field for which they normally provide the default
CREATE OR REPLACE FUNCTION adjust_seqs(namespace text)
  RETURNS text AS $$
DECLARE
  rec           record;
  startval     bigint;
  sql            text;
  seqname  text;
BEGIN
  FOR rec in EXECUTE 'select table_name, column_name, column_default
                      from information_schema.columns
                      where table_schema = ''' || namespace || '''
                      and column_default like ''nextval%''' LOOP

    seqname := pg_get_serial_sequence(rec.table_name, rec.column_name);
    sql := 'select max(' || rec.column_name || ') + 1 from ' || rec.table_name;
    EXECUTE sql INTO startval;
    IF startval IS NOT NULL THEN
      sql := 'ALTER SEQUENCE ' || seqname || ' RESTART WITH ' || startval;
      EXECUTE sql;
      RAISE NOTICE '%', sql;
    END IF;
  END LOOP;
  RETURN 'OK';
END;
$$ LANGUAGE plpgsql STRICT;
select adjust_seqs('public');

Neither of these is heavily tested, and both make certain assumptions, so please test and modify to suit your own needs. Caveat emptor!

I was given a Postgres database dump to analyze today created by "pg_dump -Fc". The source database included PostGIS 1.3.x extensions. I'm not sure if this is standard with PostGIS, but the related database objects were all dumped with a hard-coded library path, specifically /usr/lib/postgresql/8.3/lib. On my machine, I have many PostgreSQL clusters (essentially at least one for every supported branch dating back to 7.3.x), but they are not located under /usr/lib/postgresql.

As such, I needed a quick fix. To wit:

pg_restore database.with.postgis.tgz > db.w.postgis.dmp
sed 's|/usr/lib/postgresql/8.3/lib|$libdir|g' < db.w.postgis.dmp > db.w.postgis.dmp.new

The first line extracts the dump file from the compressed "custom" format into a human readable text SQL file. The second line replaces the hard-coded library path with the special PostgreSQL $libdir variable. This will always point to the correct location for any given PostgreSQL cluster. You can always discover where this is by running:

pg_config --libdir

PostgreSQL 9.0 is now in Betaphase

| No Comments

postgreslogo.png
The PostgreSQL developers' community recently published the first Beta version of the new 9.0 release. Over 200 new functions and improvements feature in this new version.

With this new release, PostgreSQL now amongst other features claims an inbuilt replication solution as well as the ability to access and read standby nodes, continuously being updated by Log Shipping (Hot Standby). Streaming replication allows the sending of transaction logs directly to one or more standby nodes, which considerably reduces the amount of time lost compared with the more common, file-based log shipping. Combining these two features makes for an extremely efficient solution for high availability or loadbalanced systems.

The all new PostgreSQL version also offers the following innovations:

  • Memory based LISTEN/NOTIFY: this replaces the previous table based implementation and is much faster.
  • Exclusion Constraints: broadens constraints to be able to deal with the complex datatypes of overlapping constraints.

  • Procedural code such as PL/pgSQL, PL/Perl and PL/Python can now be done inline per DO command.
  • This means there is no longer need to define a function with CREATE FUNCTION.

  • Triggers on columns

  • Triggers can now be tied to conditions

  • Named argument lists for procedures

  • Parameters can now be flexibly linked to rolls/databases

As always, anyone interested is invited to share their test results with the developers. Information on the procedure for testing and filing of error messages can be found in the Wiki.

All blog articles which fall into the PostgreSQL category are grouped in their own feed, and if you find you need support and services for PostgreSQL, you've come to the right place at credativ.

[Howto] PostgreSQL and Linux Memory Management

| No Comments

postgreslogo.pngThe OOM-Killer can cause nasty surprises on machines with a heavy memory load; processes are cancelled or terminated without warning. Fortunately, this behaviour can be adjusted with some clever kernel tweaks.

Administrators of Linux machines with a very high RAM-Usage are sometimes faced with a terrifying scenario: the Linux OOM-Killer (OOM = Out Of Memory). In situations such as a crashed PostgreSQL instance, the following entry can typically be found in the server log:

Out of Memory: Killed process PID (Prozessname)

Why is this?

Virtual Memory and Overcommit

Virtual Memory used by Linux can be allocated in a number of ways: malloc(), mmap(), Swap, Shared Memory, to mention some examples. It is possible to overcommit virtual memory by allocating more than is actually available in the system. If this happens, a so-called "OOM-Condition" occurs; that is, your system no longer has any available space in the virtual memory area and cannot allocate any more. This is when the OOM-Killer is activated - and does what its name suggests: kills any processes which meet certain conditions in order to free memory.

If you have an environment where servers are running PostgreSQL in parallel with other memory-intensive processes on the same machine, it's likely that the OOM-Killer will kill certain PostgreSQL processes. Due to the amount of allocated shared memory and the memory usage of each backend, the OOM-Killer will target PostgreSQL by preference since it counts the complete addressed shared memory area of all backends into summary.

The amount of committed memory of your system at a given time can be examined with the /proc-Filesystem:

$ grep Commit /proc/meminfo 
CommitLimit:    376176 kB
Committed_AS:   265476 kB

This example shows the current amount of committed memory at 265476 kB (Committed_AS). Is this equal or even larger than the amount of Committed_AS the OOM-Killer is likely to be woken up.

However, the kernel provides some interfaces to adjust the behaviour of the OOM-Killer and Overcommit with regard to PostgreSQL installations.

Turn off Overcommit

A radical method is to turn overcommit off entirely, although this is only recommended on systems dedicated to PostgreSQL. The overcommit feature can be configured within three categories with the following kernel parameter:

vm.overcommit_memory = 0

This can hold three different kinds of categories:

  • 0: Allow a careful strategy of overcommitting memory: small and reasonable amounts of overcommitting allocations are allowed, but heavy and wild allocations will be denied. In this mode, root can allocate more space than unprivileged users. This is also the kernel default setting.
  • 1: Allow overcommit without any constraints
  • 2: Turn off overcommit. The effective allocatable memory space cannot be larger than swap + a configurable percentage of physical RAM.

The fraction of physical RAM used by category 2 is defined by the parameter:

vm.overcommit_ratio = 50

While vm.overcommit_memory=1 is useful when tuning certain applications, the categories 0 or 2 are the best ones to use most of the time. If you turn off overcommit with vm.overcommit_memory=2, a process will get an "out of memory"-Exception (depending of vm_overcommit_ratio) if allocating memory when no more free space is available. Depending on the distribution you are using, we recommend that you save those settings in the configuration file /etc/sysctl.conf to ensure that they are activated on server reboot.

$ echo "vm.overcommit_memory=2 >> /etc/sysctl.conf
$ echo "vm.overcommit_ratio=60 >> /etc/sysctl.conf
$ sysctl -p /etc/sysctl.conf

Changes to those parameters are activated immediately. You can recheck this by consulting /proc/meminfo:

$ grep Commit /proc/meminfo 
CommitLimit:    401440 kB
Committed_AS:   266456 kB

The machine has 249848 kB of swap and 252656 kB physical RAM.
According to the formula swap + vm.overcommit_ratio * RAM this results in a CommitLimit of 401440 kB

Configure OOM-Killer per process

Where PostgreSQL is running without dedicated server hardware and in parallel with memory-intensive middleware (e.g. JBoss- or Tomcat-Installations), most admins would prefer to be able to control the OOM-Killer on a per-process basis and allow overcommitting of memory allocations. Since kernel 2.6.1, Linux has been providing an interface for tuning the OOM-Score of a process, which will in turn increase or decrease the affinity of the process to be killed when running in an OOM-Situation. This interface allows a very flexible configuration of processes in such environments regarding their memory requirements. The interface is exposed by the /proc-Filesystem, for example here on a PostgreSQL-Installation on Debian:

$ cat /proc/$(cat /var/run/postgresql/8.4-main.pid)/oom_adj
0

Values allowed range from -17 to +15, a negative value decreases, while a positive value increases the likelihood of being killed by the OOM-Killer. -17 is a special value and turns killing the process in an OOM-Situation off.
The settings are inherited from parent to child processes; in PostgreSQL you'll have to set this one to the PostgreSQL master process:

$ echo -17 >> /proc/$(cat /var/run/postgresql/8.4-main.pid)/oom_adj
$ psql -q postgres
test=# SELECT pg_backend_pid();
 pg_backend_pid 
----------------
           3429
(1 line)

test=# 
[1]+  Stopped                 psql -q test
$ cat /proc/3429/oom_adj
-17

The disadvantage of this method is that all child processes will now be excluded from the OOM-Killer, which is not generally what DBAs prefer. For example, where you want to protect the PostgreSQL system processes (like background writer oder autovacuum) from being killed by the OOM-Killer, but still kill ordinary database connections when running out of memory.

To set the OOM-Score you need to have a privileged user, so the best way to implement this setting is to put it into your PostgreSQL start script.

Enhancements in PostgreSQL 9.0

PostgreSQL 9.0 will have additional support for the pictured /proc-Interface. On one hand PostgreSQL 9.0 will come with a new Linux start script, which supports setting the oom_adj value before starting up PostgreSQL; on the other hand it is possible to build PostgreSQL with the special C-Macro LINUX_OOM_ADJ defined, which will allow DBAs to limit the inheritance of the OOM-Score to backend childs as shown in this example:

$ ./configure CC="ccache gcc" CFLAGS="-DLINUX_OOM_ADJ=0"

This method will save the PostgreSQL system process but will allow the OOM-Killer to kill database backend processes running amok.

Alternatives

An alternative solution is available by an additional kernel patch. This extends the existing /proc-Filesystem with a list of process names which should be excluded from the OOM-Killer. However, this patch is an unoffical extension to the Linux kernel and you may have to maintain your own builds of Linux kernels. In addition, it is not nearly as flexible as adjusting the OOM-Score and process names are not useful for uniquely identifying processes (e.g. Java- or Perlbased processes).

Summary

The Linuxkernel provides a comprehensive interface to adjust processes regarding their memory usage and the OOM-Killer. The most flexible method is the introduced /proc-Filesystem with the oom_adj-Interface. PostgreSQL 9.0 will have additional support for this interface. Dedicated PostgreSQL-Systems can be configured to avoid overcommit at all, but will need a deeper understanding of the number of memory resources the database system demands and the requirements of the VM of the kernel.

PostgreSQL 9.0alpha4 released

| No Comments

postgreslogo.pngThe PostgreSQL project just released the Alpha 4 of its upcoming PostgreSQL 9.0.

The Alpha4 version of the upcoming PostgreSQL 9.0 release is ready for download. It is planned that Alpha4 will be the last Alpha version before the Beta release cycle for PostgreSQL 9.0. Some highlights of this release are:


  • Reworked LISTEN/NOTIFY infrastructure: the performance has improved massively compared to the old table-based implementation, due to a pure main memory solution. Additionally, the new solution supports so called "Payloads" which makes it possible to transport messages.

  • Streaming Replication: an integrated solution for replication which has noticeable lower latency times than the usual, WAL-shipping-based solutions.

  • Procedural code with plpqsql and plperl can now be executed with the DO statement without the need to call a CREATE FUNCTION first.

You are very much welcome to download and test the Alpha version and play with it. The developers are interested in Bugs and test results; you can find the work flow to publish these outlined in their Wiki.