CIOZone Interview

| No Comments | No TrackBacks

mme-ciozone.pngCIOZone, a social network for CIOs, recently interviewed our Founder and CEO Dr. Michael Meskes.

CIOZone is a central place where CIOs can network. In this video, Roger Green takes the time to drop in to our office in Moenchengladbach, Germany to interview Michael Meskes, the founder of credativ about the history of the company, how Open Source has developed and how the business is different today from what it was 10 years ago.

This discussion is followed by analysis of current development and future challenges; the difference between Open Source vendors and proprietary global players; virtualisation and cloud computing in relation to Open Source and what to keep in mind when migrating to Open Source software.

Read on or watch the video at ciozone.com.

If you're looking for support, services and training for Open Source software, you've come to the right place at credativ!

postgreslogo.pngRlogo.jpgWhen you pass large amounts of data to and from PL/R, quite a lot of time is needed for converting. A change is being tested which treats arrays of 4 byte integers and 8 byte floating point values as a special case, resulting in a dramatic performance improvement.

In a recent post, I discussed PL/R performance related to seismic timeseries data stored as an array of floats that are all recorded during some seismic event at a constant sampling rate. The problem was that when dealing with, say, 14000 arrays of floats, each having on the order of 16000 elements, passing the data to and from PL/R proved slower than hoped.

My ultimate solution was to show how a significant performance improvement could be achieved by importing the arrays into Postgres tables directly as raw R objects, and then operating on those objects later using PL/R. The problem with this approach is that in some, if not most, cases, you may want to access that same data from other procedural languages or hand off the arrays to some client other than R. In this case the raw R object does not meet your needs.

So I thought about it a bit and researched the source code on the Postgres and R sides of PL/R, and concluded that for certain special cases it was possible to dramatically improve speed by skipping the one-at-a-time element conversion as arrays are processed going between PostgreSQL and R. Specifically, the in-memory storage of the array data is binary compatible in the following circumstances:

  1. pgsql -> R
    • Argument is integer or double precision array
    • Element data type is pass-by-value for given Postgres version and architecture
    • No NULL elements
    • Array is one dimensional
  2. R -> pgsql
    • Integer vector returned with integer array return type
    • Real vector returned with double precision array return type
    • No NA elements
    • One dimensional vector

Pass-by-value is most likely true for double precision (float8) if PostgreSQL is at least version 8.4 and was built with a 64 bit system architecture. If these conditions are met, PL/R now simply copies en masse the in-memory array data from the PostgreSQL array data structure to the R vector data structure. This avoids all the overhead associated with iterating over the array element by element. Although I am not a fan of special case code such as this, the use case is important (if you are crunching numbers, they are likely stored as double precision elements), and the performance benefit is huge. Here is the timing difference with the patched PL/R versus the unpatched PL/R:

DROP TABLE IF EXISTS test_ts;
CREATE TABLE test_ts
(
  dataid bigint NOT NULL,
  data double precision[],
  CONSTRAINT pk_data PRIMARY KEY (dataid)
);

CREATE OR REPLACE FUNCTION load_test(int) RETURNS text AS $$
 DECLARE
  i    int;
 BEGIN
  FOR i IN 1..$1 LOOP
   --16789 double precision elements in the data array
   INSERT INTO test_ts (dataid, data) VALUES (i, '{-0.0205086770285039,...}');
  END LOOP;
  RETURN 'OK';
 END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION
filt_r_nothing(ts double precision[])
RETURNS double precision[] AS $$
  return(ts);
$$ LANGUAGE 'plr' IMMUTABLE;

CREATE OR REPLACE FUNCTION
filt_r_avg(ts double precision[])
RETURNS double precision AS $$
  return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

-- INSERT 14000 rows of 16789 element arrays
SELECT load_test(14000);

-- unpatched code
UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 1224087.064 ms

-- patched code
UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 225591.429 ms

-- unpatched code
contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 441573.619 ms

-- patched code
contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 6541.039 ms

-- unpatched code
select array_upper(filt_r_nothing(data),1) from test_ts;
 array_upper 
-------------
       16879
       16879
[...]
(14000 rows)
Time: 1108651.349 ms

-- patched code
select array_upper(filt_r_nothing(data),1) from test_ts;
 array_upper 
-------------
       16879
       16879
[...]
(14000 rows)
Time: 23101.602 ms


So to summarize:

TestCaseTime (ms)Improvement
UPDATE NOOPUnpatched1224087.064--
UPDATE NOOPPatched225591.42982%
SELECT AVGUnpatched441573.619--
SELECT AVGPatched6541.03998%
SELECT NOOPUnpatched1108651.349--
SELECT NOOPPatched23101.60298%

Pretty substantial improvement in these particular, but I think common, use cases. The UPDATE test sees less overall benefit because the time to write out the changes would be significant and the same regardless of array handling in PL/R. The difference between SELECT NOOP and SELECT AVG is driven by the fact that the latter returns a scalar result, while the former returns the entire array. The reason SELECT NOOP does array_upper() on the returned array, is that otherwise all that array data (something like 4 GB) gets materialized in memory by psql, which of course greatly slows things further and is not what we are trying to test.

Please give the changes a try and provide feedback before I release another PL/R version. You can grab the new code from github and sign up for the PL/R mailing list to post your results or report any questions/problems. And of course visit the PL/R homepage and PL/R wiki for more general information about PL/R -- particularly to watch for these changes in the next official release. Finally, don't hesitate to contact me directly if the other choices don't suit you for some reason.

debianlogo.pngThe vast majority of Debian installations are simplified with the use of Preseeding and Netboot. Friedrich Weber, a school student on a work experience placement with us at our German office has observed the process and captured it in a Howto here:

Imagine the following situation: you find yourself with ten to twenty brand new Notebooks and the opportunity to install them with Debian and customise to your own taste. In any case it would be great fun to manually perform the Debian installation and configuration on each Notebook. This is where Debian Preseed comes into play.

The concept is simple and self-explanatory; usually, whoever is doing the installation will be faced with a number of issues during the process (e.g. language, partitioning, packages, Bootloader, etc.) In terms of Preseed, all of these issues can now be resolved. Only those which are not already accounted for in Preseed remain for the Debian installer. In the ideal situation these would become apparent at the outset of the installation, where the solution would differ depending on the target system and which the administrator must deal with manually - only when these have been dealt with can the installation be left to run unattended.

Preseed functions on some simple inbuilt configuration data: preseed.cfg. It includes, as detailed above, the questions which must be answered during installation, and in debconf-format. Data such as this consists of several rows, each row of which defines a debconf configuration option - a response to a question - for example:

    d-i debian-installer/local	string de_DE.UTF-8

The first element of these lines is the name of the package, which is configured (d-i is here an abbreviation of debian installer), the second element is the name of the option, which is set, as the third element of the type of option (a string) and the rest is the value of the option. In this example, we set the language to German using UTF-8-coding.

You can put lines like this together yourself, even simpler with the tool debconf-get-selections: these commands provide straight forward and simple options, which can be set locally. From the selection you can choose your desired settings, adjusted if necessary and copied into preseed.cfg.

Here is an example of preseed.cfg:

    d-i debian-installer/locale string de_DE.UTF-8
    d-i debian-installer/keymap select de-latin1
    d-i console-keymaps-at/keymap select de
    d-i languagechooser/language-name-fb select German
    d-i countrychooser/country-name select Germany
    d-i console-setup/layoutcode string de_DE

    d-i clock-setup/utc boolean true
    d-i time/zone string Europe/Berlin
    d-i clock-setup/ntp boolean true
    d-i clock-setup/ntp-server string ntp1

    tasksel tasksel/first multiselect standard, desktop, gnome-desktop, laptop
    d-i pkgsel/include string openssh-client vim less rsync

In addition to language and timezone settings, selected tasks and packages are also set with these options. If left competely unattended, the installation will not complete, but will make a good start.

Now onto the question of where Preseed pulls its data from. It is in fact possible to use Preseed with CD and DVD images or USB sticks, but generally more comfortable to use a Debian Netboot Image, essentially an installer, which is started across the network and which can cover its Preseed configuration. This boot across the network is implemented with PXE and requires a system that can boot from a network card.

Next, the system depends on booting from the network card. It travels from a DHCO server to an IP address per broadcast. This DHCP server transmits not only a suitable IP, but also to the IP of a so-called Bootserver. A Bootserver is a TFTP-Server, which provides a Bootloader to assist the Administrator with the desired Debian Installer. At the same time the Debian Installer can be shared with the Boot options that Preseed should use and where he can find the Preseed configuration. Here is a snippet of the PXELINUX configuration data pxelinux.cfg/default:

    label i386
        kernel debian-installer/i386/linux
        append vga=normal initrd=debian-installer/i386/initrd.gz netcfg/choose_interface=eth0 domain=example.com locale=de_DE debian-installer/country=DE debian-installer/language=de debian-installer/keymap=de-latin1-nodeadkeys console-keymaps-at/keymap=de-latin1-nodeadkeys auto-install/enable=false preseed/url=http://$server/preseed.cfg DEBCONF_DEBUG=5 -- quiet 

When the user types i386, the tt>debian-installer/i386/linux kernel (found on the TFTP server) is downloaded and run. This is in addition to a whole load of bootoptions given along the way. The debian installer allows the provision of debconf options as boot parameters. It is good practice for the installer to somehow communicate where to find the Preseed communication on the network (preseed/url). In order to download this Preseed configuration, it must also be somehow built into the network.

The options for that will be handed over (the options for the hostnames would be deliberately omitted here, as every target system has its own Hostname). auto-install/enable would delay the language set up so that it is only enabled after the network configuration, in order that these installations are read through preseed.cfg. It is not necessary as the language set up will also be handed over to the kernel options to ensure that the network configuration is German.

The examples and configuration excerpts mentioned here are obviously summarised and shortened. Even so, this blog post should have given you a glimpse into the concept of Preseed in connection with netboot. Finally, here is a complete version of preseed.cfg:

    d-i debian-installer/locale string de_DE.UTF-8
    d-i debian-installer/keymap select de-latin1
    d-i console-keymaps-at/keymap select de
    d-i languagechooser/language-name-fb select German
    d-i countrychooser/country-name select Germany
    d-i console-setup/layoutcode string de_DE

    # Network
    d-i netcfg/choose_interface select auto
    d-i netcfg/get_hostname string debian
    d-i netcfg/get_domain string example.com

    # Package mirror
    d-i mirror/protocol string http
    d-i mirror/country string manual
    d-i mirror/http/hostname string debian.example.com
    d-i mirror/http/directory string /debian
    d-i mirror/http/proxy string
    d-i mirror/suite string lenny

    # Timezone
    d-i clock-setup/utc boolean true
    d-i time/zone string Europe/Berlin
    d-i clock-setup/ntp boolean true
    d-i clock-setup/ntp-server string ntp.example.com

    # Root-Account
    d-i passwd/make-user boolean false
    d-i passwd/root-password password secretpassword
    d-i passwd/root-password-again password secretpassword

    # Further APT-Options
    d-i apt-setup/non-free boolean false
    d-i apt-setup/contrib boolean false
    d-i apt-setup/security-updates boolean true

    d-i apt-setup/local0/source boolean false
    d-i apt-setup/local1/source boolean false
    d-i apt-setup/local2/source boolean false

    # Tasks
    tasksel tasksel/first multiselect standard, desktop
    d-i pkgsel/include string openssh-client vim less rsync
    d-i pkgsel/upgrade select safe-upgrade

    # Popularity-Contest
    popularity-contest popularity-contest/participate boolean true

    # Command to be followed after the installation. `in-target` means that 
         the following
    # Command is followed in the installed environment, rather than in 
        the installation environment.
    # Here http://$server/skript.sh nach /tmp is downloaded, enabled and 
        implemented.
    d-i preseed/late_command string in-target wget -P /tmp/ http://$server/skript.sh; 
  in-target chmod +x /tmp/skript.sh; in-target /tmp/skript.sh

All Howtos of this blog are grouped together in the Howto category - and if you happen to be looking for Support and Services for Debian you've come to the right place at credativ.

290px-OpenGL_logo.svg.png
nVidia has made it known that the free nv Driver is no longer going to be supported. Starting with the newly publicised Fermi Graphics Cards, future cards from nVidia will no longer work with nv Drivers.

Originally, nv Drivers from nVidia were developed to provide basic 2D performance. 3D support was neither wanted nor planned in the open driver, users with 3D requirements have always been referred to proprietary drivers from nVidia.

In future, the nv driver will no longer be maintained and all users are referred to the proprietary drivers. The basic Linux Vesa Driver should be sufficient for a first installation under Linux, according to a nVidia statement, after which the users can continue by installing the proprietary driver. This works as the nv Driver hardly provides any more functions than the Vesa Driver.

This announcement does not mention the now quite advanced nouveau driver at all, which just a few weeks ago was taken into the kernel and will be standard in future distributions when using nVidia cards. This driver offers 2D as well as 3D functions so it was predictable that in the long term nv drivers would become obsolete.

From this point of view, the advertisement is hardly news, except maybe for the fact that the nouveau driver isn't mentioned! It would have been much more exciting if Nvidia had announced support or a cooperation with the nouveau developers... and it would have been nice to see a graphics driver follow-up of the forcedeth-story.

All our blog posts dealing with X11 are grouped together as Category X11 in their own feed... and if you are looking for Support and Services for Linux, you've come to the right place at credativ.

postgreslogo.pngRlogo.jpgWhen you pass large amounts of data to and from PL/R, quite a lot of time is needed for converting. It's better to directly store the data as R objects.

I had been planning to continue with timeseries aggregation, but decided to take a side-road based on a recent question on the PL/R mailing list.

The question was related to seismic data, which is in fact timeseries data. However, I guess the data is normally stored as an array of floats that are all recorded during some seismic event at a constant sampling rate. The arrays are available from online sources in an individual file for each event being analyzed. The problem was that when dealing with, say, 14000 arrays of floats, each having on the order of 16000 elements, passing the data to and from PL/R proved slower than hoped.

So we start with loading of sample data for a performance test:

DROP TABLE IF EXISTS test_ts;
CREATE TABLE test_ts
(
  dataid bigint NOT NULL,
  data double precision[],
  CONSTRAINT pk_data PRIMARY KEY (dataid)
);

CREATE OR REPLACE FUNCTION filt_r_nothing(ts double precision[])
RETURNS double precision[] AS $$
 return(ts);
$$ LANGUAGE 'plr' IMMUTABLE;

CREATE OR REPLACE FUNCTION load_test(int) RETURNS text AS $$
  DECLARE
   i    int;
  BEGIN
    FOR i IN 1..$1 LOOP
      INSERT INTO test_ts(dataid,data) VALUES (i,'{-0.0205086770285039, ...'})
    END LOOP;
    RETURN 'OK';
  END;
$$ LANGUAGE plpgsql;

SELECT load_test(14000);
 load_test 
-----------
 OK
(1 row)

Time: 123861.362 ms

The array in the VALUES clause of that function actually contains 16879 float8 elements. You can see that it takes over two minutes on my development machine to load the table with 14000 rows of this array. Note that on my development machine I have done no tuning of PostgreSQL configs, and I built with --enable-debug, --enable-cassert, and CFLAGS='-O0 -g3'.

Next, we update the data column with filt_r_nothing() which does nothing other than returning the same array it was passed.

UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 1224087.064 ms

Not pretty. Over 20 minutes. I did some profiling of PL/R and concluded most of the time was being spent converting 16879 PostgreSQL array elements from float8 datums to R vector elements one at a time while processing the function argument, and then repeating the process in reverse while creating the returned result. Perhaps there are optimizations that can be made to that process, but since PostgreSQL and R each have their own binary representation of this data, there is no avoiding the conversion overhead.

However, what is the point of the proposed performance test? The comparison was being made to another procedural language, which apparently does not convert the array elements if they are not used. A real function is presumably going to do some calculation over the array elements, requiring that they be individually accessed.

I decided to see how PL/pgSQL performs if forced to modify and return the passed array. The difference between this test and the PL/R one will give some insight on the time spent converting elements from PostgreSQL to R native form.

CREATE OR REPLACE FUNCTION
filt_plpgsql_nothing(ts double precision[])
RETURNS double precision[] AS $$
 BEGIN
  RETURN ts || 3.14159::float8;
 END
$$ LANGUAGE 'plpgsql' IMMUTABLE;

UPDATE test_ts SET data = filt_plpgsql_nothing(data);
UPDATE 14000
Time: 239054.580 ms

About 6 minutes. Much better. But let's see what happens if we do some more meaningful, if simple, calculations on the array elements.

CREATE OR REPLACE FUNCTION
filt_plpgsql_avg(ts double precision[])
RETURNS double precision AS $$
 DECLARE
  i int;
  numts int = array_upper(ts,1);
  ts_sum float8 = 0.0;
 BEGIN
  FOR i IN 1..numts LOOP
    ts_sum := ts_sum + ts[i];
  END LOOP;
  RETURN (ts_sum/numts::float8);
 END
$$ LANGUAGE 'plpgsql' IMMUTABLE;

select filt_plpgsql_avg(data) from  test_ts;
--killed after > 1 hour

CREATE OR REPLACE FUNCTION filt_r_avg(ts double precision[])
RETURNS double precision AS $$
 return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 441573.619 ms

Although the PL/R function still took over 7 minutes to process 14000 rows with 16879 elements, PL/pgSQL took long enough that I killed it out of impatience.

It occurred to me that a feature I added to PL/R within the past year or so might come in handy about now. Namely, it is possible to directly store R objects in PostgreSQL tables. This means that when the datum is passed to a PL/R function, it is all ready to go -- no conversion needed. Let's take a look at that scenario.

DROP TABLE IF EXISTS test_ts_obj;
CREATE TABLE test_ts_obj
(
  dataid serial PRIMARY KEY,
  data bytea
);

CREATE OR REPLACE FUNCTION make_r_object(fname text)
RETURNS bytea AS $$
 myvar<-scan(fname,sep=",")
 return(myvar);
$$ LANGUAGE 'plr' IMMUTABLE;

INSERT INTO test_ts_obj (data) SELECT make_r_object('array-data.csv') from generate_series(1,14000);
INSERT 0 14000
Time: 44182.598 ms

CREATE OR REPLACE FUNCTION filt_r_avg(ts bytea)
RETURNS double precision AS $$
 return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

select filt_r_avg(data) from  test_ts_obj;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
 [...]
 0.656530643017027
(14000 rows)

Time: 12828.331 ms

This results in 44 seconds to load the same 14000 rows of array data as before, but
directly as R objects. Compare that to the 2 minutes to load as PostgreSQL arrays as seen at the beginning of this article. And now it only takes 13 seconds to operate on the 14000 R objects compared to 442 seconds. That's a nice improvement!

But PL/R gives you access to the full power of the R environment for statistical computing and graphics. Just for fun, here is a PL/R function that calculates the "Power Spectrum" of the seismic data, and returns the result as a JPEG of the plot.

CREATE OR REPLACE FUNCTION
filt_r_ps(ts bytea)
RETURNS bytea AS $$
  library(quantmod)
  library(cairoDevice)
  library(RGtk2)

  fourier<-fft(ts)
  magnitude<-Mod(fourier)
  y2 <- magnitude[1:(length(magnitude)/10)]
  x2 <- 1:length(y2)/length(magnitude)
  mydf <- data.frame(x2,y2)

  pixmap <- gdkPixmapNew(w=500, h=500, depth=24)
  asCairoDevice(pixmap)

  plot(mydf,type="l")
  plot_pixbuf <- gdkPixbufGetFromDrawable(NULL, pixmap,
                                                        pixmap$getColormap(),
                                                        0, 0, 0, 0, 500, 500)
  buffer <- gdkPixbufSaveToBufferv(plot_pixbuf,
                                                       "jpeg",
                                                        character(0),
                                                        character(0))$buffer
  return(buffer)
$$ LANGUAGE 'plr' IMMUTABLE;

This is now not about performance so much as it is about analytical power. About half of the lines in this function are setting up to capture the output graph. The "meat" of the function can be contained in these few lines:

fourier<-fft(ts)
magnitude<-Mod(fourier)
plot(x=1:length(y2)/length(magnitude),
       y=magnitude[1:(length(magnitude)/10)],
       type="l")

Compliment that PL/R function with a bit of PHP code...

<?php
function hex2bin($data)
{
	$data = ltrim($data, "\x");
	$len = strlen($data);
	return pack("H" . $len, $data);
} 

$dbconn = pg_connect("dbname=contrib_regression");
$rs = pg_query( $dbconn, "select plr_get_raw(filt_r_ps(data))
                                    from test_ts_obj where dataid = 42");
$hexpic = pg_fetch_array($rs);
$cleandata = hex2bin($hexpic[0]);

header("Content-Type: image/jpeg");
header("Last-Modified: " .
date("r", filectime($_SERVER['SCRIPT_FILENAME'])));
header("Content-Length: " . strlen($cleandata));
echo $cleandata;
?>

...and the output looks like:plr-blog.jpg

Fairly sophisticated output for relatively little effort! For more information or assistance with respect to PostgreSQL, PL/R, and/or advanced analytics, don't hesitate to contact us.

postgreslogo.pngFrequently when dealing with parametric data, you need to "roll up" the data in summary fashion as it ages in order to reduce the volume kept on hand, or maybe because the summary statistics are what really interests you. There are several ways to do that, and this post highlights four different approaches.

I was reminded of this kind of "roll ups" today by a question on the pgsql-novice list. This is actually quite a large topic, so I this tip will likely just scratch the surface. The question was related to storing min, max, and avg summaries on an hourly, daily, and weekly basis. The basic idea, for example, is that you can keep raw data for maybe a week, hourly summaries for 6 months, daily summaries for 3 years, and weekly summaries forever. As I mentioned in my reply, I have done this kind of thing over the years using at least 4 approaches:

  1. Aggregate on demand
  2. Batch aggregate on a periodic basis -- e.g. run your aggregate query with a cron job which truncates and rebuilds a table (i.e. a "materialized view")
  3. Write a C based trigger that does "continuous aggregation" to a materialized table
  4. Write a C based bulk loader that aggregates as it bulk loads the raw data into a materialized table

The first approach is simply to run an aggregate query whenever you need the summarized data. Obviously this does not really satisfy the stated desire to discard aged raw data, but I mention it for completeness. In some cases you have sufficient storage given your data volume, and performance of the aggregate is "good enough".

The second is the rough equivalent of a materialized view. In other words, run a batch job via cron or something similar that TRUNCATEs and then repopulates a table used for storage of the aggregate result. Particularly for daily or weekly summary data, when the consumers of the data are 9-5 folk, this approach works pretty well. This also fits in nicely with common partitioning schemes.

The third is one where you want summary statistics to be updated live. In this case you actually want the summary data for the current hour/day/week to all be constantly updated as new raw data comes in. Otherwise you are stuck always looking at last hour's, or yesterday's, or last weeks, data. The way to do this is through a trigger. A while back I implemented a continuous aggregation trigger in C that used prepared queries to update my aggregate table for every INSERT/UPDATE/DELETE occurring on the target table. However even with the trigger written in C and using prepared queries, the performance impact of the trigger firing for every DML event was significant.

Finally, the forth method can be used when your reporting needs are such that the raw data can be collected for some period before storing in your database. Let's say the summary reports are never run against the current hour. What you can do is build up a file in suitable format for bulk loading via COPY. Then process the data as it is bulk loaded to calculate and insert the summary at the same time. Again, I had done that in the past using a C program that read in the stored files, generated the summary data while building a string buffer, and finally using libpq's PQputCopyData() to populate the tables.

More than likely some combination of the above is what you really want. Perhaps use method 2 to maintain your weekly and daily aggregate materialized views, and use method 4 to update your hourly aggregate data.

This post was a lot of discussion and no code -- perhaps tomorrow I will continue with some more concrete examples.

postgreslogo.pngSomeone posted a dilemma to the pgsql-sql list today that involved many if not all of his sequences getting out of sync with their respective "serial" columns. In other words, something like "SELECT max(id) FROM sometable" yields 42, but the sequence nextval for sometable.id is currently set to 36. This is obviously bad (for reasons left as an exercise for the reader). So besides trying to figure out how the database ended up in this state, he needed a script to reset all of his sequences to the correct next value.

I had run into a similar need not too long ago. Namely, when setting up multi-master replication with Bucardo you need your sequences to draw different values on either master so as not to conflict. One solution is to set up all your sequences to jump by 2, and use even numbers on one master and odd numbers on the other. Again, a script makes this easier to deal with, and I had developed one for this situation. So I modified it for the problem mentioned above.

Both versions follow:

-- create "odd" and "even" sequences in multi-master scenario
CREATE OR REPLACE FUNCTION adjust_seqs(namespace text, starteven bool)
  RETURNS text AS $$
DECLARE
  rec         record;
  startval   bigint;
  sql          text;
  fqname  text;
BEGIN
  FOR rec in EXECUTE 'select relname from pg_class where relkind = ''S''
                      and relnamespace = (select oid from pg_namespace
                      where nspname=''' || namespace || ''')' LOOP
    fqname :=  namespace || '.' ||  rec.relname;
    IF starteven THEN
      EXECUTE 'SELECT ((last_value / 2) * 2) + 2 from ' || fqname INTO startval;
    ELSE
      EXECUTE 'SELECT ((last_value / 2) * 2) + 1 from ' || fqname INTO startval;
    END If;
    sql := 'ALTER SEQUENCE ' || fqname || ' INCREMENT BY 2 RESTART WITH ' || startval;
    EXECUTE sql;
    RAISE NOTICE '%', sql;
  END LOOP;
  RETURN 'OK';
END;
$$ LANGUAGE plpgsql STRICT;
SELECT adjust_seqs('public', true);  -- in master1 (even)
SELECT adjust_seqs('public', false); -- in master2 (odd)


-- update sequences that have gotten out-of-sync with the
-- PK field for which they normally provide the default
CREATE OR REPLACE FUNCTION adjust_seqs(namespace text)
  RETURNS text AS $$
DECLARE
  rec           record;
  startval     bigint;
  sql            text;
  seqname  text;
BEGIN
  FOR rec in EXECUTE 'select table_name, column_name, column_default
                      from information_schema.columns
                      where table_schema = ''' || namespace || '''
                      and column_default like ''nextval%''' LOOP

    seqname := pg_get_serial_sequence(rec.table_name, rec.column_name);
    sql := 'select max(' || rec.column_name || ') + 1 from ' || rec.table_name;
    EXECUTE sql INTO startval;
    IF startval IS NOT NULL THEN
      sql := 'ALTER SEQUENCE ' || seqname || ' RESTART WITH ' || startval;
      EXECUTE sql;
      RAISE NOTICE '%', sql;
    END IF;
  END LOOP;
  RETURN 'OK';
END;
$$ LANGUAGE plpgsql STRICT;
select adjust_seqs('public');

Neither of these is heavily tested, and both make certain assumptions, so please test and modify to suit your own needs. Caveat emptor!

I was given a Postgres database dump to analyze today created by "pg_dump -Fc". The source database included PostGIS 1.3.x extensions. I'm not sure if this is standard with PostGIS, but the related database objects were all dumped with a hard-coded library path, specifically /usr/lib/postgresql/8.3/lib. On my machine, I have many PostgreSQL clusters (essentially at least one for every supported branch dating back to 7.3.x), but they are not located under /usr/lib/postgresql.

As such, I needed a quick fix. To wit:

pg_restore database.with.postgis.tgz > db.w.postgis.dmp
sed 's|/usr/lib/postgresql/8.3/lib|$libdir|g' < db.w.postgis.dmp > db.w.postgis.dmp.new

The first line extracts the dump file from the compressed "custom" format into a human readable text SQL file. The second line replaces the hard-coded library path with the special PostgreSQL $libdir variable. This will always point to the correct location for any given PostgreSQL cluster. You can always discover where this is by running:

pg_config --libdir

bash.pngThe text editor vim offers several tools for automation. This howto describes a way to auto-include text modules when creating new files.

Often during programming or administration you need the same text modules again and again. The editor vim is very helpful here, as it can detect a file type while it is being created and insert pre-defined text modules accordingly. This behaviour can be configured in the file .vim/plugin/autoinsert.vim, for example with:

if has("autocmd")
augroup autoinsert
  au!
  autocmd BufNewFile *.c call s:Template("c")
  autocmd BufNewFile Makefile call s:Template("make")
  autocmd BufNewFile makefile call s:Template("make-simple")
augroup END
endif

function s:Template(argument)
        if (a:argument == "help")
                echo "Currently available templates:"
                echo " c                - Plain C Template"
                echo " make             - Makefile Template"
                echo " make-simple      - Simple Variant of the Makefile Template"
        else
                " First delete all in the current buffer
                %d

                " The Makefile variants
                if (a:argument == "make")
                        0r ~/.vim/skeletons/template.make
                        set ft=make
                elseif (a:argument == "make-simple")
                        0r ~/.vim/skeletons/template.make_simple
                        set ft=make
                elseif (a:argument == "make-simple-cpp")
                        0r ~/.vim/skeletons/template.make_simple_cpp
                        set ft=make

                " Stuff for plain C
                elseif (a:argument == "c")
                        0r ~/.vim/skeletons/template.c
                        set ft=c
                endif

                silent %!~/.vim/do_header %
        endif
endfunction

command! -nargs=1 Template call s:Template(<f-args>)

The lines 21-35 clearly show the template names and include the text modules. The template for make_simple, ~/.vim/skeletons/template.make_simple, for example includes the compiler flags for building C/C++ programs with gcc:

CC := gcc
CFLAGS := -Wall -pedantic -O3
LDFLAGS :=

PROG := main
OBJS := main.o

all: $(PROG)

$(PROG): $(OBJS)
        $(CC) $(LDFLAGS) -o $@ $^

clean:
        rm -rf $(PROG) $(OBJS)

.PHONY: all clean

Another template is shown below: ~/.vim/skeletons/template.c includes text modules for not only the obligatory GPL header but also includes a basic code structure:

/*
 * %%FILENAME%% - description
 *
 * Copyright (C) %%YEAR%% %%AUTHOR%%
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2, or (at your option)
 * any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA.
 */

#include <stdio.h>

int
main (int argc, char **argv)
{
  return 0;
}

/**This must remain at the end of the file.**********
 * vim600:set sw=2 ts=8 fdm=marker fmr=«««,»»»:     *
 * vim600:set cindent cinoptions={1s,>2s,^-1s,n-1s: *
 ****************************************************/


Variables like %%FILENAME%% or %%AUTHOR%% can also automatically be replaced by a small shell script running during file creation: the script ~/.vim/do_header labelled with the file name as argument detects the full name with getent or /etc/passwd, respectively. Other variables are gathered using the default GNU tools as the listing shows:

#!/usr/bin/env zsh

if which getent > /dev/null; then
        REALNAME=$(getent passwd $USER|awk -F : '{print $5}' | awk -F , '{print $1}')
else
        REALNAME=$(grep $USER /etc/passwd|awk -F : '{print $5}' | awk -F , '{print $1}')
        if which nidump > /dev/null && [ -z "$REALNAME" ]; then
                REALNAME=$(nidump passwd / | grep $USER|awk -F : '{print $8}')
        fi
fi
DATE=$(date)
YEAR=$(date +%Y)
FILENAME=$(echo $1 | sed 's/[^/]*\///')
FILE=$(echo $FILENAME | sed 's/\..*//')
FILEBIG=$(echo $FILE | tr '[:lower:]' '[:upper:]')
sed     "s/%%AUTHOR%%/$REALNAME/g;
        s/%%DATE%%/$DATE/g;
        s/%%YEAR%%/$YEAR/g;
        s/%%FILENAME%%/$FILENAME/g;
        s/%%FILE%%/$FILE/g;
        s/%%FILEBIG%%/$FILEBIG/g;"


Besides the examples shown here other templates can be generated to deal with HTML files, python or whatnot, among others - the possibilities are endless.

This howto has just touched on one small aspect of the many automations possible with vim. You can find other howtos in this blog in the howtos category, which has its own feed - if you need more in-depth support or services for GNU tools or Linux, you've come to the right place at credativ.

PostgreSQL 9.0 is now in Betaphase

| No Comments | No TrackBacks

postgreslogo.png
The PostgreSQL developers' community recently published the first Beta version of the new 9.0 release. Over 200 new functions and improvements feature in this new version.

With this new release, PostgreSQL now amongst other features claims an inbuilt replication solution as well as the ability to access and read standby nodes, continuously being updated by Log Shipping (Hot Standby). Streaming replication allows the sending of transaction logs directly to one or more standby nodes, which considerably reduces the amount of time lost compared with the more common, file-based log shipping. Combining these two features makes for an extremely efficient solution for high availability or loadbalanced systems.

The all new PostgreSQL version also offers the following innovations:

  • Memory based LISTEN/NOTIFY: this replaces the previous table based implementation and is much faster.
  • Exclusion Constraints: broadens constraints to be able to deal with the complex datatypes of overlapping constraints.

  • Procedural code such as PL/pgSQL, PL/Perl and PL/Python can now be done inline per DO command.
  • This means there is no longer need to define a function with CREATE FUNCTION.

  • Triggers on columns

  • Triggers can now be tied to conditions

  • Named argument lists for procedures

  • Parameters can now be flexibly linked to rolls/databases

As always, anyone interested is invited to share their test results with the developers. Information on the procedure for testing and filing of error messages can be found in the Wiki.

All blog articles which fall into the PostgreSQL category are grouped in their own feed, and if you find you need support and services for PostgreSQL, you've come to the right place at credativ.