PostgreSQL topic of the Day - advanced analytics

| No Comments

postgreslogo.pngRlogo.jpgWhen you pass large amounts of data to and from PL/R, quite a lot of time is needed for converting. It's better to directly store the data as R objects.

I had been planning to continue with timeseries aggregation, but decided to take a side-road based on a recent question on the PL/R mailing list.

The question was related to seismic data, which is in fact timeseries data. However, I guess the data is normally stored as an array of floats that are all recorded during some seismic event at a constant sampling rate. The arrays are available from online sources in an individual file for each event being analyzed. The problem was that when dealing with, say, 14000 arrays of floats, each having on the order of 16000 elements, passing the data to and from PL/R proved slower than hoped.

So we start with loading of sample data for a performance test:

DROP TABLE IF EXISTS test_ts;
CREATE TABLE test_ts
(
  dataid bigint NOT NULL,
  data double precision[],
  CONSTRAINT pk_data PRIMARY KEY (dataid)
);

CREATE OR REPLACE FUNCTION filt_r_nothing(ts double precision[])
RETURNS double precision[] AS $$
 return(ts);
$$ LANGUAGE 'plr' IMMUTABLE;

CREATE OR REPLACE FUNCTION load_test(int) RETURNS text AS $$
  DECLARE
   i    int;
  BEGIN
    FOR i IN 1..$1 LOOP
      INSERT INTO test_ts(dataid,data) VALUES (i,'{-0.0205086770285039, ...'})
    END LOOP;
    RETURN 'OK';
  END;
$$ LANGUAGE plpgsql;

SELECT load_test(14000);
 load_test 
-----------
 OK
(1 row)

Time: 123861.362 ms

The array in the VALUES clause of that function actually contains 16879 float8 elements. You can see that it takes over two minutes on my development machine to load the table with 14000 rows of this array. Note that on my development machine I have done no tuning of PostgreSQL configs, and I built with --enable-debug, --enable-cassert, and CFLAGS='-O0 -g3'.

Next, we update the data column with filt_r_nothing() which does nothing other than returning the same array it was passed.

UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 1224087.064 ms

Not pretty. Over 20 minutes. I did some profiling of PL/R and concluded most of the time was being spent converting 16879 PostgreSQL array elements from float8 datums to R vector elements one at a time while processing the function argument, and then repeating the process in reverse while creating the returned result. Perhaps there are optimizations that can be made to that process, but since PostgreSQL and R each have their own binary representation of this data, there is no avoiding the conversion overhead.

However, what is the point of the proposed performance test? The comparison was being made to another procedural language, which apparently does not convert the array elements if they are not used. A real function is presumably going to do some calculation over the array elements, requiring that they be individually accessed.

I decided to see how PL/pgSQL performs if forced to modify and return the passed array. The difference between this test and the PL/R one will give some insight on the time spent converting elements from PostgreSQL to R native form.

CREATE OR REPLACE FUNCTION
filt_plpgsql_nothing(ts double precision[])
RETURNS double precision[] AS $$
 BEGIN
  RETURN ts || 3.14159::float8;
 END
$$ LANGUAGE 'plpgsql' IMMUTABLE;

UPDATE test_ts SET data = filt_plpgsql_nothing(data);
UPDATE 14000
Time: 239054.580 ms

About 6 minutes. Much better. But let's see what happens if we do some more meaningful, if simple, calculations on the array elements.

CREATE OR REPLACE FUNCTION
filt_plpgsql_avg(ts double precision[])
RETURNS double precision AS $$
 DECLARE
  i int;
  numts int = array_upper(ts,1);
  ts_sum float8 = 0.0;
 BEGIN
  FOR i IN 1..numts LOOP
    ts_sum := ts_sum + ts[i];
  END LOOP;
  RETURN (ts_sum/numts::float8);
 END
$$ LANGUAGE 'plpgsql' IMMUTABLE;

select filt_plpgsql_avg(data) from  test_ts;
--killed after > 1 hour

CREATE OR REPLACE FUNCTION filt_r_avg(ts double precision[])
RETURNS double precision AS $$
 return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 441573.619 ms

Although the PL/R function still took over 7 minutes to process 14000 rows with 16879 elements, PL/pgSQL took long enough that I killed it out of impatience.

It occurred to me that a feature I added to PL/R within the past year or so might come in handy about now. Namely, it is possible to directly store R objects in PostgreSQL tables. This means that when the datum is passed to a PL/R function, it is all ready to go -- no conversion needed. Let's take a look at that scenario.

DROP TABLE IF EXISTS test_ts_obj;
CREATE TABLE test_ts_obj
(
  dataid serial PRIMARY KEY,
  data bytea
);

CREATE OR REPLACE FUNCTION make_r_object(fname text)
RETURNS bytea AS $$
 myvar<-scan(fname,sep=",")
 return(myvar);
$$ LANGUAGE 'plr' IMMUTABLE;

INSERT INTO test_ts_obj (data) SELECT make_r_object('array-data.csv') from generate_series(1,14000);
INSERT 0 14000
Time: 44182.598 ms

CREATE OR REPLACE FUNCTION filt_r_avg(ts bytea)
RETURNS double precision AS $$
 return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

select filt_r_avg(data) from  test_ts_obj;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
 [...]
 0.656530643017027
(14000 rows)

Time: 12828.331 ms

This results in 44 seconds to load the same 14000 rows of array data as before, but
directly as R objects. Compare that to the 2 minutes to load as PostgreSQL arrays as seen at the beginning of this article. And now it only takes 13 seconds to operate on the 14000 R objects compared to 442 seconds. That's a nice improvement!

But PL/R gives you access to the full power of the R environment for statistical computing and graphics. Just for fun, here is a PL/R function that calculates the "Power Spectrum" of the seismic data, and returns the result as a JPEG of the plot.

CREATE OR REPLACE FUNCTION
filt_r_ps(ts bytea)
RETURNS bytea AS $$
  library(quantmod)
  library(cairoDevice)
  library(RGtk2)

  fourier<-fft(ts)
  magnitude<-Mod(fourier)
  y2 <- magnitude[1:(length(magnitude)/10)]
  x2 <- 1:length(y2)/length(magnitude)
  mydf <- data.frame(x2,y2)

  pixmap <- gdkPixmapNew(w=500, h=500, depth=24)
  asCairoDevice(pixmap)

  plot(mydf,type="l")
  plot_pixbuf <- gdkPixbufGetFromDrawable(NULL, pixmap,
                                                        pixmap$getColormap(),
                                                        0, 0, 0, 0, 500, 500)
  buffer <- gdkPixbufSaveToBufferv(plot_pixbuf,
                                                       "jpeg",
                                                        character(0),
                                                        character(0))$buffer
  return(buffer)
$$ LANGUAGE 'plr' IMMUTABLE;

This is now not about performance so much as it is about analytical power. About half of the lines in this function are setting up to capture the output graph. The "meat" of the function can be contained in these few lines:

fourier<-fft(ts)
magnitude<-Mod(fourier)
plot(x=1:length(y2)/length(magnitude),
       y=magnitude[1:(length(magnitude)/10)],
       type="l")

Compliment that PL/R function with a bit of PHP code...

<?php
function hex2bin($data)
{
	$data = ltrim($data, "\x");
	$len = strlen($data);
	return pack("H" . $len, $data);
} 

$dbconn = pg_connect("dbname=contrib_regression");
$rs = pg_query( $dbconn, "select plr_get_raw(filt_r_ps(data))
                                    from test_ts_obj where dataid = 42");
$hexpic = pg_fetch_array($rs);
$cleandata = hex2bin($hexpic[0]);

header("Content-Type: image/jpeg");
header("Last-Modified: " .
date("r", filectime($_SERVER['SCRIPT_FILENAME'])));
header("Content-Length: " . strlen($cleandata));
echo $cleandata;
?>

...and the output looks like:plr-blog.jpg

Fairly sophisticated output for relatively little effort! For more information or assistance with respect to PostgreSQL, PL/R, and/or advanced analytics, don't hesitate to contact us.

Leave a comment