PostgreSQL topic of the Day - PL/R performance improvements

| 4 Comments

postgreslogo.pngRlogo.jpgWhen you pass large amounts of data to and from PL/R, quite a lot of time is needed for converting. A change is being tested which treats arrays of 4 byte integers and 8 byte floating point values as a special case, resulting in a dramatic performance improvement.

In a recent post, I discussed PL/R performance related to seismic timeseries data stored as an array of floats that are all recorded during some seismic event at a constant sampling rate. The problem was that when dealing with, say, 14000 arrays of floats, each having on the order of 16000 elements, passing the data to and from PL/R proved slower than hoped.

My ultimate solution was to show how a significant performance improvement could be achieved by importing the arrays into Postgres tables directly as raw R objects, and then operating on those objects later using PL/R. The problem with this approach is that in some, if not most, cases, you may want to access that same data from other procedural languages or hand off the arrays to some client other than R. In this case the raw R object does not meet your needs.

So I thought about it a bit and researched the source code on the Postgres and R sides of PL/R, and concluded that for certain special cases it was possible to dramatically improve speed by skipping the one-at-a-time element conversion as arrays are processed going between PostgreSQL and R. Specifically, the in-memory storage of the array data is binary compatible in the following circumstances:

  1. pgsql -> R
    • Argument is integer or double precision array
    • Element data type is pass-by-value for given Postgres version and architecture
    • No NULL elements
    • Array is one dimensional
  2. R -> pgsql
    • Integer vector returned with integer array return type
    • Real vector returned with double precision array return type
    • No NA elements
    • One dimensional vector

Pass-by-value is most likely true for double precision (float8) if PostgreSQL is at least version 8.4 and was built with a 64 bit system architecture. If these conditions are met, PL/R now simply copies en masse the in-memory array data from the PostgreSQL array data structure to the R vector data structure. This avoids all the overhead associated with iterating over the array element by element. Although I am not a fan of special case code such as this, the use case is important (if you are crunching numbers, they are likely stored as double precision elements), and the performance benefit is huge. Here is the timing difference with the patched PL/R versus the unpatched PL/R:

DROP TABLE IF EXISTS test_ts;
CREATE TABLE test_ts
(
  dataid bigint NOT NULL,
  data double precision[],
  CONSTRAINT pk_data PRIMARY KEY (dataid)
);

CREATE OR REPLACE FUNCTION load_test(int) RETURNS text AS $$
 DECLARE
  i    int;
 BEGIN
  FOR i IN 1..$1 LOOP
   --16789 double precision elements in the data array
   INSERT INTO test_ts (dataid, data) VALUES (i, '{-0.0205086770285039,...}');
  END LOOP;
  RETURN 'OK';
 END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE FUNCTION
filt_r_nothing(ts double precision[])
RETURNS double precision[] AS $$
  return(ts);
$$ LANGUAGE 'plr' IMMUTABLE;

CREATE OR REPLACE FUNCTION
filt_r_avg(ts double precision[])
RETURNS double precision AS $$
  return(mean(ts));
$$ LANGUAGE 'plr' IMMUTABLE;

-- INSERT 14000 rows of 16789 element arrays
SELECT load_test(14000);

-- unpatched code
UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 1224087.064 ms

-- patched code
UPDATE test_ts SET data = filt_r_nothing(data);
UPDATE 14000
Time: 225591.429 ms

-- unpatched code
contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 441573.619 ms

-- patched code
contrib_regression=# select filt_r_avg(data) from test_ts;
    filt_r_avg     
-------------------
 0.656530643017027
 0.656530643017027
[...]
(14000 rows)
Time: 6541.039 ms

-- unpatched code
select array_upper(filt_r_nothing(data),1) from test_ts;
 array_upper 
-------------
       16879
       16879
[...]
(14000 rows)
Time: 1108651.349 ms

-- patched code
select array_upper(filt_r_nothing(data),1) from test_ts;
 array_upper 
-------------
       16879
       16879
[...]
(14000 rows)
Time: 23101.602 ms


So to summarize:

TestCaseTime (ms)Improvement
UPDATE NOOPUnpatched1224087.064--
UPDATE NOOPPatched225591.42982%
SELECT AVGUnpatched441573.619--
SELECT AVGPatched6541.03998%
SELECT NOOPUnpatched1108651.349--
SELECT NOOPPatched23101.60298%

Pretty substantial improvement in these particular, but I think common, use cases. The UPDATE test sees less overall benefit because the time to write out the changes would be significant and the same regardless of array handling in PL/R. The difference between SELECT NOOP and SELECT AVG is driven by the fact that the latter returns a scalar result, while the former returns the entire array. The reason SELECT NOOP does array_upper() on the returned array, is that otherwise all that array data (something like 4 GB) gets materialized in memory by psql, which of course greatly slows things further and is not what we are trying to test.

Please give the changes a try and provide feedback before I release another PL/R version. You can grab the new code from github and sign up for the PL/R mailing list to post your results or report any questions/problems. And of course visit the PL/R homepage and PL/R wiki for more general information about PL/R -- particularly to watch for these changes in the next official release. Finally, don't hesitate to contact me directly if the other choices don't suit you for some reason.

4 Comments

Pass-by-value-ness shouldn't really need to be a consideration here. The array contents will be the same either way.

Did you check that both PgSQL and FS caches were cleared before running patched version tests?
Otherwise benchmark results wouldn't be realistic (i.e. will be too good).

Leave a comment