Rlabkey performance: /home/Support/LabKey Support Forum

Rlabkey performance

LabKey Support Forum (Inactive)

View Message

Rlabkey performance

Leo Dashevskiy

2012-08-14 16:09

Status: Closed

The table is like this:

CREATE TABLE qualifier.stats
(
    sId INT NOT NULL,
    Container ENTITYID NOT NULL,
    id INT NOT NULL,
    gsId INT NOT NULL,
    population TEXT NOT NULL,
    stats TEXT NOT NULL,
    node TEXT NOT NULL,
    channel TEXT NOT NULL,
    value NUMERIC NOT NULL,

    CONSTRAINT PK_stats PRIMARY KEY (sId),
    CONSTRAINT FK_stats_gsTbl FOREIGN KEY (gsId)
      REFERENCES qualifier.gsTbl (gsId)
      ON DELETE CASCADE
)

where gsId references:

CREATE TABLE qualifier.gsTbl
(
    gsId INT NOT NULL,
    Container ENTITYID NOT NULL,
    gsName TEXT NOT NULL,
    objLink TEXT NOT NULL,

    CONSTRAINT PK_gsTbl PRIMARY KEY (gsId)
);

and

id references the FCSFiles virtual table.

Inserting about 15K rows into that first table with 8 columns from an R session on one machine at the Hutch to my machine (a modern desktop), where Labkey is running, also at the Hutch, takes about 5 minutes, is that as expected?

Thanks.

Brian Connolly responded:	2012-08-14 16:52
Leo, I would expect better performance from the database server running on a modern desktop. My assumption is that you are running on a PostgreSQL database server that you installed on your local machine. If so, did you change any of the configuration settings (in the file postgresql.conf) after you installed. If not, then that is your problem (the default configuration of PG is designed to work on a server with only 256MB of memory). Thanks Brian

Matthew Bellew responded:	2012-08-14 16:58
Is this being inserted one row at a time (15000 api calls) or as one large patch (1 call with 15000 rows)? It would be interesting to see how long it takes to import the same data formatted as tab-separated file.

Leo Dashevskiy responded:	2012-08-14 18:18
Thanks for your responses. Matthew, we do a one API call to insertRows() with the argument being a data.frame with 15K rows - this usually is the correct approach to avoid overhead associated with each function call. Is it not the case here? Brian, I am running on PostgreSQL, yes, sorry forgot to mention that. And indeed it looks like I have not tweaked manually (m)any of the settings in the postgresql.conf file. Am I correct in assuming to look at the "RESOURCE USAGE" section? Any quick fixes or other suggestions from the pro about modifying these? (FYI, I currently have 8GB of RAM.) Thanks.

Brian Connolly responded:	2012-08-14 18:49
Leo, If you were running just a LabKey Server on your workstation and not using it as a desktop too, I would recommend the following configuring the following setting - shared_buffers = 25% of system memory up to max of 2GB - work_mem = 10MB - maintenance_work_mem = 1024MB - effective_cache_size = 75% of system memory - wal_buffers = 20MB - checkpoint_segments = 15 - checkpoint_timeout = 15min - enable_material = off - random_page_cost = 1.5 - join_collapse_limit = 10 - track_counts = 1 - autovacuum = on Now for a workstation, like yours, where you will be doing many other things on it, you want to balance performance with resource usage. I recommend the following settings - shared_buffers = 512MB - work_mem = 10MB - maintenance_work_mem = 256MB - effective_cache_size = 75% of system memory - wal_buffers = 4MB - checkpoint_segments = 15 - [comment this out] checkpoint_timeout = 15min - enable_material = off - random_page_cost = 1.5 - join_collapse_limit = 10 - track_counts = 1 - autovacuum = on By default, most of these settings are commented out in the file postgresql.conf. To change the values, go to the file and uncomment the setting and change the value. Once you have made the changes, you will need to restart the postgresql server to pick up the changes. Also, if your workstation is running linux, you may need to set the kernel's Shared Memory setting. If you are on linux, and the postgresql server fails to start, after making these changes, then respond to this message and I will send you the instructions for increasing the Shared Memory settings on your work station. -Brian

Leo Dashevskiy responded:	2012-08-14 20:56
All right, Brian, I followed your recommendations, by tweaking the parameters of the db. As you anticipated, the shmmax param got exceeded. I changed that param to 1GB in the /etc/sysctl.conf file - hopefully it will be loaded properly on next reboot, do you know anything about that (I'm running Ubuntu 10 or 11)? And then I managed to restart the db. Tomorrow, when my colleague is back, we will test out the new performance and let you know, how it is.

Leo Dashevskiy responded:	2012-08-15 13:05
Update: After running the insert query my colleague says that he did not feel a noticeable difference. I saw it first hand now, it may not be 5 minutes, it may take like a minute of so, but I guess our expectations were for it to be under 10 seconds... I will also try running it from my own machine and then I will also try eliminating the primary/foreign key constraints (for experimentation sake)

Peter responded:	2012-08-27 16:52
Hi Leo, I worked on the RLabkey project most recently, and I think what is going on here is that the rjson package is pretty inefficient in the functions that convert from R objects to JSON data packets on the wire ( an insert scenario). It used to be equally slow in converting from JSON to R objects (select scenario) until Martin Morgan rewrote the fromJSON function in C and it sped up by a 1000x or so.

Leo Dashevskiy responded:	2012-08-27 18:29
I see, ok, well thanks for the informative explanation, Peter. If it becomes an issue for us, I guess, we will have to rewrite the 'R to JSON' part in C as well...