I have a module-based assay that is currently in use with 2.5k runs and 399k data (result) rows stored. The .data table contains a FK molecule, to hplc.molecule, a small hard table with ~40 rows . If I run the following query from within Labkey:
PARAMETERS (FOO INTEGER)
SELECT DISTINCT molecule FROM assay.hplc.hplc01.data WHERE run = FOO;
The expanded query that actually is sent to the database server is (comment lines stripped and re-indented):
SELECT *
FROM
(
SELECT
MoleculesRead.molecule AS molecule,
MoleculesRead$molecule$.name AS "molecule$name"
FROM
(
SELECT DISTINCT Data_2.molecule AS molecule
FROM
(
SELECT * FROM assayresult.c9d43_hplc_minus_assay_result_fields
WHERE ((SELECT d.Container FROM exp.Data d WHERE d.RowId = DataId)='b2db3469-9848-1032-a179-c669927523e7')
) Data_2
WHERE ((SELECT RunId FROM exp.data d WHERE d.RowId = Data_2.DataId)=CAST(? AS INTEGER))
) MoleculesRead
LEFT OUTER JOIN hplc.molecule MoleculesRead$molecule$
ON (MoleculesRead.molecule = MoleculesRead$molecule$.rowid)
) x
ORDER BY molecule ASC
LIMIT 101;
This takes about 3.7 seconds to run on my development machine and this is the Postgres execution plan:
Limit (cost=6638817.83..6638817.93 rows=1 width=13) (actual time=3723.912..3723.924 rows=8 loops=1)
-> Merge Left Join (cost=6638817.83..6638817.93 rows=1 width=13) (actual time=3723.910..3723.920 rows=8 loops=1)
Merge Cond: (c9d43_hplc_minus_assay_result_fields.molecule = "moleculesread$molecule$".rowid)
-> Sort (cost=6638816.39..6638816.40 rows=1 width=4) (actual time=3723.878..3723.879 rows=8 loops=1)
Sort Key: c9d43_hplc_minus_assay_result_fields.molecule
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=6638816.36..6638816.37 rows=1 width=4) (actual time=3723.864..3723.866 rows=8 loops=1)
-> Seq Scan on c9d43_hplc_minus_assay_result_fields (cost=0.00..6638816.34 rows=10 width=4) (actual time=1769.616..3723.833 rows=32 loops=1)
Filter: ((((SubPlan 1))::text = 'b2db3469-9848-1032-a179-c669927523e7'::text) AND ((SubPlan 2) = 817))
Rows Removed by Filter: 399097
SubPlan 1
-> Index Scan using pk_data on data d (cost=0.28..8.30 rows=1 width=37) (actual time=0.003..0.004 rows=1 loops=399129)
Index Cond: (rowid = c9d43_hplc_minus_assay_result_fields.dataid)
SubPlan 2
-> Index Scan using pk_data on data d_1 (cost=0.28..8.30 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=399129)
Index Cond: (rowid = c9d43_hplc_minus_assay_result_fields.dataid)
-> Sort (cost=1.44..1.48 rows=15 width=13) (actual time=0.027..0.028 rows=15 loops=1)
Sort Key: "moleculesread$molecule$".rowid
Sort Method: quicksort Memory: 25kB
-> Seq Scan on molecule "moleculesread$molecule$" (cost=0.00..1.15 rows=15 width=13) (actual time=0.003..0.008 rows=15 loops=1)
I was surprised how slowly this was running for me. The execution plan shows that almost all the computation is spent on lookups in the exp.data using the pk_data index. This lookup appears to be done once for every row in my .data table. So the problem isn't a lack of an index. I then tried changing the run filter from a sub-query into an JOIN with a WHERE clause as shown here: (all changes from the above query are called out on the comment lines)
EXPLAIN ANALYZE SELECT *
FROM
(
SELECT
MoleculesRead.molecule AS molecule,
MoleculesRead$molecule$.name AS "molecule$name"
FROM
(
SELECT DISTINCT Data_2.molecule AS molecule
FROM
(
SELECT * FROM assayresult.c9d43_hplc_minus_assay_result_fields
WHERE ((SELECT d.Container FROM exp.Data d WHERE d.RowId = DataId)='b2db3469-9848-1032-a179-c669927523e7')
) Data_2
-- Replace the following line:
-- WHERE ((SELECT RunId FROM exp.data d WHERE d.RowId = Data_2.DataId)=CAST(? AS INTEGER))
-- with the following block of lines:
JOIN
exp.data AS d
ON
d.RowId = Data_2.DataId
WHERE
d.RunId = CAST(? AS INTEGER)
-- end block of lines
) MoleculesRead
LEFT OUTER JOIN hplc.molecule MoleculesRead$molecule$
ON (MoleculesRead.molecule = MoleculesRead$molecule$.rowid)
) x
ORDER BY molecule ASC
LIMIT 101;
This takes about 0.6ms seconds to run on my development machine and this is the Postgres execution plan:
Limit (cost=23384.52..23384.61 rows=1 width=13) (actual time=0.333..0.347 rows=8 loops=1)
-> Merge Left Join (cost=23384.52..23384.61 rows=1 width=13) (actual time=0.331..0.343 rows=8 loops=1)
Merge Cond: (c9d43_hplc_minus_assay_result_fields.molecule = "moleculesread$molecule$".rowid)
-> Sort (cost=23383.08..23383.08 rows=1 width=4) (actual time=0.307..0.308 rows=8 loops=1)
Sort Key: c9d43_hplc_minus_assay_result_fields.molecule
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=23383.05..23383.06 rows=1 width=4) (actual time=0.295..0.298 rows=8 loops=1)
-> Nested Loop (cost=13.99..23383.04 rows=1 width=4) (actual time=0.076..0.271 rows=32 loops=1)
-> Index Scan using idx_cl_data_runid on data d (cost=0.28..8.40 rows=3 width=4) (actual time=0.013..0.016 rows=4 loops=1)
Index Cond: (runid = 817)
-> Bitmap Heap Scan on c9d43_hplc_minus_assay_result_fields (cost=13.70..7791.51 rows=4 width=8) (actual time=0.013..0.057 rows=8 loops=4)
Recheck Cond: (dataid = d.rowid)
Filter: (((SubPlan 1))::text = 'b2db3469-9848-1032-a179-c669927523e7'::text)
-> Bitmap Index Scan on c1427504204us_assay_result_fields_dataid (cost=0.00..13.70 rows=704 width=0) (actual time=0.006..0.006 rows=8 loops=4)
Index Cond: (dataid = d.rowid)
SubPlan 1
-> Index Scan using pk_data on data d_1 (cost=0.28..8.30 rows=1 width=37) (actual time=0.004..0.004 rows=1 loops=32)
Index Cond: (rowid = c9d43_hplc_minus_assay_result_fields.dataid)
-> Sort (cost=1.44..1.48 rows=15 width=13) (actual time=0.021..0.022 rows=15 loops=1)
Sort Key: "moleculesread$molecule$".rowid
Sort Method: quicksort Memory: 25kB
-> Seq Scan on molecule "moleculesread$molecule$" (cost=0.00..1.15 rows=15 width=13) (actual time=0.004..0.007 rows=15 loops=1)
So on my development machine, this resulted in a 6000x performance increase and 283x decrease in execution plan cost.
I am running Postgres v9.3.4.2 and Labkey v15.3.
Obviously the Labkey code that translates internal SQL queries to the queries that are sent to the database server is a generalized system that can't expected to be give optimal results across all possible inputs. However, I wonder if the relative simplicity of this test case might shed some light on possible improvements that could be made. I also realize this could just be a case of bad optimization by Postgres and that other database servers may not show such a larger performance differential between the two cases. Unfortunately I don't have any other database server up and running.
And of course, if anyone has thoughts on how to get this query to run faster from within in Labkey v15.3, I'd love to hear it.
thanks,
-Will |
|
jeckels responded: |
2016-01-04 17:51 |
Hi Will,
Thanks for the info. Query execution plans are hard to predict, and can be very sensitive to the data in the tables, the version of the database, etc.
I've seen some other cases where switching from a subselect to a join greatly improves Postgres performance, so I made a change so that the server will generate SQL similar to your revised query. It will be part of 16.1, but I'm attaching the patch here in case you want to try it against 15.3 on a development machine.
Thanks,
Josh |
|
|
Will Holtz responded: |
2016-01-07 11:23 |
Hi Josh,
Thanks for the fast response. I'm running your patch on my development machine and I'm seeing a nice (~2x) performance bump on many of my module-based assay page loads -- pages that don't have the specific query discussed above. Looking forward to putting this on the production system later this year.
-Will |
|
|
|