Sequence Analysis | How to upload Human reference genome

Installation Forum (Inactive)
Sequence Analysis | How to upload Human reference genome che.l.martin  2015-09-11 17:03
Status: Closed
 
Good Evening all,

Any one know how to setup the human reference genome in Sequence Analysis? I clicked the reference genome ..> more actions..> load from NCBI..> and entered the following info:

NCBI subfolder = /H_Sapien
Genome prefix = hs_ref
Genome name = GRCh38
Species = Human

But the download is empty,...Am I doing this correctly?

Che
 
 
Ben Bimber responded:  2015-09-11 17:49
Hello,

This feature is still very experimental. Honestly, I dont fully understand how NCBI is organizing their genomes to the point where one could make a consistent cross-species download tool. However, in your case I *think* all you need to do is remove the leading slash and try "H_Sapiens" (note: it is also plural: ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/). I can make the tool a little more forgiving in the future.

Also, we end up loading the majority of our genomes by downloading the FASTA file and using the 'import from FASTA' option right about the NCBI Loader you're using.

If the suggestion above doesnt work, would you be able to post the log from the pipeline job? This has more detail about the FTP paths it's trying to load.
 
che.l.martin responded:  2015-09-12 06:14
Thank you for your response Ben...I seem to be having problems uploading any genomes...I tried two locations:
1) Site Settings For DISCVR Modules page
2)Sequence Analysis page

On the sequence analysis page when I click Reference sequences---> More Actions----> import from fasta , I select the fa file; example chr2.fa, click upload and after loading for a little while ~2 minutes I get an empty box saying "error" no other messages.

Also on the DISCVR when I click upload sequence I get the error:

"Could not load records due to the following error:
Could not find schema: sequenceanalysis"

Any suggestions on which page or settings I should adjust to upload genomes

Lastly the log file for the reference genome upload is below:

###############################################################################################################
 Starting to run task 'org.labkey.sequenceanalysis.pipeline.CreateReferenceLibraryTask' at location 'webserver'
12 Sep 2015 13:11:01,835 INFO : there are 0 sequences to process
12 Sep 2015 13:11:02,684 INFO : Building index for FASTA: /usr/local/labkey/files/home/@files/.referenceLibraries/5/5_GRCh38.p2.fasta
12 Sep 2015 13:11:02,684 INFO : samtools faidx /usr/local/labkey/files/home/@files/.referenceLibraries/5/5_GRCh38.p2.fasta
12 Sep 2015 13:11:02,684 DEBUG: using path: /usr/local/labkey/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
12 Sep 2015 13:11:02,685 WARN : Unable to create FASTA index
12 Sep 2015 13:11:02,687 INFO : Creating dictionary for: /usr/local/labkey/files/home/@files/.referenceLibraries/5/5_GRCh38.p2.fasta
12 Sep 2015 13:11:02,687 INFO : java -jar /usr/local/labkey/bin/picard-tools/CreateSequenceDictionary.jar VALIDATION_STRINGENCY=LENIENT MAX_RECORDS_IN_RAM=2000000 REFERENCE=/usr/local/labkey/files/home/@files/.referenceLibraries/5/5_GRCh38.p2.fasta OUTPUT=/usr/local/labkey/files/home/@files/.referenceLibraries/5/5_GRCh38.p2.dict
12 Sep 2015 13:11:02,687 DEBUG: using path: /usr/local/labkey/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
12 Sep 2015 13:11:02,703 DEBUG: Error: Unable to access jarfile /usr/local/labkey/bin/picard-tools/CreateSequenceDictionary.jar
12 Sep 2015 13:11:02,703 WARN : process exited with non-zero value: 1
12 Sep 2015 13:11:02,703 WARN : Unable to create sequence dictionary
12 Sep 2015 13:11:02,718 INFO : creation complete
12 Sep 2015 13:11:02,719 INFO : Successfully completed task 'org.labkey.sequenceanalysis.pipeline.CreateReferenceLibraryTask'
 
Ben Bimber responded:  2015-09-12 16:13
Hi Che,

I think there are 2 things at work here:

1) Any time there is an error along the lines of 'cannot find schema X', that usually means the module the provides that schema isnt turned on in that folder. Can you click the Admin menu, then pick Folder->Management. Under here, go to the 'folder type' tab and make sure Sequence Analysis is checked. If not, check this and hit 'update' to apply.

2) The second issue is that this module uses picard tools for some of the basic sequence processing. To make this work, you can download/unzip picard into /usr/local/labkey/bin (at least that's where your server seems to be installed based on the log). try something like:

mkdir -p /usr/local/labkey/bin/picard-tools
cd /usr/local/labkey/bin/picard-tools
wget --read-timeout=10 https://github.com/broadinstitute/picard/releases/download/1.135/picard-tools-1.135.zip
unzip picard-tools-1.135.zip

I'll clean up the module docs to include external tools. Note: if you plan to use this module for more serious sequence processing (alignments, variant calling, etc), there are more tools you will need to install. I can point you to the script for this.

Note: I think you might be able to just hit 'retry' on the failed pipeline job, instead of starting the upload from scratch. If that isnt an option, I might try to load the genomes table, select your genome, and pick 're-process selected'. that forces the system to re-create the FASTA file, index and dictionary. The latter is what failed.
 
che.l.martin responded:  2015-09-13 08:39

Hello Ben,

Thank you for your feedback ...I think I am just going to have to use your second suggested option and upload the genome from a fasta files. I have download each human chromosome .fa file...How do I proceed to get them into labkey as a reference genome? You mentioned you and your team did it this way. Could you please tell me how to go about this? I have samtools and Picard already installed and added to the PATH variable.

Thanks,
Che
 
Ben Bimber responded:  2015-09-13 13:16
Hi Che,

On the same screen where you found the 'load from NCBI' option, right above it I believe you'll find a 'load from FASTA' menu item.

Also note: I think in 15.2 you will need to be sure the picard JARs are located in the folder referenced above (/usr/local/labkey/bin/picard-tools). Also note that 15.2 is expecting picard to be split into many JARs (the older style of those tools), as opposed to the single picard.jar. That will change in 15.3. The commands I posted above will let you download a suitable version of picard.
 
che.l.martin responded:  2015-09-13 13:56
Hello Ben,


So I have picard installed as instructed above but I am not seeing that option on the menu I have only 3 options:
re-process selected
load genome from NCBI
Create blast Database

How can I get that option to show on my menu?


Are the versions of SequenceAnalysis from https://github.com/bbimber/discvr/wiki/Getting-Started and the Labkey-15.2-ExtraModules link you sent me different?
 
Ben Bimber responded:  2015-09-14 08:35
Hey,

Yes, I think the official 15.2 release is behind where I thought it was. I havent finished the docs, but this will be on github relatively soon.

However, if instead of going to the 'reference genomes' link, you pick 'reference sequences', you will find the 'import FASTA' option. It's the exact same one that will show on the genomes page in 15.3 as well. Also be aware, because your NCBI import probably succeeded up until the final step that requires picard, it's possible the reference sequences table will already hold your chromosomes. if that is true, you can select the sequences you want and then pick 'create genome' from the more actions button. There's no reason to double-upload your FASTA if that is the case.

Sorry for not having cleaner instructions on this part of the module.

-Ben
 
che.l.martin responded:  2015-09-14 11:48
Hello Ben,

I tried the loading my reference chromosome; chr1.fa downloaded from USCS. I first downloaded and untared the .gz file to get the plain Chr1.fa file, then used the option you suggested. A box popped up saying loading for ~2 mins, then a once the upload reached 100%, the loading box changed into an error pop-up...with no further details. The funny thing is that if I go to FileContent I see the Chr1.fa file. How can I get this to show up in a the sequence or reference list?

Regards,
Che
 
Ben Bimber responded:  2015-09-14 13:19
Hi Che - I just followed up up offline. It will probably be quicker to talk there.
 
Ben Bimber responded:  2015-09-15 08:56
To follow up here in case someone reads this in the future:

The issue was server memory. When this module loads FASTA data, it reads each chromosome into memory temporarily. Che's server was initially running with "-Xmx1g". We increased this to 2g, and it seemed to work fine. He was able to upload the full human GRCh38 FASTA.

Here's a link to labkey's docs on webapp memory:

https://www.labkey.org/wiki/home/Documentation/page.view?name=configWebappMemory&_docid=wiki%3A1a849b6e-f952-102c-b5c7-d104f9cd25d3