cluster error: can't open output file

CPAS Forum (Inactive)
cluster error: can't open output file wnels2  2007-10-17 12:36
Status: Closed
 
I'm getting mail about this error (below). I think the error message may be truncated. The other logs don't have anything useful. The permissions on these directories are 777. Is the SGE script saved anywhere so I can use it to troubleshoot?
Thanks for any suggestions,
Bill

Message 3:
From root@msfcluster.gws.uky.edu Wed Oct 17 14:06:05 2007
X-Original-To: root@msfcluster.gws.uky.edu
Delivered-To: root@msfcluster.gws.uky.edu
To: root@msfcluster.gws.uky.edu
Subject: Job 70 (SGE.CAexample.tandem.sge) Set in error state
Date: Wed, 17 Oct 2007 14:06:05 -0400 (EDT)
From: root@local (root)

Job 70 (SGE.CAexample.tandem.sge) Set in error state
 Exit Status = -1
 Signal = unknown signal
 User = root
 Queue = labkey@compute-0-3.local
 Host = compute-0-3.local
 Start Time = <unknown>
 End Time = <unknown>
 CPU = NA
 Max vmem = NA
failed opening input/output file because:
10/17/2007 14:06:04 [0:22582]: error: can't open output file "/home/massspec/cpas/pipeline/projects/
Use "qmod -c <jobid>" to clear job error state
once the problem is fixed.
 
 
brendanx responded:  2007-10-17 15:15
If you use "--v --v --v" (or level-3 verbose), the scripts will indeed be left in the directory with the *.log files.

You might also want to actually get on one of the cluster nodes in interactive mode, and try accessing the file listed. Sometimes that can be very enlightening.

Good luck!

--Brendan
 
wnels2 responded:  2007-10-18 06:25
Hi Brendan,
I am using the labkey.org documentation, so pipe.pl is set to use --v --v --v.
I cannot find any clues in the log files (see below). The job goes into an error state right away.
The error message in the email is truncated so I don't know which file it is looking for, but all of the directories and files should be rwxrwxrwx (chmod -R 777 <projectDir>). I can access the directories from the nodes and I can run XTandem, interactivly, through SGE.
I'm starting to look at SGE.pm to get it to output the SGE scripts (so I can try them manually) and find out why the email error message is truncated - Unless - anyone may have experience with this and can direct me on a more efficient path.

Thanks,
Bill
_______________________________________________________________________________________________________________
pipe-processing.log
_______________________________________________________________________________________________________________

2007-10-17 16:51:56

LOG: Running X!Tandem for CAexample
LOG: job command: /opt/gridengine/bin/lx26-x86/qsub /home/massspec/cpas/pipeline/projects/bill/2007/08/test1/xtandem/test9/SGE.CAexample.tandem.sge
LOG: Submitted CAexample.tandem job 78
LOG: job command: /opt/gridengine/bin/lx26-x86/qsub -hold_jid 78 /home/massspec/cpas/pipeline/projects/bill/2007/08/test1/xtandem/test9/SGE.CAexample.summary.sge
LOG: Submitted CAexample.summary job 79
LOG: Sleeping 2 seconds for job scheduler.

2007-10-17 16:52:28

LOG: Checking job status CAexample
     78 0.50000 SGE.CAexam root Eqw 10/17/2007 16:51:56 1
     79 0.00000 SGE.CAexam root hqw 10/17/2007 16:51:56 1

...repeated every 30 sec.

__________________________________________________________________________________________________________
CAexample.log
_____________________________________________________________________________________________________________
X!Tandem search for CAexample.mzXML
=======================================
 
brendanx responded:  2007-10-18 10:04
Sure looks like SGE is failing to write its output log. You should have some output from running the job on the cluster no matter how badly it chokes. At least here at the Hutch, we would reboot the machine causing the failure, expecting that it has lost connectivity with shared storage. And, my guess would be that the quick failure was due to a failure to read the inputs.

Not knowing your set-up, it could also mean that the cluster nodes are not correctly mounting your /home/massspec/cpas/... directory structure. Or the cluster node does not have permission to the directory, which it sounds like you've ruled out.

Again, best way to look at this is in interactive mode on the cluster node in question.

--Brendan
 
wnels2 responded:  2007-10-18 10:39
Hi Brendan,
Sorry,
I only had 2 --v's in pipe.sh. I added the third --v and the SGE.scripts did not get deleted and I could see where it was having trouble writing.
It turns out that on my system, when the xtandem/ProtocolName folder is created it only has write permissions for the owner. SGE is running under its own special userid. I'm hoping there is a Samba thing where I can change the default permissions on new folders.

Thanks,
Bill
 
brendanx responded:  2007-10-18 10:44
Make sure you have the group "sticky" bit on for all directories in your pipeline root:

chmod g+ws (directories only)

This is how we are set up, and it gives group write privileges to all new folders created by CPAS, or from a Windows explorer.

--Brendan