Advanced search

Message boards : Number crunching : Nearly every WU crashes - what's wrong here ?

Author Message
capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31554 - Posted: 16 Jul 2013 | 11:12:54 UTC

Hi @all,

does anyone know what's happening here ? Nearly all of the GPUGRID WUs crash on my machine.

http://www.gpugrid.net/results.php?userid=93083

It's a Q8200 CPU with 3 GPUs:

- GTX260
- GTX460
- GTX560Ti

The 260 is excluded from GPUGRID. In the last days I've read a lot of things about crashing GPUGRID tasks. So I've done the following things so far:

- disabled Screensaver and energy saving at all
- set BOINC project switching to 1440 mins (24h) to prevend GPUGRID WUs from being suspended
- installed BOINC 7.2.4
- checked the cooling - everything's fine (according to HWINFO64 and GPU-Z)
- disabled all other BOINC projects on the machine - no effect

Hope someone can help here.

Thanks in advance

Rene

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31555 - Posted: 16 Jul 2013 | 11:38:38 UTC - in response to Message 31554.
Last modified: 16 Jul 2013 | 11:39:36 UTC

9 Valids and 55 errors tells me that your system can do work but isn't well setup. Some WU's have recently been more troublesome but you have completed work from different queues at different times.

I would suggest you stop running CPU tasks to see if your system performs better. What are the GPU and CPU temperatures?

Also, IF you don't use the 260, remove it; it's quite power hungry.

Running many different GPU projects can bring its own set of problems. If you must, I suggest a very small cache/buffer of work and to set boinc to switch between apps every 999 minutes.

Should you not see any improvement try the 314 drivers (advanced/clean install).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31556 - Posted: 16 Jul 2013 | 11:58:34 UTC

Thanks for the quick reply. The 260 is crunching for POEM at the moment. But let's give it a try. I will disable all the other projects and only run GPUGRID. The 260 will stay excluded for now. Since the machine is located in a remote Server room, it will be a bit difficult to remove the 260 on the fly ;)

According to HWINFO, the CPU core temps are around 55 deg, the GPU temps are

- 78 deg for the GTX460
- 69 deg for the GTX260
- 72 deg for the GTX560Ti

should be no problem IMHO.

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31561 - Posted: 16 Jul 2013 | 19:12:40 UTC
Last modified: 16 Jul 2013 | 19:13:51 UTC

Ok, I've disabled all other projects on this machine, did a complete driver cleanup (incl. Driver Cleaner PE) and reinstalled the NVIDIA drivers 314.22 (only the graphics driver). The GTX260 is still excluded from GPUGRID (makes no sense performance-wise IMHO).

I've just received a couple of short runs...let's see what happens...

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31564 - Posted: 16 Jul 2013 | 21:08:34 UTC - in response to Message 31561.

Hmmm...looks like the first short run crashed again after about 2 hours on the 560Ti... :(

This isn't funny anymore...

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31567 - Posted: 16 Jul 2013 | 23:49:06 UTC - in response to Message 31564.

I suggest you try using fan controlling software to keep the GPU temps below 70°C, if possible.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31572 - Posted: 17 Jul 2013 | 8:08:26 UTC - in response to Message 31567.

In a case of deep frustration, I've just re-installed Linux on this machine (as it already was before Win7) ;) Let's see if this runs a bit more stable again.

But I'll keep your suggestion in mind.

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31576 - Posted: 17 Jul 2013 | 10:47:31 UTC - in response to Message 31572.

Hmmmm...this bloody thing keeps crashing... :(

But only GPUGRID tasks crash on this machine. All other projects (even GPU) are working fine.

But is it normal, that each GPUGRID task requests 16GB of virtual mem ?

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31580 - Posted: 17 Jul 2013 | 11:42:42 UTC - in response to Message 31576.

Ok, next try:

- removed the 260
- added 4 GB more physical RAM (6 GB total)

started crunching 1 short and 1 long...

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31582 - Posted: 17 Jul 2013 | 12:11:49 UTC - in response to Message 31580.

it's absolutely unbelievable and totally annoying...the GPUGRID tasks still crash after an indeterminate time... :(

Is there something like a "boinc task debug mode" ?

Vagelis Giannadakis
Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 31589 - Posted: 17 Jul 2013 | 15:16:22 UTC

Why don't you try removing all other GPUs from the machine? Maybe you're having other issues, like heat, power, etc. With all the rest you've done, I think it's little more trouble to go through.
____________

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31591 - Posted: 17 Jul 2013 | 15:22:42 UTC - in response to Message 31589.

The thing is, that only GPUGRID tasks are crashing. All the other projects (CPU or GPU) are working fine.

But I will try that as well.

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31612 - Posted: 18 Jul 2013 | 8:38:54 UTC - in response to Message 31591.

ok, next try:

- removed 4GB of RAM (the two modules left are working for sure, according to 48h memtest86)
- added 2 more 120mm fans (in total: 1 intake, 3 outtakes, each @2000 rpm)

crunching 1 short and 1 long now.

CPU temps for all cores are 42 deg, GPU temps are at 73 deg.

let's see what happens...

FoldingNator
Send message
Joined: 1 Dec 12
Posts: 24
Credit: 60,122,950
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwat
Message 31618 - Posted: 18 Jul 2013 | 10:13:28 UTC - in response to Message 31612.

Do you have an overclock on your cpu or any FSB increasement?
In my case it helped when I set it back to stock clocks (Q6600). CPU times are being higher now, but its stable! I think my northbridge was running too hot with CPU OC + folding @ PrimeGrid and 2 GPU's downclocked + folding @ GPUGRID.

It is really a pain in the ass when WU's errored again, again and again. I know how its feel like. Very frustrating.

Did you also have seen this thread?

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31619 - Posted: 18 Jul 2013 | 10:52:40 UTC - in response to Message 31618.

No OC in place, not CPU and not GPU, except of the 560Ti, which is a Palit Sonic and therefore has some factory OC.

The WUs are still crashing...

I think I can rule out any heat issues now. RAM also works fine, if I can trust memtest86. This leaves the mainboard (MSI P7N SLI) and the PSU (650W LCPower) on the table. I know the LCPower isn't high end, but it's also working in my other BOINC machines and they don't show any issues. And a 560Ti + GTX460 + Q8200 should be no problem for this PSU. In another one of my machines it fires a i7-3770 + two HD6950. The Radeons are running 3 MilkyWay WUs in parallel and are at 100%, all CPU cores are at 100% too. So the PSU should work with this setup.

Hmmm...

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31620 - Posted: 18 Jul 2013 | 10:53:00 UTC - in response to Message 31618.
Last modified: 18 Jul 2013 | 10:56:42 UTC

GPU temps are at 73 deg.

That's not disastrous but it's still slightly on the high side.

I suggest you use reference settings for the 560Ti (822/1645/4008)

If you haven't done it yet, try tweaking the clocks down a bit. It might be the case that the Voltage isn't ideal for the clocks, so by dropping the clocks a notch you might find stability. I would start by reducing both by ~5% and test it on the short tasks.

As FoldingNator said, Overclocking the FSB can cause issue and you shouldn't OC the PCIE bus for here. If you haven't OC'ed, and have tried everything else, you might want to reduce the FSB and possibly even the PCIE (though that might cause as many problems as it resolves).
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31621 - Posted: 18 Jul 2013 | 10:56:34 UTC - in response to Message 31620.

Ok, thanks for the advices. I will try that. Interestingly all the last crashes were caused by segmentation violations...

How much physical RAM should be in the box to run 2 or 3 GPUGRID WUs in parallel ?
At the moment the box has 2 GB physical and plenty swap...

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31622 - Posted: 18 Jul 2013 | 11:02:45 UTC - in response to Message 31621.

The GPUGrid tasks use about 140MB of system memory each, but if you are doing anything else (CPU crunching) 2GB is probably not enough, especially on Windows 7 - the OS is likely reading and writing to the drive a lot.
Task Manager will tell you what you are using. I suggest you put the other 2GB back in, and maybe give it a quick test by swapping it with the existing modules first.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 31637 - Posted: 18 Jul 2013 | 20:08:31 UTC

- power down, remove the power cord, wait at least 15 minutes and try again - this has solved some weird sh*t for me in the past!

- run GPU-Grid only on the 460 or 560 - let's see if either one can be singled out

- I agree with slightly lower GPU core and memory clocks - the cards may have degraded slightly over time and might not be stable at stock clocks under high load (not much beats GPU-Grid except Furmark) in the summer any more

- try regular 3D tests like some 3D Mark

- removing the 260 should have ruled out the PSU.. but I'd try another one anyway, since they do break and can cause weird errors

MrS
____________
Scanning for our furry friends since Jan 2002

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31652 - Posted: 19 Jul 2013 | 11:43:01 UTC - in response to Message 31637.
Last modified: 19 Jul 2013 | 11:44:06 UTC

ok, did the following now:

- re-installed Win7 (because of better tool support)
- down-clocked the GPUs to NVIDIA defaults
- fixed fan rpm to 75% for both cards

started crunching 1 short and 1 long again...let's see what happens...

GPU temps are @ 69 deg (GTX460) and 58 deg (GTX560Ti)

capeITLabs
Send message
Joined: 17 Nov 12
Posts: 30
Credit: 111,887,025
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 31675 - Posted: 20 Jul 2013 | 7:41:05 UTC - in response to Message 31652.
Last modified: 20 Jul 2013 | 7:44:53 UTC

Ok, seems like the issue has been solved :) After down-clocking the GPUs from stock OC to factory defaults, everything is running smooth now.

Next week I will receive a GTX480 to replace the 460. Hopefully the box keeps working and there are no more issues.

Thanks a lot guys.

Post to thread

Message boards : Number crunching : Nearly every WU crashes - what's wrong here ?

//