Advanced search

Message boards : Number crunching : Lotf of errors for two weeks

Author Message
sis651
Send message
Joined: 25 Nov 13
Posts: 66
Credit: 162,605,097
RAC: 33,128
Level
Ile
Scientific publications
watwatwatwatwat
Message 41714 - Posted: 30 Aug 2015 | 1:47:06 UTC
Last modified: 30 Aug 2015 | 1:48:03 UTC

I crunch short runs on my notebook. It has a Nvidia GT 740M which is not sth. very fast but could finish long hours at two days.

However, for about two weeks there are lots of errors. Some works run for a short time and errorrr... And some passes %90 and then errorrr!...
All the works ended with errors are short run Noelia works. When I check them some ended up with errors on other users and some succeeded to finish, so I think they're not really faulty. But it seems I have a problem with my system.

I use Kubuntu 15 with Nvidia driver 346.59 running via optirun command which is a way to utilise Nvidia GPU of Optimus notebooks. Driver seems to be old but this is the latest one in Kubuntu repo. I may need to test latest drivers from Xorg-Edgar's repo but sometimes they break Bumblebee and CUDA, so I didn't hurry to install them. Also it was crunching fine since I installed this OS.

dmesg errors I get are these:
[ 7007.280414] NVRM: GPU at PCI:0000:07:00: GPU-932eafe5-81a1-4d7e-b0cb-1f8f59518f5c
[ 7007.280427] NVRM: Xid (PCI:0000:07:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 1): Out Of Range Address
[ 7007.280437] NVRM: Xid (PCI:0000:07:00): 13, Graphics SM Global Exception on (GPC 0, TPC 1): Physical Multiple Warp Errors
[ 7007.280444] NVRM: Xid (PCI:0000:07:00): 13, Graphics Exception: ESR 0x504e48=0x1000e 0x504e50=0x4 0x504e44=0x13eff2 0x504e4c=0x7f
[ 7007.280459] NVRM: Xid (PCI:0000:07:00): 13, Graphics Exception: ChID 0005, Class 0000a1c0, Offset 00001b0c, Data 00000000


What problems can cause this? Notebook hardware tired from running 7/24 Gpugrid?

Recently I've drilled some holes under the laptop and put some heatsinks on GPU VRM's and some other parts. Also running Notepal U3+ to push some air to cool down the VRM's, they're really hot, untouchable. Most of the time GPU runs less than 70 deg, and max. 73C. It was crunching problem-free even at 90degrees. So what can cause this, for two weaks? Also other games and crunching projects run normally on GPU.

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41716 - Posted: 31 Aug 2015 | 4:59:47 UTC

Have huge ratio for tasks/errors as same as previous person. Updated drivers today to 355, boinc version is 7.6.6 for last month or so. Could it be 7.6.6 version issue or driver or something else?
as task runs for days, it's not good to loose such amount of work done.
thanks

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41732 - Posted: 4 Sep 2015 | 19:33:23 UTC

Please post the computer ID that is having the problem, and also the details reported by GPU-Z. It seems possible that it is clocked too high.

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41736 - Posted: 5 Sep 2015 | 0:13:39 UTC - in response to Message 41732.
Last modified: 5 Sep 2015 | 0:14:04 UTC

all GPUs are going by default
https://www.gpugrid.net/show_host_detail.php?hostid=172749
https://www.gpugrid.net/show_host_detail.php?hostid=172862

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41750 - Posted: 6 Sep 2015 | 4:45:58 UTC

For your GTS 450 computer, you may need to lower the clocks. You should try running the Heaven benchmark for several minutes, to see if it's stable at current clocks, and if not, you'll need to lower the GPU Offset using a tool like MSI Afterburner.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2686
Credit: 1,164,361,299
RAC: 432,706
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41755 - Posted: 6 Sep 2015 | 21:25:06 UTC
Last modified: 6 Sep 2015 | 21:25:19 UTC

The same BOINC & driver combination is working fine on my main PC, and there have been no general problem reports. So something on your side seems to be wrong. The task logs of the 450 show lines like this:

# Simulation unstable. Flag 9 value 8175
# Simulation unstable. Flag 10 value 8175
# The simulation has become unstable. Terminating to avoid lock-up

Which means serious garbage has been computed. A few things I'd do, in addition to what Jacob said:

- check if GPU fan still works
- power off, remove power cord, wait 10+ minutes and try again
- also lower the GPU memory clock
- try a different project

MrS
____________
Scanning for our furry friends since Jan 2002

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41756 - Posted: 7 Sep 2015 | 1:13:58 UTC - in response to Message 41755.

Thanks
My question was about your end and possible drivers or project' new version. I have 5 hosts with nVidia cards and 3 of them with older cards stopped getting tasks or failed to execute them.
Those hosts were intact fr last couple of years doing only designated boinc tasks, nothing wrong with fans or anything being changed.
Your advises are good, but meanwhile be aware about possible issues with drivers, boinc shell updates, your project files updates, windows updates etc

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 669
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41760 - Posted: 7 Sep 2015 | 10:02:00 UTC - in response to Message 41756.
Last modified: 7 Sep 2015 | 10:37:47 UTC

Thanks
My question was about your end and possible drivers or project' new version. I have 5 hosts with nVidia cards and 3 of them with older cards stopped getting tasks or failed to execute them.
Those hosts were intact fr last couple of years doing only designated boinc tasks, nothing wrong with fans or anything being changed.
Your advises are good, but meanwhile be aware about possible issues with drivers, boinc shell updates, your project files updates, windows updates etc



The fact, Costa, your hardware is too slow for this project now (at least for Long Runs). I'll bet that responders have thought it but daren't say it for fear of offending.

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41763 - Posted: 7 Sep 2015 | 11:28:11 UTC - in response to Message 41760.

so I just cut long runs then?
also where is ATI support, I have couple of cards

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 669
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41765 - Posted: 7 Sep 2015 | 11:41:31 UTC - in response to Message 41763.
Last modified: 7 Sep 2015 | 12:00:24 UTC

so I just cut long runs then?
also where is ATI support, I have couple of cards



Yes cut long runs for sure and then only run your 560 for short runs.

To put things into perspective if you obtained a second hand GTX660TI and got rid of the rest your RAC would go up by 300% approximately, you would be able to do Long Runs and you would save a lot of money on electricity costs. You should be able to get 660ti off ebay for no more than £60 GBP I don't know how much that is in Australian dollars.

As for ATI cards, unfortunately this project has made the decision not develop an app for anything but CUDA which means Nvidia only. Can't see that changing anytime soon.

Good luck to you.

Richard


This might be a good buy http://www.ebay.com.au/itm/Asus-Geforce-GTX-660-Ti-2GB-Graphics-Card-/141767298182?hash=item2101fd4c86

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41766 - Posted: 7 Sep 2015 | 12:27:39 UTC - in response to Message 41765.

thanks mate

Killersocke
Send message
Joined: 18 Oct 13
Posts: 45
Credit: 246,484,695
RAC: 264,192
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41769 - Posted: 7 Sep 2015 | 15:46:36 UTC
Last modified: 7 Sep 2015 | 15:47:37 UTC

I agree with him.
I also find a lot of mistakes since 2 weeks
And no, it's not often propagated here overclocking the GPU
All other programs will run without errors on my system.
System: WIN 10
CPU I-7
NVIDIA GTX 760

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41770 - Posted: 7 Sep 2015 | 16:47:20 UTC
Last modified: 7 Sep 2015 | 16:49:25 UTC

What is the exact make and model of your GPU? And what does GPU-Z show for values "GPU Clock" and "Default clock"?

Believe it or not, we are trying to help, but you have to be willing to provide details, and be willing to investigate/troubleshoot by doing things like downclocking when requested.

GPUGrid puts a certain kind of stress on GPUs, that other distributed apps and games cannot replicate. I've seen firsthand where I've had to downclock in order to get stability in GPUGrid, whereas the same GPU can run other apps/games at higher clocks. So, downclocking to attain GPUGrid stability, is normal here.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 45
Credit: 246,484,695
RAC: 264,192
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41772 - Posted: 7 Sep 2015 | 16:56:18 UTC - in response to Message 41770.
Last modified: 7 Sep 2015 | 16:57:05 UTC

Jacob,
i'm NOT interested to begin a old discussion with you about overclocking.

https://www.gpugrid.net/forum_thread.php?id=4097&nowrap=true#41152

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41773 - Posted: 7 Sep 2015 | 17:01:10 UTC
Last modified: 7 Sep 2015 | 17:03:27 UTC

So, you can't be bothered to provide the details, or to try things to fix the problem? Well, then you can't be helped.

Yes, sometimes batches have problems. But if you're having problems with multiple batches, then it's time to start troubleshooting, if you want to fix it. If you want to have a bad attitude, and not troubleshoot, then you can't expect it to fix itself.

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41775 - Posted: 7 Sep 2015 | 17:17:51 UTC

failing crunchers
nVidia GTX275 633/633Mhz
nVidia GTX450 783/783Mhz

drivers updated to latest
win7-32

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41776 - Posted: 7 Sep 2015 | 17:42:17 UTC - in response to Message 41775.
Last modified: 7 Sep 2015 | 17:45:35 UTC

failing crunchers
nVidia GTX275 633/633Mhz
nVidia GTX450 783/783Mhz

drivers updated to latest
win7-32


GTX 275, 633 reference clock, compute capability 1.3 (Tesla), fab 55nm
GTS 450, 783 reference clock, compute capability 2.1 (Fermi)

GPUGrid relies on newer compute capabilities, and newer driver/Cuda versions, and can't easily support older GPUs.
https://www.gpugrid.net/forum_thread.php?id=2507

NVIDIA no-longer develops drivers for the pre-Fermi types. They try to support them, but it's very limited support.
http://nvidia.custhelp.com/app/answers/detail/a_id/3473

My recommendation would be:
- Have the GTX 275 work on some other project, like maybe Folding@Home or SETI@Home
- Continue to troubleshoot the GTS 450 if you want to do GPUGrid work, by doing things like cleaning the fans, using MSI Afterburner to set a custom fan curve to keep the temp below 70*C, doing clean driver installs, lowering clocks even below reference, etc.

https://en.wikipedia.org/wiki/CUDA
https://en.wikipedia.org/wiki/GeForce_200_series
https://en.wikipedia.org/wiki/GeForce_400_series

Regards,
Jacob

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 669
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41780 - Posted: 7 Sep 2015 | 22:21:11 UTC - in response to Message 41776.
Last modified: 7 Sep 2015 | 22:22:32 UTC

failing crunchers
nVidia GTX275 633/633Mhz
nVidia GTX450 783/783Mhz

drivers updated to latest
win7-32


GTX 275, 633 reference clock, compute capability 1.3 (Tesla), fab 55nm
GTS 450, 783 reference clock, compute capability 2.1 (Fermi)

GPUGrid relies on newer compute capabilities, and newer driver/Cuda versions, and can't easily support older GPUs.
https://www.gpugrid.net/forum_thread.php?id=2507

NVIDIA no-longer develops drivers for the pre-Fermi types. They try to support them, but it's very limited support.
http://nvidia.custhelp.com/app/answers/detail/a_id/3473

My recommendation would be:
- Have the GTX 275 work on some other project, like maybe Folding@Home or SETI@Home
- Continue to troubleshoot the GTS 450 if you want to do GPUGrid work, by doing things like cleaning the fans, using MSI Afterburner to set a custom fan curve to keep the temp below 70*C, doing clean driver installs, lowering clocks even below reference, etc.

https://en.wikipedia.org/wiki/CUDA
https://en.wikipedia.org/wiki/GeForce_200_series
https://en.wikipedia.org/wiki/GeForce_400_series

Regards,
Jacob


My advice on this occasion only would be to shut up.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41781 - Posted: 7 Sep 2015 | 22:24:02 UTC - in response to Message 41780.
Last modified: 7 Sep 2015 | 22:25:44 UTC

My advice on this occasion only would be to shut up.

Wow. Have a better day.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 45
Credit: 246,484,695
RAC: 264,192
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41782 - Posted: 8 Sep 2015 | 0:02:03 UTC - in response to Message 41773.

Very helpful answer :-(
Do you really think I post error if the same batches always emerge only fault with me?

Ex:
http://www.gpugrid.net/workunit.php?wuid=11166870
http://www.gpugrid.net/workunit.php?wuid=11175897

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41783 - Posted: 8 Sep 2015 | 0:48:14 UTC
Last modified: 8 Sep 2015 | 0:48:44 UTC

The error you are getting, appears to be different than the other errors for those work units. So, yes, I still believe you should be making changes to your system, in an attempt to resolve the problem you are having. Sorry, but it's the truth.

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 669
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41784 - Posted: 8 Sep 2015 | 8:23:11 UTC - in response to Message 41780.

Sorry Jacob I can't even remember why I posted this last night after I got in from a night out. So once again Sorry!

My advice to MYSELF is when you've had a few too many Step away from the keyboard.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41785 - Posted: 8 Sep 2015 | 12:18:16 UTC

Hehehe... Too funny! Apology accepted. I know I come across as a loser sometimes, but I genuinely am trying to help.

alvin
Send message
Joined: 13 Mar 12
Posts: 9
Credit: 112,964,646
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 41786 - Posted: 8 Sep 2015 | 13:06:38 UTC

hahah guys, clearly it was some kind of fun party time remains
don't you mind just remove that mess from below? but technically it still perfectly fits our topic " Lotf of errors for two weeks" and mistakes I would say)

John
Send message
Joined: 15 Oct 11
Posts: 17
Credit: 81,085,378
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41788 - Posted: 8 Sep 2015 | 16:51:03 UTC
Last modified: 8 Sep 2015 | 16:51:50 UTC

I currently run 2 GTS 450's.
Win 7 Ult. x64
Both run stock.
Use driver 347.52.
Had some problems a while back with both cards.They were overclocked at the time.
Jacob suggested I go stock and even less.
Have not had a problem since.
Only use them for short runs as these take between 22-24hrs per task.
I know they are getting old for GPUGRID but are great space heaters here in The Great White North (Canada).
Just trying to make my 25 million before I consider retiring them.....

My 2 cents worth....

JR

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2048
Credit: 14,826,285,069
RAC: 2,412,335
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41789 - Posted: 8 Sep 2015 | 16:54:35 UTC - in response to Message 41782.

Do you really think I post error if the same batches always emerge only fault with me?

Ex:
http://www.gpugrid.net/workunit.php?wuid=11166870
http://www.gpugrid.net/workunit.php?wuid=11175897

These batches were successfully finished on the last host, so all that you demonstrated with these is that there are other malfunctioning systems than yours, but it actually disproves that the batch itself is erroneous.

sis651
Send message
Joined: 25 Nov 13
Posts: 66
Credit: 162,605,097
RAC: 33,128
Level
Ile
Scientific publications
watwatwatwatwat
Message 41803 - Posted: 13 Sep 2015 | 1:01:57 UTC

Finished a long run and now 80% of a short. Waiting for errors. Also no errors on Einstein@home works...

sis651
Send message
Joined: 25 Nov 13
Posts: 66
Credit: 162,605,097
RAC: 33,128
Level
Ile
Scientific publications
watwatwatwatwat
Message 41805 - Posted: 13 Sep 2015 | 11:27:25 UTC

Finished a short unit, but not the next! Now, crunching another unit.

Post to thread

Message boards : Number crunching : Lotf of errors for two weeks