Advanced search

Message boards : Number crunching : WU not completing

Author Message
MichaelMac
Send message
Joined: 2 Sep 12
Posts: 16
Credit: 609,890,687
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 40913 - Posted: 20 Apr 2015 | 13:18:19 UTC

I've been running GPUGRID for almost 3 years with no problems. I have two rigs that I run this project on (the two with the best GPU cards that I have). My computer, 133536, is running tasks but they never end... The card is a GTX660.

Boinc is 7.4.42
Driver is 350.12

Is there a log file that I can check for errors? I'm not seeing any errors in the event log. The monitor for the Video Card shows the increased activity on the GPU that I expect to see, but the corresponding rise in temperature isn't there. It's at 33 degress C which is what it normally is idling.

I've uninstalled and reinstalled the video driver. It ran one task, then started doing it again. It's running cool enough, and I have an nice Gold power supply on it which is maintaining power just fine.

Thanks for any help!

MichaelMac
Send message
Joined: 2 Sep 12
Posts: 16
Credit: 609,890,687
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 40944 - Posted: 24 Apr 2015 | 16:40:46 UTC

Ok, an update. I found the log in the task link for the error. Does this mean that there's trouble with the driver or with the card?


Stderr output

<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
aborted by user
</message>
<stderr_txt>
# GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1058MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 48C
# GPU 0 : 52C
# GPU 0 : 55C
# GPU 0 : 56C
# GPU 0 : 57C
# GPU 0 : 58C
# GPU 0 : 59C
# GPU 0 : 60C
# GPU 0 : 61C
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1058MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 47C
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1058MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 34C
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1058MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 33C
# GPU 0 : 34C
# GPU 0 : 35C
# GPU 0 : 36C
SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965.
# SWAN swan_assert 0
# GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1058MHz
# Memory clock : 3004MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 34C

</stderr_txt>
]]>


Thanks.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,991,617,060
RAC: 146,649
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40966 - Posted: 27 Apr 2015 | 7:51:33 UTC - in response to Message 40944.
Last modified: 27 Apr 2015 | 7:54:26 UTC

On your XP/GTX660 system you had a problem with this WU,
e1s4_2-GERARD_FXCXCL12_LIG_11675311-0-1-RND5477_1

The Std Err says User aborted and SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965.

I also have a problem with a GERARD_FXCXCL12 WU on an XP system. The WU has run for 60h on a GTX670. It should have taken less than a day and should have finished several days ago. My WU has not check-pointed, so there is probably a design fault. While it's at 98.00% complete it is progressing at a very slow rate (0.01% every few minutes) and if it reaches 100% it might continue to run without actually completing...
If you experience this again, see if the WU has been checkpointing.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 912
Credit: 2,197,798,745
RAC: 837,678
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40967 - Posted: 27 Apr 2015 | 8:17:32 UTC - in response to Message 40966.

I also have a problem with a GERARD_FXCXCL12 WU on an XP system. The WU has run for 60h on a GTX670. It should have taken less than a day and should have finished several days ago. My WU has not check-pointed, so there is probably a design fault. While it's at 98.00% complete it is progressing at a very slow rate (0.01% every few minutes) and if it reaches 100% it might continue to run without actually completing...
If you experience this again, see if the WU has been checkpointing.

I have had several like this since upgrading to the cuda65 driver v347.88 on my Windows XP machines. Another symptom is that CPU usage drops to zero (drops to 0.0000 CPUs on the monitoring tool I use).

I find that suspending the task for a few seconds, then allowing it to restart, resumes normal progress and allows the task to complete and validate.

MichaelMac
Send message
Joined: 2 Sep 12
Posts: 16
Credit: 609,890,687
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 40969 - Posted: 27 Apr 2015 | 12:02:08 UTC - in response to Message 40967.

I'll give that a go, and report back.

Should I consider down reving my driver? Now that you mention it...I did start having problems after upgrading.

Thanks!

MichaelMac
Send message
Joined: 2 Sep 12
Posts: 16
Credit: 609,890,687
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 40972 - Posted: 28 Apr 2015 | 13:44:32 UTC

Ok, here's an update.

I uninstalled the driver and related software (version 350.12).

I rebooted in safe mode and used DDU v15.0.0.1 to do a more thorough (so I was told) uninstall.

I then rebooted back to Windows XP, and installed driver version 344.75 (the previous version that my system was working on).

It now has run two short run tasks and one long run task successfully. Previously I was only able to run a single task before problems. It, so far, looks like it's working properly.

I'll update if there are any changes.

-MichaeMac

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 912
Credit: 2,197,798,745
RAC: 837,678
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 40982 - Posted: 29 Apr 2015 | 16:04:31 UTC - in response to Message 40972.

Same here. 344.75 seems to be much better - haven't had a task stall since I downgraded (either here or at other projects), and it's new enough to run cuda65 - which was the reason for upgrading in the first place.

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,377,621
RAC: 6,231
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41010 - Posted: 2 May 2015 | 16:10:10 UTC - in response to Message 40913.

e1s748_2-NOELIA_l690330-0-3-RND3494

This is a SHORT run that ran for 20 hours and was indicated at less than 0.5% complete with almost 5000 hours to complete

Most SHORT runs on my GTX760 take 3-4 hours to complete

This is the WORST of errors I have seen from GPUGrid.

I am MERELY informing that this is on a UNIQUE problem for one user.

There is to me, an fault in some of the work units.

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,377,621
RAC: 6,231
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41011 - Posted: 2 May 2015 | 16:11:36 UTC - in response to Message 40913.

OOPS, this is on a GTX550, not a GTX760

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 180
Credit: 144,701,536
RAC: 1,539
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 41100 - Posted: 17 May 2015 | 7:40:37 UTC

Hi, GPUGrid Folks:

Could someone please tell me if the two recent failures are because of a problem with my PC or the server.

Thank you,

Workunit 10930521

About Science Volunteers Performance Stats Forum Join Us Donate

Name e13s4_e11s9f92-GERARD_FXCXCL12_LIG_23157812-0-1-RND0535_0
Workunit 10930521
Created 15 May 2015 | 15:21:14 UTC
Sent 16 May 2015 | 8:33:20 UTC
Received 17 May 2015 | 4:36:41 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 214484
Report deadline 21 May 2015 | 8:33:20 UTC
Run time 52,871.45
CPU time 10,006.15
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)
Stderr output
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1110MHz
# Memory clock : 3304MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 74C
# GPU 1 : 63C
# GPU 1 : 64C
# GPU 1 : 66C
# GPU 1 : 67C
# GPU 0 : 75C
# GPU 1 : 68C
# GPU 0 : 76C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1110MHz
# Memory clock : 3304MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 50C
# GPU 1 : 49C
# GPU 0 : 56C
# GPU 1 : 54C
# GPU 0 : 60C
# GPU 1 : 57C
# GPU 0 : 64C
# GPU 1 : 61C
# GPU 0 : 66C
# GPU 1 : 62C
# GPU 0 : 68C
# GPU 0 : 69C
# GPU 1 : 63C
# GPU 0 : 70C
# GPU 1 : 64C
# GPU 0 : 71C
# GPU 1 : 65C
# GPU 0 : 72C
# GPU 1 : 66C
# GPU 0 : 73C
# GPU 0 : 74C
# GPU 1 : 67C
# GPU 0 : 75C
# GPU 1 : 68C
# GPU 0 : 76C
# BOINC suspending at user request (exit)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1110MHz
# Memory clock : 3304MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 56C
# GPU 1 : 51C
# GPU 0 : 62C
# GPU 1 : 57C
# GPU 0 : 65C
# GPU 1 : 61C
# GPU 0 : 67C
# GPU 1 : 63C
# GPU 0 : 69C
# GPU 1 : 64C
# GPU 0 : 71C
# GPU 0 : 72C
# GPU 1 : 66C
# GPU 0 : 73C
# GPU 0 : 74C
# GPU 0 : 75C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 0 : 76C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 14195000)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1110MHz
# Memory clock : 3304MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 180
Credit: 144,701,536
RAC: 1,539
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 41101 - Posted: 17 May 2015 | 7:42:21 UTC

Second failure

Workunit 10932386

Name e17s7_e16s4f41-GERARD_FXCXCL12_LIG_15494362-0-1-RND8820_0
Workunit 10932386
Created 16 May 2015 | 22:06:23 UTC
Sent 17 May 2015 | 3:40:23 UTC
Received 17 May 2015 | 6:28:16 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 214484
Report deadline 22 May 2015 | 3:40:23 UTC
Run time 6,888.55
CPU time 1,263.70
Validate state Invalid
Credit 0.00
Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65)
Stderr output
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1110MHz
# Memory clock : 3304MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# GPU 0 : 75C
# GPU 1 : 63C
# GPU 1 : 66C
# GPU 1 : 67C
# GPU 1 : 68C
# The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 1770000)
# GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 660 Ti
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1110MHz
# Memory clock : 3304MHz
# Memory width : 192bit
# Driver version : r349_00 : 35012
# The simulation has become unstable. Terminating to avoid lock-up (1)

</stderr_txt>
]]>

Betting Slip
Send message
Joined: 5 Jan 09
Posts: 669
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41102 - Posted: 17 May 2015 | 10:11:37 UTC - in response to Message 41101.

It looks like you have your card overclocked, ease down the OC or increase power.

The clue is "The simulation has become unstable"

Robert Gammon
Send message
Joined: 28 May 12
Posts: 63
Credit: 714,377,621
RAC: 6,231
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41103 - Posted: 17 May 2015 | 11:38:24 UTC - in response to Message 40913.

I just aborted a short run Noelia that was stuck at 0.88% with indicated runtime completion in over 25,000 hours.

I saw NOTHING in the log file updated

This SAME card is running a Long Run Gerard normally incrementing percent upwards, estimating run time is dropping every few seconds.

This a GTX550Ti running on a Xeon 2620 under Ubuntu Linux 15.04 updated with all updates installed with Nvidia Drivers 246.59

This is not an overclocked card, stock setting. no options on Nvidia Settings (I did have the fan control and overclock options at one time, but even when i did it, I could NOT overclock only REDUCE clock)

And the PS is 1050W rated and the CPU is running at 51C

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 180
Credit: 144,701,536
RAC: 1,539
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 41106 - Posted: 18 May 2015 | 13:59:19 UTC

Hi, everyone:

ALL three of my recent failures have 'GERARD' in common in the WU name. Two failures have occurred one each on my two GTX 660 Ti devices and one today on one of my 650 Ti devices. None of my cards or CPUs (AMD FX-8350 for the 660) and AMD Phenom 1090T for the 650 Ti is overclocked.

Any suggestions?

Thanks,

John

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 180
Credit: 144,701,536
RAC: 1,539
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 41108 - Posted: 18 May 2015 | 17:09:25 UTC

Issue found & fixed: too many CPU tasks running with GPUGrid.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41113 - Posted: 20 May 2015 | 0:06:39 UTC - in response to Message 41103.

I just aborted a short run Noelia that was stuck at 0.88% with indicated runtime completion in over 25,000 hours.

I saw NOTHING in the log file updated

This SAME card is running a Long Run Gerard normally incrementing percent upwards, estimating run time is dropping every few seconds.

This a GTX550Ti running on a Xeon 2620 under Ubuntu Linux 15.04 updated with all updates installed with Nvidia Drivers 246.59

This is not an overclocked card, stock setting. no options on Nvidia Settings (I did have the fan control and overclock options at one time, but even when i did it, I could NOT overclock only REDUCE clock)

And the PS is 1050W rated and the CPU is running at 51C


Robert:
What is the exact make and model, of your GPU devices that are leading to "Simulation has become unstable"?

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41114 - Posted: 20 May 2015 | 0:08:10 UTC - in response to Message 41108.

Issue found & fixed: too many CPU tasks running with GPUGrid.


I don't think that can be an actual cause. I may be wrong, but I'd be interested in knowing how you came to that conclusion.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 45
Credit: 246,484,695
RAC: 264,192
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41121 - Posted: 21 May 2015 | 11:46:03 UTC

Here my Results,
aborts with all same Errorcode.
No overclocking etc.

BOINC: 7.4.42
NVIDIA Driver: 437.88

9 May 2015 - 21 May

e1s468_4-NOELIA_ETQ_bound-1-2-RND6921
e3s206_e1s24f53-NOELIA_l6903301-2-3-RND6031
e1s958_8-NOELIA_ETQ_bound-0-2-RND8280
e3s78_e1s24f24-NOELIA_l6903301-2-3-RND0130
e2s136_e1s99f53-NOELIA_l6903301-2-3-RND5303
e2s499_e1s122f52-NOELIA_l6903301-0-3-RND2621

-97 (0xffffffffffffff9f) Unknown error number

The simulation has become unstable. Terminating to avoid lock-up (1)
# Attempting restart (step 1050000)
# GPU [GeForce GTX 760] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 760
# ECC : Disabled
# Global mem : 2048MB
# Capability : 3.0
# PCI ID : 0000:01:00.0
# Device clock : 1071MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r346_00 : 34788
# The simulation has become unstable. Terminating to avoid lock-up (1)

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41122 - Posted: 21 May 2015 | 12:15:52 UTC - in response to Message 41121.
Last modified: 21 May 2015 | 12:18:18 UTC

Killersocke:
What is the exact make and model of the GPU that is giving you problems? The reason I ask, is so we can determine if the GPU is factory-overclocked. If it is, then the next step would be to use a tool like PrecisionX, to apply a negative GPU Offset clock, so that the clock matches the reference clock... and then retest.

Killersocke
Send message
Joined: 18 Oct 13
Posts: 45
Credit: 246,484,695
RAC: 264,192
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41124 - Posted: 22 May 2015 | 11:43:15 UTC - in response to Message 41122.

Hi Jacob,

NVIDIA Systeminformationen-Bericht erstellt am: 05/21/2015 15:34:56
Name des Systems: Killersocke

[Anzeige]
Betriebssystem: Windows 8.1 Pro, 64-bit
DirectX-Version: 11.0
GPU-Prozessor: GeForce GTX 760 GPU GK104
Treiberversion: 347.88
Direct3D-API-Version: 11.2
Direct3D-Funktionsebene: 11_0
CUDA-Kerne: 1152
Kerntakt: 1006 MHz
Speicher-Datenrate: 6008 MHz
Speicherschnittstelle: 256-Bit
Speicherbandbreite: 192.26 GB/s
Gesamter verfügbarer Grafikspeicher: 4096 MB
Dedizierter Videospeicher: 2048 MB GDDR5
System-Videospeicher: 0 MB
Freigegebener Systemspeicher: 2048 MB
Video-BIOS-Version: 80.04.BF.00.06
IRQ: 16
Bus: PCI Express x16 Gen3
Geräte-ID: 10DE 1187 84721043
Teilenummer: 2004 0010

[Komponenten]

nvui.dll 8.17.13.4788 NVIDIA User Experience Driver Component
nvxdsync.exe 8.17.13.4788 NVIDIA User Experience Driver Component
nvxdplcy.dll 8.17.13.4788 NVIDIA User Experience Driver Component
nvxdbat.dll 8.17.13.4788 NVIDIA User Experience Driver Component
nvxdapix.dll 8.17.13.4788 NVIDIA User Experience Driver Component
NVCPL.DLL 8.17.13.4788 NVIDIA User Experience Driver Component
nvCplUIR.dll 8.1.800.0 NVIDIA Control Panel
nvCplUI.exe 8.1.800.0 NVIDIA Control Panel
nvWSSR.dll 6.14.13.4788 NVIDIA Workstation Server
nvWSS.dll 6.14.13.4788 NVIDIA Workstation Server
nvViTvSR.dll 6.14.13.4788 NVIDIA Video Server
nvViTvS.dll 6.14.13.4788 NVIDIA Video Server
nvDispSR.dll 6.14.13.4788 NVIDIA Display Server
NVMCTRAY.DLL 8.17.13.4788 NVIDIA Media Center Library
nvDispS.dll 6.14.13.4788 NVIDIA Display Server
PhysX 09.14.0702 NVIDIA PhysX
NVCUDA.DLL 8.17.13.4788 NVIDIA CUDA 7.0.29 driver
nvGameSR.dll 6.14.13.4788 NVIDIA 3D Settings Server
nvGameS.dll 6.14.13.4788 NVIDIA 3D Settings Server

John C MacAlister
Send message
Joined: 17 Feb 13
Posts: 180
Credit: 144,701,536
RAC: 1,539
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 41125 - Posted: 22 May 2015 | 14:21:52 UTC - in response to Message 41114.

I tried running all CPU cores on BOINC WUs with two GPUGrid WUs. GPUGrid WUs failed every time.

I find I must run 5 BOINC CPU WUs max on my AMD FX-8350 8 core PC with two GPUGrid WUs to prevent failures.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41145 - Posted: 26 May 2015 | 1:13:42 UTC - in response to Message 41124.
Last modified: 26 May 2015 | 1:15:05 UTC

So.. according to what I can sort-of understand ... you don't know the exact make/model/manufacturer. But, we're able to see:

GPU-Prozessor: GeForce GTX 760 GPU GK104
Kerntakt: 1006 MHz
Speicherschnittstelle: 256-Bit
Dedizierter Videospeicher: 2048 MB GDDR5


... which means that you have a 256-bit GTX 760 with a factory clock of 1006 MHz.

Looking at the wiki listing here:
http://en.wikipedia.org/wiki/GeForce_700_series
... The base core clock for your GPU is actually 980 MHz. This means, to my knowledge, that your GPU is factory-overclocked, and could be causing the failures.

I recommend installing EVGA Precision X, and using it to downclock the GPU Clock Offset value to -24 (so you are running at 980 MHz, the reference clock), and seeing if that helps you at all. You could even try values lower, like -100 or -200, to test.

Regards,
Jacob

Killersocke
Send message
Joined: 18 Oct 13
Posts: 45
Credit: 246,484,695
RAC: 264,192
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 41159 - Posted: 27 May 2015 | 7:28:36 UTC - in response to Message 41145.

I think it is not the basic problem.
All other Applications, Programs, Apps, Boinc Projects etc.
are stable here.

https://www.gpugrid.net/forum_thread.php?id=4097

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41163 - Posted: 27 May 2015 | 12:29:30 UTC
Last modified: 27 May 2015 | 12:30:24 UTC

Can you please remove the overclock, to at least test and rule that out?

I have had overclocks where everything works great, except GPUGrid tasks, because they work parts of the GPU in different ways. So, when testing, it's best to remove the overclock (or even put it at -200), to confirm that it resolves the problems.

Gerard
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41165 - Posted: 27 May 2015 | 12:56:54 UTC

If the problems you were having were with any of the workunits named "EQUI_26Apr_CXCL", most likely the problem was ours. These workunits have been cancelled this morning (Spain). Thanks for your understanding and sorry for any inconvenience caused.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41167 - Posted: 27 May 2015 | 13:20:20 UTC

The problems in this thread ... are different than the "EQUI_26Apr_CXCL" TDR tasks.

Any time is says "has become unstable", I continue to recommend taking the base clock down to attempt to resolve it. I wish people would listen :)

Gerard
Volunteer moderator
Project developer
Project scientist
Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41205 - Posted: 29 May 2015 | 18:20:42 UTC - in response to Message 41167.

In my experience, when the simulation gives an error of the type "has become unstable" is mainly because of some misconfiguration of the molecular system (usually it means that an explosion occured in some molecule due to extreme forces; this is what was happening in my case).

On the other hand, I also noticed errors of the type "cuda errors" which are usually unsolvable and are related to some specific cards or some random error in the calculus.

The third type is when no error is found and the simulation seems to get ongoing indefinetely. I got some of them this last time with the "EQUI_26Apr_CXCL" corrupted batch.

Do you think "has become unstable" errors could be also caused by overclocking? I doubt it, I would expect more a "cuda error"

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2048
Credit: 14,826,576,669
RAC: 2,426,205
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41206 - Posted: 29 May 2015 | 18:37:23 UTC - in response to Message 41205.
Last modified: 29 May 2015 | 18:56:03 UTC

Do you think "has become unstable" errors could be also caused by overclocking?

It definitely does. (EDIT2: it's like the older error: "energies have become NAN")

EDIT3: these two tasks errored out on my new Palit JetStream GTX980, because it is factory overclocked, and the MSI Afterburner raised the same amount of MHz of its GPU clock as it was set on my standard GTX980. 14198435 14198172
Now both cards runs fine at 1420MHz.

EDIT: There should be some safety check calculation (with known results) built in the client, which would regularly check the condition of the GPU (say by every 20 minutes)

EDIT4: the client should detect the real clock of the GPU somehow.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 912
Credit: 2,197,798,745
RAC: 837,678
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41207 - Posted: 29 May 2015 | 19:09:31 UTC - in response to Message 41206.

Hardware can also go out of tolerance if it's not given the correct supply voltage, either from the host PSU or via the bios/regulators on the card itself. It's possibly more more likely that power components will suffer from aging when subject to the continuous stress of GPGPU work, compared to the calculation components.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,587,539
RAC: 893,726
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41208 - Posted: 29 May 2015 | 19:25:43 UTC - in response to Message 41206.
Last modified: 29 May 2015 | 19:26:11 UTC

Do you think "has become unstable" errors could be also caused by overclocking?

It definitely does.


Yes. Definitely. Based on my own experience with 3 factory-overclocked GPUs.

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,991,617,060
RAC: 146,649
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41212 - Posted: 31 May 2015 | 8:23:23 UTC - in response to Message 41208.
Last modified: 31 May 2015 | 8:31:04 UTC

If the power, voltage, GPU clocks or GDDR5 clocks are too high for any given task then the task can fail. This is more commonly seen on smaller cards which can be weaker design ways yet more fully used/pushed to their max (especially on XP & Linux were GPU usage is often 99%).
GPU usage by tasks varies by task type/batch. This is why one setup or OC might work for one batch but not another and I too had issues with some factory closks on some smaller cards in the past.

I've even seen situations where running some CPU WU's cause the CPU to run hot enough to raise the temperature of GPU0 by several degrees C. Just running climate models for example can increase power usage by 30W and that mostly ends up as heat in the case if you have a basic heatsink and fan cooler.

However, on a decent setup (with a GPU fan profile), GPU core clocks or temps may or may not be the reason tasks fail,

____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Profile skgiven
Volunteer moderator
Project tester
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,991,617,060
RAC: 146,649
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41269 - Posted: 7 Jun 2015 | 10:27:40 UTC - in response to Message 41212.

Tried 353.06 on an XPx86 system with a GTX770 (GPU0) and a GTX670 (GPU1).
While the GTX670 ran at ~98% Power and ~95% GPU usage the GTX770's power remained at 48% constantly - it had downclocked and wouldn't be coaxed back to what it should be (restarts and task swapping).
I've tried other recent drivers too but had similar experiences.
Half expecting a GPU power related issue I went back to 344.75 to see if it's the driver, the mobo, a connector...
Now both the GTX770 and GTX670 are running at ~95% GPU usage. The power usage for the GTX670 is 98% while the power usage of the GTX770 is ~78%.

I think it's fair to conclude that the 344.75 driver works well on Windows XP while the more recent drivers do not.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Number crunching : WU not completing