Advanced search

Message boards : Number crunching : Error on ethTRYP

Author Message
Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23543 - Posted: 19 Feb 2012 | 13:09:09 UTC

I recently returned to GPUGrid after a long absence and haven't been having any problems (that I know of) until this ethTRYP workunit errored out overnight. Stderr out is:

<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 570"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 15
# Number of cores: 120
MDIO: cannot open file "restart.coor"
ERROR: get_Dvec() element 0 (b)
called boinc_finish

</stderr_txt>
]]>

I'm not sure if these are to be expected every once in a while or if it's indicative of a problem with the host or GPU.

Thanks for any words of advice.

MarkR
____________

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23545 - Posted: 19 Feb 2012 | 13:42:28 UTC - in response to Message 23543.

From the crunchers perspective it's just a fact that you get the odd task failure. There are some generic things you can do to try to deal with increasing amounts of failures, but one failure in say 50 isn't a big issue; overall performance is reduced by no more than 2%.

It might be your setup (overheating, clocks too high, or something else you can change), a bug (for Ignasi to fix), or deprecated clients creating problems in results from previous generations of tasks that silently cause WU corruption on upload.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 23566 - Posted: 20 Feb 2012 | 10:17:50 UTC - in response to Message 23543.

Unfortunately there's a rate of 22% of failure for these tasks. Sources are several, from abortions via GUI to unknown app errors.

Most of these WUs are running fine so I'd discard the bug cause. You may check out skgiven's suggestions.

cheers,
i

Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23567 - Posted: 20 Feb 2012 | 13:09:07 UTC

Thanks for the feedback, you guys.

I don't overclock and I don't believe I have any overheating issues in this host. If I see more errors, I'll try some of the other things like dedicating a CPU, changing the driver, etc. Otherwise, I'll just chalk this one up to bad luck.

MarkR
____________

Profile ritterm
Avatar
Send message
Joined: 31 Jul 09
Posts: 88
Credit: 244,413,897
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 23599 - Posted: 21 Feb 2012 | 21:40:39 UTC
Last modified: 21 Feb 2012 | 22:36:57 UTC

Just an FYI in case it's worthwhile, the WU I referenced in my original post resulted in compute errors for two other hosts before being successfully completed.

And, although it's technically off-topic, this metTRYP errored out for my host and two others and has been sent to a fourth.

[edit]Now four errors and sent to a fifth host... :-)[/edit]
____________

Simba123
Send message
Joined: 5 Dec 11
Posts: 147
Credit: 69,970,684
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 23601 - Posted: 22 Feb 2012 | 1:19:23 UTC - in response to Message 23599.

Just an FYI in case it's worthwhile, the WU I referenced in my original post resulted in compute errors for two other hosts before being successfully completed.

And, although it's technically off-topic, this metTRYP errored out for my host and two others and has been sent to a fourth.

[edit]Now four errors and sent to a fifth host... :-)[/edit]



Yeah, I was one of them :(. My second error out of 60 odd tasks.
didn't mind this one too much as it failed after only 6500 seconds.

http://Thishttp://www.gpugrid.net/workunit.php?wuid=3174532 one though
failed after 37000 seconds :/
took 5 errors and 1 abortion before finally being completed by the 6th user.

Simba123
Send message
Joined: 5 Dec 11
Posts: 147
Credit: 69,970,684
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 23602 - Posted: 22 Feb 2012 | 1:22:30 UTC - in response to Message 23599.

Just an FYI in case it's worthwhile, the WU I referenced in my original post resulted in compute errors for two other hosts before being successfully completed.

And, although it's technically off-topic, this metTRYP errored out for my host and two others and has been sent to a fourth.

[edit]Now four errors and sent to a fifth host... :-)[/edit]



Yeah, I was one of them :(. My second error out of 60 odd tasks.
didn't mind this one too much as it failed after only 6500 seconds.

this one though
failed after 37000 seconds :/
took 5 errors and 1 abortion before finally being completed by the 6th user.

ignasi
Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 23604 - Posted: 22 Feb 2012 | 10:28:12 UTC - in response to Message 23602.

Some tasks end up being corrupted and error out serially and end up dying. Again, causes are a mystery but the impact is very small on the batch overall.

I apologize for the inconveniences anyway.

i

Post to thread

Message boards : Number crunching : Error on ethTRYP

//