Author |
Message |
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
The new acemd3 app should fix the issue.
Thanks for all the reporting!
Note that one still can't restart between different types of cards. |
|
|
Aurum Send message
Joined: 12 Jul 17 Posts: 401 Credit: 16,797,728,127 RAC: 2,406,961 Level
Scientific publications
|
That's great news Toni. I hope you'll send a BOINC notice out when Linux is back in production so those of us on walkabout know to return.
____________
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1358 Credit: 7,897,244,095 RAC: 6,517,334 Level
Scientific publications
|
Just change your Preferences for Computing to "Switch between tasks every" to something like 360 minutes and the task should start and finish on the same card avoiding the issue of restarting on a dissimilar card. If all your cards are the same brand and type, maybe only type, you can restart on a different card and finish with no errors. |
|
|
|
Only two of the three windows apps were updated, cuda92 & cuda101 Why not the cuda100 app too?
____________
Reno, NV
Team: SETI.USA
|
|
|
|
I managed to successfully complete a task after suspending it and restarting it on the windows 7 computer with a rtx 2080ti card. When I suspend the task the wrapper and the acemd3 disappeared from the task manager, and then reappear when the task restarted:
http://www.gpugrid.net/result.php?resultid=21429745
On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed.
http://www.gpugrid.net/result.php?resultid=21429733
I restarted the other task , and which started on the 2080ti and is running well on the 2080ti, right now:
http://www.gpugrid.net/result.php?resultid=21429800
You can't start the tasks on one and restarted successfully on another card. I haven't tried reboot the computer without first suspend the task, yet.
|
|
|
|
...
On the windows 10 computer, I ran a task on the gtx 980ti card, I suspended and resumed it successfully. I suspended it again, reboot the computer, it restarted successfully. I received another task, which ran on the rtx 2080ti card successfully, side by side with the other task. I then suspended both tasks, and restarted the task, which ran on the 980ti, on the 2080ti, and it crashed.
http://www.gpugrid.net/result.php?resultid=21429733
I have this task now, and it's not loading the GPU at all. It's the second task like that I've had in the last couple days.
http://www.gpugrid.net/result.php?resultid=21429805
The other one failed on a suspend / restart, when I paused the client.
http://www.gpugrid.net/result.php?resultid=21429344
That one did validate on another machine, and it looks like the one I have now is slowly making progress, so I'll let it run at least overnight to give it a chance to complete.
____________
Team USA forum | Team USA page
Join us and #crunchforcures. We are now also folding:join team ID 236370! |
|
|
rod4x4Send message
Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level
Scientific publications
|
Received a New version of ACMD v2.08 (cuda101) Work Unit on a Win8.1 (update 1) Host with GTX750ti GPU.
Work Unit Name: e40s11_e37s6p1f279-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_4-0-2-RND1517_1
- Suspended Work Unit after 18 minutes (2.2% complete)
- Wrapper and ACEMD tasks disappeared from Task Manager.
- Resumed Work Unit, Wrapper and ACEMD tasks reappeared and Work Unit continued to process.
Then rebooted PC after Work Unit had been running for 21 minutes. (without suspending WU)
Work Unit successfully restarted and continues to process.
NOTES
- The Remaining (estimated) time does not seem to change or indicate an accurate run time. (only a small issue)
- Checkpoint seems to be every 90 seconds
- GTX750ti is running at 98% utilization and 94% power according to nvidia-smi. This GPU does not reach these figures on the old tasks.
This Work unit may take another 13 hours to complete at current rate.
Work Unit is here: http://www.gpugrid.net/result.php?resultid=21429814
It has not completed yet, but is still encouraging results! |
|
|
|
Hi,
Acemd3 WU in error at the end ... same GPU, no suspend/resume action ...
http://www.gpugrid.net/result.php?resultid=21430068
K.
____________
Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King) |
|
|
rod4x4Send message
Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level
Scientific publications
|
Acemd3 WU in error at the end ... same GPU, no suspend/resume action ...
http://www.gpugrid.net/result.php?resultid=21430068
Your host has returned 2 "New Version ACEMD" work units that have both ended in "upload failure"
Other WU: http://www.gpugrid.net/result.php?resultid=21428934
Failure Message:
<message>
upload failure: <file_xfer_error>
<file_name>e39s4_e33s7p1f250-ADRIA_FOLDUBQ_BANDIT_crystal_ss_contacts_50_ubiquitin_0-1-2-RND0503_0_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
The WUs appear to complete successfully
Exit State: 0
Very curious as all the "old" work units upload fine.
I think the key to the error is in the error_code: stat() failed |
|
|
|
I rebooted the computer (without suspending the tasks), these 2 task were able to restarted and finish successfully afterwards:
http://www.gpugrid.net/result.php?resultid=21429991
http://www.gpugrid.net/result.php?resultid=21430924
There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there?
http://www.gpugrid.net/result.php?resultid=21429972
The error was due to suspend and restart.
I also have an unexplained error:
http://www.gpugrid.net/result.php?resultid=21429886
It was running on the 980ti card. I suspended and restarted it successfully. It was running fine when I left it. Next morning, I found that it crashed. The 2080ti was running either Einstein or Milkyway tasks. Every once in a long while the Einstein gamma ray pulsar task will cause the NVIDIA driver to crash momentary, then it restarts. Maybe that and it was running on a non Turing card are the reasons for this crash. After the task crashed, it cause afterburner to crash. I had to restart that also.
|
|
|
rod4x4Send message
Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level
Scientific publications
|
There is still an issue with getting ACEMD v2.06 (cuda100) tasks. It happened on my windows 10 machine, which has a Maxwell card and a Turing card. I wonder, if there's connection there?
http://www.gpugrid.net/result.php?resultid=21429972
The error was due to suspend and restart.
From what I understand, I think only ACEMD v2.08 survives the suspend/restart.
I also have an unexplained error:
http://www.gpugrid.net/result.php?resultid=21429886
Assuming no issues with the host/other projects, looks like a new error. (These assumptions would need to be explored also)
From the Stderr output:
# Engine failed: Error invoking kernel: CUDA_ERROR_UNKNOWN (999)
Error appears after 7 minutes (2 minutes after Task was suspended and resumed). |
|
|
rod4x4Send message
Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level
Scientific publications
|
Received another New version ACEMD v2.08 work unit on a Win7 Host with GTX750 GPU.
Suspended and resumed the work unit.
Work unit proceeded fine after suspend/resume and completed successfully.
Work unit here:
http://gpugrid.net/result.php?resultid=21433334 |
|
|
|
The new acemd3 app should fix the issue.
Thanks for all the reporting!
Note that one still can't restart between different types of cards. When do you plan to release the ACEMD3 client for Linux?
I thought that you will do it after the Windows client is fixed.
I know it still has (at least) one problem (restarting a task on a different card), but probably the Linux client has the same problem.
Alternatively you could put part of the long workunits to the ACEMD3 queue (it has the new client for both platforms). |
|
|
|
Not getting any work yet for any of my Linux machines and I don't have Windows so I have been out of luck since May but I am sure E@H is pleased. I had been getting some of the beta tests however, just nothing recently since the restart problems appear to have been resolved with the Windows systems. |
|
|
|
Got a new ACEMD 2.06 (http://www.gpugrid.net/result.php?resultid=21443611) WU on one of my Linux machines. About 53% finished and was suspended and restarted once without issue. Wingman WU failed on a Win10 machine a with little less than 2 min completed, reason unknown. Both systems used GTX-1060's. I'm using driver version 430.50 and boinc 7.16.1 (Fedora distro). All is good with Linux so far. |
|
|
|
I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on cone card and another type of task on the other card.
Acemd3 tasks running on both cards simultaneously:
21443009 16813137 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100)
21443008 16813136 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100)
21443007 16813135 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100)
21443006 16813134 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100)
Acemd3 tasks running side by side with non acmed3 tasks:
21442682 16812889 12 Oct 2019 | 2:13:51 UTC 12 Oct 2019 | 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101)
21430016 16803567 8 Oct 2019 | 11:27:46 UTC 8 Oct 2019 | 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101)
21429800 16803495 8 Oct 2019 | 1:41:48 UTC 8 Oct 2019 | 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101)
21429286 16803018 7 Oct 2019 | 1:23:48 UTC 7 Oct 2019 | 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100)
Either Einstein, Milkyway or Long Runs are running on the other card.
I also noticed issue with the scheduler not asking for GPU task when I had all these lines in my cc_config.xml file:
<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>0</device_num>
<app>acemdlong</app>
</exclude_gpu>
<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>0</device_num>
<app>acemdshort</app>
</exclude_gpu>
<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>1</device_num>
<app>acemd3</app>
</exclude_gpu>
What I am telling boinc is to not run long and short tasks on the Turning card and not to run acemd3 tasks on the Maxwell card. The logic works, but again the scheduler doesn't ask for GPU tasks, no matter what I set the cache number to and I have less than 2 tasks per card downloaded. (I downloaded the tasks before I ran this test.)
But if I delete this from the file and of course save it:
<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>1</device_num>
<app>acemd3</app>
</exclude_gpu>
Everything works fine. The scheduler asks for GPU tasks.
Is this a boinc problem or a GPUGRID problem?
|
|
|
|
I noticed that when I am running acemd3 task on each of cards (Maxwell and Turing), they ran slower than if I was running one acemd3 on one card and another type of task on the other card.
Acemd3 tasks running on both cards simultaneously:
21443009 16813137 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:53:01 UTC Completed and validated 17,927.90 17,526.83 75,000.00 New version of ACEMD v2.06 (cuda100)
21443008 16813136 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 17:08:35 UTC Completed and validated 7,590.81 7,420.91 75,000.00 New version of ACEMD v2.06 (cuda100)
21443007 16813135 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 19:14:35 UTC Completed and validated 7,557.09 7,356.45 75,000.00 New version of ACEMD v2.06 (cuda100)
21443006 16813134 12 Oct 2019 | 12:46:46 UTC 12 Oct 2019 | 15:01:54 UTC Completed and validated 7,589.88 7,461.47 75,000.00 New version of ACEMD v2.06 (cuda100)
Acemd3 tasks running side by side with non acmed3 tasks:
21442682 16812889 12 Oct 2019 | 2:13:51 UTC 12 Oct 2019 | 4:09:23 UTC Completed and validated 6,604.02 6,413.02 75,000.00 New version of ACEMD v2.08 (cuda101)
21430016 16803567 8 Oct 2019 | 11:27:46 UTC 8 Oct 2019 | 13:22:21 UTC Completed and validated 6,498.40 6,246.75 75,000.00 New version of ACEMD v2.08 (cuda101)
21429800 16803495 8 Oct 2019 | 1:41:48 UTC 8 Oct 2019 | 3:40:53 UTC Completed and validated 6,519.76 6,327.67 75,000.00 New version of ACEMD v2.08 (cuda101)
21429286 16803018 7 Oct 2019 | 1:23:48 UTC 7 Oct 2019 | 11:50:24 UTC Completed and validated 16,540.09 16,367.14 75,000.00 New version of ACEMD v2.06 (cuda100)
Either Einstein, Milkyway or Long Runs are running on the other card.
Here is an observation that contradicts my previous observation:
21444555 16814402 13 Oct 2019 | 14:27:59 UTC 13 Oct 2019 | 17:43:04 UTC Completed and validated 7,623.39 7,471.55 75,000.00 New version of ACEMD v2.06 (cuda100)
21444501 16814357 13 Oct 2019 | 12:40:33 UTC 13 Oct 2019 | 15:35:54 UTC Completed and validated 7,577.33 7,431.94 75,000.00 New version of ACEMD v2.06 (cuda100)
These two task ran on the Turing card, while the Maxwell was running a long tasks. They were running slower. So what is causing this? The tasks being v2.06, the tasks themselves, or something else. I don't know! Ok, no more theories, at least for a while.
|
|
|
Keith Myers Send message
Joined: 13 Dec 17 Posts: 1358 Credit: 7,897,244,095 RAC: 6,517,334 Level
Scientific publications
|
Everything works fine. The scheduler asks for GPU tasks.
Is this a boinc problem or a GPUGRID problem?
Both. You are running a client that does not handle excludes well. The latest development version does much better.
Also the project runs very old server software that should be updated but has not been. |
|
|