Advanced search

Message boards : Number crunching : New version of ACEMD requires libboost v1.74

Author Message
Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,011,851
RAC: 8,755,489
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57206 - Posted: 17 Jul 2021 | 20:19:42 UTC

So, late on a Saturday evening, I get sent WU 27077711. Initial replication 2, quorum 1. Mine was _6, sent after six previous failures.

I could have handled that, because I've installed the required library. But it was 'cancelled by server', because nobody else has.

Everyone, please refer to Message 57067. It only takes seconds, and you don't even have to reboot.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,331,959
RAC: 6,516,137
Level
Arg
Scientific publications
watwatwatwatwat
Message 57209 - Posted: 18 Jul 2021 | 3:51:40 UTC - in response to Message 57206.

I managed to pick up 3 cryptic scout resends today and successfully crunched and validated them.

888
Send message
Joined: 28 Jan 21
Posts: 6
Credit: 106,022,917
RAC: 0
Level
Cys
Scientific publications
wat
Message 57213 - Posted: 18 Jul 2021 | 19:19:37 UTC

I think the solution is for the app to be updated - which should have happened already, not for thousands of users to try and install something extra.
I've installed libboost v1.74 and am getting a different error.
I shouldn't have to diagnose, alter and tweak my system, when the issue is with the app itself - Hopefully the developers will fix this soon.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,603,807,483
RAC: 72,688,420
Level
Trp
Scientific publications
wat
Message 57214 - Posted: 18 Jul 2021 | 22:24:51 UTC - in response to Message 57213.

I think the solution is for the app to be updated - which should have happened already, not for thousands of users to try and install something extra.
I've installed libboost v1.74 and am getting a different error.
I shouldn't have to diagnose, alter and tweak my system, when the issue is with the app itself - Hopefully the developers will fix this soon.


I definitely agree. The app should be updated to include the necessary package.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,102,024
RAC: 10,881,026
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57215 - Posted: 19 Jul 2021 | 7:28:35 UTC

I think the solution is for the app to be updated - which should have happened already, not for thousands of users to try and install something extra.
I've installed libboost v1.74 and am getting a different error.
I shouldn't have to diagnose, alter and tweak my system, when the issue is with the app itself - Hopefully the developers will fix this soon.

I definitely agree. The app should be updated to include the necessary package.

+1

To lacking of libboost v1.74 library, a problem of many wrongly constructed WUs in current batch can be added.
Even with libboost workaround applied locally, these tasks are failing anyway with this other error:

EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1"

...And they come to extinguish due to "max # of errors" reached.

Since number of tasks in progress is reduced to 3 at this time, it is consistent with Project be flushing the tasks buffer before a new App version is launched.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,603,807,483
RAC: 72,688,420
Level
Trp
Scientific publications
wat
Message 57219 - Posted: 19 Jul 2021 | 19:56:38 UTC - in response to Message 57215.

I think the solution is for the app to be updated - which should have happened already, not for thousands of users to try and install something extra.
I've installed libboost v1.74 and am getting a different error.
I shouldn't have to diagnose, alter and tweak my system, when the issue is with the app itself - Hopefully the developers will fix this soon.

I definitely agree. The app should be updated to include the necessary package.

+1

To lacking of libboost v1.74 library, a problem of many wrongly constructed WUs in current batch can be added.
Even with libboost workaround applied locally, these tasks are failing anyway with this other error:

EXCEPTIONAL CONDITION: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdio/bincoord.c, line 193: "nelems != 1"

...And they come to extinguish due to "max # of errors" reached.

Since number of tasks in progress is reduced to 3 at this time, it is consistent with Project be flushing the tasks buffer before a new App version is launched.


yeah i've seen that on all new tasks that have come through recently.

they seem to be malformed from the project. not a problem with missing packages here, just a problem with the WU itself.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,605,011,851
RAC: 8,755,489
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57224 - Posted: 23 Jul 2021 | 11:51:42 UTC

I've just got a new ADRIA task:

e1s5_I6-ADRIA_test_acemd3_update_KIXCMYB-1-2-RND1396

'test_acemd3_update' sounds hopeful, but:
1) Although created today, it's still being sent out with the ACEMD v2.12 (cuda1121) application.
2) The first user spat their copy out with the familiar libboost error. I got the second copy: I've patched my own machine, and it's running normally.

So, the test doesn't seem to be telling us anything we didn't know already.

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 102,324,681
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 57248 - Posted: 8 Aug 2021 | 19:58:27 UTC


I'm done until it gets fixed on the server side. My only boost option is 1.76.0 in gentoo.
1.74.0 isn't available.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 5,943,102,024
RAC: 10,881,026
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57268 - Posted: 4 Sep 2021 | 11:26:51 UTC - in response to Message 57248.
Last modified: 4 Sep 2021 | 11:31:22 UTC

I'm done until it gets fixed on the server side. My only boost option is 1.76.0 in gentoo.
1.74.0 isn't available.

This problem seems to have been corrected in new version of ACEMD 2.17 tasks.
I've seen several computers previously failing due to the lack of libboost v1.74, now succeeding in v2.17

Also, the problem due to tasks restarting in a different device at blended multi GPU systems is (corrected) avoided in current new version of ACEMD 2.17
But the way this known problem is avoided leads to a potential performance waste at this kind of systems, because when "N" tasks are received simultaneously, every of them are being executed effectively at device #0 only, thus multiplying by "N" the execution times, while "N-1" GPUs stay idle...

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1284
Credit: 4,926,331,959
RAC: 6,516,137
Level
Arg
Scientific publications
watwatwatwatwat
Message 57269 - Posted: 4 Sep 2021 | 18:56:25 UTC

That's interesting. I'm betting the fix for the errors for restarting on a different device is what is causing the problem of all tasks starting on device#0.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1031
Credit: 35,603,807,483
RAC: 72,688,420
Level
Trp
Scientific publications
wat
Message 57270 - Posted: 4 Sep 2021 | 19:43:13 UTC - in response to Message 57269.

It must be hard coded to gpu0, or the whatever checks or communication it might have with BOINC to see which GPU is available isn’t being properly communicated so it somehow always thinks gpu0 is available. So every task runs there.
____________

Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 14 Mar 07
Posts: 1957
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 57271 - Posted: 5 Sep 2021 | 0:44:47 UTC - in response to Message 57270.

thaks for reporting

Post to thread

Message boards : Number crunching : New version of ACEMD requires libboost v1.74

//