Advanced search

Message boards : Number crunching : Major SNAFU in Effect

Author Message
Nick Name
Send message
Joined: 3 Sep 13
Posts: 21
Credit: 908,936,769
RAC: 1,273,889
Level
Glu
Scientific publications
watwatwatwatwatwat
Message 51786 - Posted: 14 May 2019 | 1:14:37 UTC
Last modified: 14 May 2019 | 1:20:02 UTC

I noticed a ton of errors on a previously 100% reliable host tonight. Looks like a bad batch of WUs got pushed out, both IDP and KIX jobs are affected.
IDP
http://www.gpugrid.net/workunit.php?wuid=16483464
http://www.gpugrid.net/workunit.php?wuid=16480175
http://www.gpugrid.net/workunit.php?wuid=16480417
http://www.gpugrid.net/workunit.php?wuid=16453242
KIX
http://www.gpugrid.net/workunit.php?wuid=16483553
http://www.gpugrid.net/workunit.php?wuid=16474311
http://www.gpugrid.net/workunit.php?wuid=16483548


I have 25 bad jobs in total that also have failed on numerous other hosts.

[edit]I should have said mine is a Linux host, and I just noticed most of the other hosts where work failed are also Linux machines.[/edit]
____________
Team USA forum | Team USA page
Always crunching / Always recruiting

PappaLitto
Send message
Joined: 21 Mar 16
Posts: 485
Credit: 4,101,445,401
RAC: 4,238,724
Level
Arg
Scientific publications
watwat
Message 51789 - Posted: 14 May 2019 | 1:24:29 UTC

http://www.gpugrid.net/results.php?hostid=490728

Above is my host with erroring WUs

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 360
Credit: 4,673,105,389
RAC: 1,183,982
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51790 - Posted: 14 May 2019 | 1:59:34 UTC - in response to Message 51789.

http://www.gpugrid.net/results.php?hostid=490728

Above is my host with erroring WUs



Did someone forget to renew a license?





Keith Myers
Send message
Joined: 13 Dec 17
Posts: 225
Credit: 234,017,213
RAC: 9,992
Level
Leu
Scientific publications
wat
Message 51791 - Posted: 14 May 2019 | 3:29:32 UTC

I'm getting nothing but comp errors on these new tasks also.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 683
Credit: 1,371,521,768
RAC: 31,817
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 51792 - Posted: 14 May 2019 | 4:18:58 UTC

Same here, of course. But I haven't seen anyone from the project around here for a while. Is anyone at home?

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 64
Credit: 1,001,671,751
RAC: 17,902
Level
Met
Scientific publications
watwatwatwatwat
Message 51793 - Posted: 14 May 2019 | 4:42:08 UTC

Same here as well. Error 212 on WU's that were running fine up to 4 -5 hours ago. sounds like a license thing to me as well. Suspended project until the issue is resolved.

DRSMT
Send message
Joined: 23 Feb 17
Posts: 18
Credit: 584,999,372
RAC: 142,321
Level
Lys
Scientific publications
wat
Message 51795 - Posted: 14 May 2019 | 6:00:57 UTC

Have the same issues on two Linux machines, so not sure if this is a license thing.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 48
Credit: 1,376,806,719
RAC: 1,882,120
Level
Met
Scientific publications
watwatwatwatwat
Message 51796 - Posted: 14 May 2019 | 6:18:42 UTC

For the last 2 years, the License error usually comes after July 1st. 12 month license, I am assuming.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 225
Credit: 234,017,213
RAC: 9,992
Level
Leu
Scientific publications
wat
Message 51798 - Posted: 14 May 2019 | 7:19:59 UTC

Every task I had in my cache on 4 hosts errored out today. Since I don't run very high resource allotment, some tasks had been running a couple of hours a day with no issues until today. The hosts are processing other projects without any errors during this time. I'd have to guess a license expired today.

Azmodes
Send message
Joined: 7 Jan 17
Posts: 9
Credit: 536,785,815
RAC: 506,304
Level
Lys
Scientific publications
wat
Message 51799 - Posted: 14 May 2019 | 7:56:34 UTC
Last modified: 14 May 2019 | 8:02:17 UTC

Same. I have two Ubuntu machines that throw up nothing but immediate errors now. My two Windows crunchers are fine, though.

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2006
Credit: 14,641,725,819
RAC: 1,612,281
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51801 - Posted: 14 May 2019 | 8:03:13 UTC

The Linux app is broken (most probably its license expired).
All of my Linux hosts run immediately into this error with every single workunit:

<core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 212 (0xd4, -44)</message> <stderr_txt> </stderr_txt> ]]>

However my Windows host are crunching happily, so I switched back to Windows on my Linux hosts.

The GPUGrid staff need to act on this without delay.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 55
Credit: 582,539,698
RAC: 14,343
Level
Lys
Scientific publications
watwat
Message 51802 - Posted: 14 May 2019 | 8:12:57 UTC
Last modified: 14 May 2019 | 8:13:21 UTC

Same over here:
http://www.gpugrid.net/forum_thread.php?id=4909&nowrap=true#51794

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 894
Credit: 2,076,424,820
RAC: 1,253,092
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51804 - Posted: 14 May 2019 | 10:15:56 UTC

The Linux ACEMD v9.19 apps were deployed on 13/14 February 2018 - so it possibly looks like a 15 month licence expiry.

The Windows v9.22 apps were deployed on 26 July 2018, so with luck we have until late October for those...

Applications

rod4x4
Send message
Joined: 4 Aug 14
Posts: 48
Credit: 1,376,806,719
RAC: 1,882,120
Level
Met
Scientific publications
watwatwatwatwat
Message 51808 - Posted: 14 May 2019 | 12:22:42 UTC
Last modified: 14 May 2019 | 12:51:24 UTC

A temporary fix for Linux users is to set your system date back 1 year.

EDIT: Setting time back 1 year caused certificate errors with other projects. So I have now set time back 1 month. This seems to work better.

This has allowed me to start GPUgrid jobs successfully.

You may need to stop time sync services so the system does not reset the time back to current time.

For systemd based distros (eg...Ubuntu) - sudo datetimectl set-ntp 0 will turn time sync off

EDIT: you will need to reissue this command and reset time after each reboot. If this licensing issue persists, I will post a more permanent time sync fix

This was the temporary fix last year when license issues occurred.

James C. Owens
Send message
Joined: 16 Apr 09
Posts: 4
Credit: 1,926,345,907
RAC: 818,635
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51810 - Posted: 14 May 2019 | 14:35:36 UTC - in response to Message 51808.

Is project leadership aware of the licensing expiration? Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.
____________

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2006
Credit: 14,641,725,819
RAC: 1,612,281
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51811 - Posted: 14 May 2019 | 16:25:44 UTC - in response to Message 51810.
Last modified: 14 May 2019 | 16:25:53 UTC

Is project leadership aware of the licensing expiration?
Apparently not. That's why this SNAFU.

Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.
True.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 683
Credit: 1,371,521,768
RAC: 31,817
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 51814 - Posted: 14 May 2019 | 17:37:27 UTC - in response to Message 51811.

There wasn't any notification of the pending shutdown of the Quantum Chemistry (CPU) work units either, or when they might be restarted.
I am not sure that there is any project leadership at the moment.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 225
Credit: 234,017,213
RAC: 9,992
Level
Leu
Scientific publications
wat
Message 51816 - Posted: 14 May 2019 | 19:51:19 UTC

I'm going to just suspend the project on all my hosts. The fact I have to exclude my Turing cards makes it difficult to work with the project anyway.

I'll just check back in occasionally and see if a new Linux app is available with current licensing.

Erich56
Send message
Joined: 1 Jan 15
Posts: 538
Credit: 2,927,727,519
RAC: 1,202,912
Level
Phe
Scientific publications
watwatwatwat
Message 51817 - Posted: 14 May 2019 | 20:25:16 UTC - in response to Message 51810.

Seems like someone should be keeping a tickler file for this so that renewals could happen before WU's start erroring out.

also in the past, license renewals were not done in time and tasks failed. Too bad, but it really seems that the people at GPUGRID simply forget about these things.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 894
Credit: 2,076,424,820
RAC: 1,253,092
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51821 - Posted: 15 May 2019 | 6:46:56 UTC

Just in case anyone is still wondering, I've been sent WU 16485663.

Failed three times on Linux v9.19 hosts, now running normally under Windows v9.22

Confirms that it's an application problem, not a data problem.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 894
Credit: 2,076,424,820
RAC: 1,253,092
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51822 - Posted: 15 May 2019 | 10:06:39 UTC

Got a PM reply from Toni:

Oh gosh, thanks ...

:-)

rod4x4
Send message
Joined: 4 Aug 14
Posts: 48
Credit: 1,376,806,719
RAC: 1,882,120
Level
Met
Scientific publications
watwatwatwatwat
Message 51823 - Posted: 15 May 2019 | 13:54:40 UTC - in response to Message 51822.

Got a PM reply from Toni:

Hey Richard, thanks for raising this with admins.
Much appreciated!

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 64
Credit: 1,001,671,751
RAC: 17,902
Level
Met
Scientific publications
watwatwatwatwat
Message 51824 - Posted: 15 May 2019 | 15:34:35 UTC

So hopefully we will be back up and running shortly :). Thanks for bring it to Toni's attention.

Aurum
Send message
Joined: 12 Jul 17
Posts: 86
Credit: 7,193,903,068
RAC: 326,321
Level
Tyr
Scientific publications
wat
Message 51825 - Posted: 15 May 2019 | 16:32:52 UTC

Will someone tell us when the FUBAR has finished???
____________

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 55
Credit: 582,539,698
RAC: 14,343
Level
Lys
Scientific publications
watwat
Message 51828 - Posted: 16 May 2019 | 6:14:49 UTC

The problem is still not resolved...

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Erich56
Send message
Joined: 1 Jan 15
Posts: 538
Credit: 2,927,727,519
RAC: 1,202,912
Level
Phe
Scientific publications
watwatwatwat
Message 51830 - Posted: 16 May 2019 | 8:38:08 UTC - in response to Message 51823.

Got a PM reply from Toni:

Hey Richard, thanks for raising this with admins.
Much appreciated!

What surprises me though is that no one from GPUGRID found out by themselves :-(

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 64
Credit: 1,001,671,751
RAC: 17,902
Level
Met
Scientific publications
watwatwatwatwat
Message 51838 - Posted: 16 May 2019 | 20:47:42 UTC

I aborted all my gpu wu's to let someone with windows run them. Was hoping the certificate would be renewed by now so I could finish the ones I had time invested that I suspended before they failed. No such luck :-(. Barley enough calander time left to finish them anyway.

Matt Kowal
Avatar
Send message
Joined: 27 May 14
Posts: 9
Credit: 74,343,001
RAC: 116,377
Level
Thr
Scientific publications
wat
Message 51839 - Posted: 16 May 2019 | 20:52:30 UTC

Toni responded in this thread: http://www.gpugrid.net/forum_thread.php?id=4925&nowrap=true#51834

We are aware of the problem. We'd like to do a major version upgrade rather than continue fixing the old one. For the time being, I'm deprecating the app for linux so crunching goes on on Windows rather than erroring out.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 171
Credit: 4,013,368,076
RAC: 1,929,938
Level
Arg
Scientific publications
watwatwat
Message 51841 - Posted: 16 May 2019 | 22:01:32 UTC - in response to Message 51839.

So it looks like time to find a new project for the majority of my machines. Only have 1 that still runs M$
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 230
Credit: 647,700,389
RAC: 16,535
Level
Lys
Scientific publications
wat
Message 51844 - Posted: 16 May 2019 | 22:41:24 UTC

I came back from the Pent to this. :( Thought my computers borked.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 225
Credit: 234,017,213
RAC: 9,992
Level
Leu
Scientific publications
wat
Message 51845 - Posted: 17 May 2019 | 1:01:26 UTC

So does anyone want to explain how a BOINC wrapper works? The docs don't really say anything about the mechanics involved.

What pre-requisites are there?

Anyone running a BOINC wrapper on other projects and care to elaborate?

tullio
Send message
Joined: 8 May 18
Posts: 157
Credit: 35,533,995
RAC: 4,234
Level
Val
Scientific publications
wat
Message 51846 - Posted: 17 May 2019 | 2:46:52 UTC

LHC@home uses a boincwrapper. All Windows, MAC OSX and other Linux distros can run their programs written in Scientific Linux. You must have VirtualBox installed.
Tullio
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 230
Credit: 647,700,389
RAC: 16,535
Level
Lys
Scientific publications
wat
Message 51859 - Posted: 17 May 2019 | 22:04:04 UTC - in response to Message 51846.
Last modified: 17 May 2019 | 22:05:14 UTC

LHC@home uses a boincwrapper. All Windows, MAC OSX and other Linux distros can run their programs written in Scientific Linux. You must have VirtualBox installed.
Tullio


Nope that's even more separation from the client including OS and environment variables like specific libc versions. In the case of LHC they give the choice of VBox or setting up CVFMS and Singularity on your own which is included in vbox.vdi file

https://boinc.berkeley.edu/trac/wiki/WrapperApp

If you DON'T want want to include progress % complete, check pointing, GPU device # within your app then the wrapper can do that.

Don't expect it to be as efficient as there is now another layer between the exe doing the calculations and hardware.

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 54
Credit: 702,969,950
RAC: 1,282,490
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 51863 - Posted: 18 May 2019 | 4:00:45 UTC
Last modified: 18 May 2019 | 4:01:40 UTC

So we can no longer run this BOINC GPU Project under BOINC version 7.9.3 on Ubuntu 18.04.2 LTS [4.15.0-51-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Running NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 390.11?

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2006
Credit: 14,641,725,819
RAC: 1,612,281
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51866 - Posted: 18 May 2019 | 9:59:43 UTC - in response to Message 51863.

So we can no longer run this BOINC GPU Project under BOINC version 7.9.3 on Ubuntu 18.04.2 LTS [4.15.0-51-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Running NVIDIA GeForce GTX 1080 Ti (4095MB) driver: 390.11?
Correction:
We can not run this BOINC GPU Project (GPUGrid) on any Linux distro for a who-knows-how-long time period.

Aurum
Send message
Joined: 12 Jul 17
Posts: 86
Credit: 7,193,903,068
RAC: 326,321
Level
Tyr
Scientific publications
wat
Message 51875 - Posted: 18 May 2019 | 15:27:10 UTC

I bet it won't be long before we get Linux WUs again. In the mean time there's asteroids, einstein, milkyway & seti to keep one busy.
____________

MarkJ
Volunteer moderator
Project tester
Volunteer tester
Send message
Joined: 24 Dec 08
Posts: 732
Credit: 197,194,445
RAC: 45
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51883 - Posted: 19 May 2019 | 3:43:46 UTC - in response to Message 51845.
Last modified: 19 May 2019 | 3:45:54 UTC

So does anyone want to explain how a BOINC wrapper works? The docs don't really say anything about the mechanics involved.

From what I understand its a wrapper program they put around their normal (non-BOINC) science app that is used to invoke it. No pre-reqs. No need for vbox. That way the wrapper handles the BOINC interaction and allows the use of non-BOINC app.

See https://boinc.berkeley.edu/trac/wiki/WrapperApp for docs.
____________
BOINC blog

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 225
Credit: 234,017,213
RAC: 9,992
Level
Leu
Scientific publications
wat
Message 51886 - Posted: 19 May 2019 | 17:56:49 UTC - in response to Message 51883.

Thanks, I had already read that document and was and still am confused. I gather it is not a VM. So assume you don't need virtualization on the cpu?

Why does BOINC offer versions of BOINC+Virtual Box if this mechanism does not require VBox?

Does VBox do more or less than a wrapper? What are the limitations of a wrapper compared to VBox?

Does the application wrapped in a wrapper have to be native code for the platform? With a VM you could run an app not native to the platform.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 230
Credit: 647,700,389
RAC: 16,535
Level
Lys
Scientific publications
wat
Message 51887 - Posted: 19 May 2019 | 20:45:18 UTC - in response to Message 51886.

Thanks, I had already read that document and was and still am confused. I gather it is not a VM. So assume you don't need virtualization on the cpu?

Why does BOINC offer versions of BOINC+Virtual Box if this mechanism does not require VBox?

Does VBox do more or less than a wrapper? What are the limitations of a wrapper compared to VBox?

Does the application wrapped in a wrapper have to be native code for the platform? With a VM you could run an app not native to the platform.


The wrapper does not need VBox. It's just another interface to perform BOINC related functions while the project's 'math.exe' or w/e is doing the crunching ONLY performs calculations.

VBox can set up the entire OS environment to satisfy all the specifics needed to crunch. If a project needs extra programs that do not typically come with an OS or are normally installed by people then that can be included in the vbox image. Again as LHC as the example, Singularity and CVFMS are included in the image. They can also make 1 vbox image for Windows and Linux Host OSs

Aurum
Send message
Joined: 12 Jul 17
Posts: 86
Credit: 7,193,903,068
RAC: 326,321
Level
Tyr
Scientific publications
wat
Message 51888 - Posted: 19 May 2019 | 23:36:21 UTC

Is the BOINC wrapper a memory hog like virtualbox???
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 230
Credit: 647,700,389
RAC: 16,535
Level
Lys
Scientific publications
wat
Message 51889 - Posted: 20 May 2019 | 0:07:34 UTC
Last modified: 20 May 2019 | 0:09:35 UTC

I'm trying to to think of projects that use it. Going through project folders it looks like DrugDiscovery CPU Goofy, MindModeling and CAS used it. DHEP, Gerasium, Moo, SRBase, Enigma, YoYo and Yafu are active projects that have a wrapper in the exe name. Some Yoyo ECM tasks can use like 8GB but I think thats the data as its limited to certain types. But nothing like LHC Atlas using 10gb for the other projects. VBox apps are huge because its an entire image.

It seems like most GPUGrid crunching is done in Windows as the stats have only gone down from about 600m to 400m per day.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 225
Credit: 234,017,213
RAC: 9,992
Level
Leu
Scientific publications
wat
Message 51890 - Posted: 20 May 2019 | 2:31:55 UTC

That still shows the Linux hosts responsible for 1/3 of the total credit. And since the percentage of Linux hosts is 37% compared to 54% for Windows hosts, the Linux hosts are showing a greater percentage of higher production hosts compared to Windows hosts.

It would benefit the project to return the Linux hosts to participation.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 894
Credit: 2,076,424,820
RAC: 1,253,092
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 51891 - Posted: 20 May 2019 | 11:24:17 UTC - in response to Message 51890.

It would benefit the project to return the Linux hosts to participation.

Which is why the PM which got Toni's attention had the subject line

Research being delayed - Linux apps broken

:-)

Profile bcavnaugh
Send message
Joined: 8 Nov 13
Posts: 54
Credit: 702,969,950
RAC: 1,282,490
Level
Lys
Scientific publications
watwatwatwatwatwatwatwat
Message 52084 - Posted: 14 Jun 2019 | 1:02:09 UTC

Been a while, and news?

Post to thread

Message boards : Number crunching : Major SNAFU in Effect