Advanced search

Message boards : Number crunching : Cause of quantum chemistry task failures: md5sum errors

Author Message
Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 55
Credit: 582,539,698
RAC: 14,343
Level
Lys
Scientific publications
watwat
Message 51129 - Posted: 28 Dec 2018 | 16:44:19 UTC
Last modified: 28 Dec 2018 | 16:46:39 UTC

I have a single Ubuntu Linux machine participating in GPUGRID using its CPU. Apart from a few correctly completed QC tasks, by now this machine has produced 28 "compute errors" after just a few seconds of run time each (0 secs CPU time). Checking the error logs yields the following message for all 28 tasks:

WARNING: md5sum mismatch of tar archive
expected: 75a9f0faa822a01dfe0e0e5c43400ed0
got: dfc9f09eb6b6771c69d6cf10b91bc6c9 -

bunzip2: Data integrity error when decompressing.

I noticed that WU download (and communication in general) is extremely slow - could it be that this is the cause of byte-hick-ups resulting in non-functional WU archives ending up with checksum errors upon extraction?

In effect, this machine is prohibited to download additional tasks for 24 hours making it kind of obsolete to continue to participate in the current GPUGRID team challenge and GPUGRID QC task computation in general.

Maybe an upgrade of the GPUGRID server infrastructure would help improve the situation?

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 171
Credit: 4,013,368,076
RAC: 1,929,938
Level
Arg
Scientific publications
watwatwat
Message 51130 - Posted: 28 Dec 2018 | 17:09:27 UTC - in response to Message 51129.

I seem to remember something about those in the past, just can't remember.

Your computers are hidden so we can not check the error you report. If you unhide them we can look at the entire stderr report and hopefully get an idea as to what is occuring.
____________

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 55
Credit: 582,539,698
RAC: 14,343
Level
Lys
Scientific publications
watwat
Message 51132 - Posted: 28 Dec 2018 | 19:21:16 UTC

Here is an exemplary stderr log:

Name m0000040872_65a1af79_n00050-SDOERR_QMML50_4-0-1-RND5714_1
Arbeitspaket 15707811
Erstellt 27 Dec 2018 | 17:23:04 UTC
Gesendet 27 Dec 2018 | 18:30:32 UTC
Empfangen 27 Dec 2018 | 18:32:06 UTC
Serverstatus Abgeschlossen
Resultat Berechnungsfehler
Clientstatus Berechnungsfehler
Endstatus 195 (0xc3) EXIT_CHILD_FAILED
Computer ID 428878
Ablaufdatum 1 Jan 2019 | 18:30:32 UTC
Laufzeit 2.72
CPU Zeit 0.00
Pr├╝fungsstatus Ung├╝ltig
Punkte 0.00
Anwendungsversion Quantum Chemistry v3.31 (mt)
Stderr Ausgabe

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
19:30:37 (6677): wrapper (7.7.26016): starting
19:30:37 (6677): wrapper (7.7.26016): starting
19:30:37 (6677): wrapper: running /usr/bin/flock (/var/lib/boinc-client/projects/www.gpugrid.net/miniconda.lock -c "/bin/bash ./miniconda-installer.sh -b -u -p /var/lib/boinc-client/projects/www.gpugrid.net/miniconda &&
/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install -m -y -p qmml3 --override-channels -c defaults -c gpugrid --file requirements.txt ")
WARNING: md5sum mismatch of tar archive
expected: 75a9f0faa822a01dfe0e0e5c43400ed0
got: dfc9f09eb6b6771c69d6cf10b91bc6c9 -

bunzip2: Data integrity error when decompressing.
Input file = /var/lib/boinc-client/projects/www.gpugrid.net/miniconda/preconda.tar.bz2, output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to attempt to recover
data from undamaged sections of corrupted files.

tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
19:30:38 (6677): /usr/bin/flock exited; CPU time 0.185019
19:30:38 (6677): app exit status: 0x1
19:30:38 (6677): called boinc_finish(195)

</stderr_txt>
]]>

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 55
Credit: 582,539,698
RAC: 14,343
Level
Lys
Scientific publications
watwat
Message 51135 - Posted: 29 Dec 2018 | 13:32:38 UTC

44 tasks are now affected...

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Profile Michael H.W. Weber
Send message
Joined: 9 Feb 16
Posts: 55
Credit: 582,539,698
RAC: 14,343
Level
Lys
Scientific publications
watwat
Message 51368 - Posted: 23 Jan 2019 | 21:24:40 UTC

Is this issue resolved?

Michael.
____________
President of Rechenkraft.net - Germany's first and largest distributed computing organization.

Toni
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 9 Dec 08
Posts: 787
Credit: 4,294,282
RAC: 139
Level
Ala
Scientific publications
watwatwatwat
Message 51417 - Posted: 2 Feb 2019 | 16:49:03 UTC - in response to Message 51368.
Last modified: 2 Feb 2019 | 16:49:17 UTC

Not something we can fix from here. Try resetting the project, which should clear local files.

Post to thread

Message boards : Number crunching : Cause of quantum chemistry task failures: md5sum errors