Advanced search

Message boards : Number crunching : 1000's of strange event messages

Author Message
Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 163
Credit: 2,628,420,728
RAC: 856,646
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52093 - Posted: 18 Jun 2019 | 2:03:35 UTC
Last modified: 18 Jun 2019 | 2:06:08 UTC

Not sure what is causing this, but I have over 3000 messages that are in pairs as shown:

GPUGRID 6/17/2019 8:55:53 PM [coproc] NVIDIA instance 0; 1.000000 pending for e93s49_e72s18p0f35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0
GPUGRID 6/17/2019 8:55:53 PM [coproc] NVIDIA instance 0: confirming 1.000000 instance for e93s49_e72s18p0f35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0


Thera are two gtx1070 in this system but only once has a job and I read the following:


3760 GPUGRID 6/17/2019 8:59:07 PM Requesting new tasks for NVIDIA GPU
3761 GPUGRID 6/17/2019 8:59:09 PM Scheduler request completed: got 0 new tasks
3762 GPUGRID 6/17/2019 8:59:09 PM No tasks sent
3763 GPUGRID 6/17/2019 8:59:09 PM Project has no tasks available
3764 GPUGRID 6/17/2019 9:00:01 PM [coproc] NVIDIA instance 0; 1.000000 pending for e93s49_e72s18p0f35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0
3765 GPUGRID 6/17/2019 9:00:01 PM [coproc] NVIDIA instance 0: confirming 1.000000 instance for e93s49_e72s18p0f35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0
3766 GPUGRID 6/17/2019 9:01:01 PM [coproc] NVIDIA instance 0; 1.000000 pending for e93s49_e72s18p0f35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0
3767 GPUGRID 6/17/2019 9:01:01 PM [coproc] NVIDIA instance 0: confirming 1.000000 instance for e93s49_e72s18p0f35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0
..etc...

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 163
Credit: 2,628,420,728
RAC: 856,646
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52095 - Posted: 18 Jun 2019 | 14:15:50 UTC - in response to Message 52093.

There are now two tasks running but only the second one is making progress. Noticed added another 3,500 same strange messages since last post so i restarted boinc to see if that fixes the problem. if it does not help then i will abort the stuck task and it will become someone else's problem.

Zalster
Avatar
Send message
Joined: 26 Feb 14
Posts: 181
Credit: 4,144,164,976
RAC: 626,664
Level
Arg
Scientific publications
watwatwatwat
Message 52096 - Posted: 18 Jun 2019 | 16:28:59 UTC - in response to Message 52095.

reboot the system?
____________

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 163
Credit: 2,628,420,728
RAC: 856,646
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52101 - Posted: 19 Jun 2019 | 3:08:52 UTC - in response to Message 52096.

reboot the system?


Shut it down when you first posted and I finally got around to booting it back up. There seems to be other problems:


52 GPUGRID 6/18/2019 9:41:42 PM [error] no project URL in task state file
65 6/18/2019 9:41:47 PM [error] Inconsistent signing key from account manager


The url missing I occasionally see on other projects and seems to be ignored but the one about the signing key is new.

System is crunching and another 2 work units (gpugrid) showed up but I also see the first pair of those strange warnings.

Got another 20 of them in the time I wrote this. I hate losing tasks, especially gpugrid but going to detach.

Could not detach through BAM!. Log never showed the sync with project manager but reset worked. I made a note of the names of the work units that were lost and will check to see if the problem show up elsewhere but I assume something just got corrupted here.

Profile JStateson
Avatar
Send message
Joined: 31 Oct 08
Posts: 163
Credit: 2,628,420,728
RAC: 856,646
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52108 - Posted: 19 Jun 2019 | 12:40:58 UTC

strange messages was my "the sky is falling" as I didn't realize the debug flag was set in cc_config.
Anyway, things seem to be working after the abort project was done.

I did have one observation: The pair of aborted programs are missing from the error list under my account. I do have the names of the two programs that were running

e93s49_e72s18pOf35-PABLO_v3Q86UU0_MOR_6_IDP-1-2-RND0688_0
e97s47_e59s104p1f3l9-PABLO_v3075376_MOR_58_IDP-0-2-RND6678_0


but without the workunit name it is difficult to see if any one else had a problem.

One of the above two was hung the other one chugged along fine but I failed to make a note of which one had the problem.

Post to thread

Message boards : Number crunching : 1000's of strange event messages