Advanced search

Message boards : News : Experimental Python tasks (beta)

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 55588 - Posted: 13 Oct 2020 | 6:07:19 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55590 - Posted: 13 Oct 2020 | 7:44:18 UTC - in response to Message 55588.
Last modified: 13 Oct 2020 | 8:24:54 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



Preference Ticked, ready and waiting...

EDIT: Received some already
https://www.gpugrid.net/result.php?resultid=29466771
https://www.gpugrid.net/result.php?resultid=29466770

Conda Warnings reported. Will you push out update with app (or safe to ignore)?

Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing. environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda registry file: /root/.conda/environments.txt

Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable

Otherwise the task status is Completed and Validated

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 55591 - Posted: 13 Oct 2020 | 9:25:38 UTC - in response to Message 55590.

Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think.

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 55592 - Posted: 13 Oct 2020 | 11:14:14 UTC - in response to Message 55591.
Last modified: 13 Oct 2020 | 11:17:49 UTC

Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think.


Agreed

Perhaps adding "./envs" switch to the end of the command:

/var/lib/boinc-client/projects/www.gpugrid.net/miniconda/bin/conda install


May help with setting up the environment.

This switch should add environment file to current directory from which command is executed.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55724 - Posted: 12 Nov 2020 | 1:59:01 UTC

I got one of these tasks which confused me as I have not set "accept beta applications" in my project preferences.

Failed after 1200 seconds.

Any idea why I got this task even when I have not accepted the app through beta settings?

https://www.gpugrid.net/result.php?resultid=30508976

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55920 - Posted: 9 Dec 2020 | 19:42:43 UTC

What is the difference between these test Python apps and the standard one? Is it just that this application is coded in Python? what language are the default apps coded in?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55926 - Posted: 9 Dec 2020 | 23:40:13 UTC - in response to Message 55920.

Both apps are wrappered. One is the stock acemd3 and I assume is written in some form of C.

The new Anaconda Python task is a conda application. And Python.

I think Toni is going to have to explain what and how these new tasks and application work.

Very strange behavior. I think the conda and python parts run first and communicate with the project doing some intermediary calculation/configuration/formatting or something. Lots of upstream network activity and nothing going on in the client transfers screen.

I saw the tasks get to 100% progress and no time remaining and then stall out. No upload of the finished task.

Looked away from the machine and looked again and now both tasks have reset their progress and now have 3 hours to run.

I first saw conda show up in the process list and now that has disappeared to be replaced with a acemd3 and python process for each task.

Must be doing something other than insta-failing that the previous tries.

sph
Send message
Joined: 22 Oct 20
Posts: 4
Credit: 34,434,982
RAC: 0
Level
Val
Scientific publications
wat
Message 55933 - Posted: 10 Dec 2020 | 5:22:30 UTC

CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.


I am receiving this error in STDerr Output for Experimental Python tasks on all my hosts.

This is probably due to the fact all my PCs are behind a proxy. Can you please set the Python tasks to use the Proxy defined in the Boinc Client?

Work Units here:
https://www.gpugrid.net/result.php?resultid=31672354
https://www.gpugrid.net/result.php?resultid=31668427
https://www.gpugrid.net/result.php?resultid=31665961

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55936 - Posted: 10 Dec 2020 | 8:30:18 UTC

Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority.

The regular acemd3 tasks are getting 3-6 day estimated completions.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55945 - Posted: 10 Dec 2020 | 15:25:26 UTC - in response to Message 55936.
Last modified: 10 Dec 2020 | 15:41:38 UTC

Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority.

The regular acemd3 tasks are getting 3-6 day estimated completions.


I'm seeing that too lol. but it doesnt seem to be causing too much trouble for me since I don't run more than one GPU project concurrently. Only have Prime and backup.

copying my message from another thread with my observations about these tasks for Toni to see if he doesnt check the other threads:

Looks like I have 11 successful tasks, and 2 failures.

the two failures both failed with "196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED" after a few mins and on different hosts.
https://www.gpugrid.net/result.php?resultid=31680145
https://www.gpugrid.net/result.php?resultid=31678136

curious, since both systems have plenty of free space, and I've allowed BOINC to use 90% of it.

these tasks also have much different behavior compared to the default new version acemd tasks. and they don't seem well optimized yet.
-less reliance on PCIe bandwidth, seeing 2-8% PCIe 3.0 bus utilization
-more reliance on GPU VRAM, seeing 2-3GB memory used
-less GPU utilization, seeing 65-85% GPU utilization. (maybe more dependent on a fast CPU/mem subsystem. my 3900X system gets better GPU% than my slower EPYC systems)

contrast that with the default acemd3 tasks:
-25-50% PCIe 3.0 bus utilization
-about 500MB GPU VRAM used
-95+% GPU utilization

thinking about the GPU utilization being dependent on CPU speed. It could also have to do with the relative speed between the GPU:CPU. just something I observed on my systems. slower GPUs seem to tolerate slower CPUs better, which makes sense if the CPU speed is a limiting factor.

Ryzen 3900X @4.20GHz w/ 2080ti = 85% GPU Utilization
EPYC 7402P @3.30GHz w/ 2080ti = 65% GPU Utilization
EPYC 7402P @3.30GHz w/ 2070 = 76% GPU Utilization
EPYC 7642 @2.80GHz w/ 1660Super = 71% GPU Utilization

needs more optimization IMO. the default app sees much better performance keeping the GPU fully loaded.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55946 - Posted: 10 Dec 2020 | 16:03:34 UTC - in response to Message 55936.

Boy, mixing both regular acemd3 and the python anaconda tasks sure F*s up the APR for both tasks. The insanely low APR for the Python tasks is forcing all GPUGrid tasks into High Priority.

The regular acemd3 tasks are getting 3-6 day estimated completions.

Actually, that won't be the cause. The APRs are kept separately for each application, and once you have an 'active' APR (11 or more 'completions' - validated tasks for that app), they should keep out of each others way.

What will F* things up is that this project still allows DCF to run free - and that's a single value which is applied to both task types.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55947 - Posted: 10 Dec 2020 | 16:07:55 UTC - in response to Message 55946.

Yeah, after I wrote that I realized I meant the DCF is what is messing up the runtime estimations.

I wonder if the regular acemd3 tasks will ever get their normal DCF's back to normal.

I haven't run ANY of my other gpu project tasks since these anaconda python tasks have shown up. I will eventually when the other projects deadlines approach of course.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55948 - Posted: 10 Dec 2020 | 16:09:51 UTC - in response to Message 55946.

what's DCF?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55949 - Posted: 10 Dec 2020 | 16:29:16 UTC - in response to Message 55948.

what's DCF?

Task Duration Correction Factor.
The older BOINC server versions use it like Einstein.
It messes up gpu tasks of different apps there too.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55951 - Posted: 10 Dec 2020 | 17:11:20 UTC - in response to Message 55947.

You can't talk about 'their DCFs' - there is only one (there could have been more than one, but that's the way David chose to play it)

You can see it in BOINC Manager, on the Projects|properties dialog. If it gets really, really high (above 90), it'll inch downwards at 1% per task. Below 90, it'll speed up to 10% par task. The standard advice used to be "two weeks to stabilise", but with modern machines (multi-core, multi-GPU, and faster), the tasks fly by, and it should be quicker.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55953 - Posted: 10 Dec 2020 | 17:28:15 UTC - in response to Message 55951.

What is also messed up is the size of the Anaconda Python task estimated computation size shown in the task properties.

The ones I crunched were only set for 3,000 GFLOPS.

The regular acemd3 tasks are set for 5,000,000 GFLOPS.

This also probably influenced the wildly inaccurate DCF's for the new python tasks.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55954 - Posted: 10 Dec 2020 | 17:33:17 UTC - in response to Message 55951.

You can't talk about 'their DCFs' - there is only one (there could have been more than one, but that's the way David chose to play it)

You can see it in BOINC Manager, on the Projects|properties dialog. If it gets really, really high (above 90), it'll inch downwards at 1% per task. Below 90, it'll speed up to 10% par task. The standard advice used to be "two weeks to stabilise", but with modern machines (multi-core, multi-GPU, and faster), the tasks fly by, and it should be quicker.

This daily driver has GPUGrid DCF Project properties currently at 85 and change.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55955 - Posted: 10 Dec 2020 | 17:33:49 UTC - in response to Message 55953.
Last modified: 10 Dec 2020 | 17:37:11 UTC

What is also messed up is the size of the Anaconda Python task estimated computation size shown in the task properties.

The ones I crunched were only set for 3,000 GFLOPS.

The regular acemd3 tasks are set for 5,000,000 GFLOPS.

This also probably influenced the wildly inaccurate DCF's for the new python tasks.

can confirm.

could this be why the credit reward is so high too?

I wonder what the flop estimate was on this one from Kevvy:
https://www.gpugrid.net/result.php?resultid=31679003
he got wrecked on this one, over 5hrs on a 2080ti, and got a mere 20 credits lol.
____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 96,702
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55956 - Posted: 10 Dec 2020 | 18:20:14 UTC

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55957 - Posted: 10 Dec 2020 | 18:25:08 UTC - in response to Message 55956.

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2


what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?

what is the clock speed of your 3900X and memory speed as well?

try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit.

____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55958 - Posted: 10 Dec 2020 | 19:05:05 UTC - in response to Message 55955.

There's an explanation for 20 credit tasks over at Rosetta.
Has to do with a task being interrupted in calculation and restarted if I remember correctly.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55959 - Posted: 10 Dec 2020 | 19:07:58 UTC - in response to Message 55957.
Last modified: 10 Dec 2020 | 19:15:47 UTC

what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?


That was one of the questions I wanted to ask Mr. Kevvy in the case he seems to be the first cruncher to successfully crunch a ton of them without errors.

I wondered if his BOINC was a service install or a standalone.

[Edit] OK, so Mr. Kevvy is still using the AIO. I wondered since a lot of our team seem to have dropped the AIO and gone back to the service install.

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55961 - Posted: 10 Dec 2020 | 19:17:41 UTC - in response to Message 55959.

I'm almost positive he's running a standalone install.
____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 96,702
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55962 - Posted: 10 Dec 2020 | 19:28:54 UTC - in response to Message 55957.

I've got one running now on an RTX 2070S and the only real issue is low GPU utilization (60-70%). The current task is using ~2 GB of VRAM and ~3 GB of system RAM. I have one thread free on a ryzen 3900X to support the GPU and that thread is running at 100%. This computer has complete 3 of the new python tasks successfully.

Linux Mint 20; Driver Version: 440.95.01; CUDA Version: 10.2


what kind of BOINC install do you have? does it run as a service? or a standalone install that runs from an executable?

what is the clock speed of your 3900X and memory speed as well?

try letting there be 2 spare free threads (so you have one doing nothing) to avoid maxing out the CPU to 100% utilization on all threads. this is known to slow down GPU work. this might increase your GPU utilization a bit.


Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55963 - Posted: 10 Dec 2020 | 19:31:00 UTC - in response to Message 55959.
Last modified: 10 Dec 2020 | 19:39:47 UTC

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.


difference in what sense?

you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions.

but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments. I also don't know if he's using service installs on his Mint systems, he's got a lot of different BOINC versions across all his systems.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55964 - Posted: 10 Dec 2020 | 19:36:51 UTC - in response to Message 55962.

Boinc runs as a service and was installed from the Mint repository (version 17.16.6). The CPU clock speed is 3.9 GHz and the RAM is DDR4 3200 CL16. I did free up another thread but I didn't see an obvious difference in GPU utilization.


thanks for the clarification. it was worth a shot on the GPU utilization with the free thread, low hanging fruit.

I run my memory at 3600 CL14, but I've never seen memory matter that much even for CPU tasks on other projects, let alone GPU tasks. (I saw no difference when changing from 3200CL16 to 3600CL14), but anything's possible I guess.

____________

biodoc
Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 96,702
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55965 - Posted: 10 Dec 2020 | 19:44:14 UTC - in response to Message 55963.

So, then likely the main difference is that Mr. Kevvy is using the older glibc 2.29 instead of the glibc 2.31 that we Ubuntu 20 users are running.


difference in what sense?

you and I both have glibc 2.31 and we both have a bunch of successful completions. looks like Kevvy's Ubuntu 20 systems also have 2.31. all of us with these Ubuntu 20.04 systems have successful completions.

but of all of his Linux Mint (based on Ubuntu 19) systems, none have completed a single Python task successfully. I'm not sure if it's a problem with Linux Mint or what. I'm not sure its necessarily anything to do with the GLIBC since his error messages are varied, and none mention GLIBC as being the cause. It could just be that the app has some bugs to work out when running in different environments.


Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55966 - Posted: 10 Dec 2020 | 20:01:37 UTC - in response to Message 55965.
Last modified: 10 Dec 2020 | 20:02:13 UTC

Mint 20 is based on Ubuntu 20.04 and has glibc 2.31. The 2 computers I have running GPUGrid have Mint 20 installed and the RTX cards on those computers are completing the new python tasks successfully.


Yes, I know. But my point was that there are many differences between Mint 19 and 20, not just GLIBC version, and usually when GLIBC is an issue that shows up as the reason for the error in the task results, but that hasn't been the case.

and conversely we have several examples of tasks hitting Ubuntu 20.04 systems with GLIBC of 2.31 and they still fail.

I think it's just buggy.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 55969 - Posted: 10 Dec 2020 | 22:05:44 UTC - in response to Message 55966.

Yes, I had over a half dozen failed tasks before the first successful task.
Why I was wondering if the failed tasks report the failed configuration upstream and change the future task configuration.

Pretty sure lots of prerequisite software is downloaded first from conda and configured on the system before finally actually starting real crunching.

And the configuration downloads happen for each task I think.

Not just some initial download and from then on all the file are static.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 55996 - Posted: 12 Dec 2020 | 20:10:00 UTC

FYI, these tasks don't checkpoint properly.

if you need to stop BOINC or the system experiences a power outage, the tasks restart from the beginning (10%) but the task timer still tracks from where it left off even though the task restarted. if the tasks were short like MDAD (but MDAD checkpoints properly) it wouldn't be a huge problem. but when they run for 4-5hrs and need to start over for any interruption, it's a bit of a kick in the pants. even worse when these restarted tasks only get 20cred for up to 2x total run time. not worth finishing it at that point.

additionally as has been mentioned in the other thread, these tasks wreak havoc on the system's DCF since it seems to be set incorrectly for these tasks. you get these tasks that make boinc thing they will complete in 10 seconds, and they end up taking 4hrs, so BOINC counters with inflating the run time of normal tasks to 10+ days when they only take 20-40 min lol. and it swings wildly back and forth depending how many of each type you've completed.

and credit reward, other than being about 10x normal for tasks of this runtime, seems only tied to FLOPS and runtime without accounting for efficiency at all.

my 3900X/2080ti completes tasks on average much faster than my EPYC/2080ti system since the 3900X system is running higher GPU utilization allowing faster run times. but the 3900X system earns proportionally less credit. so both systems end up earning the same amount of credit per card. the 3900X/2080ti should be earning more credit since it's doing more tasks. reward is being overinflated for tasks that have longer run time due to inefficiency. it seems tied only to raw runtime and estimated flops. i understand that tasks can have varying run times, but if you wont account for efficiency you need to have a static reward not dependent on runtime at all. for reference, a static reward of about 175,000 would, on average, bring these tasks near the MDAD for cred/unit-time.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,920,057,682
RAC: 38,971,249
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 55997 - Posted: 12 Dec 2020 | 22:03:58 UTC
Last modified: 12 Dec 2020 | 22:06:56 UTC

My host switch to another project task then resume and after a while i had to update system and restart. So it indeed fail to resume from last state so it looks like checkpoint was far behind or no checkpoint at all. Time stay at around 2 hour which was hours behind and est percentage locked at 10%

I aborted it next day as it reached 14 hours.

https://www.gpugrid.net/result.php?resultid=31701824

I would expect it not to be fully working and checkpoint added later on. There is much testing of this but low info for us still so we need to take it for what it is and deal with it if the don't work.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 56007 - Posted: 15 Dec 2020 | 15:52:05 UTC - in response to Message 55997.

The Python app runs ACEMD, but uses additional libraries to compute additional force terms. These libraries are distributed as Conda (Python) packages.

For this to work, I had to make an App which installs a self-contained Conda install in the project dir. The installation is re-used from one run to the other.

This is rather finicky (for example, downloads are large, and I have to be careful with concurrent installs).

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56008 - Posted: 15 Dec 2020 | 17:04:43 UTC - in response to Message 56007.

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.
____________
Reno, NV
Team: SETI.USA

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56009 - Posted: 15 Dec 2020 | 17:14:18 UTC - in response to Message 56007.

Thanks for the details.

The flops estimate

Yes, the "size" of the tasks, as expressed by <rsc_fpops_est> in the workunit template. The current value is 3,000 GFLOPS: all other GPUGrid task types are are 5,000,000 GFLOPS.

An App which installs a self-contained Conda install

We are encountering an unfortunate clash with the security of BOINC running as a systemd service under Linux. Useful bits of BOINC (pausing computation when the computer's user is active on the mouse or keyboard) rely on having access to the public /tmp/ folder structure. The conda installer wants to make use of a temporary folder.

systemd allows us to have either public tmp folders (read only, for security), or private tmp folders (write access). But not both at the same time. We're exploring how to get the best of both worlds...

Discussions in
https://www.gpugrid.net/forum_thread.php?id=5204
https://github.com/BOINC/boinc/issues/4125

over-crediting

We're enjoying it while it lasts!

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56010 - Posted: 15 Dec 2020 | 17:19:51 UTC - in response to Message 56008.

Over-crediting?

OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti.

Host 508381

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56011 - Posted: 15 Dec 2020 | 17:50:02 UTC - in response to Message 56010.
Last modified: 15 Dec 2020 | 18:13:03 UTC

Over-crediting?

OK, make that 'inconsistent crediting'. Mine are all in the 600,000 - 900,000 range, for much the same runtime on a 1660 Ti.

Host 508381


the 20 credits thing seems to only happen with restarted tasks from what ive seen. not sure if anything else triggers it.

but I can say with certainty that the credit allocation is "questionable", and only appears to be related to the flops of device 0 in BOINC, as well as runtime. slow devices masked behind a fast device0 will earn credit at the rate of the faster device...
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56012 - Posted: 15 Dec 2020 | 17:53:40 UTC - in response to Message 56008.

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56013 - Posted: 15 Dec 2020 | 18:09:15 UTC

We should perhaps mention the lack of effective checkpointing while we have Toni's attention. Even though the tasks claim to checkpoint every 0.9% (after the initial 10% allowed for the setup), the apps are unable to resume from the point previously reached.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56014 - Posted: 15 Dec 2020 | 18:22:52 UTC - in response to Message 56012.

Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.


I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56015 - Posted: 15 Dec 2020 | 18:28:46 UTC - in response to Message 56014.

Over-crediting? I am seeing the opposite problem.

https://www.gpugrid.net/result.php?resultid=31902208

20.83 credits for 4.5 hours of run time on an RTX 2080 Ti. That is practically nothing. And this is not a one-off. All my tasks so far are similar.


this happens when the task is interrupted. started and resumed. you can't interrupt these tasks at all.


I'll check that out. But I have not suspended or otherwise interrupted any tasks. Unless BOINC is doing that without my knowledge. But I don't think so.


you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56016 - Posted: 15 Dec 2020 | 18:57:24 UTC - in response to Message 56015.

you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

I have reached my wuprop goals for the other apps. So I am interested in only this particular app (for now).

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

Yep, all the other apps run fine, both here and on other projects.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56017 - Posted: 15 Dec 2020 | 20:40:18 UTC - in response to Message 56016.
Last modified: 15 Dec 2020 | 21:09:19 UTC

you also appear to have your hosts setup to ONLY crunch these beta tasks. is there a reason for that?

I have reached my wuprop goals for the other apps. So I am interested in only this particular app (for now).

does your system process the normal tasks fine? maybe it's something going on with your system as a whole.

Yep, all the other apps run fine, both here and on other projects.


I have a theory, but not sure if it's correct or not.

can you tell me the peak_flops value reported in your coproc_info.xml file for the 2080ti?

basically, since you are using such an old version of BOINC (7.9.3) which pre-dates the fixes implemented in 7.14.2 to properly calculate the peak flops of Turing cards. So I'm willing to bet that your version of BOINC is over-estimating your peak flops by a factor of 2. a 2080ti should read somewhere between 13.5 and 15 TFlops, and I'm guessing your old version of BOINC is thinking it's closer to double that (25-30 TFlops)

the second half of the theory is that there is some kind of hard limit (maybe an anti-cheat mechanism?) that prevents a credit reward somewhere around >2,000,000. maybe 1.8million, maybe 1.9million? but I haven't observed ANYONE getting a task earning that much, and all tasks that would reach that level based on runtime seem to get this 20-credit value.

thats my theory, i could be wrong. if you try a newer version of boinc that properly measures the flops on a turing card, and you start getting real credit, then it might hold water.
____________

sph
Send message
Joined: 22 Oct 20
Posts: 4
Credit: 34,434,982
RAC: 0
Level
Val
Scientific publications
wat
Message 56018 - Posted: 15 Dec 2020 | 23:13:08 UTC - in response to Message 56007.
Last modified: 15 Dec 2020 | 23:15:51 UTC

Two outstanding issues are over-crediting (I am using some default BOINC formula) and, as far as i understand, the flops estimate (?).


Toni, One more issue to add to the list.

The download from Ananconda website does not allow for hosts behind a proxy. Can you please add a check for Proxy settings in the BOINC client so external software can be downloaded?
I have other hosts that are not behind a proxy and they download and run the Experimental tasks fine.

Issue here:
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2>
Elapsed: -

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

This error repeats itself until it eventually gives up after 5 minutes and fails the task.

Happens on 2 hosts sitting behind a Web Proxy (Squid)

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56019 - Posted: 16 Dec 2020 | 1:19:31 UTC - in response to Message 56017.

A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository.

So maybe it is due to interruptions after all, and I am just unaware? I am running some more tasks now, and will check again in the morning.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56020 - Posted: 16 Dec 2020 | 2:57:24 UTC - in response to Message 56019.
Last modified: 16 Dec 2020 | 3:01:29 UTC

A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository.

So maybe it is due to interruptions after all, and I am just unaware? I am running some more tasks now, and will check again in the morning.


it doesnt rule it out because a 1660ti has a much lower flops value. like 5.5 TFlop. so with the old boinc version, it's estimating ~11TFlop and that's not high enough to trigger the issue. you're only seeing it on the 2080ti because it's a much higher performing card. ~14TFlop by default, and the old boinc version is scaling it all the way up to 28+ TFlop. this causes the calculated credit to be MUCH higher than that of the 1660ti, and hence triggering the 20-cred issue, according to my theory of course. but your 1660ti tasks are well below the 2,000,000 credit threshold that i'm estimating. highest i've seen is ~1.7million, so the line cant be much higher. I'm willing to bet that if one of your tasks on that 1660ti system runs for ~30,000-40,000 seconds, it gets hit with 20 credits. ¯\_(ツ)_/¯

you really should try to get your hands on a newer version of BOINC. I use a version of BOINC that was compiled custom, and have usually used custom compiled versions from newer versions of the source code. maybe one of the other guys here can point you to a different repository that has a newer version of BOINC that can properly manage the Turing cards.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56021 - Posted: 16 Dec 2020 | 3:13:29 UTC - in response to Message 56020.

i also verified that restarting ALONE, wont necessarily trigger the 20-credit reward.

it depends WHEN you restart it. if you restart the task early, early enough that the combined runtime wont reach a point where you wont come close to the 2mil credit mark, you'll get the normal points

this task here: https://www.gpugrid.net/result.php?resultid=31934720

I restarted this task about 10-15mins into it. and it started over from the 10% mark, ran to completion, and still got normal crediting. and well below the threshold.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56023 - Posted: 16 Dec 2020 | 14:36:25 UTC - in response to Message 56019.

A second, identical machine, except it has dual RTX 1660 Ti cards, finally got some work. The tasks reported and were awarded the large credits. So that rules out the question WRT BOINC version. FWIW, that version of BOINC is the latest available from the repository.

So maybe it is due to interruptions after all, and I am just unaware? I am running some more tasks now, and will check again in the morning.


i see you changed BOINC to 7.17.0.

another thing I noticed was that the change in tasks didnt take effect until new tasks were downloaded after the change, so tasks that were already there and tagged with the overinflated flops value will probably still get 20-cred. only the newly downloaded tasks after the change should work better.

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56027 - Posted: 16 Dec 2020 | 18:10:19 UTC - in response to Message 56023.

aaaand your 2080ti just completed a task and got credit with the new BOINC version. called it.

http://www.gpugrid.net/result.php?resultid=31951281
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56028 - Posted: 16 Dec 2020 | 18:13:53 UTC - in response to Message 56020.

I'm willing to bet that if one of your tasks on that 1660ti system runs for ~30,000-40,000 seconds, it gets hit with 20 credits. ¯\_(ツ)_/¯


looks like just 25,000s was enough to trigger it.

http://www.gpugrid.net/result.php?resultid=31946707

it'll even out over time, since your other credits are earning 2x as much credit as you should be since the old version of BOINC is doubling your peak_flops value.

____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56030 - Posted: 17 Dec 2020 | 0:43:46 UTC

After upgrading all the BOINC clients, the tasks are erroring out. Ugh.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56031 - Posted: 17 Dec 2020 | 0:54:19 UTC - in response to Message 56030.

they were working fine on your 2080ti system when you had 7.17.0. why change it?

but the issue you're having now looks like the same issue that richard was dealing with here: https://www.gpugrid.net/forum_thread.php?id=5204

that thread has the steps they took to fix it. it's a permissions issue.
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56033 - Posted: 17 Dec 2020 | 4:47:44 UTC - in response to Message 56031.

they were working fine on your 2080ti system when you had 7.17.0. why change it?

but the issue you're having now looks like the same issue that richard was dealing with here: https://www.gpugrid.net/forum_thread.php?id=5204

that thread has the steps they took to fix it. it's a permissions issue.


That was a kludge. There is no such thing as 7.17.0. =;^) Once I verified that the newer version worked, I updated all my machines with the latest repository version, so it would be clean and updated going forward.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56036 - Posted: 17 Dec 2020 | 5:05:48 UTC - in response to Message 56033.

There is such a thing. It’s the development branch. All of my systems use a version of BOINC based on 7.17.0 :)
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56037 - Posted: 17 Dec 2020 | 5:23:58 UTC

Well sure. I meant a released version.
____________
Reno, NV
Team: SETI.USA

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 2,118,133
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56046 - Posted: 18 Dec 2020 | 11:24:17 UTC
Last modified: 18 Dec 2020 | 11:24:46 UTC

So long start to end run times cause the 20 credit issue, not that they were restarted. But tasks that are interrupted cause them to restart at 0, thus having a longer start to end run time.

1070 or 1070Ti
27,656.18s received 1,316,998.40
42,652.74 received 20.83

1080Ti
21,508.23 received 1,694,500.25
25,133.86, 29,742.04, 38,297.41 tasks received 20.83

I doubt they were interrupted with the tasks being High Priority and nothing else but GPUGrid in the BOINC queue.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56049 - Posted: 18 Dec 2020 | 14:57:21 UTC - in response to Message 56046.

yup I confirmed this. I manually restarted a task that didnt run very long and it didnt have the issue.

the issue only happens if your credit reward will be greater than about 1.9 million.

take some of your completed tasks, divide the total credit by the runtime seconds to figure how much credit you earn per second. then figure how many seconds you need to hit 1.9 million, and that's the runtime limit for your system, anything over that and you get the 20-credit bug
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,226,786,456
RAC: 12,117,166
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56148 - Posted: 24 Dec 2020 | 15:33:20 UTC

Why is the number of tasks in progress dwindling? Are no new tasks being issued?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56149 - Posted: 24 Dec 2020 | 15:48:21 UTC - in response to Message 56148.
Last modified: 24 Dec 2020 | 15:49:07 UTC

most of the Python tasks I've received in the last 3 days have been "_0", so that indicates brand new. and a few resends here and there.

the rate in which they are creating them is likely slowed, and the demand is high since points chasers have come to try to snatch them up. also possible that the recent new (_0) ones are only recreations of earlier failed tasks that had some bug that needed fixing. it does seem that this run is concluding.
____________

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 102,324,681
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 56151 - Posted: 25 Dec 2020 | 16:41:49 UTC - in response to Message 55590.

...
Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing. environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda registry file: /root/.conda/environments.txt

Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable
...

I had the same error message except that mine was trying to go to
/opt/boinc/.conda/environments.txt

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 102,324,681
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 56152 - Posted: 25 Dec 2020 | 16:43:36 UTC - in response to Message 55590.
Last modified: 25 Dec 2020 | 16:59:59 UTC

...
Also Warnings about path not found:
WARNING conda.core.envs_manager:register_env(50): Unable to register environment. Path not writable or missing. environment location: /var/lib/boinc-client/projects/www.gpugrid.net/miniconda registry file: /root/.conda/environments.txt

Registry file location ( /root/ ) will not be accessible to boinc user unless conda is already installed on the host (by root user) and conda file is world readable
...

I had the same error message except that mine was trying to go to...
/opt/boinc/.conda/environments.txt
Looks harmless, thanks for reporting. It's because the "boinc" user doesn't have a HOME directory I think.

Gentoo put the home for boinc at /opt/boinc.
I updated the user file to change it to /var/lib/boinc.

ALAIN_13013
Avatar
Send message
Joined: 11 Sep 08
Posts: 18
Credit: 1,551,929,462
RAC: 9
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56177 - Posted: 29 Dec 2020 | 6:50:04 UTC - in response to Message 55588.

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



What type of card minimum for this app. My 980Ti don't load WU.
____________

rod4x4
Send message
Joined: 4 Aug 14
Posts: 266
Credit: 2,219,935,054
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56181 - Posted: 29 Dec 2020 | 13:08:38 UTC - in response to Message 56177.
Last modified: 29 Dec 2020 | 13:10:12 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



What type of card minimum for this app. My 980Ti don't load WU.

In "GPUGRID Preferences", ensure you select "Python Runtime (beta)" and "Run test applications?"
Your GPU, driver and OS should run these tasks fine

ALAIN_13013
Avatar
Send message
Joined: 11 Sep 08
Posts: 18
Credit: 1,551,929,462
RAC: 9
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56182 - Posted: 29 Dec 2020 | 13:32:58 UTC - in response to Message 56181.
Last modified: 29 Dec 2020 | 13:33:30 UTC

I'm creating some experimental tasks for the Python app (made Beta). They are Linux and CUDA specific and serve in preparation for future batches.

They may use a relatively large amount of disk space (order of 1-10 GB) which persists between runs, and is cleared if you reset the project.



What type of card minimum for this app. My 980Ti don't load WU.

In "GPUGRID Preferences", ensure you select "Python Runtime (beta)" and "Run test applications?"
Your GPU, driver and OS should run these tasks fine


Merci, I just forgot Run test applications :)
____________

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,548,418,963
RAC: 2,508,736
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56183 - Posted: 29 Dec 2020 | 13:35:30 UTC

All of these seem now to error out after computation has finished. On several computers:

<message>
upload failure: <file_xfer_error>
<file_name>2p95312000-RAIMIS_NNPMM-0-1-RND8920_1_0</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>

</message>


What causes this and how it can be fixed?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56185 - Posted: 29 Dec 2020 | 14:24:17 UTC - in response to Message 56183.

What causes this and how it can be fixed?

I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching).

Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56186 - Posted: 29 Dec 2020 | 14:33:52 UTC - in response to Message 56185.

What causes this and how it can be fixed?

I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching).

Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience.


really needs to be fixed server side (or would be nice if it were configurable via cc_config but that doesnt look to be the case either).

stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56187 - Posted: 29 Dec 2020 | 14:45:32 UTC - in response to Message 56186.

Stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day.

By itself, it's fairly safe - provided you know and understand the software on your own system well enough. But you do need to have that experience and knowledge, which I why I put the caveats in.

I agree about having to re-do it for every new task, but I'd like to get my APR back up to something reasonable - and I'm happy to help nudge the admins one more step along the way to a fully-working, 'set and forget', application.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56189 - Posted: 29 Dec 2020 | 16:39:50 UTC - in response to Message 56187.

They're working on something...

WU 26917726

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,548,418,963
RAC: 2,508,736
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56208 - Posted: 31 Dec 2020 | 8:59:22 UTC - in response to Message 56186.

What causes this and how it can be fixed?

I've just posted instructions in the Anaconda Python 3 Environment v4.01 failures thread (Number Crunching).

Read through the whole post. If you don't understand anything, or you don't know how to do any of the steps I've described - back away. Don't even attempt it until you're sure. You have to edit a very important, protected, file - and that needs care and experience.


really needs to be fixed server side (or would be nice if it were configurable via cc_config but that doesnt look to be the case either).

stopping and starting the client is a recipe for instant errors, and where successful, this process will need to be repeated for every time you download new tasks. not really a viable option unless you want to babysit the system all day.

Excaltly so. I don't know about others, but I have no time to sit and watch my hosts working. A host is working 10 hours to get the task done, and then everything turns out to be just a waste of time and energy because of this file size limitation. This is somewhat frustrating.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56209 - Posted: 31 Dec 2020 | 10:16:49 UTC - in response to Message 56208.

Opt out of the Beta test programme if you don't want to encounter those problems.

But as it happens, I haven't had a single over-run since they cancelled the one I highlighted in the post before yours.

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,548,418,963
RAC: 2,508,736
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56210 - Posted: 31 Dec 2020 | 12:02:22 UTC - in response to Message 56209.

Opt out of the Beta test programme if you don't want to encounter those problems.

But as it happens, I haven't had a single over-run since they cancelled the one I highlighted in the post before yours.

Yes, I agree - something has changed.

It looks like the last full time (successful) computation on my hosts that produced too large output file was WU 26900019, ended 29 Dec 2020 | 15:00:52 UTC after 31,056 seconds of run time.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56864 - Posted: 7 May 2021 | 12:33:53 UTC
Last modified: 7 May 2021 | 12:46:38 UTC

I see some new Python tasks have gone out. however they seem to be erroring for everyone.

https://www.gpugrid.net/results.php?userid=552015&offset=0&show_names=0&state=0&appid=31

seems to always error with this "os" not defined error. GPU load 0%

Environment
Traceback (most recent call last):
File "run.py", line 5, in <module>
for key, value in os.environ.items():
NameError: name 'os' is not defined

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56865 - Posted: 7 May 2021 | 14:14:09 UTC - in response to Message 56864.
Last modified: 7 May 2021 | 14:16:10 UTC

now seeing this:


==> WARNING: A newer version of conda exists. <==
current version: 4.8.3
latest version: 4.10.1

Please update conda by running

$ conda update -n base -c defaults conda


10:07:30 (341141): /usr/bin/flock exited; CPU time 42.091445
application ./gpugridpy/bin/python missing


and this:

09:57:32 (340085): wrapper (7.7.26016): starting
[input.zip]
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of input.zip or
input.zip.zip, and cannot find input.zip.ZIP, period.
boinc_unzip() error: 9

____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56866 - Posted: 7 May 2021 | 14:30:42 UTC - in response to Message 56865.

just had my first two successful completions. doesn't look like it ran any GPU work though, the GPU was never loaded. just unpacked the WU, ran the setup. then exited. marked as complete with no error. only ran for about 45 seconds.

https://www.gpugrid.net/result.php?resultid=32570561
____________

klepel
Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,737,070,908
RAC: 565,372
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56867 - Posted: 7 May 2021 | 15:04:46 UTC - in response to Message 56866.

just had my first two successful completions. doesn't look like it ran any GPU work though, the GPU was never loaded. just unpacked the WU, ran the setup. then exited. marked as complete with no error. only ran for about 45 seconds.

https://www.gpugrid.net/result.php?resultid=32570561

Did you have to up-date conda for the two successful tasks? I received a few new WUs but all errored. I will not have access to this computer until tomorrow.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56868 - Posted: 7 May 2021 | 15:09:16 UTC - in response to Message 56867.

just had my first two successful completions. doesn't look like it ran any GPU work though, the GPU was never loaded. just unpacked the WU, ran the setup. then exited. marked as complete with no error. only ran for about 45 seconds.

https://www.gpugrid.net/result.php?resultid=32570561

Did you have to up-date conda for the two successful tasks? I received a few new WUs but all errored. I will not have access to this computer until tomorrow.


I didnt make any changes to my system between failed tasks and successful tasks. AFAIK the project is sending conda packaged into these WUs so it doesn't matter what you have installed, it contains everything you should need.

looks like testrun93+ ish are OK, but test runs in the 80s and lower all fail with some form of error like the errors I listed above.
____________

jiipee
Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,548,418,963
RAC: 2,508,736
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 56874 - Posted: 19 May 2021 | 12:03:14 UTC
Last modified: 19 May 2021 | 12:05:01 UTC

All of these Python WU's seem to fail. A pair of examples with different problems:

http://www.gpugrid.net/result.php?resultid=32583864

http://www.gpugrid.net/result.php?resultid=32583210

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56875 - Posted: 19 May 2021 | 12:39:26 UTC - in response to Message 56874.

some succeed. but very few. out of the 94 python tasks i've received recently. only 4 of them succeeded.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56876 - Posted: 19 May 2021 | 15:11:55 UTC

i see some new tasks going out.

still broken.

https://www.gpugrid.net/result.php?resultid=32584011

11:06:39 (1387708): /usr/bin/flock exited; CPU time 281.233647
11:06:39 (1387708): wrapper: running ./gpugridpy/bin/python (run.py)
WARNING: ray 1.3.0 does not provide the extra 'debug'
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.4.1 requires flatbuffers~=1.12.0, but you have flatbuffers 20210226132247 which is incompatible.
tensorflow 2.4.1 requires gast==0.3.3, but you have gast 0.4.0 which is incompatible.
tensorflow 2.4.1 requires grpcio~=1.32.0, but you have grpcio 1.36.1 which is incompatible.
tensorflow 2.4.1 requires opt-einsum~=3.3.0, but you have opt-einsum 3.1.0 which is incompatible.
/home/icrum/BOINC/slots/41/gpugridpy/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
"update your install command.", FutureWarning)
Traceback (most recent call last):
File "run.py", line 296, in <module>
main()
File "run.py", line 35, in main
args = get_args()
File "run.py", line 283, in get_args
config_file = open(config_path, 'rt', encoding='utf8')
FileNotFoundError: [Errno 2] No such file or directory: '/home/icrum/BOINC/slots/41/data/conf.yaml'
11:07:04 (1387708): ./gpugridpy/bin/python exited; CPU time 20.831556
11:07:04 (1387708): app exit status: 0x1
11:07:04 (1387708): called boinc_finish(195)

____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,271,529,776
RAC: 16,004,213
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56878 - Posted: 19 May 2021 | 17:57:38 UTC - in response to Message 56875.

some succeed. but very few. out of the 94 python tasks i've received recently. only 4 of them succeeded.

65 received / 64 errored / 1 successful is my current balance

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,920,057,682
RAC: 38,971,249
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 56879 - Posted: 19 May 2021 | 21:08:09 UTC

204 failed with 5 succeded.

Got one that was running for a while but got runtimerror

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

https://www.gpugrid.net/result.php?resultid=32584418

I did reading at boinc discord today that MLC@Home also testing pytorch and looks like it cause some issues.
PyTorch uses SIGARLM internally, which seems to conflict with libboinc API's usage of SIGALRM.


I hope Toni would get this working soon it looks to be complex setup.

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,920,057,682
RAC: 38,971,249
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 56880 - Posted: 20 May 2021 | 15:34:27 UTC

Most off task for Anaconda Python 3 worked well today. Some changes have been made.

e1a1-ABOU_testzip13-0-1-RND2694_0 an higher appears to be good.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56881 - Posted: 20 May 2021 | 16:37:34 UTC

It seems that these Python tasks are being used to train some kind of AI/Machine Learning model.

can any of the admins or researchers comment on this? I'd like to know more about the work being done.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56882 - Posted: 20 May 2021 | 16:59:41 UTC - in response to Message 56880.

Most off task for Anaconda Python 3 worked well today. Some changes have been made.

e1a1-ABOU_testzip13-0-1-RND2694_0 an higher appears to be good.


side-note: you should set no new tasks or remove GPUGRID from your RTX 30-series hosts. the applications here do not work with RTX 30-series Ampere cards and always produce errors.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56883 - Posted: 21 May 2021 | 1:06:02 UTC

Looks like he only let one acemd3 task slip through to an Ampere card.

I don't think the Python tasks care much about the gpu architecture.

If the tasks are formatted correctly they appear to run fine on Ampere cards.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56884 - Posted: 21 May 2021 | 4:50:05 UTC - in response to Message 56883.

Looks like he only let one acemd3 task slip through to an Ampere card.

I don't think the Python tasks care much about the gpu architecture.

If the tasks are formatted correctly they appear to run fine on Ampere cards.


They care. They are still CUDA 10.0. And were compiled without the proper configurations for ampere. They will all still fail under an Ampere card.

The Python tasks they’ve been pushing out recently never actually run any work on the GPU. They do a little bit of CPU processing and then complete or error. Even the few that succeed never touch the GPU.
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56926 - Posted: 2 Jun 2021 | 20:58:15 UTC
Last modified: 2 Jun 2021 | 21:30:03 UTC

I see a bunch of Python tasks went out again.

I allowed my hosts to pick up one. I don't have high hopes for it though. it's a _6 already and constant errors from all the hosts before. so I'm expecting it'll fail as well.

anyone have a successful run?

Maybe an admin comment on why they keep sending out tasks that mostly fail and never seem to use the GPU?

-edit-
I was right, the Python task failed right around 2mins, never ran anything on the GPU. It's like they aren't even bothering to test that these tasks will fail before sending them out.
____________

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 2,118,133
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56927 - Posted: 2 Jun 2021 | 21:37:13 UTC
Last modified: 2 Jun 2021 | 21:38:07 UTC

All junk for me. None have completed. Pretty sure some have before for me. All around 525-530 seconds. Nice ETA of 646 days so BOINC freaks out.

CPU usage reported in BOINCTasks goes up to 4 threads worth before leveling off a bit. BOINC only reports CPU time = run time even though that doesn't match what I see. Run time is half of what is reported

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 102,324,681
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 56934 - Posted: 6 Jun 2021 | 13:49:38 UTC

A have 3 of these valid over the past couple days. None of them used the GPU. Did they complete any work?
https://www.gpugrid.net/result.php?resultid=32619357

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,765,428,051
RAC: 2,118,133
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56935 - Posted: 6 Jun 2021 | 18:51:11 UTC

This one worked for me after that same PC failed earlier in the week
https://www.gpugrid.net/result.php?resultid=32619337

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56936 - Posted: 6 Jun 2021 | 19:07:10 UTC - in response to Message 56934.

A have 3 of these valid over the past couple days. None of them used the GPU. Did they complete any work?
https://www.gpugrid.net/result.php?resultid=32619357


I agree. It’s weird that these tasks are marked as being a GPU task with CUDA10.0, makes the GPU otherwise unavailable for other tasks in BOINC, yet they never touch the GPU.

According to the stderr.txt, they seem to spend most of their time extracting and installing packages, then does “something” for a few seconds and completes. It’s obvious that they are exploring some kind of machine learning approach based on the packages used (pytorch, tensorflow, etc) and references to model training. Maybe they are still working out how to properly package the WUs so they have the right configuration for future real tasks.

Would be cool to hear what they are actually trying to accomplish with these tasks.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1142
Credit: 10,922,655,840
RAC: 22,476,589
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 56938 - Posted: 9 Jun 2021 | 4:46:24 UTC - in response to Message 56936.

Would be cool to hear what they are actually trying to accomplish with these tasks.

I guess you will never hear any details from them.
As we know, the GPUGRID people are very taciturn on everything.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,271,529,776
RAC: 16,004,213
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56939 - Posted: 9 Jun 2021 | 5:48:30 UTC - in response to Message 56938.
Last modified: 9 Jun 2021 | 5:49:21 UTC

Would be cool to hear what they are actually trying to accomplish with these tasks.

I guess you will never hear any details from them.
As we know, the GPUGRID people are very taciturn on everything.

In other times, when the Gpugrid Project run smoothly, they used to be more polite by returning some feedback to contributors.
I guess that there must be heavy reasons for this current lack of communication.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 56940 - Posted: 10 Jun 2021 | 10:12:43 UTC - in response to Message 56939.

For the time being we are perfecting the WU machinery so to support ML packages + CUDA. All tasks are linux beta for now. Thanks!

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,271,529,776
RAC: 16,004,213
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56941 - Posted: 10 Jun 2021 | 10:28:56 UTC - in response to Message 56940.

Thank you for this pearl!
Nice to know that everything is going on...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56942 - Posted: 10 Jun 2021 | 12:59:40 UTC - in response to Message 56940.

For the time being we are perfecting the WU machinery so to support ML packages + CUDA. All tasks are linux beta for now. Thanks!


Thanks, Toni. Can you explain why these tasks are not using the GPU at all? they only run on the CPU. GPU utilization stays at 0%
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56943 - Posted: 11 Jun 2021 | 17:19:49 UTC

I would like to know whether we are supposed to do the things requested in the output file. Things like updating the various packages that are called out.

Or are we supposed to do nothing and let the app/task packagers sort it out before generation?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56944 - Posted: 11 Jun 2021 | 17:44:37 UTC - in response to Message 56943.

I would like to know whether we are supposed to do the things requested in the output file. Things like updating the various packages that are called out.

Or are we supposed to do nothing and let the app/task packagers sort it out before generation?


I'm relatively sure these tasks are sandboxed. the packages being referenced are part of the whole WU (tensorflow). they are installed by the extraction phase in the beginning of the WU. if you check your system you will find that you do not have tensorflow installed most likely.

the package updates need to happen on the project side before distribution to us.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56945 - Posted: 11 Jun 2021 | 18:23:02 UTC - in response to Message 56944.

I wonder if I should add the project to my Nvidia Nano. It has Tensorflow installed by default in the distro.

But I wonder if the app would even run on the Maxwell card even though it is mainly a cpu application for the time being and never touches the gpu it seems.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56946 - Posted: 11 Jun 2021 | 18:53:07 UTC - in response to Message 56945.
Last modified: 11 Jun 2021 | 18:54:07 UTC

I wonder if I should add the project to my Nvidia Nano. It has Tensorflow installed by default in the distro.

But I wonder if the app would even run on the Maxwell card even though it is mainly a cpu application for the time being and never touches the gpu it seems.


you can try, but I don't think it'll run because of the ARM CPU. there's no app for that here.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56947 - Posted: 11 Jun 2021 | 19:15:29 UTC - in response to Message 56946.

Ah, yes . . . . forgot about that small matter . . . .

Profile trigggl
Send message
Joined: 6 Mar 09
Posts: 25
Credit: 102,324,681
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 56949 - Posted: 12 Jun 2021 | 17:09:44 UTC - in response to Message 56944.

I would like to know whether we are supposed to do the things requested in the output file. Things like updating the various packages that are called out.

Or are we supposed to do nothing and let the app/task packagers sort it out before generation?


I'm relatively sure these tasks are sandboxed. the packages being referenced are part of the whole WU (tensorflow). they are installed by the extraction phase in the beginning of the WU. if you check your system you will find that you do not have tensorflow installed most likely.

the package updates need to happen on the project side before distribution to us.

Furthermore, I checked on my Gentoo system what it would take to install Tensorflow. The only vesion available to me required python 3.8. I didn't even bother to check it out because my system is using python 3.9 stable.

Things may become easier with the app if they are able to upgrade to python 3.8. I don't know how this will work with python 3.7. Is it just Gentoo taking the 3.7 option away?

emerge -v1p tensorflow

These are the packages that would be merged, in order:

Calculating dependencies .....

!!! Problem resolving dependencies for sci-libs/tensorflow
... done!

!!! The ebuild selected to satisfy "tensorflow" has unmet requirements.
- sci-libs/tensorflow-2.4.0::gentoo USE="cuda python -mpi -xla" ABI_X86="(64)" CPU_FLAGS_X86="avx avx2 fma3 sse sse2 sse3 sse4_1 sse4_2 -fma4" PYTHON_TARGETS="-python3_8"

The following REQUIRED_USE flag constraints are unsatisfied:
python? ( python_targets_python3_8 )

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56950 - Posted: 12 Jun 2021 | 21:38:59 UTC

My Ubuntu 20.04.2 LTS distro has Python 3.8.5 installed so should satisfy the tensorflow requirements.

I'm curious enough to experiment and install tensorflow to see if the tasks will actually do something other than unpack the packages.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56958 - Posted: 15 Jun 2021 | 11:11:25 UTC

I was trying to catch some ACEMD tasks to test the oversized upload file report, but got a block of Pythons instead.

http://www.gpugrid.net/result.php?resultid=32623625

Can anyone advise on the multitude of gcc failures, starting with

gcc: fatal error: cannot execute &#226;&#128;&#152;cc1plus&#226;&#128;&#153;: execvp: No such file or directory
compilation terminated.

Machine is Linux Mint 20.1, freshly updated today (including BOINC v7.16.17, which is an auto-build test for the Mac release last week - not relevant here, but useful to keep an eye on to make sure they haven't broken anything else).

I have a couple of spare tasks suspended - I'll look through the actual runtime packaging to see what they're trying to achieve.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 56960 - Posted: 15 Jun 2021 | 13:58:15 UTC - in response to Message 56958.

Richard, I haven't had any GCC errors with any of the Python tasks on my hosts.

Never see it invoked.

Just see lots of unpacking and attempts to run other ML programs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56962 - Posted: 15 Jun 2021 | 14:54:08 UTC - in response to Message 56960.

Interestingly, the task that failed to run gcc was re-issued to a computer on ServicEnginIC's account - and ran successfully. That gives me a completely different stderr_txt file to compare with mine. I'll make a permanent copy of both for reference, and try to work out what went wrong.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 10,271,529,776
RAC: 16,004,213
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56963 - Posted: 15 Jun 2021 | 15:27:38 UTC - in response to Message 56962.

Interestingly, the task that failed to run gcc was re-issued to a computer on ServicEnginIC's account - and ran successfully. That gives me a completely different stderr_txt file to compare with mine. I'll make a permanent copy of both for reference, and try to work out what went wrong.

I remember that I applied to all my hosts a kind remedy suggested by you at your message #55967.
I related it at message #55986
Thank you again.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56964 - Posted: 15 Jun 2021 | 15:48:43 UTC - in response to Message 56963.

Thanks for the kind words. Yes, that's necessary, but not sufficient. My host 132158 got a block of four tasks when I re-enabled work fetch this morning.

The first task I ran - ID _621 - got

[1008937] INTERNAL ERROR: cannot create temporary directory!
11:23:17 (1008908): /usr/bin/flock exited; CPU time 0.132604

- that same old problem. I stopped the machine, did a full update and restart, and verified that the new .service file had the fix for that bug.

Then I fired off task ID _625 - that's the one with the cpp errors.

Unfortunately, we only get the last 64 KB of the file, and it's not enough in this case - we can't see what stage it's reached. But since the first task only ran 3 seconds, and the second lasted for 190 seconds, I assume we fell at the second hurdle.

My second Linux machine has just picked up two of the ADRIA tasks I was hunting for - I'll sort those out next.

valterc
Send message
Joined: 21 Jun 10
Posts: 21
Credit: 8,942,364,672
RAC: 13,844,304
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56975 - Posted: 16 Jun 2021 | 12:11:07 UTC - in response to Message 56958.

I was trying to catch some ACEMD tasks to test the oversized upload file report, but got a block of Pythons instead.

http://www.gpugrid.net/result.php?resultid=32623625

Can anyone advise on the multitude of gcc failures, starting with

gcc: fatal error: cannot execute &#226;&#128;&#152;cc1plus&#226;&#128;&#153;: execvp: No such file or directory
compilation terminated.

Machine is Linux Mint 20.1, freshly updated today (including BOINC v7.16.17, which is an auto-build test for the Mac release last week - not relevant here, but useful to keep an eye on to make sure they haven't broken anything else).

I have a couple of spare tasks suspended - I'll look through the actual runtime packaging to see what they're trying to achieve.

I got a similar error some time ago. A memory module was faulty, started to get segmentation fault errors. Eventually my compiling environment (gcc etc.) became messed up. Solved the situation by removing the bad module and completely reinstalling the compiling environment. What I might suggest to do is try to verify if gcc/g++ are actually working, by compiling something of your choice.


Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56980 - Posted: 17 Jun 2021 | 16:29:54 UTC

Finally got time to google my gcc error. Simples: turns out the app requires g++, and it's not installed by default on Ubuntu - and, one assumes, derivatives like mine.

All it needed was

sudo apt-get install g++

No restart needed, of either BOINC or Linux, and task 32623619 completed successfully.

No much sign of any checkpointing: one update at 10%, then nothing until the end.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56981 - Posted: 17 Jun 2021 | 16:40:25 UTC - in response to Message 56980.

Finally got time to google my gcc error. Simples: turns out the app requires g++, and it's not installed by default on Ubuntu - and, one assumes, derivatives like mine.

Hummm...
I have run a few on Ubuntu 20.04.2 and did not do anything special, unless maybe something else I was working on required it.
http://www.gpugrid.net/results.php?hostid=452287

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56982 - Posted: 17 Jun 2021 | 17:47:38 UTC - in response to Message 56980.
Last modified: 17 Jun 2021 | 17:50:31 UTC

Finally got time to google my gcc error. Simples: turns out the app requires g++, and it's not installed by default on Ubuntu - and, one assumes, derivatives like mine.

All it needed was

sudo apt-get install g++

No restart needed, of either BOINC or Linux, and task 32623619 completed successfully.

No much sign of any checkpointing: one update at 10%, then nothing until the end.


I think this just might be your distribution. I never installed this (specifically) on my Ubuntu 20.04 systems. if it's there, it was there by default or through some other package as a dependency.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1626
Credit: 9,387,266,723
RAC: 19,004,805
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 56983 - Posted: 17 Jun 2021 | 18:20:34 UTC - in response to Message 56982.

This was a fairly recent (February 2021) clean installation of Linux Mint 20.1 'Ulyssa' - I decided to throw away my initial fumblings with Mint 19.1, and start afresh. So, let this be a warning: not every distro is as complete as you might expect.

Anyway, the solution is in public now, in case anyone else needs it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 56999 - Posted: 22 Jun 2021 | 12:11:05 UTC

errors on the Python tasks again.

https://www.gpugrid.net/result.php?resultid=32626713
____________

Jim1348
Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57001 - Posted: 22 Jun 2021 | 15:39:11 UTC - in response to Message 56999.

errors on the Python tasks again.

I see them too.

http://www.gpugrid.net/results.php?hostid=452287
UnsatisfiableError: The following specifications were found to be incompatible with each other:

That will give them something to work on.

Profile robertmiles
Send message
Joined: 16 Apr 09
Posts: 503
Credit: 762,859,535
RAC: 89,385
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57006 - Posted: 23 Jun 2021 | 0:51:34 UTC

I'm now using my GPU preferably for tasks related to medical research. Could you mention whether the Python tasks are related to medical research and whether they use the GPU?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1078
Credit: 40,231,533,983
RAC: 24
Level
Trp
Scientific publications
wat
Message 57007 - Posted: 23 Jun 2021 | 3:07:49 UTC - in response to Message 57006.

I'm now using my GPU preferably for tasks related to medical research. Could you mention whether the Python tasks are related to medical research and whether they use the GPU?


right now these Python tasks are using machine learning to do what we assume is some medical kind of research, but the admins haven't given many specifics on exactly how or what type of medical research specifically is being done. GPUGRID as a whole does various types of medical research. see more info about it in the other thread here: https://www.gpugrid.net/forum_thread.php?id=5233

the tasks are labelled as being GPU tasks, and they do otherwise reserve the GPU in BOINC (ie, other tasks wont run on it), however in reality the GPU is not actually used. it sits idle and all the computation happens on the CPU thread that's assigned to the job. the admins have stated that these early tasks are still in testing and they will use the GPU in the future. but right now they don't.

the other thing to keep in mind, the Python application is Linux-only (at least right now). you wont be able to get these tasks on your Windows system.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1358
Credit: 7,897,244,095
RAC: 6,517,334
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57036 - Posted: 29 Jun 2021 | 22:56:36 UTC

Just finished a new Python task that didn't error out. Hope this is the start of a trend.

Post to thread

Message boards : News : Experimental Python tasks (beta)

//