Advanced search

Message boards : Number crunching : ATMML

Author Message
Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,209,900,137
RAC: 14,573,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61582 - Posted: 5 Jul 2024 | 14:15:37 UTC

I just finished crunching a task for this new application successfully.

https://www.gpugrid.net/result.php?resultid=35379717

What exactly are we crunching here?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61583 - Posted: 5 Jul 2024 | 15:08:01 UTC

By the name of the app, somehow uses machine learning.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61584 - Posted: 6 Jul 2024 | 16:38:31 UTC - in response to Message 61582.

I just finished crunching a task for this new application successfully.

how did you manage to download such a task?
The list in which you can choose from the various subprojects does NOT include ATMML

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61585 - Posted: 6 Jul 2024 | 18:36:43 UTC

This is an app in testing mode, it does not appear as one to select yet. You will only get the WUs if you have selected to run the test applications. It is a different version of the existing ATM app that includes machine learning based forcefields for the molecular dynamics.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61586 - Posted: 6 Jul 2024 | 19:59:54 UTC - in response to Message 61585.

Thanks for the progress update and explanation of just what kind of ML is being used for the ATM tasks, Steve.


I see also you released a new beta ATM app yesterday to go along with the ATMML app. Already did one of those today.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 871,817,770
RAC: 4,242,718
Level
Glu
Scientific publications
wat
Message 61594 - Posted: 15 Jul 2024 | 9:34:47 UTC - in response to Message 61586.

Is it Windows, Linux or both?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,945,862,024
RAC: 20,331,153
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61595 - Posted: 15 Jul 2024 | 10:32:05 UTC

You can verify OS compatibility for different applications at GPUGRID apps page.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,209,900,137
RAC: 14,573,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61602 - Posted: 17 Jul 2024 | 1:10:16 UTC

I noticed that this batch of ATMML units takes almost 3 times longer than the previous batches to complete. One of them, I suspended and when I restarted it, it would not start, I kept "running" it for over an hour, and no progress, so I had no option, but to abort it.


Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61604 - Posted: 17 Jul 2024 | 7:45:34 UTC - in response to Message 61602.

effectivement elles sont tres longue a calculer.Je vais les arreter aussi.
9h20 sur ma rtx 4060 et 14h20 sur rtx a2000.

They are very long to calculate. I will stop them too.
9h20 on my rtx 4060 and 14h20 on rtx a2000.
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61607 - Posted: 17 Jul 2024 | 20:37:40 UTC

j ai annulé les 4 taches ATMML que j'avais car trop longues a calculer.
entre 16 et 24 heures.MESSIEURS LES PROGRAMMEURS,j'espere que vous allez vous pencher sur ce probleme?

I cancelled the 4 ATMML stains I had because too long to calculate.
between 16 and 24 hours.PROGRAMMERS, I hope you will look into this problem?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61609 - Posted: 18 Jul 2024 | 16:30:26 UTC

Didn't have any issues with the new ATMML tasks I received. Rescued one at "the last chance saloon" as the _7 wingman.

Don't seem to have a "unreasonable" crunch time for the hardware used. About 7 hours or so. I've had acemd that went for 12-14 hours before.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,711,257,660
RAC: 10,801,436
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61611 - Posted: 20 Jul 2024 | 1:46:11 UTC

I don't recall a larger executable from a BOINC project. 4.67 GB! That is larger than some LHC VDI files.

roundup
Send message
Joined: 11 May 10
Posts: 63
Credit: 9,648,842,244
RAC: 57,094,537
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61612 - Posted: 20 Jul 2024 | 4:15:20 UTC

I had 57 units so far without a single error. Great!
Fastest unit took 4,197 seconds (1,17 hours) on a 4080 Super, longest one took a bit over 30,000 seconds (8,33 hours) on a 4060ti.
More than reasonable.

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 871,817,770
RAC: 4,242,718
Level
Glu
Scientific publications
wat
Message 61615 - Posted: 22 Jul 2024 | 13:05:22 UTC

Hello everyone! My four hosts running 3060, 3060ti and 3070ti were not able to complete a single unit so far. They all fail at the very beginning with the following STDERR output: "Error loading cuda module". I am running Linux Mint and Ubuntu with Nvidida driver 470. The newer drivers produce errors in other projects so I decided to stick to that driver version. I noticed that a lot of my wingmen successfully crunch the units with driver 530 or 535. is that a driver issue? All other projects run just fine on version 470.


Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/24/bin/rbfe_explicit_sync.py", line 10, in <module>
rx.setupJob()
File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/sync/atm.py", line 85, in setupJob
self.worker = OMMWorkerATM(ommsystem, self.config, self.logger)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/sync/worker.py", line 34, in __init__
self.simulation = Simulation(self.topology, self.ommsystem.system, self.integrator, platform, properties)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/openmm/app/simulation.py", line 106, in __init__
self.context = mm.Context(self.system, self.integrator, platform, platformProperties)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/24/lib/python3.11/site-packages/openmm/openmm.py", line 12171, in __init__
_openmm.Context_swiginit(self, _openmm.new_Context(*args))
^^^^^^^^^^^^^^^^^^^^^^^^^^
openmm.OpenMMException: Error loading CUDA module: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61616 - Posted: 22 Jul 2024 | 15:07:38 UTC - in response to Message 61615.

with that error, yes i would assume the old driver version is the issue.

CUDA historically has not been forward compatible. as in, a CUDA10 binary could not run on a system with only CUDA 8 drivers. but the opposite was true in most cases, that backward compatibility is fine and you can run even very old CUDA code with the latest drivers.

only starting with CUDA 11.1 was forward compatibility introduced, and only within the same major version. So a system with only CUDA 11.1 drivers could still run up to CUDA 11.8 binaries. Same goes for CUDA12, where all CUDA 12 drivers will be compatible with all CUDA 12+ binaries.

I have a feeling that some parts of this new ATMML app, and probably in particular OpenMM (based on what's throwing the error) actually requires CUDA 12+ drivers. and the app is misidentified at the project as being CUDA 11 compatible.

you could test this by installing the newer drivers and see if they then run.

what other project has issue with the newer drivers?
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61617 - Posted: 22 Jul 2024 | 16:10:31 UTC - in response to Message 61616.

chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint..

https://www.gpugrid.net/results.php?userid=563937

at me the original drivers of the system works three good.this are the 535 drivers provided to install linux mint..
____________

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 871,817,770
RAC: 4,242,718
Level
Glu
Scientific publications
wat
Message 61619 - Posted: 23 Jul 2024 | 12:25:52 UTC - in response to Message 61617.

chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint..

https://www.gpugrid.net/results.php?userid=563937

at me the original drivers of the system works three good.this are the 535 drivers provided to install linux mint..


I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. Do I have to install additional libraries or something like that? I also noticed that there is an open driver package from Nvidia and a regualar meta package and a server version of that driver. Which one are you guys using?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61620 - Posted: 23 Jul 2024 | 13:13:11 UTC - in response to Message 61619.

chez moi les pilotes d'origine du system fonctionne tres bien.ce sont les pilotes 535 fourni a l'install de linux mint..

https://www.gpugrid.net/results.php?userid=563937

at me the original drivers of the system works three good.this are the 535 drivers provided to install linux mint..


I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found. Do I have to install additional libraries or something like that? I also noticed that there is an open driver package from Nvidia and a regualar meta package and a server version of that driver. Which one are you guys using?


if you're running opencl applications then yes you need additional opencl package.

sudo apt install ocl-icd-libopencl1

535 drivers work fine for einstein, most of my hosts are on that driver and I contribute to einstein primarily.
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61621 - Posted: 23 Jul 2024 | 15:40:38 UTC - in response to Message 61620.

je n'utilise rien de supplemntaire comme package.
J'ai installé linux mint normalement et fais les mises a jours systeme et pilotes.
J'ai installé les pilotes 535 en passant par le gestionnaire de pilotes at tout fonctionne tres bien.
boinc reconnait ma rtx 4060 et ma rtx a2000 et ma gtx 1650 dans le meme pc.
je calcule pour gpugrid et amicable numbers sans problemes.
soit vous avez une installation systeme défaillante soit un probleme hardware.

I don’t use anything extra as a package.
I installed linux mint normally and make the system and driver updates.
I installed the 535 drivers through the driver manager and everything works fine.
boinc recognizes my rtx 4060 and my rtx a2000 and my gtx 1650 in the same pc.
I calculate for gpugrid and friendly numbers without problems.
either you have a system installation failure or a hardware problem.
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61622 - Posted: 23 Jul 2024 | 15:55:08 UTC

pour commencer,je vous conseille de tester vos barrettes de ram avec memtest free et pas un autre programme.il fonctionne tres bien et est fiable.

To start with, I advise you to test your ram strips with memtest free and not another program.it works very well and is reliable.

https://www.memtest86.com/
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61623 - Posted: 23 Jul 2024 | 15:57:56 UTC

quand vous installer de driver 535 sous linux il installe aussi tout le nécessaire pour calculer.
c'est a dire opencl nvidia et cuda donc il ne faut rien installer d'autres .
Juste le driver 535 et c'est tout

when you install driver 535 under linux it also installs everything necessary to calculate.
that is opencl nvidia and cuda so nothing else to install .
Just driver 535 and that’s it
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61624 - Posted: 23 Jul 2024 | 16:46:34 UTC

je vous conseille aussi de faire un test de votre dur pour voir s'il ny a pas de cluster defectueux.
il faut faire un test de surface avec hdtune 255 free ou un autre programme genre crystaldiskinfo.


https://www.hdtune.com/download.html

I also advise you to do a test of your hard to see if there is no bad cluster.
you have to do a surface test with hdtune 255 free or another program like crystaldiskinfo.

____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 78
Credit: 1,667,472,303
RAC: 11,682,892
Level
His
Scientific publications
wat
Message 61625 - Posted: 23 Jul 2024 | 16:48:16 UTC

pour dépanner un pc je commence toujours par ces 2 choses.

to troubleshoot a pc I always start with these 2 things
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61626 - Posted: 23 Jul 2024 | 17:45:43 UTC - in response to Message 61623.

it may be true for Mint that opencl components are installed with the normal driver package. but that is not the case for Ubuntu, and you do need to install the opencl components separately with the command in my previous post.
____________

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 871,817,770
RAC: 4,242,718
Level
Glu
Scientific publications
wat
Message 61631 - Posted: 24 Jul 2024 | 13:58:43 UTC
Last modified: 24 Jul 2024 | 13:59:35 UTC

ok, I managed to get one of my UBUNTU hosts running with the 535 driver and the additional OpenCL libraries installed like you said Ian&Steve. For 5h it has been crunching an ATMML unit so far, it's looking good! I am surpised to see that you seemingly need OpenCL to run CUDA code because that is the only difference to my previous attempts.

So far no luck with my Mint Laptop. If I change the driver to any other version than 470 the GPU is no longer detected. That may also have to do with the AMD driver from the on-board AMD GPU. Maybe they interfere with each other. Will try my other hosts this weekend. Thanks for the help.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61632 - Posted: 24 Jul 2024 | 14:13:39 UTC - in response to Message 61631.

you don't need opencl driver components to run true cuda code. but a lot of other projects are not running apps compiled in cuda, but rather OpenCL.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 401
Credit: 16,755,010,632
RAC: 99,681
Level
Trp
Scientific publications
watwatwat
Message 61633 - Posted: 25 Jul 2024 | 6:20:50 UTC - in response to Message 61615.

...running 3060, 3060ti and 3070ti were not able to complete a single unit so far.

It's hit or miss for 1080 Ti, 2070 Ti, and 3060 Ti to successfully complete ATMML WUs. They have no problem with 1.05 QC so I dedicate them to QC.
My 2080 Ti, 3080, and 3080 Ti GPUs have no problem running ATMML so they're dedicated to ATMML.
All running Linux Mint with Nvidia 550.54.14.

____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61634 - Posted: 25 Jul 2024 | 8:14:34 UTC
Last modified: 25 Jul 2024 | 8:24:13 UTC

Hello.

This app is actually built with cudatoolkit version 11.8

For reasons indicated here:
https://docs.nvidia.com/deploy/cuda-compatibility/#application-considerations-for-minor-version-compatibility

The minimum driver version is not 450.80.02 as stated in table 1 of that link. But most likely 520 that was released with 11.8: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id5

This app used OpenMM which uses PTX code. Hence if you are using a too old driver you will see the error: CUDA_ERROR_UNSUPPORTED_PTX_VERSION (222)

For reference my test machine is a GTX 1080 with driver version 545

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 871,817,770
RAC: 4,242,718
Level
Glu
Scientific publications
wat
Message 61646 - Posted: 31 Jul 2024 | 11:34:25 UTC

Until yesterday I received the ATMML wus and my hosts crunched them successfully with driver version 535. Today I am not getting any new ones. The servers says there were no available but on the server main page there are always around 300 available for download. And that number fluctuates so others get them. Does anybody else have that problem?

roundup
Send message
Joined: 11 May 10
Posts: 63
Credit: 9,648,842,244
RAC: 57,094,537
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61647 - Posted: 31 Jul 2024 | 12:54:39 UTC - in response to Message 61646.

Does anybody else have that problem?

No. Since the new batch arrived I have constant supply on two machines with drivers 535 and 550.
The new work units seem to take much longer to calculate (with higher credits):
11700s on 4080 Super
12000s on 4080
14700s on 4070Ti

Drago
Send message
Joined: 3 May 20
Posts: 18
Credit: 871,817,770
RAC: 4,242,718
Level
Glu
Scientific publications
wat
Message 61648 - Posted: 31 Jul 2024 | 15:28:07 UTC

Huh... Do you know how much vram they need? My GPUs are limited to 8 GB.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61649 - Posted: 31 Jul 2024 | 15:45:03 UTC - in response to Message 61648.

Huh... Do you know how much vram they need? My GPUs are limited to 8 GB.

8GB is plenty. They only use 3-4GB at most at some times.

Speedy
Send message
Joined: 19 Aug 07
Posts: 43
Credit: 40,991,082
RAC: 601,560
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 61650 - Posted: 1 Aug 2024 | 3:23:45 UTC - in response to Message 61594.

Is it Windows, Linux or both?

Unfortunately it is only for Linux according to the application page.

mrchips
Send message
Joined: 9 May 21
Posts: 16
Credit: 1,393,597,617
RAC: 1,775,061
Level
Met
Scientific publications
wat
Message 61651 - Posted: 2 Aug 2024 | 11:25:00 UTC

still waiting and wanting for some Windows tasks
____________

Dmit
Send message
Joined: 12 Sep 10
Posts: 8
Credit: 162,918,469
RAC: 459,311
Level
Ile
Scientific publications
watwatwat
Message 61670 - Posted: 11 Aug 2024 | 23:25:47 UTC - in response to Message 61619.
Last modified: 11 Aug 2024 | 23:31:13 UTC

I tried to install the 535 driver but after that my GPU is no longer recognised by Amicable, Einstein and Asteroids. GPUgrid lets me start new wus but they fail after 43 seconds saying that no Nvidia GPU was found.

It look like OpenCL not installed with distro 535 drivers. You need download from Nvidia website official source with *.run extension and install it manually, not from distro driver manager.
Something like this method, but newer *.run version of course:
https://askubuntu.com/questions/66328/how-do-i-install-the-latest-nvidia-drivers-from-the-run-file

Profile Opolis
Send message
Joined: 19 Feb 12
Posts: 3
Credit: 1,236,053,370
RAC: 6,698,446
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 61685 - Posted: 22 Aug 2024 | 16:59:03 UTC

These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01.

Greger
Send message
Joined: 6 Jan 15
Posts: 76
Credit: 24,274,812,783
RAC: 10,706,439
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 61686 - Posted: 22 Aug 2024 | 18:07:41 UTC - in response to Message 61685.

It is lower credit because it took longer from receive to return result

19 Aug 2024 | 19:41:09 UTC 21 Aug 2024 | 20:35:33 UTC
This is time and it is more then 24 hours and therefor it got reduced in points.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,209,900,137
RAC: 14,573,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61687 - Posted: 22 Aug 2024 | 18:13:19 UTC - in response to Message 61685.

These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01.


The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results.


Richard
Send message
Joined: 13 Jan 24
Posts: 2
Credit: 28,764,406
RAC: 183,900
Level
Val
Scientific publications
wat
Message 61689 - Posted: 22 Aug 2024 | 18:24:03 UTC

This task has been "downloading" for almost 4 hrs, has 0% completion and estimated run time of 400+ days.

Looks like it needs killing?

Running on Win 11, high-end machine.

Richard

==============================

Application
ATMML: Free energy with neural networks 1.01 (cuda1121)

Name TYK2_A02_A09_r0_2-QUICO_ATM_AF_04_Benchmark-12-20-RND2746
State Downloading
Received 2024-08-22 7:40:17 AM
Report deadline 2024-08-27 7:40:19 AM
Resources 0.996 CPUs + 1 NVIDIA GPU
Estimated computation size 1,000,000,000 GFLOPs
Executable wrapper_6.1_windows_x86_64.exe

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61690 - Posted: 22 Aug 2024 | 18:29:45 UTC - in response to Message 61689.

Patience . . . . grasshopper. This is a new app for Windows hosts so you are competing for download bandwidth with the cohort of other Windows hosts. Which are many.

The task runtime estimation is not accurate until your host has returned 11 valid tasks of that type to develop a correct and accurate APR rate.

Only then will the estimated runtimes be correct.

Richard
Send message
Joined: 13 Jan 24
Posts: 2
Credit: 28,764,406
RAC: 183,900
Level
Val
Scientific publications
wat
Message 61692 - Posted: 22 Aug 2024 | 20:11:48 UTC - in response to Message 61689.

Looks like it finally started and ran for a few minutes, then uploaded...

Richard

Profile Opolis
Send message
Joined: 19 Feb 12
Posts: 3
Credit: 1,236,053,370
RAC: 6,698,446
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 61693 - Posted: 22 Aug 2024 | 22:21:42 UTC - in response to Message 61687.

These tasks are running fine for me so far. The only thing I noticed was that the points awarded seem off. The second task I completed took an hour longer than the first but received 900k fewer points. So far they have been taking 5-6 hours on a 3080ti, driver version 535.183.01.


The points are accurate. You get a 50% bonus, if you finish the task successfully and return the results within 24 hours from downloading it. There is a 25% bonus if you do it within 48 hours. No bonus if you return it after 48 hours. This is an incentive for quick return of results.




Ah you are correct. I had the one task stuck in "downloading" for a while and I didn't run it until the next day.

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,845,934,111
RAC: 21,329,667
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61694 - Posted: 23 Aug 2024 | 1:13:04 UTC

Are there no checkpoints on ATMML tasks?

I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,209,900,137
RAC: 14,573,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61695 - Posted: 23 Aug 2024 | 2:00:00 UTC - in response to Message 61694.

Are there no checkpoints on ATMML tasks?

I was about 30% complete when I had to suspend the task and shut down the computer. When I restarted both the % done and elapsed time were zero.



No, there are not. Same goes for quantum chemistry and ATM. They haven't figured out how to do it, yet.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61696 - Posted: 23 Aug 2024 | 7:14:23 UTC

I hope this doesn't backfire. This morning I see 800 tasks in progress, but zero ready to send.

My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first.

I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years:


  • small cache, especially with slower GPUs.
  • run continuously, don't allow interruptions (especially auto-updates)
  • don't swap to a different GPU type mid-run

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,945,862,024
RAC: 20,331,153
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61697 - Posted: 23 Aug 2024 | 10:32:49 UTC - in response to Message 61696.

I do hope new Windows users pay attention to the 'tricks of the trade' we've learned over the years:

Thank you for your ever-sharing expertise


My last two downloads have been replica _3 tasks, each WU having failed on three Windows machines first.

Despite this, there is a noticeable increase in the number of users returning ATMML results.
Likely for the effect of Windows users now added to previous Linux ones.
Before new Windows ATMML app was released, users/24h was consistently about 80 - 100.
Currently it is more than 230, as can be seen at Server status page.

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,845,934,111
RAC: 21,329,667
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61698 - Posted: 23 Aug 2024 | 10:55:09 UTC - in response to Message 61595.
Last modified: 23 Aug 2024 | 10:58:27 UTC

ReL the Apps Page: https://www.gpugrid.net/apps.php

I wish, for consistency, it would state:

ATMML: Free energy with neural networks for GPU

Also, when selecting projects in project preferences, it would be nice if it stated:

ATMML on GPU

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61699 - Posted: 23 Aug 2024 | 11:21:33 UTC - in response to Message 61698.

ReL the Apps Page: https://www.gpugrid.net/apps.php

I wish, for consistency, it would state:

ATMML: Free energy with neural networks for GPU

Also, when selecting projects in project preferences, it would be nice if it stated:

ATMML on GPU


this is GPUgrid. all tasks are for GPU
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61700 - Posted: 23 Aug 2024 | 12:41:38 UTC - in response to Message 61697.

Despite this, there is a noticeable increase in the number of users returning ATMML results.

Indeed. But the question is: are those completed, end-of-run, scientifically useful results - or are they early crashes, resulting only in the creation and issue of another replica, to take its place in the 'in progress' count?

We can't tell from the outside. But runtimes starting at 0.04 hours don't look too good.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61701 - Posted: 23 Aug 2024 | 12:57:06 UTC - in response to Message 61700.
Last modified: 23 Aug 2024 | 12:59:22 UTC

Hi, the windows host are working successfully. There are more errors than on linux as expected, but plenty are working well.

Unfortunately some WUs with the very short run time but validated status bug are still in circulation. (each WU runs in a chain of 5 steps, when a step finishes it launches a new job with the same settings.) New WUs do not have this bug.
This is the bug I am talking about: https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61682

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,845,934,111
RAC: 21,329,667
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61702 - Posted: 23 Aug 2024 | 16:55:08 UTC - in response to Message 61696.


* small cache, especially with slower GPUs.


Which cache? Where is it set?? What should it be set at???

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,845,934,111
RAC: 21,329,667
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61703 - Posted: 23 Aug 2024 | 17:03:59 UTC

I just started ATMML yesterday. Out of seven starts only one completed. The rest errored-out after 1-1.5 hours. Windows11/RTX4090.

I'd like to get some actual work done...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61704 - Posted: 23 Aug 2024 | 17:19:31 UTC - in response to Message 61702.


* small cache, especially with slower GPUs.


Which cache? Where is it set?? What should it be set at???


He's talking about the work cache on the host. you can (kind of) control that in the BOINC Manager Options->"Computing Preferences" menu. set it to something less than 1 day probably.

you'll be limited to 4 tasks from the project (per GPU) anyway.

____________

Profile Farscape
Avatar
Send message
Joined: 1 Feb 09
Posts: 5
Credit: 1,655,352,119
RAC: 6,458,177
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61705 - Posted: 23 Aug 2024 | 17:30:31 UTC

The Windows tasks ARE NOT working as advertised....

On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time.

Previous successful task run times went between 17000-18500 seconds.

Errored tasks are 5000-8500 seconds.

I am killing the ap in preferences until itself out....
____________

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,845,934,111
RAC: 21,329,667
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61706 - Posted: 23 Aug 2024 | 18:22:30 UTC - in response to Message 61705.

Thanks. There are too many cache's out there. Let's call this the work queue.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 209
Credit: 4,095,161,456
RAC: 10,116,174
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61707 - Posted: 23 Aug 2024 | 20:14:08 UTC - in response to Message 61705.

The Windows tasks ARE NOT working as advertised....

On two 3090ti computers and one 3090 11 work units have error out between 2-4 hours of run time.

Previous successful task run times went between 17000-18500 seconds.

Errored tasks are 5000-8500 seconds.

I am killing the ap in preferences until itself out....


All 8 of 8 tasks I have completed and returned also categorized as error. This is on win10 with 4080 and 4090 GPUs. Here is a sample:
http://www.gpugrid.net/result.php?resultid=35743812
____________
Reno, NV
Team: SETI.USA

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61708 - Posted: 23 Aug 2024 | 21:09:54 UTC - in response to Message 61707.

Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

That's going to be a difficult one to overcome unless the project addresses its job estimation. You need to 'complete' (which includes a successful finish plus validation) 11 tasks before the estimates are normalised - and if every task fails, you'll never get there.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61709 - Posted: 24 Aug 2024 | 8:17:34 UTC - in response to Message 61708.

Hello. I apologise about the time limit exceed errors. I did not expect this. The jobs run for the same time as the linux ones that have all been working so I dont really understand what is happening.

Unfortunately the way boinc deals with "runtime" is completely inadequate for gpu projects. In a WU we have to estimate the flop use, which is a difficult thing to do for a gpu app. The boinc client then somehow estimates the flops performance of your computer in a way I don't understand. I cannot simply put a runtime limit of x hours as would be typical.


Does anyone know where the denominator comes from in this line?:

<message>
exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message>
<stderr_txt>


The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61710 - Posted: 24 Aug 2024 | 9:16:26 UTC - in response to Message 61709.
Last modified: 24 Aug 2024 | 9:18:42 UTC

Does anyone know where the denominator comes from in this line?:

<message>
exceeded elapsed time limit 5454.20 (10000000000.00G/1712015.37G)</message>
<stderr_txt>

The numerator I believe is the fpops_bound that is set in the WU template which is controlled by us.

Yes. It's the current estimated speed for the task, which should be 'learned' by BOINC for the individual computer running this particular task type ('app_version').

It's a complex three-stage process, and unfortunately it doesn't go down to the granularity of individual GPU types - all GPUs are considered equal.

1) When a new app version is created, the server will set a first, initial, value for GPU speeds for that version. I'm afraid I don't know how that initial value is estimated, but I'll try to find out.

2) Once the app version is up and running, the server monitors the runtime of the successful tasks returned. That's done at both the project level, and the individual host level. The first critical point is probably when the project has received 100 results: the calculated average speed from those 100 is used to set the expected speed for all tasks issued from that point forward. [aside - 'obviously' the first results received will be from the fastest machines, so that value is skewed]

3) Later, as each individual host reports tasks, once 11 successful tasks have been returned, future tasks assigned to that host are assigned the running average value for that host.

The current speed estimate ('fpops_est') can be seen in the application_details page for each host. zombie67 hasn't completed an ATMML task yet, so no 'Average processing rate' for his machine is shown yet for ATMML (at the bottom), but you can see it for other task types.

Phew. That's probably more than enough for now, so I'll leave you to digest it.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 475
Level
Phe
Scientific publications
watwatwatwat
Message 61711 - Posted: 24 Aug 2024 | 20:16:04 UTC - in response to Message 61710.

I'm curious why do we even bother to intentionally error out a task based on runtime at all? Usually a wrong estimate of runtime just messes with local client scheduling a bit, but tasks finish fine eventually. It's not like GPUGrid had accurate runtime estimation before, but previous tasks didn't fail.

Does this batch/app has bug that could cause it to stuck computing forever, which is why we need an additional protection to abort tasks after certain runtime?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61712 - Posted: 24 Aug 2024 | 21:30:21 UTC - in response to Message 61711.

Probably the decision is because this project depends on fast turnaround and turnover for tasks.

Science can't proceed till the earlier result is returned, validated and then iterated into the next task.

Better to fail fast and send out the next wingman task until the task gets retired at 8 fails.

Deadline is always 5 days to get through 8 tries and why they reward 50% credit bonus for returning results within 24 hours.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 475
Level
Phe
Scientific publications
watwatwatwat
Message 61713 - Posted: 24 Aug 2024 | 22:25:16 UTC - in response to Message 61712.
Last modified: 24 Aug 2024 | 22:34:54 UTC

I've seen people misuse this "fail fast" philosophy very often. "Fail fast" makes sense only when it's going to be a failure anyway. Turning a successful result into a failure proactively is the opposite of making progress.

Look at the errors on this host. It took ~20K seconds for my host to finish, but all the prematurely killed results ended up wasting way more compute time and on average suffered another half a day delay before getting a successful return. That's not speeding up but slowing down.

That's why I'm curious what this limit is trying to protect against. Only if we know there is a chance that a task can stuck computing indefinitely, would such a limit make sense. That would generally indicate some bug needs to be fixed. Even then, given how long turnaround would be after killing an otherwise successful task, the project should have set a floor of the limit to a few hours.

In addition, "science" does not equal to this project alone. Wasting hours of compute that could have been used by other projects isn't advancing "science". It's advancing this project at the cost of other science. That would be a bit disrespectful to fellow scientists if it's done intentionally. However, I'd rather assume good intention here that this is just a misguided optimization. I know software isn't easy, so project owner should take a look at the resulting data and try to make more efficient use of available compute by reducing waste, which would also speed up the progress for this project.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61714 - Posted: 25 Aug 2024 | 7:40:25 UTC - in response to Message 61713.

I'm pretty sure the "exceeded elapsed time limit" is not because the project scientists just decided on a whim to utilize it.

It's part of the Boinc code and nothing they have control over. It's present for all projects that use the Boinc code unmodified.

Only the Boinc developers have the knowledge of how that function is implemented.

The project scientist already stated he was surprised by the errors when the exact same task template was used for the Linux tasks and they have not had any issues with elapsed time limit errors.

Something specific to Windows. And they do not develop for Windows firstly being all the tools and software they use is primarily first Linux based and where their expertise is greatest.

Some of the toolchains they use have never had Windows versions which is why it has taken so long for some Windows versions of the native Linux apps.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61716 - Posted: 25 Aug 2024 | 11:49:19 UTC - in response to Message 61714.

I agree - the runtime errors are an issue mainly of the BOINC software, but they are appearing because the GPUGrid teams - admin and research - have over the years failed to fully come to terms with the changes introduced by BOINC around 2010. We are running a very old copy of the BOINC server code here, which include the beginnings of the 2010 changes, but which makes it very difficult for us to dig our way out of the hole we're in.

But I don't agree that only the BOINC developers understand the code. It's all open-source, and other projects mange to control it reasonably well. The finer points are indeed cloaked in obscure language, but the resulting data is visible on all our machines.

Let's play with a current worked excample.

I've just downloaded a new ATMML task on host 508381. That's a Linux machine, with 2 GPUs. They are in fact identical, so for once the 'Coprocessors' line is true. It has completed 52 ATMML tasks so far, so it has had plenty of time to reach a steady state. [BOINC loves steady states - it's the edge cases, like deploying a new app_version, which cause the problems]

My key objective is to see how the runtime estimate was derived, and to see what was done well, and what was done badly. BOINC works out the runtime from the size and the speed of the task. In dimensional terms, that's

{size} / {size per second}

The sizes cancel out, and duration is the inverse of speed.

In the case of my new task, I have:

<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> (size)
<flops>698258637765.176392</flops> (speed)

My calculator makes that

Duration 1,432,134 seconds, or about 16.5 days.

But our BOINC clients have a trick up their sleeves for coping with that - it's called the DCF, or duration correction factor. For this machine, it's settled to 0.045052. Popping that into the calculator, that comes down to:

Runtime estimate 64,520 seconds, or 17.92 hours. BOINC Manager displays 17:55:20, and that's about right for these tasks (although they do vary).

CONCLUSION
The task sizes set by the project for this app are unrealistically high, and the runtime estimates only approach sanity through the heavy application of DCF - which should normally hover around 1.

DCF is self-adjusting, but very slowly for these extreme limits. And you have to do the work first, which may not be possible.

Volunteers with knowledge and skill can adjust their own DCFs, but I wouldn't advise it for novices.

@ Steve
That's even more indigestible than the essay I wrote you yesterday. Please don't jump into changing things until they've both sunk in fully: meddling (and in particular, meddling with one task type at a project with multiple applications) can cause even more problems that it cures.

Mull it over, discuss it with your administrators and fellow researchers, and above all - ask as many questions as you need.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61721 - Posted: 25 Aug 2024 | 19:56:36 UTC

"transient upload error: server out of disk space"

the old problem which has been occurring over the years :-(
Unbelievable that this is still happening.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 475
Level
Phe
Scientific publications
watwatwatwat
Message 61724 - Posted: 26 Aug 2024 | 0:56:52 UTC - in response to Message 61716.

So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61730 - Posted: 26 Aug 2024 | 7:52:36 UTC - in response to Message 61724.

So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power?

Not so fast, please. The rsc_fpops_est figure is obviously wrong, but that's the result of many years of twiddling knobs without really understanding what they do.

Two flies in that pot of ointment:
If they reduce rsc_fpops_est by itself, the time limit will reduce, and more tasks will fail.
There's a second value - rsc_fpops_bound - which actually triggers the failure. In my worked example, that was set to 1,000x the estimate, or several years. That was one of the knobs they twiddled some years ago: the default is 10x. So something else is seriously wrong as well.

Soon after the Windows app was launched, I saw tasks with very high replication numbers, which had failed on multiple machines - up to 7, the limit here. But very few of them were 'time limit exceeded'. The tasks I'm running now have low replication numbers, so we may be over the worst of it.

I repeat my plea to Steve - please take your time to think, discuss, understand what's going on. Don't change anything until you've worked out what it'll do.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61732 - Posted: 26 Aug 2024 | 8:32:33 UTC - in response to Message 61730.

Thank you for the explanation.

The time limit exceeded error therefore happened because:
- we had a bug in some circulating WUs where certain errors would not trigger a proper error code. The result would then be validated with short runtimes.
- these fast runtime results then skewed the correction factors for the newly released windows app version.

To fix the problem I 10x'ed the rsc_fpops_bound value while leaving the rsc_fpops_est unchanged.

This appears to have worked and hosts that previously had the time limit exceeded errors now do not.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61733 - Posted: 26 Aug 2024 | 9:05:27 UTC - in response to Message 61732.

Yes, I found host 591089. That succeeded on its first task, but then failed five in a row on the time limit.

It's had the current one for two days now, so hopefully it'll work. One to watch.

EA6LE
Send message
Joined: 28 Dec 20
Posts: 7
Credit: 22,199,787,011
RAC: 189,712,498
Level
Trp
Scientific publications
wat
Message 61735 - Posted: 26 Aug 2024 | 13:16:48 UTC - in response to Message 61733.

I found a way to get the windows WUs finish without errors.
after you get a WU, go to Projects tab and select no new tasks. be sure you don't have other projects running at the same time. once is finished and uploaded you can allow for another WU and repeat.

EA6LE
Send message
Joined: 28 Dec 20
Posts: 7
Credit: 22,199,787,011
RAC: 189,712,498
Level
Trp
Scientific publications
wat
Message 61739 - Posted: 27 Aug 2024 | 22:05:37 UTC - in response to Message 61735.

WUs starting with "MCL1" are all erroring out in windows or linux.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 337
Credit: 7,711,257,660
RAC: 10,801,436
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61740 - Posted: 27 Aug 2024 | 22:21:52 UTC

Quite a few of my tasks starting with PTP1B are failing in both OSs

EA6LE
Send message
Joined: 28 Dec 20
Posts: 7
Credit: 22,199,787,011
RAC: 189,712,498
Level
Trp
Scientific publications
wat
Message 61741 - Posted: 27 Aug 2024 | 22:31:58 UTC - in response to Message 61740.
Last modified: 27 Aug 2024 | 22:32:11 UTC

Quite a few of my tasks starting with PTP1B are failing in both OSs

Those worked fine for me under linux. took shorter time to finish them.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,209,900,137
RAC: 14,573,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61742 - Posted: 28 Aug 2024 | 0:37:26 UTC

Same issue here:

https://www.gpugrid.net/result.php?resultid=35798577



Tue 27 Aug 2024 08:34:58 PM EDT | GPUGRID | Computation for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 finished
Tue 27 Aug 2024 08:34:58 PM EDT | GPUGRID | Output file MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_0 for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 absent
Tue 27 Aug 2024 08:34:59 PM EDT | GPUGRID | Started upload of MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_1


roundup
Send message
Joined: 11 May 10
Posts: 63
Credit: 9,648,842,244
RAC: 57,094,537
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61743 - Posted: 28 Aug 2024 | 3:47:48 UTC - in response to Message 61742.

That is a bad batch. All units error out after several weeks of trouble-free calculation under Linux. Example:
https://www.gpugrid.net/result.php?resultid=35800268

Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
[W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function )
[W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function )
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/28/bin/rbfe_explicit_sync.py", line 11, in <module>
rx.scheduleJobs()
File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/sync/atm.py", line 142, in scheduleJobs
if isample % int(self.config['CHECKPOINT_FREQUENCY']) == 0 or isample == num_samples:
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/configobj/__init__.py", line 554, in __getitem__
val = dict.__getitem__(self, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'CHECKPOINT_FREQUENCY'

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61747 - Posted: 28 Aug 2024 | 9:31:22 UTC

Yes, I had 90 failed ATMML tasks overnight. The earliest was issued just after 18:00 UTC yesterday, but was created at 27 Aug 2024 | 13:28:15 UTC.

I've switched to helping with the quantum chemistry backlog for the time being.

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 40
Credit: 1,556,701,081
RAC: 2,410,164
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61757 - Posted: 3 Sep 2024 | 18:30:20 UTC - in response to Message 61694.
Last modified: 3 Sep 2024 | 18:37:38 UTC

I experienced the same problems over several days when I was suspending GPU processing because of very hot temps in Texas; the result was loss of many hours of processing until I discovered the LTIWS(leave tasks in memory while suspended) was apparently not working. I am suspending ATMML tasks until cooler weather arrives in the fall.BET

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61758 - Posted: 3 Sep 2024 | 19:46:19 UTC - in response to Message 61757.

I can't remember all of the task types that allow suspending or exiting Boinc without erroring out.

The acemd tasks properly checkpoint, but you also can't allow a restarted WU to start again on a different gpu or it will also error out.

Best practice for GPUGrid generally has always been to let all tasks run to completion before exiting Boinc. No guarantees that any task will resume without loss of prior work done or just error out.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61760 - Posted: 3 Sep 2024 | 21:07:48 UTC - in response to Message 61757.

LTI[M]WS only applies to CPU tasks. GPUs don't have that much spare memory.

Billy Ewell 1931
Send message
Joined: 22 Oct 10
Posts: 40
Credit: 1,556,701,081
RAC: 2,410,164
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61761 - Posted: 4 Sep 2024 | 17:12:09 UTC - in response to Message 61760.

I appreciate the informative responses of BOTH Keith and Richard immediately below!!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61762 - Posted: 4 Sep 2024 | 18:08:37 UTC

Generally, if I know I must reboot the system shortly in the future I will just wait till the current tasks are finished or reboot shortly after a new task starts so I won't begrudge the little time lost it has already spent crunching and which it will have have restart again after the reboot.

It is generally safe to stop a task soon after it starts because with the exception of the acemd tasks, all the rest of the task types need several minutes to unpack the python environment in the slots and actually hasn't started calculating anything yet

You can get away with interrupting the startup process with a reboot I have found and you won't throw away the task or error it out.

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 16
Credit: 27,395,983
RAC: 643
Level
Val
Scientific publications
wat
Message 61763 - Posted: 4 Sep 2024 | 21:35:14 UTC - in response to Message 61687.
Last modified: 4 Sep 2024 | 22:07:43 UTC

The time bonus system has been in place w/ GPUGrid for years. (And yes, the GG tasks download several GIG of data... [WHY? well, another issue] and the download time does count against the deadline)
BUT the points awarded are nonetheless - shall one say - unfathomable.
Case in point: ATMML has a very, very high failure rate [yet another issue, AND an important one], and when completed usually award 300,000 points, at least to my NVIDIA which is better in some ways than this guy's... HOWEVER, host 621740 has had seven successful ATMML tasks (see below) in the last six days with EACH being awarded 2,700,000 points .... SO, what gives?? WHY a 9-fold difference???
WUid other task result
29271283 1 error, 1 abort
29270516 3 errors
29265238 1 error, 1 abort
29204796 1 time out, 5 errors (1 of these after 50,905 sec = 14+ Hrs
29268456 n/a
29267692 3 errors
29267146 2 errors

621740's specs:
GenuineIntel
Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz [Family 6 Model 26 Stepping 5]
Number of processors 8
Coprocessors NVIDIA NVIDIA GeForce RTX 3060 (12287MB) driver: 560.81
Operating System Microsoft Windows 10 Professional
Memory 12279.11 MB
Cache 256 KB

My NVIDIA has
Memory 16316.07 MB
Cache 512 KB
PLUS
Swap space 45668.07 MB
Total disk space 464.49 GB

Lasslo P, PhD, Prof Engr.

Life v lies: Dont be a DN...
Send message
Joined: 14 Feb 20
Posts: 16
Credit: 27,395,983
RAC: 643
Level
Val
Scientific publications
wat
Message 61764 - Posted: 4 Sep 2024 | 22:00:06 UTC - in response to Message 61762.

Good point, but winDoze update does not make it easy to avoid IT's "decision" about updating and the time to restart my system. (Don't you hate it when big tech is so much more brilliant and all-knowing than you?)

I turn OFF updates for the 5 weeks max allowed, then as the month ends, I pick the time when I will download and install OS updates.
Even then, I set the "active hours" to the times LEAST likely my PC is in use, usually including late PM to early AM

Lasslo P, PhD, Prof Engr

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61765 - Posted: 4 Sep 2024 | 22:08:10 UTC - in response to Message 61763.
Last modified: 4 Sep 2024 | 22:11:37 UTC

you're comparing multiple levels of apples vs oranges.

a GTX1660Ti is in no ways better than a RTX3060, it's older gen, less than half the CUDA cores, no tensor cores (which ATMML will use), slower clock speed, slower memory speed, truly basically every performance metric favors the 3060.

your task you completed for 300,000cr was ACEMD3, not ATMML, and you also need to consider that the ATMML tasks run much longer than ACEMD3 and use more resources, so the higher credit reward is appropriate.

your 1x ACEMD3 task if you were to crunch 24/7 would come up with a production of a little over 1,000,000 points per day.

your competitor also completed one ACEMD3 task recently, and scaling that to 24/7 production comes out to around 1,500,000 points per day.

it takes them about 4x longer to run ATMML. and based on their recent production, including the failure, is about 3,600,000 ppd.

the project admins have resolved a lot of the problems with ATMML, if you want to have better success with this project, and GPUGRID in general, you can consider switching to Linux. otherwise, maybe investigate what's going wrong with your system to cause the failures. looks like a permissions issue to me since your failed tasks have a bunch of Access is denied errors in the WU log. possibly an over zealous AV software. that could be the reason for your download errors also, or just spotty internet
____________

Profile Life v lies. Dont be a DN...
Send message
Joined: 7 Feb 12
Posts: 5
Credit: 333,581,143
RAC: 254,451
Level
Asp
Scientific publications
wat
Message 61766 - Posted: 4 Sep 2024 | 23:19:20 UTC - in response to Message 61765.

GTX1660Ti is in no ways better

"in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache.
But, admittedly it is not in general as powerful

Profile Life v lies. Dont be a DN...
Send message
Joined: 7 Feb 12
Posts: 5
Credit: 333,581,143
RAC: 254,451
Level
Asp
Scientific publications
wat
Message 61767 - Posted: 4 Sep 2024 | 23:34:23 UTC - in response to Message 61765.
Last modified: 4 Sep 2024 | 23:47:58 UTC

maybe investigate what's going wrong with your system to cause the failures.

how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them...

that could be the reason for your download errors also

I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors. GPUGrid needs 8 failures before figuring out that there are 'too many error (may have bug)' If I can, I'd rather give them this insight before I waste 5-10 minutes of time on my GPU, such as it is.

Anyway, thanks for your feedback

Oh, and by the way, I run 12-13 other projects, including at least three others where I run GPU tasks. This very high error rate of tasks is NOT an issue whatsoever with any of them.

LLP, PhD, Prof Engr

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61768 - Posted: 4 Sep 2024 | 23:34:57 UTC - in response to Message 61766.
Last modified: 4 Sep 2024 | 23:41:05 UTC

GTX1660Ti is in no ways better

"in no way" is an absolute statement, and is false: my NVIDIA has 33% more memory and double the cache.
But, admittedly it is not in general as powerful


read up. my "absolute" statement is correct.

your 1660Ti has half the memory of a 3060.
your 1660Ti also has half the cache of the 3060.

GTX 1660Ti has 6GB
RTX 3060 has 12GB (you can see this in the host details you referenced)

not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something)
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61769 - Posted: 4 Sep 2024 | 23:38:05 UTC - in response to Message 61767.
Last modified: 4 Sep 2024 | 23:41:39 UTC

maybe investigate what's going wrong with your system to cause the failures.

how bizarre ... Batting 1000, the GPUGrid tasks which fail on my system have ALSO failed on several, perhaps even 6 or 7, other systems (except when I take a newly issued task, and when I check those they also fail after bombing on my system.) So, if the problem is on my end and not in any way on GPUGrid's end, then there must be dozens and dozens (and dozens) of other systems which apparently need to "investigate what's going wrong" with them...

that could be the reason for your download errors also

I have no "Download errors" except when I abort the download of a task which has already had repeated compute errors.

Anyway, thanks for your feedback


there are more people running Windows. higher probability for resends to land on another problematic windows host.
it's more common for Windows users to be running AV software.
it's common for windows users to have issues with BOINC projects and AV software.

not hard to imagine that these factors mean that a large number of people would have problems when they're all coming to play.

check your AV settings, whitelist the BOINC data directories and try again.
____________

Profile Life v lies. Dont be a DN...
Send message
Joined: 7 Feb 12
Posts: 5
Credit: 333,581,143
RAC: 254,451
Level
Asp
Scientific publications
wat
Message 61770 - Posted: 5 Sep 2024 | 0:07:54 UTC - in response to Message 61768.
Last modified: 5 Sep 2024 | 0:08:32 UTC

your 1660Ti has half the memory of a 3060.
your 1660Ti also has half the cache of the 3060.
GTX 1660Ti has 6GB
RTX 3060 has 12GB (you can see this in the host details you referenced)
not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something)

My information is from GPUGrid's host information,
https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB
My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info.

And no, this is not a laptop, but a 12-core desktop.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61771 - Posted: 5 Sep 2024 | 0:18:58 UTC - in response to Message 61770.
Last modified: 5 Sep 2024 | 0:24:01 UTC

your 1660Ti has half the memory of a 3060.
your 1660Ti also has half the cache of the 3060.
GTX 1660Ti has 6GB
RTX 3060 has 12GB (you can see this in the host details you referenced)
not sure where you're getting your information, but it's wrong or missing context (like comparing to a laptop GPU or something)

My information is from GPUGrid's host information,
https://gpugrid.net/show_host_detail.php?hostid=613323 which states 16GB, but this may be unreliable as TechPowerUp GPU-Z does give the NVIDIA site's number of 6GB
My numbers for the cache also come from gpugrid.net/show_host_detail.php as indeed all the memory figurers in my original post so I guess my mistake was trusting gpugrid.net/show_host_detail info.

And no, this is not a laptop, but a 12-core desktop.


TPU is not out of date, and probably one of the most reliable databases for GPU (and other) specifications.

there lies the issue. you're looking at system memory, not the GPU memory. system memory has little to do with GPUGRID tasks that run on the GPUs and not the CPU. at all BOINC projects, the GPU VRAM is listed in parenthesis next to the GPU model name on the Coprocessors line. and further context, there was a long standing bug with BOINC versions older than about 7.18 that capped Nvidia memory reported (not actual) to only 4GB. so old versions were wrong in what they reported for a long time.

so still, the 3060 beats the 1660Ti in every metric. you just happened to have populated more system memory on the motherboard, but that has nothing to do with comparing the GPUs themselves.
____________

Profile Life v lies. Dont be a DN...
Send message
Joined: 7 Feb 12
Posts: 5
Credit: 333,581,143
RAC: 254,451
Level
Asp
Scientific publications
wat
Message 61772 - Posted: 5 Sep 2024 | 0:39:58 UTC - in response to Message 61769.

windows users to have issues with BOINC projects
Again, I run 12-13 other projects, including at least three others where I run GPU tasks.
I have a zero error rate on other projects.

But I do appreciate your suggestion, as I like the science behind GPUGrid and would very much like to RUN tasks rather than have them error out.
I have searched PC settings and Control Panel settings as well as file options for "AV" and do not get any relevant hits.
Could you please elaborate on what you mean by AV settings and whitelisting the BOINC directories?
Thanks.
LLP, PhD, Prof Engr.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1072
Credit: 40,231,533,983
RAC: 239
Level
Trp
Scientific publications
wat
Message 61773 - Posted: 5 Sep 2024 | 0:43:07 UTC - in response to Message 61772.

AV = Anti Virus software.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61774 - Posted: 5 Sep 2024 | 12:21:27 UTC

Switching back to BOINC software and (specifically) ATMML tasks. I've posted extensively in this thread about the problems of task duration estimation at this project. I've got some new data, which I can't explain.

Last week, I added a new Linux host (host 625407). It's a pretty plain vanilla Intel i5 with a single RTX 3060 - should be fairly average for this project. It's completed 17 tasks so far, with number 18 in progress - around 3 per day. I attached it to a venue with only ATMML tasks allowed.

Given my interest in BOINC's server-side task duration estimation for GPUs, I've been logging the stats. Here's what I've got so far:

Task number rsc_fpops_est rsc_fpops_bound flops DCF Runtime estimate (secs)

1 1E+18 1E+21 20,146,625,396,909 1.0000 49636 13.79 hours
2
3
4 1E+18 1E+21 20,218,746,342,900 0.8351 41301 11.47 hours
5
6
7 1E+18 1E+21 19,777,581,461,665 0.9931 50214 13.95 hours
8
9
10 1E+18 1E+21 19,446,193,249,403 0.8926 45900 12.75 hours
11 1E+18 1E+21 19,506,082,146,580 0.8247 42279 11.74 hours
12 1E+18 1E+21 19,522,515,301,144 0.7661 39242 10.90 hours
13
14 1E+18 1E+20 99,825,140,137 0.7585 7598256 87.94 days
15
16 1E+18 1E+21 99,825,140,137 0.7360 7373243 85.34 days
17 1E+18 1E+21 99,825,140,137 0.7287 7300045 84.49 days
18 1E+18 1E+21 99,825,140,137 0.7215 7227478 83.65 days

My understanding of the BOINC server code is that, for a mature app_version (Linux ATMML has been around for 2 months), the initial estimates should be based on the average speed of the tasks so far across the project as a whole. So it seems reasonable that the initial estimates were for 10-12 hours - that's about what I expected for this GPU.

Then, after the first 11 tasks have been reported successful, it should switch to the average for this specific host. So why does it appear that this particular host is reporting a speed something like 1/200th of the project average? So now, it's frantically attempting to compensate by driving my DCF through the floor, as with my two older machines.

The absolute values are interesting too. The initial (project-wide) flops estimates are hovering around 20,000 GFlops - does that sound right, for those who know the hardware in detail? And they are fluctuating a bit, as might be expected for an average with variable task durations for completions.

After the transition, my card dropped to below 100 GFlops - and has remained rock-steady. That's not in the script. The APR for the card (which should match the flops figure for allocated tasks) is 35599.725995644 GFlops - which doesn't match any of the figures above.

Where does this take us? And what, if anything, can we do about it? I'll try to get my head round the published BOINC server code on GitHub, but this area is notoriously complex. And the likelihood is that the current code differs to a greater or lesser extent from the code in use at this project.

I invite others of similarly inquisitive mind to join in with suggestions.

pututu
Send message
Joined: 8 Oct 16
Posts: 25
Credit: 4,153,801,869
RAC: 5,369,807
Level
Arg
Scientific publications
watwatwatwat
Message 61775 - Posted: 5 Sep 2024 | 14:38:15 UTC

I didn't run ATMML but I'm currently running Qchem on Tesla P100 with short run times (averaging somewhere around 12 mins or so per task). I see this similar behavior/pattern when starting a new instance. If I were to guess, your DCF will eventually go down from the last value of 0.7215 to 0.01 after running 100+ tasks and your final estimated run time could be about 1.16 days which is still higher than your average expected run time for your card. However if you run cpu benchmark, then the DCF number will go up from 0.01 to something higher and will take another 100+ tasks for the DCF to go down to 0.01 again but this time the estimated run time will go below 1.16 days.

I didn't take any note but just making observation only, so I could be wrong. My wild guess is that when running gpu only task there is associated cpu % required to run that gpu task and running benchmark will take care of the cpu portion needed for the gpu task. My one cent.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61776 - Posted: 5 Sep 2024 | 14:38:59 UTC - in response to Message 61774.

This is very interesting, thank you for the numbers.

I still don't understand where the flops number for a machine comes from.
does it use the data of your hardware?
or is it purely based on maths done from the rsc_fpops_est number we have set and the time taken for WUs?

I am also unsure how I would set this rsc_fpops_est number to be more accurate.

given one of these WUs takes maybe an hour on a 4090:
A 4090 is 80 TFLOPS. x1 hour = ~ 3x10^17 float point operations. Which is not actually that far off the estimated value of 1x10^18.
of course the WUs will not be using all the Tflops of the 4090. And there is no sane way for me to calculate the number of floating point operations the program uses.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61777 - Posted: 5 Sep 2024 | 17:07:17 UTC - in response to Message 61776.

Well, I said that this is going to be difficult ...

My knowledge and understanding comes from already being an active volunteer at SETI@home on 18 Dec 2008, when BOINC was first announced as being able to manage and use CUDA on GPUs for scientific computing - GPUGrid is also mentioned as becoming CUDA-enabled on the same day. We spent the following months and years knocking the rough edges off the initial launch code. I think the names and features I was referring to in my post were introduced in a sort-of relaunch around 2011. That's still a long way back in the memory bank, and that makes it difficult to find precise references in code or documentation.

My understanding is that the system was designed to be as easy as possible for researchers to implement. I believe the only key information required is the rsc_fpops_est - the estimated size of the task. From your comments on the 4090, and my logging of the early tasks, I think we can accept the current figure as being 'near enough right', and that's all it needs to be.

I think that the flops value is - since 2011-ish - reverse-engineered from your estimate of fpops_est and the measured runtime of the task on the volunteer's hardware. Pre 2011, BOINC took more notice of the 'peak flops' calculated by our computers from the speed and internal geometry of the GPU in use. BOINC guessed a 'fiddle factor' - I think something like 5% - as the ratio between the maximum usable speed on real-life jobs, and the calculated peak speed. But that was abandoned, except possibly for use as an initial seeding value.

Once the real-life data is available from the initial tasks run on a new computer, the server should maintain a running average value for flops for each computer attached to the project. That should wobble with small changes in actual task duration, which is why I was surprised to see it remained identical to the last significant digit in my run so far.

All the data necessary to calculate flops is returned as each task is reported complete. It's stored in the result table on the server, and should be transferred/averaged to the host table. I should be able to point you to the current code for managing that transfer, and the variable names and db field names used - though I may not be able to post them until after the weekend. Perhaps we could compare notes once I've found them?


Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61778 - Posted: 8 Sep 2024 | 20:09:24 UTC

08/09/2024 21:08:27 | GPUGRID | [error] Error reported by file upload server: Server is out of disk space

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61783 - Posted: 9 Sep 2024 | 2:39:37 UTC - in response to Message 61778.

08/09/2024 21:08:27 | GPUGRID | [error] Error reported by file upload server: Server is out of disk space

this has happened in irregular intervals over all the years - last time about 2 weeks ago.
Hard to believe how difficult is must be to take measures against it.

Maxxina
Send message
Joined: 6 Mar 15
Posts: 2
Credit: 165,500,150
RAC: 103,176
Level
Ile
Scientific publications
watwatwatwatwat
Message 61790 - Posted: 11 Sep 2024 | 20:27:28 UTC

Well. How does one managed to complete unit in 5 min ?

Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .)

pututu
Send message
Joined: 8 Oct 16
Posts: 25
Credit: 4,153,801,869
RAC: 5,369,807
Level
Arg
Scientific publications
watwatwatwat
Message 61791 - Posted: 11 Sep 2024 | 21:36:50 UTC - in response to Message 61790.

Well. How does one managed to complete unit in 5 min ?

Im sitting on quite more then ok PC with decent 4090 card. and me units are close to 5 hours .)


If you are referring to this post https://www.gpugrid.net/forum_thread.php?id=5468&nowrap=true#61786, Steve posted in the gpugrid discord that there are still tasks that will be generated by the older batch before the code was updated. I don't know how long before these older tasks will be flushed out from the system but it has now been more than 21 days.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61799 - Posted: 12 Sep 2024 | 12:45:01 UTC

Could it be that my Quadro P5000 is unable to crunch ATMMLs?
Several days ago, I tried it twice, and each time the tasks errored out after a few minutes (I guess, but cannot tell for sure: at the moment the GPU was supposed to start working after the initial steps).
BTW: the CPU is Intel Xeon E5 2667 v4 (two such CPUs are in the box).

Any ideas ?

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 46
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61800 - Posted: 12 Sep 2024 | 13:10:01 UTC - in response to Message 61799.

I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61801 - Posted: 12 Sep 2024 | 18:00:32 UTC - in response to Message 61800.

I think a P5000 should work, given that it is the same generation as a 1080 which I have confirmed to work. It may be that your drivers are too old. I have a 1080 with driver version 536.

thanks, Steve, for your quick reply.
Some 5 hours ago, I started another task - and it is still running :-)
So I keep my fingers crossed that it will finish successfully.
No idea why the other two ones before failed.
BTW: the driver is 537.99

TofPete
Send message
Joined: 17 Mar 24
Posts: 7
Credit: 52,389,304
RAC: 111,695
Level
Thr
Scientific publications
wat
Message 61816 - Posted: 18 Sep 2024 | 13:47:47 UTC

Hi,

Why do I receive such an error messages in ATMML tasks recently?

Stderr output
<core_client_version>8.0.2</core_client_version>
<![CDATA[
<message>
(unknown error) (0) - exit code 195 (0xc3)</message>
<stderr_txt>
09:59:48 (19024): wrapper (7.9.26016): starting
09:59:48 (19024): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
aceforce_dft_v0.4.ckpt

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61817 - Posted: 18 Sep 2024 | 19:18:18 UTC - in response to Message 61816.

You have to read a long way further down to find the real answer to your question!

In the one I picked, I see:

Traceback (most recent call last):
File "D:\ProgramData\BOINC\slots\2\Scripts\rbfe_explicit_sync.py", line 11, in <module>
rx.scheduleJobs()
File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\atm.py", line 126, in scheduleJobs
self.worker.run(replica)
File "D:\ProgramData\BOINC\slots\2\Lib\site-packages\sync\worker.py", line 124, in run
raise RuntimeError(f"Simulation failed {ntry} times!")
RuntimeError: Simulation failed 5 times!

That looks like something is wrong in the way that job was set up by the project - that's not your fault, and there's nothing you can do about it except report it here - and move on to the next one.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61834 - Posted: 27 Sep 2024 | 12:53:24 UTC

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 485
Credit: 11,209,900,137
RAC: 14,573,666
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61835 - Posted: 27 Sep 2024 | 12:57:15 UTC - in response to Message 61834.

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.


I agree.


Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61836 - Posted: 27 Sep 2024 | 12:59:50 UTC - in response to Message 61834.
Last modified: 27 Sep 2024 | 13:02:17 UTC

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.

I came here to report exactly the same thing. In the last hour, I've had 6 ATMML tasks which went wrong, and I only have 5 video cards!

Two were a relatively quick 'Error while computing' - perhaps around the 5% mark.
Four were 'Cancelled by server', after runs from 14 ksec to 50 ksec.

I'm switching to Quantum Chemistry for the moment, until we get a handle on what the problem is.

Edit - there goes another one: 'Error while computing' around the 5% mark.

WPrion
Send message
Joined: 30 Apr 13
Posts: 96
Credit: 2,845,934,111
RAC: 21,329,667
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61837 - Posted: 27 Sep 2024 | 13:10:42 UTC - in response to Message 61836.
Last modified: 27 Sep 2024 | 13:11:34 UTC

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.
If the team decides that certain types of tasks are no longer needed, they should delete them from the distribution pool before they are sent out, instead of having us volunteeers crunch them for hours before stopping them.

I came here to report exactly the same thing.

Two were a relatively quick 'Error while computing' - perhaps around the 5% mark.


Something is strange. The work queue was over 800 tasks yesterday, now it's 7.

Freewill
Send message
Joined: 18 Mar 10
Posts: 20
Credit: 32,035,232,894
RAC: 163,063,555
Level
Trp
Scientific publications
watwatwatwatwat
Message 61838 - Posted: 27 Sep 2024 | 13:15:55 UTC - in response to Message 61837.

There was a message from Quico posted on the GPUGrid Discord server. They cancelled some parts of the project as they need to finish other parts quickly. They will resend those later.

I also lost quite a bit of run time, but they didn't have a better way to do it.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61839 - Posted: 27 Sep 2024 | 13:26:49 UTC - in response to Message 61838.

They cancelled some parts of the project as they need to finish other parts quickly.

Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'.

My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2.

Freewill
Send message
Joined: 18 Mar 10
Posts: 20
Credit: 32,035,232,894
RAC: 163,063,555
Level
Trp
Scientific publications
watwatwatwatwat
Message 61841 - Posted: 27 Sep 2024 | 14:59:39 UTC - in response to Message 61839.

They cancelled some parts of the project as they need to finish other parts quickly.

Let's hope the ones they need to finish quickly aren't the most recent ones that end with 'Error while computing'.

My 'Error while computing' today are all replication _0, type CDK2 or - a recent one - NEW_CDK2.


If you're using MPS with Nvidia cards, I have seen that killing or stopping tasks while loaded into the GPU memory can really screw things up and cause newly started tasks to fail as well. That was happening after their server cancels, so I am restarting my PCs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61842 - Posted: 27 Sep 2024 | 18:15:21 UTC - in response to Message 61841.

It doesn't really feel like that sort of problem, but I'll keep an eye on it and restart if it happens again.

At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching, judging by the SSP. When I see the RTS queue filling up again, I'll try one or two to see what happens.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61843 - Posted: 27 Sep 2024 | 19:27:21 UTC - in response to Message 61842.

At the moment, the machines seem happy on QC tasks - and that seems to be the work that they want crunching

but this strategy does not work the way it was probably supposed to - as QC tasks are not available for Windows crunchers (and those are still the majority).
Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 % - as I noticed with the last few tasks that were not aborted by server and hence could finish - see here:
https://www.gpugrid.net/result.php?resultid=36020678

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 581
Credit: 9,945,862,024
RAC: 20,331,153
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61844 - Posted: 27 Sep 2024 | 20:13:59 UTC - in response to Message 61834.

Today, I've so far had 3 tasks which were "aborted by server" after several hours' runtime. Which is not really nice.

I have added up the processing time from my hosts for 11 ATMML tasks "Aborted by server" on past three days.
About 135 hours. Really not nice.
The tradition used to be that only not started tasks were aborted, and started ones were allowed to finish.
Heavy reasons for breaking this tradition, I suppose.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1620
Credit: 8,985,148,370
RAC: 18,179,019
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61845 - Posted: 27 Sep 2024 | 21:17:54 UTC - in response to Message 61843.

Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 %

I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair.

KeithBriggs
Send message
Joined: 29 Aug 24
Posts: 10
Credit: 950,400,000
RAC: 14,436,087
Level
Glu
Scientific publications
wat
Message 61846 - Posted: 27 Sep 2024 | 21:18:28 UTC - in response to Message 61844.

135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61849 - Posted: 28 Sep 2024 | 5:58:40 UTC - in response to Message 61845.

Another indication for this intention seems to be that they reduced the credit points for ATMML by 50 %

I've seen that well before the cancellations started - but only for tasks which ran significantly quicker that the ones we're used to. I think that's fair.

I agree that for significantly shorter tasks the credit is lower, no question.
But the task which I cited was one of the "long ones".

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61850 - Posted: 28 Sep 2024 | 5:59:23 UTC - in response to Message 61846.

135 hours is a drag! Sorry to hear. We want our GPUs productive and not just hamster wheels.

+ 1

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 135
Credit: 121,597,647
RAC: 27,598
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 61860 - Posted: 1 Oct 2024 | 20:59:28 UTC
Last modified: 1 Oct 2024 | 21:00:26 UTC

I just processed my first one after getting my computer back online.
It errored out.
There is a lot of depreciation going on in the task.
What is that all about?

12.9 hours run time.
https://www.gpugrid.net/results.php?userid=107556&offset=0&show_names=0&state=5&appid=

Erich56
Send message
Joined: 1 Jan 15
Posts: 1132
Credit: 10,443,547,676
RAC: 26,958,523
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61939 - Posted: 15 Nov 2024 | 19:39:23 UTC

in the past few days, there have been a lot of ATMML tasks which failed after a few minutes. It's the sub-type PTP1B...
Clicking on the working package reveals that this does not only happen with my hosts, but likewise with other vulunteers.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1352
Credit: 7,753,570,448
RAC: 11,869,104
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61941 - Posted: 15 Nov 2024 | 19:49:08 UTC

Yes, that batch was badly configured. Seems to have been cleared out and expired relatively soon. I'm on to a different MCL1 batch that is working correctly.

Post to thread

Message boards : Number crunching : ATMML

//