Message boards : Server and website : Optimized bandwith
Author | Message |
---|---|
We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better. | |
ID: 54125 | Rating: 0 | rate: / Reply Quote | |
Thank you!! | |
ID: 54127 | Rating: 0 | rate: / Reply Quote | |
We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better. I can confirm the site is far more responsive to browse via web. Many thanks for your efforts! | |
ID: 54129 | Rating: 0 | rate: / Reply Quote | |
We have optimized the network so that bandwidth to the server should double. Hopefully this will make the download/upload better. Indeed, it's much faster now. Thank you! | |
ID: 54137 | Rating: 0 | rate: / Reply Quote | |
I still woke up this morning to a long queue of GG WUs needing to move up & down. When they're moving the transfer rate seems faster. | |
ID: 54139 | Rating: 0 | rate: / Reply Quote | |
On March 27th 2020 GDF wrote: We have optimized the network so that bandwidth to the server should double... Since March 14th, due to Coronavirus crisis, here in Spain all we citizens are required by government's order for home confinement. Gianni, Toni, and all GPUGrid's Team at backstage: Thank you very much for your continued support at current hard times!!! Hoping everybody healthy, | |
ID: 54141 | Rating: 0 | rate: / Reply Quote | |
I've been watching my BoincTasks Transfers page for the last several hours wondering when it will clear. When I got up there were almost no GG WUs actually running since the UL queue was full. I've been using these commands in my cc_config file for a year or so and at first they seemed to help but now I'm not so sure. Maybe if everyone used them or if they could be enforced by the server: <max_file_xfers>9</max_file_xfers> <max_file_xfers_per_project>3</max_file_xfers_per_project> From https://boinc.berkeley.edu/wiki/Client_configuration <max_file_xfers>N</max_file_xfers> Maximum number of simultaneous file transfers (default 8). <max_file_xfers_per_project>N</max_file_xfers_per_project> Maximum number of simultaneous file transfers per project (default 2). But it does not behave as described, maybe because things are actually happening faster in the computer than what's being displayed on the screen. But these commands lump CPU & GPU WUs together and treat DLs the same as ULs. My Charter Spectrum ISP limits UL speeds to 10% of DL speeds so UL is always the choke point. I just ran a speed test with a GG transfer backlog trying to clear: 53.3 Mbps download and 4.66 Mbps upload. It seems that it would be better if the GG server could limit the number of ULs from a given IP address. For now I'm going to switch to 64 &1 and see how that runs through the next couple of days of server backups: <max_file_xfers>64</max_file_xfers> <max_file_xfers_per_project>1</max_file_xfers_per_project> Note: I don't know the first thing about how servers work. | |
ID: 54590 | Rating: 0 | rate: / Reply Quote | |
What I have noticed today, since this morning, was/is an obvious GPUGRID server outage a few times. | |
ID: 54591 | Rating: 0 | rate: / Reply Quote | |
transfers are really sluggish. | |
ID: 54592 | Rating: 0 | rate: / Reply Quote | |
something doesn't seem right I am wondering whether the GPUGRID people are aware of the problem? No comments here from their side so far :-( | |
ID: 54593 | Rating: 0 | rate: / Reply Quote | |
Downloads are currently being limited by a single connection from the project to any host. | |
ID: 54595 | Rating: 0 | rate: / Reply Quote | |
I see nothing obviously wrong, so I hope it's some international connectivity issue. | |
ID: 54596 | Rating: 0 | rate: / Reply Quote | |
Downloads are currently being limited by a single connection from the project to any host. is there a source for this? ____________ | |
ID: 54597 | Rating: 0 | rate: / Reply Quote | |
I see nothing obviously wrong, so I hope it's some international connectivity issue. Of course as soon as I post something about it, all the stalled uploads and downloads cleared out. The only thing of consequence now is the project requested a 1 hour backoff. | |
ID: 54598 | Rating: 0 | rate: / Reply Quote | |
Downloads are currently being limited by a single connection from the project to any host. No, just what I was observing and after I posted that, the rest of the connections picked up and all the stalled tasks moved to the project on both hosts. Toni says he sees nothing wrong on his end. Thinks there might be international connection issues that we were seeing. | |
ID: 54599 | Rating: 0 | rate: / Reply Quote | |
I'm amazed that we've gone this long through an international crisis with connectivity being up. Should be no surprise that elements of the net start going down. Will probably get worse before it gets better. | |
ID: 54600 | Rating: 0 | rate: / Reply Quote | |
Well I'm back to stalled up/downloads again that I can't persuade to get moving. | |
ID: 54601 | Rating: 0 | rate: / Reply Quote | |
The problem seems to be that my hosts never receive an ACK from the project about successful uploads. | |
ID: 54603 | Rating: 0 | rate: / Reply Quote | |
I see nothing obviously wrong, so I hope it's some international connectivity issue.I don't know what you're able to look at, but it's been particularly bad at certain times of day for the last 24 hours. Yesterday morning, most attempts at most connections were failing until about 10:00 UTC. Then, suddenly the floodgates opened, and I managed to get all tasks uploaded, reported, and replaced over about 20 minutes. I went out for the day, but when I returned in the evening, most machines were queuing again and were still in backlog when I went to bed. Starting this morning at about 06:05 UTC, most machines were running, but two were in local backoff. A single 'retry', and both uploaded, reported, and downloaded at full normal speed. There was a small glitch around 06:45 UTC, but the rot set in an hour ago, just after 07:00 UTC. A few tasks have crept through, but I now have 9 tasks uploading, and 3 tasks downloading. Each task requires 16 separate server connections: 6 to upload, 1 scheduler contact, and 9 downloads. Most of the delays seem to be failures to connect, so I'm not sure whether they would show up in internal monitoring - possibly only in slower turnround and reduced research throughput. With the mothballing of SETI@Home, you will have the opportunity (extra volunteers) to complete much more bioscience research. But I would urge you to, perhaps, commission a network traffic audit from a networking specialist to try to locate the cause of these problems. Otherwise, you may find that the additional volunteers float away as suddenly as they arrived. One additional problem is that every type of connection has to pass through the same bottleneck. Now to try connection number 17, to post this message. Failed - "This site can’t be reached. The connection was reset." Take 2... | |
ID: 54605 | Rating: 0 | rate: / Reply Quote | |
Today's floodgates opened a little earlier. Just completed this morning's big exchange - I'm good for a few more hours. | |
ID: 54606 | Rating: 0 | rate: / Reply Quote | |
I had connection problems with these DC projects: | |
ID: 54609 | Rating: 0 | rate: / Reply Quote | |
I see nothing obviously wrong, so I hope it's some international connectivity issue.I'm in the USA. These are European BOINC projects that have never behaved like GG: Ibercivis LHC TN-GRID Asteroids YoYo Yafu Universe QuChemPedIA I would look at your server configuration some more. | |
ID: 55300 | Rating: 0 | rate: / Reply Quote | |
If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: <config> <refresh> <uploads>60</uploads> <downloads>60</downloads> </refresh> </config> or <config> <refresh> <auto>60</auto> </refresh> </config> But sadly it only works on localhost and does not help my headless fleet. Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command??? It would also help to eliminate the stifling 2 WU per GPU limitation. | |
ID: 55387 | Rating: 0 | rate: / Reply Quote | |
If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: you could write a script to issue the retry transfers. then just have it run locally on each system looping with a timed wait. your systems are hidden so i can't really be more specific since I don't know what kind of setups you have. ____________ | |
ID: 55389 | Rating: 0 | rate: / Reply Quote | |
your systems are hidden so i can't really be more specific since I don't know what kind of setups you have.They're naked as a Jaybird now :-) | |
ID: 55392 | Rating: 0 | rate: / Reply Quote | |
ok since you are running Linux, try this. you can use the boinccmd tool to retry transfers, #!/bin/bash for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done name the script something like "update_transfers.sh" change permissions of the script to make it executable sudo chmod +x update_transfers.sh run it with the following command from the same directory where the script is saved: watch -n 600 ./update_transfers.sh *replace the value 600 with whatever wait (in seconds) you want. if you have a user install version of BOINC (ie, one that does not need to be "installed" and just runs from your home folder) then you need to put the script in the same directory where your boinccmd executable is, and modify the script, replacing "boinccmd" with "./boinccmd" you'll have to do this on each of your 40ish hosts. ____________ | |
ID: 55393 | Rating: 0 | rate: / Reply Quote | |
but the real problem is the short cooldown time from the project combined with a block on communications from the same IP (it seems they have a 30 second timer on that). they need to fix that. | |
ID: 55394 | Rating: 0 | rate: / Reply Quote | |
they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes. +100 | |
ID: 55399 | Rating: 0 | rate: / Reply Quote | |
Thanks for the script. I installed it but I also upgraded Nvidia driver to 455.23.04 and when I rebooted I lost the headless computer. | |
ID: 55402 | Rating: 0 | rate: / Reply Quote | |
"cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..." basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds. in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU. ____________ | |
ID: 55403 | Rating: 0 | rate: / Reply Quote | |
Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well. It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks. The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF. | |
ID: 55404 | Rating: 0 | rate: / Reply Quote | |
its a project server setting that controls the requested delay. the IP blocking timeout must be something setup on their network. its the combination of the short deferral time and the IP block timeout. either thing by itself doesn't cause a problem. i've forced my systems to go into a cooldown for 10 minutes after each schedule request (using a custom BOINC client) and that fixed all issues for my systems. the IP block timer is still in effect. they no longer get stuck in idle because the chance that it's trying to communicate at the same time as another system is drastically reduced. the way things are by default almost guarantees that if you have more than 1 fast computer, you will have issues. but the requested delay is absolutely in their control to change. I've seen many other projects adjust this value when they wanted to. there's no reason GPUGRID cant change it either since they are using the BOINC server software. ____________ | |
ID: 55405 | Rating: 0 | rate: / Reply Quote | |
Some projects are even more "busy" than GPUGrid. When I joined Universe I found it had a project delay time of 11 seconds. Totally ridiculous. No host or download server needs to be polled that often. | |
ID: 55406 | Rating: 0 | rate: / Reply Quote | |
Not a peep from GG staff. You'd think lengthening the cooldown period would be trivial to try. This problem is just getting worse for me. | |
ID: 55419 | Rating: 0 | rate: / Reply Quote | |
All I hear are crickets ;-( | |
ID: 55432 | Rating: 0 | rate: / Reply Quote | |
Will GPUGrid ever outgrow the need for a babysitter??? | |
ID: 55532 | Rating: 0 | rate: / Reply Quote | |
Will GPUGrid ever outgrow the need for a babysitter???While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.) The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue. I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging". 0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.) With lesser cards you can try higher values: days seconds h:m:s
0.01 864 14:24
0.02 1728 28:48
0.03 2592 43:12
0.04 3456 57:36
0.05 4320 1:12:00
0.06 5174 1:26:24
0.07 6048 1:40:48
0.08 6912 1:55:12
0.09 7776 2:09:36
0.10 8640 2:24:00 As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.)I'm curious about that fix is working for you or not. | |
ID: 55535 | Rating: 0 | rate: / Reply Quote | |
Will GPUGrid ever outgrow the need for a babysitter???While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. Lets see how many people try this. | |
ID: 55539 | Rating: 0 | rate: / Reply Quote | |
This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. | |
ID: 55544 | Rating: 0 | rate: / Reply Quote | |
This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. So are we dealing with DDOS, contention from a saturated link, rate limiting on a under resourced link, badly configured router, QOS putting our connects at the bottom of the list or a combination of these factors? Comments by volunteers in the forum indicates DDOS and a saturated link is quite likely. The other options listed above also rate a consideration. The title of this thread also suggest the rules on the network edge equipment have been modified to change bandwidth allocation. I guess we will never really know unless it is identified by Gpugrid. We are really just hypothesizing. It passes the time....and gives us a distraction. | |
ID: 55545 | Rating: 0 | rate: / Reply Quote | |
if you can find a way to prevent your system from communicating with the project for longer than the default 31 second comms delay, you will solve the problem. | |
ID: 55546 | Rating: 0 | rate: / Reply Quote | |
increasing the [default 31 seconds communications] delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them.I wonder about the ideal length of that delay. We don't know the exact rules of the DDOS protection hitting on us, which, in combination with the number of hosts the given user has behind a single WAN IP would decide the ideal delay length. This delay can't be longer than the shortest workunit on the fastest GPU, because it would make those hosts to starve. (so it won't be better than the DDOS protection making those hosts to starve.) Taking the signs of the present DDOS protection and the present short workunits in consideration, I think there is a maximum number of hosts behind the single WAN IP which can work without some of them starving. Above that number some random one of the hosts behind that WAN IP will be inevitably hit by that DDOS protection. So I think the delay should be around 600 seconds (10 minutes); but also the length of the workunits should be (at least) doubled, even quadrupled. The present workunits should be in the short queue (for lesser GPUs). | |
ID: 55547 | Rating: 0 | rate: / Reply Quote | |
A 10min delay seems pretty ideal to me. The fastest WUs I see through my 2080ti is about 15mins. The 2080ti is about the fastest card right now (Titan RTX is barely faster, RTX 8000 is a little slower) until Ampere support is added and we can properly gauge how the 30-series cards will perform. | |
ID: 55551 | Rating: 0 | rate: / Reply Quote | |
You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches.I've tried this before but not to the extreme of 0.01/0.01. As I recall it reduces the chances of getting big WUs such as ARP & HST from WCG. I'll give it a try on all my computers today but it'll probably take a day to see if it's working. I'm certain that 0.5/0.1 does not help GG. Note that the GPUGrid server will send only two workunits per GPU for a given host.This is part of the problem. | |
ID: 55553 | Rating: 0 | rate: / Reply Quote | |
10 mins is the delay I run. And it definitely solved the problem. Does this mean you set this Preferences like this??? Store at least 0.01 days of work. Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.} | |
ID: 55554 | Rating: 0 | rate: / Reply Quote | |
...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab. name the script something like "update_transfers.sh" The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh. The script cannot be writable by a user other than root (aurum). So I did this: sudo chmod 700 BOINC_Retry_sh https://manpages.debian.org/stretch/cron/cron.8.en.html So my script BOINC_Retry_sh is now this: #!/bin/bash for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry; done It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder. run it with the following command from the same directory where the script is saved:Forgot the watch command and will revisit that now. | |
ID: 55555 | Rating: 0 | rate: / Reply Quote | |
It only took an hour to remind me of the problem with using a very short work queue 0.01/0.01. I believe I saw the same thing when I previously tried 0.1/0.1 which proved unworkable. I always run Milkyway along with GPUGrid since it dries up so quickly and I abhor idle computers. Rig-02 13907 GPUGRID 12-10-2020 08:31 Not requesting tasks: don't need (CPU: ; NVIDIA GPU: job cache full) So if I have MW WUs then I cannot DL a replacement GG WU. I will not run GG exclusively just to implement a kluge. GDF needs to fix this issue from his server side. | |
ID: 55558 | Rating: 0 | rate: / Reply Quote | |
...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential. ____________ | |
ID: 55562 | Rating: 0 | rate: / Reply Quote | |
there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential. There is if you need to run it from a crontab: As described above, the files under these directories have to be pass some sanity checks including the following: be executable, be owned by root, not be writable by group or other and, if symlinks, point to files owned by root. Additionally, the file names must conform to the filename requirements of run-parts: they must be entirely made up of letters, digits and can only contain the special signs underscores ('_') and hyphens ('-'). Any file that does not conform to these requirements will not be executed by run-parts. For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new). | |
ID: 55571 | Rating: 0 | rate: / Reply Quote | |
well that's why my instructions were designed around running it in an open terminal ;). just open the terminal and run it there with the watch command | |
ID: 55572 | Rating: 0 | rate: / Reply Quote | |
10 mins is the delay I run. And it definitely solved the problem. I have a custom client that was developed by a team member, which overrides the default comms delay and forces a longer timeout to whatever you wish. this is how i KNOW that the issue is solved with a longer timeout, because i've done it (as have several other teammates). this software is locked to our team however, so even if I were to give you the BOINC client software, it wont work unless you are on our team. doesnt sound like you use anything but service installs anyway. this is a custom BOINC client that runs stand alone from wherever you have it on your system. the benefit is that you dont have to "install" anything. you just copy the folder wherever you want, and run the executable from there. the downside is that it wont auto-run when you boot the system. but when you have a stable system with failover projects, it's not too much hassle. i reboot maybe every few months due to power outages or system upgrades. ____________ | |
ID: 55573 | Rating: 0 | rate: / Reply Quote | |
I have another (much more sophisticated, yet not implemented) idea: | |
ID: 55579 | Rating: 0 | rate: / Reply Quote | |
I have another (much more sophisticated, yet not implemented) idea: Just make it wait a set amount of time, rather than wait for x number of tasks. Would be much simpler. boinccmd project update (to initiate send/receive) Wait 20-30secs (to allow proj update to complete) boinccmd set NNT wait 10 mins boinccmd allow NT And just loop that. ____________ | |
ID: 55583 | Rating: 0 | rate: / Reply Quote | |
here, i wrote it. #!/bin/bash while : do ./boinccmd --project https://www.gpugrid.net update sleep 20 ./boinccmd --project https://www.gpugrid.net nomorework sleep 10m ./boinccmd --project https://www.gpugrid.net allowmorework sleep 1 done easy. put this script in whatever directory contains your boinccmd tool executable. edit it to whatever suits your needs. this is an infinite loop, best not to run this as a cronjob. just run it in a terminal and ctrl+c if you want to kill it. unsure if this will totally fix the problem though. since it will still do a schedule request to report any finished work on 31 sec cycles. setting NNT only stops asking for new work, it doesnt stop reporting of completed work, and doesnt stop schedule requests (there is no boinccmd tool to do that other than shutting off network comms to all projects, which is likely not desired). but since you wont finish WUs faster than 30s anyway maybe it works? BOINC stops trying after a while when theres nothing to do. ____________ | |
ID: 55584 | Rating: 0 | rate: / Reply Quote | |
since GPUGRID is back, i'm running my script on my computers. i've removed my custom 10 min timer via the custom boinc client. so GPUGRID is running with a default comms delay of 31 seconds. | |
ID: 55593 | Rating: 0 | rate: / Reply Quote | |
Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own. | |
ID: 55594 | Rating: 0 | rate: / Reply Quote | |
Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own. as you'll find out, you probably will need to remove the "./" prefix on the boinccmd lines. since my implementation is with a user install of BOINC that just runs the executable directly. ____________ | |
ID: 55595 | Rating: 0 | rate: / Reply Quote | |
It does not seem to get WUs suffering from the Project Backup syndrome to move but for WUs that finish after your script starts running it works. So I added training wheels & invoked your Retry script: #!/bin/bash while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update /home/aurum/BOINC_Retry.sh echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 done It's working good so far on 2 rigs. I'll be adding it to more. | |
ID: 55596 | Rating: 0 | rate: / Reply Quote | |
yeah, nothing in this script will retry the stuck transfers. and if you have too many stuck transfers, the schedule requests wont even get new work (you'll see in the event log that you have too many stuck uploads or whatever). | |
ID: 55597 | Rating: 0 | rate: / Reply Quote | |
I think that Aurum needs the "retry transfers scripts" if all of his hosts are behind the same WAN IP. | |
ID: 55598 | Rating: 0 | rate: / Reply Quote | |
It won't work without a Retry. I've seen WUs that completed after the BOINC_Nap.sh started that went into the fatal Download pending (Project backoff) mode. E.g., #!/bin/bash It's a bit unnerving to look at my Project tab and see most of my GG set to "No new work." But watch for 10 minutes and they turn on and off. The downside is that if one wants to gracefully shutdown ARPs with their 2 hour checkpoints it turns them back on when they need to stay off. Suspending a WU stops new DLs.while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do boinccmd --host localhost --passwd pw --file_transfer https://www.gpugrid.net $i retry; done echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 done | |
ID: 55602 | Rating: 0 | rate: / Reply Quote | |
I saw a few instances of transfers getting stuck. But they usually clear on the next attempt in a few mins. The first back off from a stuck transfer is rather short. Then they get longer and longer on each successive retry failure. Having 1 stuck task doesn’t prevent downloads of more work. But having a lot of stuck ones does. I ran my script without automatic transfer retries on my systems for over 24hrs and even though one would occasionally get stuck, it always eventually cleared itself without intervention. That was my point. Getting stuck occasionally isn’t a problem if it eventually gets uploaded, where in my case they always did. You just have to trust it a bit and not get too anxious if you see a stuck one. I can see how having 40+ systems might be a different situation though. So if you absolutely need it, then do what works for you. | |
ID: 55603 | Rating: 0 | rate: / Reply Quote | |
ARP is one of the many projects of World Community Grid. | |
ID: 55604 | Rating: 0 | rate: / Reply Quote | |
ok, that only seems to further the confusion. my script wont change anything with WCG or its projects. nor do i know what he means by suspending WUs since my script doesnt do that either, it just stops getting new work for GPUGRID. So i don't see what the issue or connection between this script and WCG/ARP or suspending WUs or whatever. | |
ID: 55605 | Rating: 0 | rate: / Reply Quote | |
Africa Rainfall Project has 2 to 3 hour checkpoints. If one wants to avoid discarding all that work then an orderly shut down is required. One selects "No New Work" for all projects and waits for everything to checkpoint and then shuts down. /usr/bin/boinccmd --project https://www.gpugrid.net allowmorework Switches GG to Allow New Work long enough to start additional 1-2 hour GG WUs going. Just a small occasional nuisance. I still have not heard a peep out of GDF or Toni about whether they intend to fix their server issue. | |
ID: 55608 | Rating: 0 | rate: / Reply Quote | |
Now that COVID moonshot sprint 5 is finished at folding@home, my hosts have run out of work. | |
ID: 55819 | Rating: 0 | rate: / Reply Quote | |
There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it: 26/11/2020 09:29:07 | GPUGRID | [http] [ID#12984] Info: Connection #7366 to host www.gpugrid.net left intact 26/11/2020 09:29:08 | GPUGRID | Finished upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_0 26/11/2020 09:29:08 | GPUGRID | Started upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_2 26/11/2020 09:29:08 | GPUGRID | [http] [ID#12985] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:18 | GPUGRID | Sending scheduler request: To report completed tasks. 26/11/2020 09:29:18 | GPUGRID | Reporting 1 completed tasks 26/11/2020 09:29:18 | GPUGRID | [http] [ID#1] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:21 | GPUGRID | Started download of 2hy5B00_320_0-TONI_MDADex2sh-33-conf_file_enc 26/11/2020 09:29:22 | GPUGRID | [http] [ID#12990] Info: Re-using existing connection! (#7366) with host www.gpugrid.net That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established. By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours. | |
ID: 55825 | Rating: 0 | rate: / Reply Quote | |
There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it:I didn't examine the http_debug log before, so that's the reason for the flaw in my logic, however... By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.I thought that raising the "file transfers per project" limit would help, because I saw the same thing happen when the "per project" limit is 2 (or 5). Some of the files are downloaded, some of the downloads get stuck. After a few unsuccessful retries, the project backoff kicks in, even when the "per project" limit is low. My point is that this unknown DDOS protection is triggered even if the BOINC manager reuses the open http connection(s). In the meantime it turned out that this method is not the adequate workaround: the uploads / downloads still get stuck at my hosts. So the only working method is the "file transfer retry" script. | |
ID: 55827 | Rating: 0 | rate: / Reply Quote | |
My experience is slightly different. I have seven machines attached, made up of | |
ID: 55828 | Rating: 0 | rate: / Reply Quote | |
My experience is slightly different. I have seven machines attached, made up of Take your Windows machines to Folding. Their core_22 now has a CUDA version that works well. I will bring my Linux machines here (mostly GTX 1070's). Their control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing). | |
ID: 55848 | Rating: 0 | rate: / Reply Quote | |
From what the admins have posted, GPUGRID includes the whole Python package with the application, so the environment doesn’t matter. I run all my systems on Ubuntu 20.04 and no issues. | |
ID: 55849 | Rating: 0 | rate: / Reply Quote | |
Or did you mean “they” as in FAH? Yes, it is the Folding control program that has the problem. I am using Ubuntu 20.04 here too. | |
ID: 55852 | Rating: 0 | rate: / Reply Quote | |
Their [FAH] control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. If you do a clean install of Ubuntu 20.04, FAH won't work. | |
ID: 55862 | Rating: 0 | rate: / Reply Quote | |
If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. Good thought, but whenever I do an upgrade, it never works. I always end up having to do a clean install anyway. So I will just keep some machines on Ubuntu 18.04 for the time being. By the way, I just did my usual efficiency tests on GPUGrid, and found that the GTX 1660 Ti and GTX 1650 Super are the best, a little ahead of both the GTX 1060 and GTX 1070, so those are the ones I will use here. | |
ID: 55865 | Rating: 0 | rate: / Reply Quote | |
I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. Good idea. I routinely set that to 4, but 10 is better. It is remarkable what you have to do (in most projects) to get work. | |
ID: 55869 | Rating: 0 | rate: / Reply Quote | |
Message boards : Server and website : Optimized bandwith