Message boards : Server and website : SOS-Downloads stuck
Author | Message |
---|---|
After not having run this project for months I return to find the same problem that was here when I left. Absolutely unconscionable. Good by again. | |
ID: 44717 | Rating: 0 | rate: / Reply Quote | |
What is the problem and why are you having it? Maybe we can help you figure it out. The project up and down loading are working just fine and we are all getting tons of tasks at the moment... some every few minutes with these SDOERR_CASP tasks being handed out that are literally running in less than 5-10 minutes on many hosts. <http_transfer_timeout>3000</http_transfer_timeout> to <http_transfer_timeout>60</http_transfer_timeout> That will make it so if something does interrupt the transfer, it will retry a connection after 60 seconds and not wait 3000 seconds (50 minutes). As far as I can tell, all the servers are running fine (from the Server Status page and all my up and downloads are running smooth and nobody else has complained about this issue in weeks when there was a server full issue for a weekend. ____________ 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org | |
ID: 44718 | Rating: 0 | rate: / Reply Quote | |
The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. | |
ID: 44721 | Rating: 0 | rate: / Reply Quote | |
The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. Note that the default, as stated in http://boinc.berkeley.edu/wiki/Client_configuration#Options, is actually 300. <http_transfer_timeout>seconds</http_transfer_timeout> | |
ID: 44723 | Rating: 0 | rate: / Reply Quote | |
The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. This is the only project with that problem. Have no idea what they have set wrong but we've complained about it a number of times. Anyway, to address this problem I use the switch above: <http_transfer_timeout>60</http_transfer_timeout> That helps but in order to make it at acceptable I also have to start BOINC from the command line and use this argument: --pers_retry_delay_max 60 It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now a maximum of 7-8 minutes. We shouldn't have to jump through these hoops but I really don't think there's anyone anymore on the project that knows how to configure the system. There are other easy fixes that we've asked for that never get addressed. | |
ID: 44724 | Rating: 0 | rate: / Reply Quote | |
I just don't get these problems except for the times when everyone is because of server failures. I have systems between 2 different locations on 2 different internet providers (Comcast and RCN). | |
ID: 44736 | Rating: 0 | rate: / Reply Quote | |
You've complained about it previously as have many others. It happens here on virtually every GPUGrid download (Centurylink). It never happens on any other downloads, BOINC or otherwise. | |
ID: 44742 | Rating: 0 | rate: / Reply Quote | |
Maybe I'm overreacting but there are just too many issues with seeming simple remedies for me. Since I'm already in the process of scaling back my DC operation it doesn't really matter. | |
ID: 44751 | Rating: 0 | rate: / Reply Quote | |
While I don't think the staff of GPUGrid could do anything about your HTTP timeout problem, out of curiosity I ask you to run a very basic network diagnostics: ping www.gpugrid.net -n 100 You can do it on Linux also, but I'm not familiar with its command syntax (the -n 100 parameter tells the ping command to try 100 times). You'll see a lot of (exactly 100, if everything's going well) messages like: Reply from 84.89.134.145: bytes=32 time=83ms TTL=49 Then, at the end: Ping statistics for 84.89.134.145:
Packets: Sent = 100, Received = 100, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 83ms, Maximum = 88ms, Average = 83ms These are the actual results of my host, I'm curious about your statistics. I expect your loss of packets and the round trip times be significantly higher than what I experience. Unfortunately these numbers do not reveal the device which is responsible for your problem, but I'm quite confident in that it's closer to your end (most probably it's at your ISP) than to the GPUGrid site (in this case much more users would have such difficulties). You could also try a traceroute command: tracert www.gpugrid.net Which gives you a list of the devices between your end and grosso.upf.edu (on which the gpugrid.net project resides). Perhaps this list could help us to figure out what's wrong. Especially if it gives you very different results when you run it multiple times. In some cases these errors are simply caused by network congestion (when the ISP has limited bandwidth to certain destinations), but it could depend on the time of the day. On your end however, P2P file sharing applications or appliances, a faulty router/switch could cause such strange errors (but I'm sure in this case there would be problems with other sites as well). | |
ID: 44756 | Rating: 0 | rate: / Reply Quote | |
Actually Retvari, I experience the same problems with downloads sticking, not the whole package just one or two files that stick. | |
ID: 44757 | Rating: 0 | rate: / Reply Quote | |
I am also having issues downloading individual files. Usually take two or three retries. My trace route says: | |
ID: 44758 | Rating: 0 | rate: / Reply Quote | |
After a quiet period when most downloads completed at the first attempt, in the last few days I've seen a marked increase in download delays - as Betting Slip says, usually just one file dropping to zero speed, while nominally still 'active'. That's coincided with more work being downloaded (and re-downloaded - see Pascal thread): I doubt that's a coincidence. | |
ID: 44760 | Rating: 0 | rate: / Reply Quote | |
My trace route looks very similar after the first couple of hops: Tracing route to www.gpugrid.net [84.89.134.145]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms 192.168.11.254 [192.168.11.254]
2 16 ms 16 ms 16 ms lo1.bsr0-zugliget.net.telekom.hu [145.236.238.178]
3 16 ms 16 ms 16 ms 81.183.3.4
4 17 ms 16 ms 17 ms 81.183.3.4
5 19 ms 16 ms 16 ms 81.183.3.145
6 24 ms 23 ms 23 ms 80.157.202.125
7 22 ms 22 ms 22 ms 80.150.171.74
8 28 ms 28 ms 28 ms be2974.ccr21.muc03.atlas.cogentco.com [154.54.58.5]
9 33 ms 34 ms 34 ms be3072.ccr21.zrh01.atlas.cogentco.com [130.117.0.17]
10 46 ms 46 ms 45 ms be3080.ccr21.mrs01.atlas.cogentco.com [130.117.49.1]
11 58 ms 58 ms 57 ms be2354.ccr21.vlc02.atlas.cogentco.com [130.117.0.150]
12 62 ms 61 ms 62 ms be2339.ccr22.mad05.atlas.cogentco.com [130.117.49.81]
13 63 ms 62 ms 63 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62]
14 63 ms 62 ms 63 ms 149.11.68.50
15 159 ms 74 ms 74 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206.245.38]
16 78 ms 77 ms 77 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245.34]
17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70]
18 * * * Request timed out.
19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145]
20 84 ms 83 ms 83 ms grosso.upf.edu [84.89.134.145]
21 83 ms 91 ms 84 ms grosso.upf.edu [84.89.134.145]
Trace complete. I'm suspecting that one of my hosts has had a stalled download, and that made it crunch for Einstein@home for awhile. But these glitches usually happen to my hosts almost only when new workunits become available after a near-empty period. That's when the ghost workunits are appear too. Probably too many hosts are connected / trying to connect to the server at these time periods. Perhaps it looks like a DDOS attack for some firewall/router in the way. | |
ID: 44762 | Rating: 0 | rate: / Reply Quote | |
Ping statistics for 84.89.134.145: | |
ID: 44763 | Rating: 0 | rate: / Reply Quote | |
Nanoprobe, check your messages ^^ | |
ID: 44764 | Rating: 0 | rate: / Reply Quote | |
Download / upload issues have been around for a while now. We have discussed them to sufficient length to come to the conclusion that neither we nor GPUGRID staff have a clue as to their cause. :( | |
ID: 44765 | Rating: 0 | rate: / Reply Quote | |
For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID! Same here, but only downloads stall for me, and again: only for GPUGrid. Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 170ms, Maximum = 215ms, Average = 175ms It seems that everyone (including me) has this happening: 17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70] 18 * * * Request timed out. 19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145] Is it the problem? Again, the ONLY project this happens to is GPUGrid and never on any other downloads of any kind. | |
ID: 44767 | Rating: 0 | rate: / Reply Quote | |
For me, it's only one file (or rarely two) that stalls, out of a dozen or more for a typical task. And it gets partway through the download before stalling. | |
ID: 44768 | Rating: 0 | rate: / Reply Quote | |
If there is an internal network issue it's likely something the university needs to sort out, rather than the group. | |
ID: 44770 | Rating: 0 | rate: / Reply Quote | |
Just processed a number of short tasks. Many of them had issues downloading files. In going back through the event log, all of the interrupts happened with the "**-psf_file" and "*-pdb_file" files. | |
ID: 44771 | Rating: 0 | rate: / Reply Quote | |
Same here, but only downloads stall for me, and again: only for GPUGrid. After checking the logs a little closer, I concur, it is only downloads that exhibit this symptom not uploads, for example: 18-Oct-2016 23:54:49 [GPUGRID] Requesting new tasks for NVIDIA GPU 18-Oct-2016 23:54:52 [GPUGRID] Scheduler request completed: got 1 new tasks 18-Oct-2016 23:54:54 [GPUGRID] Started download of ... ... 19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file: transient HTTP error 19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:03:05 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file 19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file: transient HTTP error 19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:02:47 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file 19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-psf_file 19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-par_file ... The download for these two files kept failing and retrying, it took them about 10 minutes to download. ____________ | |
ID: 44772 | Rating: 0 | rate: / Reply Quote | |
It seems that everyone (including me) has this happening: I assume you refer to #18: It's quite normal that some routers don't reply to requests which come from random computers on the internet. I hoped to get some clues, but we're still just guessing the problem. To investigate this issue some network traffic analysis on the packet level should be done by the network admins at the campus, and decide to take some countermeasures locally, or contact some other ISPs for a solution. But frankly I think this issue doesn't have that much impact on the project's throughput. I don't know how many sites are hosted on this server (besides ps3grid.net and gpugrid.net). I presume there are a lot of servers hosting a lot of webpages at the campus which are routed through the same devices. Their traffic may interfere GPUGrid's traffic, but it can't be analysed from outside. | |
ID: 44774 | Rating: 0 | rate: / Reply Quote | |
I think this is more likely a gateway / firewall / reverse proxy issue. The connections are not closed, they are just stalled. Force-closing a connection raises an error on both sides immediately, and clearly this does not happen with our downloads. I think some network component (hardware or software), through which our connections are routed, intervenes and stalls them. Perhaps some network traffic limiter? | |
ID: 44775 | Rating: 0 | rate: / Reply Quote | |
transient HTTP error Transient HTTP errors can be diagnosed further by setting the <http_debug> event log flag in BOINC. I'll do that next time I'm due to download a new task (if I remember to notice in time), but my expectation is that it will turn out to be simply BOINC's own timeout, which doesn't get us much further forward. But it would confirm that reducing the timeout to 60 seconds is likely to help. | |
ID: 44776 | Rating: 0 | rate: / Reply Quote | |
Well, I've downloaded and logged a new task, and - wouldn't you believe - it didn't get stuck. 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 2736 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] http op done; retval 0 (Success) 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] file transfer status 0 (Success) 19-Oct-2016 10:46:45 [GPUGRID] Finished download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-vel_file 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] Throughput 0 bytes/sec 19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file 19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'D:\BOINC\ca-bundle.crt' 19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set 19-Oct-2016 10:46:45 [GPUGRID] Started download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] URL: http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Found bundle for host www.gpugrid.org: 0x40b89e0 [can pipeline] 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Re-using existing connection! (#1191) with host www.gpugrid.org 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#1191) 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: GET /PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file HTTP/1.1 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Host: www.gpugrid.org 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.7.0) 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept: */* 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Encoding: deflate, gzip 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Content-Type: application/x-www-form-urlencoded 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Language: en_GB 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 12696 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes Most of the time the download jogged along writing 1368 bytes at a time: I'm interpreting that as individual packets being received in the right order, and being sent to the disk-writing queue immediately. "wrote 2736 bytes" appears a lot of times too - probably two packets arriving in reverse order, and both needing to be processed before being written. But when a new file was being requested, the writes increased to 16384 bytes, and stayed that way for some time. That suggests to me that something in one or other system - server or client - is having problems walking and chewing gum at the same time. Since this is the only project where it happens, I'd suggest that possibly the server is the one on the verge of being overloaded. Connection [ID#1271] was downloading: e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-coor_file - 924 KB with throughput 730677 bytes/sec (over 500 packets/sec, if my analysis is right). That's going to be really hard to diagnose from outside the lab, and even inside it without specialist equipment and skills. But one thing comes to mind - restricting BOINC to one file being transferred at a time might ease the pressure caused by that hiccup in the middle. | |
ID: 44780 | Rating: 0 | rate: / Reply Quote | |
But it would confirm that reducing the timeout to 60 seconds is likely to help. It does help. I'll post again what made these GPUGrid downloads acceptable for me: <http_transfer_timeout>60</http_transfer_timeout> That helps but in order to make it more acceptable I also have to start BOINC from the command line and use this argument: --pers_retry_delay_max 60 It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now around 7-8 minutes. We shouldn't have to jump through these hoops but at least these workarounds help (a lot). | |
ID: 44782 | Rating: 0 | rate: / Reply Quote | |
OK, I was able to capture a DEBUG-level section of the log with a failed download. It confirms (of course) the experience we have: the file download begins, a part of the file is downloaded, then the download stalls. | |
ID: 44790 | Rating: 0 | rate: / Reply Quote | |
Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed. | |
ID: 44830 | Rating: 0 | rate: / Reply Quote | |
It also may be a consideration that this is on a university campus. When I worked for my last company, we worked with municipalities and schools to get them streaming television channels online and also allow online access to transfer MPEG-2 files into the servers from around the campus and off campus. Many times the university television staff would be at constant odds with the IT department because many departments wanted the bandwidth and there was only so much to go around. The IT department would act like they were working with us and the department to improve speed or cut down on interruptions, but then we would catch them by doing ongoing pings between us and the station computers and giving the data to IT and they would deny for a while and then say, "Oh yeah, that limiter parameter! We forgot about that! We'll 'loosen' that for you to get a better stream." Then we would still get calls from them asking why "our stream" was cutting out on people and always traced it back to IT giving bandwidth to other departments and putting limiters on the bandwidth that would make the signal intermittent. So maybe the department needs to battle for a more steady bandwidth, even if they have to trade speed for stability. If everybody was a few K slower up and downloading but the signal never broke, maybe we could live with that easier? | |
ID: 44831 | Rating: 0 | rate: / Reply Quote | |
Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed. I don't think it's a traffic issue as it stalls here on every WU download, often several times before the download is complete. Doesn't matter what time of day. Again, this never happens on any other download of any kind. Only GPUGrid. | |
ID: 44833 | Rating: 0 | rate: / Reply Quote | |
FWIW, I see the same thing on almost every download. But I have two GTX 960s on the same machine, which I just started up again. One of them downloaded all the files, while the second one got stuck as usual. It is always the longest file (or maybe the second-longest), and I concluded some time ago that it must be a problem with the server rather than the network. It seems to pause the long ones to give preference to the shorter ones, and then can't start up again. | |
ID: 44835 | Rating: 0 | rate: / Reply Quote | |
Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr? | |
ID: 44837 | Rating: 0 | rate: / Reply Quote | |
Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr? All I can say is that it is reproducible, and so I don't see how it can be a network issue, unless it is a router or switch on your own network. It could be some sort of traffic-shaping that a router might do; I don't really know that it is a server per se, but it is not at all random. About the only time I don't see it is when re-attaching to the project after an absence, though I have not done rigorous tests on that. | |
ID: 44838 | Rating: 0 | rate: / Reply Quote | |
I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps? | |
ID: 44846 | Rating: 0 | rate: / Reply Quote | |
I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps? I think that's not the right reason. Right now the network traffic at GPUGrid is very low, because there's plenty of work available in both queues, so there's no constant unfulfilled work requests. However network statistics from calm and disturbed periods could prove or disprove it. | |
ID: 44847 | Rating: 0 | rate: / Reply Quote | |
That may be so, but I wonder if it is related to the fact that I run GPUGrid with a zero resource share? As one work unit ends and starts to upload, a new work unit starts to download. That is when I see the pauses, sometimes both on the upload and download. | |
ID: 44848 | Rating: 0 | rate: / Reply Quote | |
So I wonder whether other people who are having the problem use a zero resource share also? I'm using a high resource share and almost always have stalls/pauses. | |
ID: 44850 | Rating: 0 | rate: / Reply Quote | |
I just started another download three hours after the last upload finished, and it got stuck again on the longest file: | |
ID: 44852 | Rating: 0 | rate: / Reply Quote | |
Hm this sounds suspiciously familiar. We are having issues with another webservice of ours getting stuck at loading from time to time these days. The two could be related if the network of the university is having problems. I will report this to our guys just in case. | |
ID: 44856 | Rating: 0 | rate: / Reply Quote | |
On the Computing Preferences tab of the BOINC Options list has up and download limiting. I noticed on some of my systems I set this to less than half what they can push opening the connection and have seen these user-side limited speed connections pause and timeout less if at all. It may be that the university is limiting the bandwidth and one of the triggers is a noticeable spike for a single connection. | |
ID: 44857 | Rating: 0 | rate: / Reply Quote | |
On the Computing Preferences tab of the BOINC Options list has up and download limiting. I noticed on some of my systems I set this to less than half what they can push opening the connection and have seen these user-side limited speed connections pause and timeout less if at all. It may be that the university is limiting the bandwidth and one of the triggers is a noticeable spike for a single connection. Doubt it. My max DL speed is 5 Mbps. That can't tax anybody's server. (centurylink monopoly dsl. We don't care, we don't have to. We're the phone company...) | |
ID: 44861 | Rating: 0 | rate: / Reply Quote | |
Stefan wrote: You can start by asking the university / campus IT people whether they are doing any form of traffic shaping on incoming connections to servers in the university. If there is some traffic shaping going on, you can tell them your contributors have reported problems downloading files (tasks) from certain servers (grosso??) and ask them to monitor the traffic shaping for incoming connections to your servers. Finally, ask them to report any findings to you and, if you do find we are victim to any bandwidth / number of connections limiting mechanism, start to exercise the fine art of negotiating for "MOAR BANDWIDTH!!" :D ____________ | |
ID: 44866 | Rating: 0 | rate: / Reply Quote | |
University staff have a "won't bother looking into it till you prove it" attitude, so right now Jose is running a script from home testing the connection over a few days. Then we can throw the hard cold data at them and tell them to fix it. | |
ID: 45060 | Rating: 0 | rate: / Reply Quote | |
Does anyone notice download problems on weekends? | |
ID: 45063 | Rating: 0 | rate: / Reply Quote | |
Stephan asked Does anyone notice download problems on weekends? Yes Sun 30 Oct 2016 11:45:55 PM CDT | | Project communication failed: attempting access to reference site Sun 30 Oct 2016 11:45:55 PM CDT | GPUGRID | Temporarily failed download of e26s11_e22s4p0f35-SDOERR_CASP11_crystal_ss_20ns_ntl9_0-0-psf_file: transient HTTP error Sun 30 Oct 2016 11:45:55 PM CDT | GPUGRID | Backing off 00:06:21 on download of e26s11_e22s4p0f35-SDOERR_CASP11_crystal_ss_20ns_ntl9_0-0-psf_file Sun 30 Oct 2016 11:45:57 PM CDT | | Internet access OK - project servers may be temporarily down. Sun 30 Oct 2016 04:42:45 PM CDT | | Project communication failed: attempting access to reference site Sun 30 Oct 2016 04:42:45 PM CDT | GPUGRID | Temporarily failed download of e12s17_e4s21p0f210-PABLO_SH2TRIPEP_Q_TRI_2-0-pdb_file: transient HTTP error Sun 30 Oct 2016 04:42:45 PM CDT | GPUGRID | Backing off 00:04:07 on download of e12s17_e4s21p0f210-PABLO_SH2TRIPEP_Q_TRI_2-0-pdb_file Sun 30 Oct 2016 04:42:46 PM CDT | | Internet access OK - project servers may be temporarily down. | |
ID: 45066 | Rating: 0 | rate: / Reply Quote | |
Does anyone notice download problems on weekends? Yes, here too. Examples from just one machine: 29-Oct-2016 01:51:04 [GPUGRID] Temporarily failed download of e10s3_e9s8p0f10-SDOERR_CASP11_crystal_contacts_20ns_a3D_0-0-coor_file: transient HTTP error 29-Oct-2016 05:08:31 [GPUGRID] Temporarily failed download of e16s12_e9s18p0f486-GERARD_CXCL12CHALCLD_mol0_2-0-coor_file: transient HTTP error 29-Oct-2016 10:31:01 [GPUGRID] Temporarily failed download of e6s1_e5s2p0f181-SDOERR_CASP11_crystal_ss_50ns_a3D_0-0-pdb_file: transient HTTP error 30-Oct-2016 14:03:47 [GPUGRID] Temporarily failed download of e28s4_e27s3p0f1-SDOERR_CASP11_crystal_ss_20ns_ntl9_1-0-psf_file: transient HTTP error 30-Oct-2016 22:11:57 [GPUGRID] Temporarily failed download of e13s11_e10s4p0f159-SDOERR_CASP11_crystal_ss_contacts_20ns_a3D_1-0-pdb_file: transient HTTP error I do have a copy of Wireshark available and I can try to capture a log, if that would be helpful? | |
ID: 45071 | Rating: 0 | rate: / Reply Quote | |
I do have a copy of Wireshark available and I can try to capture a log, if that would be helpful? You can have a try, but we'll see similar events: some http requests remain unanswered, but we won't know which device blocked/dropped that packet (and why). Perhaps if it's a packet fragmentation issue we'll see something useful in the log. | |
ID: 45073 | Rating: 0 | rate: / Reply Quote | |
While I don't think the staff of GPUGrid could do anything about your HTTP timeout problem, out of curiosity I ask you to run a very basic network diagnostics: Nanoprobe's network/setup/config is NOT the issue, I've experienced this issue many times on different machines with different OS and connections. The issue is exactly as he describes, usually the larger file will hangup and some of the smaller ones will finish. Then after timing out the big one will restart and make a small amount of progress and hang again. This is unique to this project, I have no issues elsewhere. It IS on gpugrid's side, not sure if its the project or their provider. People regularly mention download/upload problems around here. Currently experiencing this on win7 64bit and linux 64bit | |
ID: 45074 | Rating: 0 | rate: / Reply Quote | |
University staff have a "won't bother looking into it till you prove it" attitude, so right now Jose is running a script from home testing the connection over a few days. Then we can throw the hard cold data at them and tell them to fix it. Tell them to install boinc client and add gpugrid to it. I bet they'll get some hangups. If i have a stalled file transfer, currently I have a stalled libcufft.so.6.5 and after it stalls and times out I can watch my network activity when I retry it. It spikes up but immediately comes back down and stalls, looks like 180deg of a sin wave. I'm not an IT guy, i could do tracerts and what not but I have no idea how to diagnose this as everything points to server side for the following reasons. -This is the only project I have this problem on -Many other people have posted about "transient http error" for over 6 months. this kind of error is almost unheard of on other projects. -But most importantly; veteran crunchers with years of BOINC experience are telling you that they cannot contribute. And what kind of IT department doesn't do the IT work? Sounds like they said we don't want to figure it out, you figure it out. That's pretty messed up. edit: felt I should follow up since I made some progress after I got ranty. I edited my cc_config on a linux machine to do 1 max transfer per project and I was able to get all the files without interruptions. It feels like there's something at play affecting parallel downloads to the same IP/host? | |
ID: 45075 | Rating: 0 | rate: / Reply Quote | |
Nanoprobe's network/setup/config is NOT the issue, ... I'm aware of that. ISP's are doing some traffic shaping (or QoS), which could result in issues like this one. Most probably the campus' ISP (or WAN operator, or IT staff) is to blame. This issue began when there was a change in the network at the campus about a year ago. It was much worse than now in the beginning, but it seems that there is still something which escaped their attention. I've experienced this issue many times on different machines with different OS and connections. The issue is exactly as he describes, usually the larger file will hangup and some of the smaller ones will finish. Then after timing out the big one will restart and make a small amount of progress and hang again. This is probability at work: large files are divided to much more packets than smaller ones, so if a packet gets lost from time to time a larger file has higher probability to get stuck (even many times). This is unique to this project, I have no issues elsewhere. It IS on gpugrid's side, not sure if its the project or their provider. Perhaps GPUGrid's BOINC server log (compared to the user's log) could help in deciding this. | |
ID: 45077 | Rating: 0 | rate: / Reply Quote | |
Hello, | |
ID: 45079 | Rating: 0 | rate: / Reply Quote | |
We sent them our tests which show the timeouts and now they are looking into it. Let's hope we get some news soon. | |
ID: 45132 | Rating: 0 | rate: / Reply Quote | |
Hello, The problem is solved for me, I edit my cc_config file as like as caffeineyellow5 said in the second message. Anthony it doesn't really solve the problem, it simply masks it somewhat so that DLs don't hang for hours. BTW, this was first suggested by Richard Haselgrove. A more complete workaround is the one I posted in the 5th message: https://www.gpugrid.net/forum_thread.php?id=4399&nowrap=true#44724 Realize that these are only workarounds and not a real solution. The bad news is that they might tend to hammer the server with more requests than should be necessary if everything was working correctly. Personally, I wouldn't go under 60 for http_transfer_timeout. Also the same DL problem is evident when trying to access long threads on the message board: the thread DL stalls especially on threads with too many graphics (like the crunchathlon thread for instance). Quite irritating. Again, this happens on no other projects except GPUGrid. | |
ID: 45232 | Rating: 0 | rate: / Reply Quote | |
Have you received any new info on this issue? I can concur problem occurs in downloading files with sizes <1MB. In fact, I had to attach a backup project just to keep my gpus working. | |
ID: 45253 | Rating: 0 | rate: / Reply Quote | |
I still suspect a cache/packet size, open active connection limit, timeout, or throttling issues at the University's IT level which stands between the project and the world. | |
ID: 45277 | Rating: 0 | rate: / Reply Quote | |
Downloads still stalling here... | |
ID: 45283 | Rating: 0 | rate: / Reply Quote | |
I too am now wrestling with the download issue, manually trying to force libcufft through... This really reminds me of how the internet worked 20 years ago when it was hopelessly overloaded. I agree with caffeineyellow5 that this all points to some overloaded or malfunctioning network component on the campus causing it to randomly drop packets. This is something the network guys on the campus need to solve... | |
ID: 45811 | Rating: 0 | rate: / Reply Quote | |
If you have a 'modern' internet connection, you can download the full darn official toolkit for CUDA 8.0 direct from NVidia: | |
ID: 45812 | Rating: 0 | rate: / Reply Quote | |
Downloads are hanging anywhere from 0.00% to 92.71% of the download. I've aborted several transfers that remain hung for an hour and keep cycling between "Download: active" (with nothing downloading), "Download: pending" (ditto), and "Download: retry in {time}." Very frustrating. Unproductive, too. This problem has not occurred in the three other BOINC projects I'm subscribed to. My location: NH, USA. | |
ID: 45815 | Rating: 0 | rate: / Reply Quote | |
I've aborted several transfers that remain hung for an hour and keep cycling between "Download: active" (with nothing downloading), "Download: pending" (ditto), and "Download: retry in {time}". Next time you can select "Suspend network activity" in the manager, wait a few seconds, and then resume it. This causes the download to pause and close the stalled TCP connection then start a fresh one. ____________ | |
ID: 45818 | Rating: 0 | rate: / Reply Quote | |
Clever. Better than smacking "Retry now." Thanks. I'll have to remember that... for "normal" network problems. What's going on these past few weeks with GPUGRID isn't "normal." <Sigh> | |
ID: 45832 | Rating: 0 | rate: / Reply Quote | |
I have a laptop connected through WiFi to the same network as my desktops. I can't download the cufft64_80.dll with my laptop, however I can download it with my desktops. | |
ID: 45835 | Rating: 0 | rate: / Reply Quote | |
Should be solved - https://gpugrid.net/forum_thread.php?id=4466&nowrap=true#45967 | |
ID: 45978 | Rating: 0 | rate: / Reply Quote | |
To paraphrase "My Fair Lady": By George, I think they've got it! | |
ID: 45996 | Rating: 0 | rate: / Reply Quote | |
Should be solved - https://gpugrid.net/forum_thread.php?id=4466&nowrap=true#45967 https://gpugrid.net/forum_thread.php?id=4466&nowrap=true#45967 | |
ID: 46042 | Rating: 0 | rate: / Reply Quote | |
Message boards : Server and website : SOS-Downloads stuck