Advanced search

Message boards : Number crunching : GPUgrid disk writes 100GB+ a day, any way to reduce checkpoint frequency?

Author Message
doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41646 - Posted: 10 Aug 2015 | 22:45:36 UTC

So basically, my logs look like this:

8/10/2015 2:56:14 PM | GPUGRID | [checkpoint] result e2s1_e1s14f430 GERARD_FXCXCL12_LIG_15353472-0-1-RND8208_1 checkpointed
8/10/2015 2:56:30 PM | GPUGRID | [checkpoint] result e2s1_e1s14f430-GERARD_FXCXCL12_LIG_15353472-0-1-RND8208_1 checkpointed
8/10/2015 2:56:47 PM | GPUGRID | [checkpoint] result e2s1_e1s14f430-GERARD_FXCXCL12_LIG_15353472-0-1-RND8208_1 checkpointed
8/10/2015 2:57:03 PM | GPUGRID | [checkpoint] result e2s1_e1s14f430-GERARD_FXCXCL12_LIG_15353472-0-1-RND8208_1 checkpointed

Im wondering if this intended behavior? Checkpoints every 15 seconds? I don't baby my SSDs, haven't really ever paid attention to what does disk writes when, but I get a little nervous when I see GPUgrid writing 120Gb worth or data every 24 hours. That's a lot.

I didn't even believe the numbers I was seeing in Resource Monitor at first, but it showed restart.coor, restart.vel and restart.idx in the GPUgrid directory open and being written to constantly, and BOINC logs backed it up. So I decided to run a day with just WCG, and a day with WCG and GPUgrid and compare S.M.A.R.T. data. I went from 500Mb to 118Gb per day with GPUgrid crunching.

I'm new to BOINC, but I tried adjusting the checkpoint frequency settings and had no affect. Is there perhaps a config file setting? I mean I can always just set up a RAM disk and move on, but I guess the real question is whether this is really the intended behavior or not. That's a lot of thrashing drives out there if it is.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 695
Credit: 1,371,992,468
RAC: 3
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41647 - Posted: 11 Aug 2015 | 1:55:48 UTC - in response to Message 41646.
Last modified: 11 Aug 2015 | 1:56:33 UTC

My GERARD_FXCXCL12_LIG have been check-pointing about every 5.5 minutes. I have the minimum checkpoint interval set to 300 seconds (5 minutes) for whatever good that does, using BOINC 7.6.6 x64 (Win7 64-bit). Maybe your version of BOINC needs updating?

But I agree 120 GB per day is too much; I use a write-cache (PrimoCache) to protect the SSDs in those machines, though it was not for GPUGrid but because I run WCG/CEP2 on the CPUs.

doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41649 - Posted: 11 Aug 2015 | 16:49:40 UTC - in response to Message 41647.

Thanks Jim, I am using the latest stable version. That's interesting that you are using the same OS as me, but the development client and not seeing the issue. I will try that out.

I noticed the CEP2 being completely nuts as well, but not due to constant checkpointing. It may be programmed inefficiently (or not, no idea), but this checkpointing business looks like a bug in BOINC or GPUgrid.

Actually I happen to have a CEP2 WU going right now, and it's written 30Gb in the last 30 minutes. Sheesh.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2686
Credit: 1,164,361,299
RAC: 432,706
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41651 - Posted: 11 Aug 2015 | 20:39:15 UTC - in response to Message 41649.

That's interesting, doobedy! I have the same BOINC version as you have and am seeing frequent file changes in a GPU-Grid slot directory as well. In ressource monitor the data rate of the GPU-Grid process is really low, though, never exceeding 100 B/s (watched ~ 1 min).
How do you check how much a process has already written?

MrS
____________
Scanning for our furry friends since Jan 2002

Profile Retvari Zoltan
Avatar
Send message
Joined: 20 Jan 09
Posts: 2048
Credit: 14,826,285,069
RAC: 2,412,335
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41652 - Posted: 11 Aug 2015 | 23:36:13 UTC - in response to Message 41646.

I've checked it on my hosts (with the built in task manager), and I've found that at 50% completion the acemd.847-65.exe app wrote nearly 3.5GB. On a host with a GTX 980 and Windows XP x64 it will write ~23.26GB every day. This much data will not shorten an SSD's lifetime significantly. However the GPUGrid app actually does not checkpoints according to the period set in BOINC manager.

doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41653 - Posted: 12 Aug 2015 | 1:04:58 UTC - in response to Message 41652.


I've checked it on my hosts (with the built in task manager), and I've found that at 50% completion the acemd.847-65.exe app wrote nearly 3.5GB. On a host with a GTX 980 and Windows XP x64 it will write ~23.26GB every day. This much data will not shorten an SSD's lifetime significantly. However the GPUGrid app actually does not checkpoints according to the period set in BOINC manager.


Like I said, I'm very laid back when it comes to SSDs. I get similar numbers to you written by acemd.847-65.exe, it's not the main culprit. What's causing an extra 100Gb on top of that can be seen (at least on my machine) by looking at the event logs with checkpoint_debug enabled. It's logging checkpoints 4x a minute.

Under the Resource Monitor, on the Disk tab a constant stream of writing to restart.coor, restart.idx, and restart.vel shows up. Sort by Write(Bytes/sec) and they will be at the top, belonging to the System process.

doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41654 - Posted: 12 Aug 2015 | 1:18:07 UTC - in response to Message 41651.
Last modified: 12 Aug 2015 | 1:25:36 UTC

That's interesting, doobedy! I have the same BOINC version as you have and am seeing frequent file changes in a GPU-Grid slot directory as well. In ressource monitor the data rate of the GPU-Grid process is really low, though, never exceeding 100 B/s (watched ~ 1 min).
How do you check how much a process has already written?

MrS


See what I said above for some hints about tracking down this particular issue, but because its showing up under System I/O I think thats the only place it will show up under Task Manager, where you can sort by total writes. System will include lots of other I/O as well, like windows logs, but it will give you an idea.

If you want to do a quick sanity check, you can run something like http://www.ssdready.com/, which I used to confirm something weird was going on. Just run it, hit start, and it shows you total writes and estimated writes per day. I'd ignore the estimated SSD life, its very conservative, but it can give you a ballpark. I compared it with the S.M.A.R.T. data the drive reports, and they matched.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2686
Credit: 1,164,361,299
RAC: 432,706
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41658 - Posted: 12 Aug 2015 | 19:41:16 UTC
Last modified: 12 Aug 2015 | 20:09:06 UTC

I used SSDReady to check my write volume: about 140 GB/day. With GPU-Grid on hold this reduced to 70 GB/day. I upgraded from 7.4.42 to 7.6.6 and the daily write load with GPU-Grid (and everything else, as before) reduced to 80 GB/day.

Those numbers are only measured for a few minutes each, so have at least 10% uncertainty. But it seems clear that some bug in 7.4.42 was removed and the write load reduced! Well, the projection just shot up to 500 GB/day.. but it looks like that was some write spike which doesn't happen often.

Edit: a WCG CEP2 WU had been messing with the measurements. I put it on hold and am now at a very steady 33-34 GB/day with a checkpoint setting of 5 minutes (measured for 11 minutes).

MrS
____________
Scanning for our furry friends since Jan 2002

doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41659 - Posted: 12 Aug 2015 | 20:50:26 UTC - in response to Message 41658.

Interesting. Did you actually see checkpoints spaced out 5 minutes in the event log? Are your preferences are set 300 seconds?

I installed 7.6.6 fresh, re-added my proects, and it's still humming along making checkpoints every 15 seconds.

I setup a RAM disk, which I have mixed feelings about. If I lose WUs because of unexpected shutdowns, I'll have to give up on that. I already lost some after upgrading.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 695
Credit: 1,371,992,468
RAC: 3
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41660 - Posted: 12 Aug 2015 | 22:17:45 UTC - in response to Message 41659.

I setup a RAM disk, which I have mixed feelings about. If I lose WUs because of unexpected shutdowns, I'll have to give up on that. I already lost some after upgrading.

It helps to have an uninterruptible power supply (UPS), especially in the spring and summer months when lightning can produce brief outages. In my area, they last for only a second or two, but I use UPS on all my machines to bridge the gap. And if the machines are not stable but crash for other reasons, I would attend to that first before putting BOINC data on a ramdisk.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2686
Credit: 1,164,361,299
RAC: 432,706
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41661 - Posted: 13 Aug 2015 | 21:20:32 UTC

I'm still seeing frequent writes of those files. Just the overall transfer volume somehow went down. I haven't performt any further checks, though.

MrS
____________
Scanning for our furry friends since Jan 2002

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41662 - Posted: 14 Aug 2015 | 0:22:55 UTC

I keep my BOINC Data directory on a mechanical drive, even though the OS is installed on an SSD.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2686
Credit: 1,164,361,299
RAC: 432,706
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41663 - Posted: 14 Aug 2015 | 7:38:36 UTC

Whatever is left in my 33-34 GB/day is not influenced by the checkpoint time I set in BOINC. I tried 300 and 180s, each over a whole night and without any WCG CEP2 WUs. The transferred was pretty much the same.

MrS
____________
Scanning for our furry friends since Jan 2002

doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41665 - Posted: 14 Aug 2015 | 18:30:06 UTC - in response to Message 41663.

Whatever is left in my 33-34 GB/day is not influenced by the checkpoint time I set in BOINC. I tried 300 and 180s, each over a whole night and without any WCG CEP2 WUs. The transferred was pretty much the same.

MrS


Funny, in the last 28 hours of uptime, my RAM disk process shows 218GB written under task manager. That's with WCG going, not an attempt to measure scientifically. I suspect GPUgrid ignores the checkpoint settings, but is behaving well on your machine since the upgrade to 7.6.6. Which is fine, as long as it's aiming for 5 minutes and not 15 seconds. I don't even know what percentage of projects use the checkpoint setting.

Who would know how all of this works? Is it the Acemd executable that calls a checkpoint? Maybe it's using the checkpoint mechanism to do necessary work and it's not a bug at all. I really have no idea. Who is the developer who would know?

doobedy
Send message
Joined: 25 Jul 15
Posts: 7
Credit: 19,649,500
RAC: 0
Level
Pro
Scientific publications
wat
Message 41666 - Posted: 14 Aug 2015 | 18:58:54 UTC - in response to Message 41660.
Last modified: 14 Aug 2015 | 19:23:09 UTC


It helps to have an uninterruptible power supply (UPS), especially in the spring and summer months when lightning can produce brief outages. In my area, they last for only a second or two, but I use UPS on all my machines to bridge the gap. And if the machines are not stable but crash for other reasons, I would attend to that first before putting BOINC data on a ramdisk.


I have a UPS, although despite Los Angeles's reputation for brown outs, we've had one power outage in my neighborhood in the past 5 years, and it was the classic Mylar balloon hitting a transformer.

Even more tangentially, this particular computer was stress tested pretty hard. I don't usually tinker, and have never overclocked, but these modern chipsets make it easy to tweak useful things. I wanted my media/gaming computer in the living room to crunch 24/7, and do it while staying cool and silent. In a mini ITX case.

So I undervolted the CPU core about 10%, and dropped the turbo speed down .2Ghz. It's 5% slower but 10C cooler, and dropped from the CPU power usage from 62W to 50W watts under load measured at the wall. The fans stay quiet, and combined with a Maxwell GPU, I'm only using 170W for the entire computer, and wish I had bought a GTX 980 instead of 960. But GPU crunching wasn't on my radar at the time.

Edited to add: I forgot my original point, which is that computers are computers. Can't trust them too much, especially when they are used to do more than crunch. I stream Netflix and play games while BOINC remains at 100% load in the background. It's happily done that for weeks, but not all programs are good neighbors. Lock ups happen.

Jim1348
Send message
Joined: 28 Jul 12
Posts: 695
Credit: 1,371,992,468
RAC: 3
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41668 - Posted: 15 Aug 2015 | 2:20:20 UTC - in response to Message 41666.

I stream Netflix and play games while BOINC remains at 100% load in the background. It's happily done that for weeks, but not all programs are good neighbors. Lock ups happen.

I do most BOINC work on dedicated machines (both Ivy Bridge and Haswell), and they are quite stable. But on my main machine (Haswell), I record and edit video, play it back over the LAN, do BOINC (both CPU and GPU) and numerous other things. I have spent two years trying to get it stable. It is one thing after another. When you add too much stuff, there is always some straw that breaks the camel's back.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1111
Credit: 1,813,512,539
RAC: 953,866
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41670 - Posted: 15 Aug 2015 | 12:21:36 UTC
Last modified: 15 Aug 2015 | 12:28:20 UTC

You can turn on the <checkpoint_debug> flag, in your cc_config.xml file or by changing "Event Log Options" on the user interface, to see the checkpoints happen, per http://boinc.berkeley.edu/wiki/Client_configuration ... and you'll see the lines in the Event Log that the original post shows.

The BOINC checkpoint setting has actually been renamed in the 7.6.x version (not publicly released yet), to better indicate that it is only a request. The setting is now described as "Request tasks to checkpoint at most every x seconds". I have that setting set to 60.

From my experience, GPUGrid ignores that request, at least for most of its task types. Looking at the data, it seems that different GPUGRID task types checkpoint at different intervals -- some as often as every 15 seconds, some as sporadically as every 60 seconds. You can turn on the debug flag to measure it properly, dumping the data into Excel and sorting it to see what you need to see.

I think it'd be nice if GPUGrid did honor the setting.

Post to thread

Message boards : Number crunching : GPUgrid disk writes 100GB+ a day, any way to reduce checkpoint frequency?