-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipfs freezes (memleak?) #8195
Comments
The corresponding code looks like this: ipfs_api_host="/ip4/127.0.0.1/tcp/5001"
function get_timestamp() {
date --utc -Iseconds
}
function ipfs_api() {
local -a cmd=(ipfs --api="$ipfs_api_host")
"${cmd[@]}" "$@"
return $?
}
printf ':: checking diskspace... @ %s\n' "$(get_timestamp)"
repo_current_size=-1
repo_maxsize=-1
while IFS= read -r -d $'\n' line; do
if [[ $line =~ ^RepoSize.* ]]; then
repo_current_size=$(echo "$line" | awk '{ print $2 }')
elif [[ $line =~ ^StorageMax.* ]]; then
repo_maxsize=$(echo "$line" | awk '{ print $2 }')
fi
done < <(ipfs_api repo stat)
if [ -z "$repo_maxsize" ] || [ -z "$repo_current_size" ] || [ "$repo_maxsize" -eq -1 ] || [ "$repo_current_size" -eq -1 ]; then
fail "Could not read the repo sizeafter completing the import" 1233
elif [ "$repo_current_size" -gt "$repo_maxsize" ]; then
printf ':: diskspace usage exceeded maxsize; starting GC... @ %s\n' "$(get_timestamp)"
ipfs_api repo gc > /dev/null || fail "Could not run the GC after completing the import" 1232
printf ':: GC operation completed @ %s\n' "$(get_timestamp)"
else
printf ':: diskspace usage ok @ %s\n' "$(get_timestamp)"
fi |
I tried to shut down the instance, but IPFS wasn't able to shut down within 5 minutes. So I killed it with TERM. I think this might be related to the badgerds? |
I've updated to the latest master (f05f84a) and rebooted the machine. IPFS is still unable to report the repository stats within 15 minutes: |
Does If you ran the command and it hung for a day that'd be surprising and getting a pprof dump (https:/ipfs/go-ipfs/blob/master/docs/debug-guide.md) could be helpful. Side note: The built-in GC algorithm requires going through the blockstore, if you have enough data that just listing it (e.g. |
I downgraded the server to 0.8.0 and I could get a stat within 15 minutes (something like 13 minutes). I ran the GC but the consumption on the disk was still at 400 GB (measured with
I converted the datastore to flatfs and the consumption of diskspace dropped to 78 GB.
So the issue is clearly somewhere in the badgerds. |
Okay, so I'm now using I stopped the I tried to get the debug stuff via curl, as described in the debug guide - there was zero bytes received within 2 minutes. I tried to shut down IPFS and it won't shut down within 10 minutes. So I killed it with Here's the printout:
|
There's not a ton to go off of here. It seems pretty strange that if the curling the pprof endpoint just hangs or that the stacktrace printout was slow without their being any significant resource usage. The stacktrace being slow to output doesn't really seem like something under go-ipfs's control. Did you have any output from the daemon? If this happens again it would be great to get a more complete stacktrace since it's pretty hard to otherwise see what happened. What is your config |
Hey, @aschmahmann it happens again. I forgot to upgrade, so the server is still running 0.9-rc2. Its running is some kind of memory IO heavy task and won't respond to anything in a reasonable amount of time. I tried to follow the debug instructions, which leads to stuck curls which won't get anything within several minutes. Here's an If you're not familiar, that's a pretty new Linux API which is called pressure stall information The process is currently shutting down with a |
I'm sorry for the delay, but the machine is still dumping the core of the ipfs process... |
@RubenKelevra v0.9.0-rc2 has some issues compared with the release (e.g. the experimental DHT has much higher CPU usage in the RC). If you've got any logs/dumps around the high memory usage though that'd be helpful |
@aschmahmann CPU usage is basically zero and the experimental DHT is not activated. I had this issues with badgerdb, expected the issue to be badger related but it appears now with flatfs. I think the only unusual config on this machine is ZFS as the file system. I'm currently mobile, but some hours ago it was still dumping its core 😂 To be exact: This is not about high memory consumption but extremely high amounts of memory IO. The machine has still plenty of memory left. |
Here's a partly dumped core. Maybe this is already useful? Not sure. The process continues to dump. https://ipfs.io/ipfs/QmNzettEdAoD5qX4JgRDqpZz9UDWocWZVxN1ft64PLnfkf |
@RubenKelevra perhaps unrelated but it looks like you were triggering some panics. Those may have been fixed in the final release, but I'm not sure at the moment. |
Yeah I saw that too. Not sure if related (: |
Okay, I aborted the core dump now, it just creates too much downtime. I'll update to the master on this machine, and write back if this error persist on the master. |
I can confirm this for the recent master too. I think I have an idea what's happening: I have memory restrictions for the ipfs service in place to avoid that the process can run the kernel into oom. That said it's set to 12 GB of memory and 64 GB of address space. Both were never hit previously - not even close. Is there a memory leak suspected in 0.9.x? 🤔 |
@aschmahmann do you run a somewhat recent systemd/linux somewhere and can test out the following options?
I used all of them, the commented out ones are my approach to binary search the issue. |
And my memory is now above the MemoryMax set previously: |
The config on this machine: $ ipfs --api='/ip4/127.0.0.1/tcp/5001' config show
{
"API": {
"HTTPHeaders": {}
},
"Addresses": {
"API": "/ip4/127.0.0.1/tcp/5001",
"Announce": [],
"Gateway": "/ip4/127.0.0.1/tcp/80",
"NoAnnounce": [
"/ip4/10.0.0.0/ipcidr/8",
"/ip4/100.64.0.0/ipcidr/10",
"/ip4/169.254.0.0/ipcidr/16",
"/ip4/172.16.0.0/ipcidr/12",
"/ip4/192.0.0.0/ipcidr/24",
"/ip4/192.0.0.0/ipcidr/29",
"/ip4/192.0.0.8/ipcidr/32",
"/ip4/192.0.0.170/ipcidr/32",
"/ip4/192.0.0.171/ipcidr/32",
"/ip4/192.0.2.0/ipcidr/24",
"/ip4/192.168.0.0/ipcidr/16",
"/ip4/198.18.0.0/ipcidr/15",
"/ip4/198.51.100.0/ipcidr/24",
"/ip4/203.0.113.0/ipcidr/24",
"/ip4/240.0.0.0/ipcidr/4",
"/ip6/100::/ipcidr/64",
"/ip6/2001:2::/ipcidr/48",
"/ip6/2001:db8::/ipcidr/32",
"/ip6/fc00::/ipcidr/7",
"/ip6/fe80::/ipcidr/10"
],
"Swarm": [
"/ip4/0.0.0.0/tcp/443",
"/ip6/::/tcp/443"
]
},
"AutoNAT": {},
"Bootstrap": [
"/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
"/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
"/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
"/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
],
"DNS": {
"Resolvers": null
},
"Datastore": {
"BloomFilterSize": 0,
"GCPeriod": "1h",
"HashOnRead": false,
"Spec": {
"mounts": [
{
"child": {
"path": "blocks",
"shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
"sync": true,
"type": "flatfs"
},
"mountpoint": "/blocks",
"prefix": "flatfs.datastore",
"type": "measure"
},
{
"child": {
"compression": "none",
"path": "datastore",
"type": "levelds"
},
"mountpoint": "/",
"prefix": "leveldb.datastore",
"type": "measure"
}
],
"type": "mount"
},
"StorageGCWatermark": 90,
"StorageMax": "500GB"
},
"Discovery": {
"MDNS": {
"Enabled": false,
"Interval": 10
}
},
"Experimental": {
"AcceleratedDHTClient": false,
"FilestoreEnabled": false,
"GraphsyncEnabled": false,
"Libp2pStreamMounting": false,
"P2pHttpProxy": false,
"ShardingEnabled": false,
"StrategicProviding": false,
"UrlstoreEnabled": false
},
"Gateway": {
"APICommands": [],
"HTTPHeaders": {
"Access-Control-Allow-Headers": [
"X-Requested-With",
"Range",
"User-Agent"
],
"Access-Control-Allow-Methods": [
"GET"
],
"Access-Control-Allow-Origin": [
"*"
]
},
"NoDNSLink": false,
"NoFetch": false,
"PathPrefixes": [],
"PublicGateways": null,
"RootRedirect": "",
"Writable": false
},
"Identity": {
"PeerID": "QmVoV4RiGLcxAfhA181GXR867bzVxmRTWwaubvhUyFrBwB"
},
"Ipns": {
"RecordLifetime": "96h",
"RepublishPeriod": "",
"ResolveCacheSize": 128
},
"Migration": {
"DownloadSources": null,
"Keep": ""
},
"Mounts": {
"FuseAllowOther": false,
"IPFS": "/ipfs",
"IPNS": "/ipns"
},
"Peering": {
"Peers": null
},
"Pinning": {},
"Plugins": {
"Plugins": null
},
"Provider": {
"Strategy": ""
},
"Pubsub": {
"DisableSigning": false,
"Router": "gossipsub"
},
"Reprovider": {
"Interval": "12h",
"Strategy": "all"
},
"Routing": {
"Type": "dht"
},
"Swarm": {
"AddrFilters": [
"/ip4/10.0.0.0/ipcidr/8",
"/ip4/100.64.0.0/ipcidr/10",
"/ip4/169.254.0.0/ipcidr/16",
"/ip4/172.16.0.0/ipcidr/12",
"/ip4/192.0.0.0/ipcidr/24",
"/ip4/192.0.0.0/ipcidr/29",
"/ip4/192.0.0.8/ipcidr/32",
"/ip4/192.0.0.170/ipcidr/32",
"/ip4/192.0.0.171/ipcidr/32",
"/ip4/192.0.2.0/ipcidr/24",
"/ip4/192.168.0.0/ipcidr/16",
"/ip4/198.18.0.0/ipcidr/15",
"/ip4/198.51.100.0/ipcidr/24",
"/ip4/203.0.113.0/ipcidr/24",
"/ip4/240.0.0.0/ipcidr/4",
"/ip6/100::/ipcidr/64",
"/ip6/2001:2::/ipcidr/48",
"/ip6/2001:db8::/ipcidr/32",
"/ip6/fc00::/ipcidr/7",
"/ip6/fe80::/ipcidr/10"
],
"ConnMgr": {
"GracePeriod": "3m",
"HighWater": 12000,
"LowWater": 11800,
"Type": "basic"
},
"DisableBandwidthMetrics": false,
"DisableNatPortMap": true,
"EnableAutoRelay": false,
"EnableRelayHop": false,
"Transports": {
"Multiplexers": {},
"Network": {
"QUIC": false
},
"Security": {}
}
}
} |
I think this and #8219 is the same issue. Closing this one. |
This was fixed in #8263 |
Version information:
go-ipfs version: 0.10.0-dev-041de2aed
Repo version: 11
System version: amd64/linux
Golang version: go1.16.4
Description:
I have a script that modifies the MFS and then runs an
ipfs repo stat
to have a look at the space consumption. If it's too high it will run a manual GC to clean up.This is a workaround for the corruption of the MFS by the GC.
I noticed that it's hanging after it got stuck for 4 days while running on (what I consider) a high-performance server with SSD storage:
I also checked the remaining storage for the badger datastore:
The text was updated successfully, but these errors were encountered: