Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipfs freezes (memleak?) #8195

Closed
RubenKelevra opened this issue Jun 16, 2021 · 22 comments
Closed

ipfs freezes (memleak?) #8195

RubenKelevra opened this issue Jun 16, 2021 · 22 comments
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization

Comments

@RubenKelevra
Copy link
Contributor

Version information:

go-ipfs version: 0.10.0-dev-041de2aed
Repo version: 11
System version: amd64/linux
Golang version: go1.16.4

Description:

I have a script that modifies the MFS and then runs an ipfs repo stat to have a look at the space consumption. If it's too high it will run a manual GC to clean up.

This is a workaround for the corruption of the MFS by the GC.

I noticed that it's hanging after it got stuck for 4 days while running on (what I consider) a high-performance server with SSD storage:

  • Six cores of an AMD EPYC 7702P
  • 32 Gigs of memory

I also checked the remaining storage for the badger datastore:

  • 281G free
@RubenKelevra RubenKelevra added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jun 16, 2021
@RubenKelevra
Copy link
Contributor Author

The corresponding code looks like this:

ipfs_api_host="/ip4/127.0.0.1/tcp/5001"

function get_timestamp() {
        date --utc -Iseconds
}

function ipfs_api() {
        local -a cmd=(ipfs --api="$ipfs_api_host")

        "${cmd[@]}" "$@"
        return $?
}

printf ':: checking diskspace... @ %s\n' "$(get_timestamp)"

repo_current_size=-1
repo_maxsize=-1
while IFS= read -r -d $'\n' line; do
        if [[ $line =~ ^RepoSize.* ]]; then
                repo_current_size=$(echo "$line" | awk '{ print $2 }')
        elif [[ $line =~ ^StorageMax.* ]]; then
                repo_maxsize=$(echo "$line" | awk '{ print $2 }')
        fi
done < <(ipfs_api repo stat)

if [ -z "$repo_maxsize" ] || [ -z "$repo_current_size" ] || [ "$repo_maxsize" -eq -1 ] || [ "$repo_current_size" -eq -1 ]; then
        fail "Could not read the repo sizeafter completing the import" 1233
elif [ "$repo_current_size" -gt "$repo_maxsize" ]; then
        printf ':: diskspace usage exceeded maxsize; starting GC... @ %s\n' "$(get_timestamp)"
        ipfs_api repo gc > /dev/null || fail "Could not run the GC after completing the import" 1232
        printf ':: GC operation completed @ %s\n' "$(get_timestamp)"
else
        printf ':: diskspace usage ok @ %s\n' "$(get_timestamp)"
fi

@RubenKelevra
Copy link
Contributor Author

I tried to shut down the instance, but IPFS wasn't able to shut down within 5 minutes. So I killed it with TERM.

I think this might be related to the badgerds?

@RubenKelevra
Copy link
Contributor Author

I've updated to the latest master (f05f84a) and rebooted the machine. IPFS is still unable to report the repository stats within 15 minutes: ipfs repo stat --timeout 15m returns Error: context deadline exceeded after 15 minutes.

@aschmahmann
Copy link
Contributor

aschmahmann commented Jun 16, 2021

Does ipfs repo stat --size-only return quickly? ipfs repo stat iterates through every object in the repo which means that if you have a lot of blocks it'll take a long time. If your repo is on a remote HDD or something I suspect performance would decrease as well. Posting your config (ipfs config show) could be helpful for debugging as well.

If you ran the command and it hung for a day that'd be surprising and getting a pprof dump (https:/ipfs/go-ipfs/blob/master/docs/debug-guide.md) could be helpful.

Side note: The built-in GC algorithm requires going through the blockstore, if you have enough data that just listing it (e.g. ipfs repo stats or ipfs refs local take a long time) then running GC seems likely to be quite slow as well.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jun 17, 2021

I downgraded the server to 0.8.0 and I could get a stat within 15 minutes (something like 13 minutes).

I ran the GC but the consumption on the disk was still at 400 GB (measured with du), while I knew it's more like 80 GB.

ipfs repo stat also reported a value of around 300 GB and 270'000 objects.

I converted the datastore to flatfs and the consumption of diskspace dropped to 78 GB.

ipfs repo stat now returns in 640ms with the master version of go-ipfs.

So the issue is clearly somewhere in the badgerds.

@RubenKelevra
Copy link
Contributor Author

Okay, so I'm now using flatfs and still the master. Now a simple ipfs files stat /foldername hung up for 12 hours, while everything else worked fine until this point.

I stopped the ipfs files stat /foldername and tried a ipfs repo stat and it won't return either within several minutes.

I tried to get the debug stuff via curl, as described in the debug guide - there was zero bytes received within 2 minutes.

I tried to shut down IPFS and it won't shut down within 10 minutes.

So I killed it with kill -ABRT <pid> and while the machine is basically idle (within 1-2%) cpu usage and no disk usage to speak of - the stacktrace was EXTREMLY slow printed out. Sadly it got interrupted by systemd timing out on the service stopping.

Here's the printout:

-- Journal begins at Sun 2021-05-30 13:56:24 CEST, ends at Fri 2021-06-18 12:30:48 CEST. --
Jun 18 12:04:29 odin.pacman.store systemd[1]: Stopping InterPlanetary File System (IPFS) daemon...
Jun 18 12:13:34 odin.pacman.store ipfs[3598897]: SIGABRT: abort
Jun 18 12:13:49 odin.pacman.store ipfs[3598897]: PC=0x55dde6863661 m=0 sigcode=0
Jun 18 12:13:57 odin.pacman.store ipfs[3598897]: goroutine 0 [idle]:
Jun 18 12:14:12 odin.pacman.store ipfs[3598897]: runtime.futex(0x55dde8baee70, 0x80, 0x0, 0x0, 0x55dd00000000, 0x7ffcce832650, 0x55dde6835c1f, 0xc000058ed8, 0x7ffcce832678, 0x55dde67f9885, ...)
Jun 18 12:14:25 odin.pacman.store ipfs[3598897]:         go/src/runtime/sys_linux_amd64.s:579 +0x21
Jun 18 12:14:27 odin.pacman.store ipfs[3598897]: runtime.futexsleep(0x55dde8baee70, 0x55dd00000000, 0xffffffffffffffff)
Jun 18 12:14:29 odin.pacman.store ipfs[3598897]:         go/src/runtime/os_linux.go:44 +0x46
Jun 18 12:14:34 odin.pacman.store ipfs[3598897]: runtime.notesleep(0x55dde8baee70)
Jun 18 12:14:36 odin.pacman.store ipfs[3598897]:         go/src/runtime/lock_futex.go:159 +0xa5
Jun 18 12:14:40 odin.pacman.store ipfs[3598897]: runtime.mPark()
Jun 18 12:14:44 odin.pacman.store ipfs[3598897]:         go/src/runtime/proc.go:1340 +0x3d
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: runtime.stoplockedm()
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         go/src/runtime/proc.go:2495 +0x79
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: runtime.schedule()
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         go/src/runtime/proc.go:3103 +0x48c
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: runtime.park_m(0xc000103b00)
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         go/src/runtime/proc.go:3318 +0xab
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: runtime.mcall(0x0)
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         go/src/runtime/asm_amd64.s:327 +0x53
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: goroutine 1 [chan receive, 1967 minutes]:
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: main.daemonFunc(0xc00017e070, 0x55dde7fa9b68, 0xc000912a20, 0x55dde7e31960, 0xc0009129c0, 0x0, 0x0)
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/src/go-ipfs/cmd/ipfs/daemon.go:545 +0x1025
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: github.com/ipfs/go-ipfs-cmds.(*executor).Execute(0xc000010700, 0xc00017e070, 0x55dde7fa9b68, 0xc000912a20, 0x55dde7e31960, 0xc0009129c0, 0xc000912a20, 0x0)
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/ipfs/[email protected]/executor.go:77 +0x16b
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: github.com/ipfs/go-ipfs-cmds/cli.Run(0x55dde7fa9778, 0xc00088ec00, 0x55dde8badf60, 0xc000142000, 0x4, 0x4, 0xc00013e000, 0xc00013e008, 0xc00013e010, 0x55dde7f575c0, ...)
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/ipfs/[email protected]/cli/run.go:137 +0xac5
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: main.mainRet(0x0)
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/src/go-ipfs/cmd/ipfs/main.go:168 +0x45e
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: main.main()
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/src/go-ipfs/cmd/ipfs/main.go:71 +0x25
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: goroutine 19 [select, 1967 minutes]:
Jun 18 12:14:46 odin.pacman.store ipfs[3598897]: github.com/ipfs/go-log/writer.(*MirrorWriter).logRoutine(0xc00011be00)
Jun 18 12:14:53 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/ipfs/[email protected]/writer/writer.go:71 +0x108
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: created by github.com/ipfs/go-log/writer.NewMirrorWriter
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/ipfs/[email protected]/writer/writer.go:36 +0xbb
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: goroutine 6 [runnable]:
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: go.opencensus.io/stats/view.(*worker).start(0xc000312000)
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/[email protected]/stats/view/worker.go:276 +0xcf
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: created by go.opencensus.io/stats/view.init.0
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/[email protected]/stats/view/worker.go:34 +0x6a
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: goroutine 1025 [IO wait, 607 minutes]:
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: internal/poll.runtime_pollWait(0x7f0550fdacb8, 0x72, 0xffffffffffffffff)
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]:         go/src/runtime/netpoll.go:222 +0x65
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: internal/poll.(*pollDesc).wait(0xc000ef0b18, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]:         go/src/internal/poll/fd_poll_runtime.go:87 +0x47
Jun 18 12:14:57 odin.pacman.store ipfs[3598897]: internal/poll.(*pollDesc).waitRead(...)
Jun 18 12:15:03 odin.pacman.store ipfs[3598897]:         go/src/internal/poll/fd_poll_runtime.go:92
Jun 18 12:15:14 odin.pacman.store ipfs[3598897]: internal/poll.(*FD).Read(0xc000ef0b00, 0xc000ccb000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 18 12:15:27 odin.pacman.store ipfs[3598897]:         go/src/internal/poll/fd_unix.go:166 +0x1d8
Jun 18 12:15:35 odin.pacman.store ipfs[3598897]: net.(*netFD).Read(0xc000ef0b00, 0xc000ccb000, 0x1000, 0x1000, 0x0, 0x1000, 0x121a)
Jun 18 12:15:42 odin.pacman.store ipfs[3598897]:         go/src/net/fd_posix.go:55 +0x51
Jun 18 12:16:05 odin.pacman.store ipfs[3598897]: net.(*conn).Read(0xc00013e4b8, 0xc000ccb000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 18 12:16:18 odin.pacman.store ipfs[3598897]:         go/src/net/net.go:183 +0x93
Jun 18 12:16:50 odin.pacman.store ipfs[3598897]: bufio.(*Reader).Read(0xc001c32000, 0xc0002406c0, 0x2, 0x2, 0xc000e76c18, 0x55dde6e1cfe5, 0xc000eee190)
Jun 18 12:16:50 odin.pacman.store ipfs[3598897]:         go/src/bufio/bufio.go:227 +0x224
Jun 18 12:16:50 odin.pacman.store ipfs[3598897]: io.ReadAtLeast(0x55dde7f81bd8, 0xc001c32000, 0xc0002406c0, 0x2, 0x2, 0x2, 0x1000, 0x0, 0x0)
Jun 18 12:16:50 odin.pacman.store ipfs[3598897]:         go/src/io/io.go:328 +0x89
Jun 18 12:16:50 odin.pacman.store ipfs[3598897]: io.ReadFull(...)
Jun 18 12:16:50 odin.pacman.store ipfs[3598897]:         go/src/io/io.go:347
Jun 18 12:16:52 odin.pacman.store ipfs[3598897]: github.com/libp2p/go-libp2p-noise.(*secureSession).readNextInsecureMsgLen(0xc000240630, 0xc0019bf000, 0x0, 0x1000)
Jun 18 12:16:52 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/libp2p/[email protected]/rw.go:134 +0x68
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: github.com/libp2p/go-libp2p-noise.(*secureSession).Read(0xc000240630, 0xc0019bf000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/libp2p/[email protected]/rw.go:53 +0x20e
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: bufio.(*Reader).fill(0xc000e242a0)
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         go/src/bufio/bufio.go:101 +0x108
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: bufio.(*Reader).ReadByte(0xc000e242a0, 0xdd79cb32a01d7153, 0x0, 0x55dde6dfd565)
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         go/src/bufio/bufio.go:253 +0x3b
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: github.com/multiformats/go-varint.ReadUvarint(0x55dde7f81bb8, 0xc000e242a0, 0x4300000000002983, 0x0, 0xc09f2c83d0)
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/multiformats/[email protected]/varint.go:80 +0x79
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: github.com/libp2p/go-mplex.(*Multiplex).readNextHeader(0xc000ef0180, 0xc000e1c1b0, 0xc000e76e90, 0x55dde8be0380, 0x0)
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/libp2p/[email protected]/multiplex.go:508 +0x45
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: github.com/libp2p/go-mplex.(*Multiplex).handleIncoming(0xc000ef0180)
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/libp2p/[email protected]/multiplex.go:365 +0xe5
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]: created by github.com/libp2p/go-mplex.NewMultiplex
Jun 18 12:16:54 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/libp2p/[email protected]/multiplex.go:105 +0x20b
Jun 18 12:17:29 odin.pacman.store ipfs[3598897]: goroutine 356984053 [runnable]:
Jun 18 12:18:33 odin.pacman.store ipfs[3598897]: github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).runQueue(0xc07b69a000)
Jun 18 12:18:43 odin.pacman.store ipfs[3598897]:         /home/ruben/.cache/yay/go-ipfs-git/go/pkg/mod/github.com/ipfs/[email protected]/internal/messagequeue/messagequeue.go:404 +0x231
Jun 18 12:19:26 odin.pacman.store ipfs[3598897]: created by github.com/ipfs/go-bitswap/internal/messagequeue.(*MessageQueue).Startup
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: State 'stop-sigterm' timed out. Killing.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598897 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598898 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598899 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598901 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598902 (n/a) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598903 (n/a) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598904 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598905 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598906 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598907 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3598909 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3599107 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3601147 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3601170 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3603730 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3605532 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3744214 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846735 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846736 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846740 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846742 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846744 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846745 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846764 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846859 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846861 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846932 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846933 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846934 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846935 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846936 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846937 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846938 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846939 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846940 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846941 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846942 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846943 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3846949 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3847300 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3847301 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3847302 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3847303 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3847321 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3847322 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849153 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849869 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849870 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849871 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849873 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849874 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849875 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 3849876 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034926 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034927 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034928 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034929 (n/a) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034931 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034932 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034946 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034950 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 4034951 (ipfs) with signal SIGKILL.
Jun 18 12:19:29 odin.pacman.store systemd[1]: ipfs.service: Killing process 866680 (ipfs) with signal SIGKILL.
Jun 18 12:19:30 odin.pacman.store systemd[1]: ipfs.service: Main process exited, code=killed, status=9/KILL
Jun 18 12:19:30 odin.pacman.store systemd[1]: ipfs.service: Failed with result 'timeout'.
Jun 18 12:19:30 odin.pacman.store systemd[1]: Stopped InterPlanetary File System (IPFS) daemon.
Jun 18 12:19:30 odin.pacman.store systemd[1]: ipfs.service: Consumed 1d 2h 41min 8.850s CPU time.

@RubenKelevra RubenKelevra changed the title ipfs never returns on ipfs repo stat ipfs freezes Jun 18, 2021
@aschmahmann
Copy link
Contributor

There's not a ton to go off of here. It seems pretty strange that if the curling the pprof endpoint just hangs or that the stacktrace printout was slow without their being any significant resource usage. The stacktrace being slow to output doesn't really seem like something under go-ipfs's control.

Did you have any output from the daemon? If this happens again it would be great to get a more complete stacktrace since it's pretty hard to otherwise see what happened.

What is your config ipfs config show?

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jul 4, 2021

Hey, @aschmahmann it happens again. I forgot to upgrade, so the server is still running 0.9-rc2.

Its running is some kind of memory IO heavy task and won't respond to anything in a reasonable amount of time.

I tried to follow the debug instructions, which leads to stuck curls which won't get anything within several minutes.

Here's an atop 1 screenshot, note the high percentages on memsome and memfull, while there's plenty of memory left.

Screenshot_20210704_212038

If you're not familiar, that's a pretty new Linux API which is called pressure stall information

The process is currently shutting down with a kill -ABRT. Once finished I will add the log here (might take a while).

@RubenKelevra
Copy link
Contributor Author

I'm sorry for the delay, but the machine is still dumping the core of the ipfs process...

@aschmahmann
Copy link
Contributor

@RubenKelevra v0.9.0-rc2 has some issues compared with the release (e.g. the experimental DHT has much higher CPU usage in the RC). If you've got any logs/dumps around the high memory usage though that'd be helpful

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jul 6, 2021

@aschmahmann CPU usage is basically zero and the experimental DHT is not activated.

I had this issues with badgerdb, expected the issue to be badger related but it appears now with flatfs.

I think the only unusual config on this machine is ZFS as the file system.

I'm currently mobile, but some hours ago it was still dumping its core 😂

To be exact: This is not about high memory consumption but extremely high amounts of memory IO.

The machine has still plenty of memory left.

@RubenKelevra
Copy link
Contributor Author

Here's a partly dumped core. Maybe this is already useful? Not sure.

The process continues to dump.

https://ipfs.io/ipfs/QmNzettEdAoD5qX4JgRDqpZz9UDWocWZVxN1ft64PLnfkf

@aschmahmann
Copy link
Contributor

@RubenKelevra perhaps unrelated but it looks like you were triggering some panics. Those may have been fixed in the final release, but I'm not sure at the moment.

@RubenKelevra
Copy link
Contributor Author

Yeah I saw that too. Not sure if related (:

@RubenKelevra
Copy link
Contributor Author

Okay, I aborted the core dump now, it just creates too much downtime.

I'll update to the master on this machine, and write back if this error persist on the master.

@RubenKelevra
Copy link
Contributor Author

I can confirm this for the recent master too.

I think I have an idea what's happening: I have memory restrictions for the ipfs service in place to avoid that the process can run the kernel into oom.

That said it's set to 12 GB of memory and 64 GB of address space.

Both were never hit previously - not even close.

Is there a memory leak suspected in 0.9.x? 🤔

@RubenKelevra
Copy link
Contributor Author

@aschmahmann do you run a somewhat recent systemd/linux somewhere and can test out the following options?

CPUSchedulingPolicy=idle
Nice=3
IOSchedulingClass=idle
IOSchedulingPriority=1
MemorySwapMax=0
#MemoryMax=12G
#MemoryHigh=8G
#MemoryMin=1G
#MemoryLow=2G
#LimitAS=64G:128G
LimitNOFILE=750000:1000000

I used all of them, the commented out ones are my approach to binary search the issue.

@RubenKelevra
Copy link
Contributor Author

And my memory is now above the MemoryMax set previously: Memory: 15.3G (swap max: 0B)

@RubenKelevra
Copy link
Contributor Author

This looks like a memory leak to me:

Screenshot_20210710_221718-1

The current process is running since:
Sat 2021-07-10 01:45:38 CEST; 20h ago

Debug data:

/ipfs/bafykbzacecrv5ykohkefyhvmqrjbsxyh2q2kcitvcj4amyz7o76h7h4jwgq4s

@RubenKelevra RubenKelevra changed the title ipfs freezes ipfs freezes (memleak?) Jul 10, 2021
@RubenKelevra
Copy link
Contributor Author

The config on this machine:

$ ipfs --api='/ip4/127.0.0.1/tcp/5001' config show
{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "Gateway": "/ip4/127.0.0.1/tcp/80",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/443",
      "/ip6/::/tcp/443"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": null
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "500GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": false,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "ShardingEnabled": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "QmVoV4RiGLcxAfhA181GXR867bzVxmRTWwaubvhUyFrBwB"
  },
  "Ipns": {
    "RecordLifetime": "96h",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": null,
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {},
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": "gossipsub"
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dht"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "3m",
      "HighWater": 12000,
      "LowWater": 11800,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "EnableAutoRelay": false,
    "EnableRelayHop": false,
    "Transports": {
      "Multiplexers": {},
      "Network": {
        "QUIC": false
      },
      "Security": {}
    }
  }
}

@RubenKelevra
Copy link
Contributor Author

I think this and #8219 is the same issue. Closing this one.

@RubenKelevra
Copy link
Contributor Author

This was fixed in #8263

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

2 participants