Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when creating many pods simultaneously #219

Closed
amohoste opened this issue Apr 21, 2021 · 2 comments · Fixed by #337
Closed

Segfault when creating many pods simultaneously #219

amohoste opened this issue Apr 21, 2021 · 2 comments · Fixed by #337
Labels
bug Something isn't working

Comments

@amohoste
Copy link
Contributor

Describe the bug
When a lot's of pods are being created simultaneously, vHive segfaults in createUserContainer.

To Reproduce

  1. Setup a vHive cluster as specified in the quickstart guide (one master and one worker should be sufficient)
  2. Modify examples/deployer/functions.json such that only helloworld will be deployed
  3. Deploy helloworld by calling go run examples/deployer/client.go
  4. Call watch kubectl get pods and wait until all pods of the helloworld deployment get deleted
  5. Call go run examples/invoker/client.go -rps 100

Logs

WARN[2021-04-21T10:07:55.170793024-04:00] Using google dns 8.8.8.8
ERRO[2021-04-21T10:07:55.315457829-04:00] stock containerd failed to start UC           error="rpc error: code = Unknown desc = failed to create container io: failed to open /run/containerd/io.containerd.grpc.v1.cri/cont
ainers/ae4016be685c7b69fdab28942aeba1df87602b2115c6e5163a933556097af716/io/084543282/ae4016be685c7b69fdab28942aeba1df87602b2115c6e5163a933556097af716-stdout with O_PATH: open /run/containerd/io.containerd.grpc.v1.cri/con
tainers/ae4016be685c7b69fdab28942aeba1df87602b2115c6e5163a933556097af716/io/084543282/ae4016be685c7b69fdab28942aeba1df87602b2115c6e5163a933556097af716-stdout: too many open files"
ERRO[2021-04-21T10:07:55.376246971-04:00] stock containerd failed to start UC           error="rpc error: code = Unknown desc = failed to create container io: failed to open /run/containerd/io.containerd.grpc.v1.cri/cont
ainers/a16675f95a3f3c6b61307357be3c32be7e70463a5e37b703722d79c833d296ff/io/182064948/a16675f95a3f3c6b61307357be3c32be7e70463a5e37b703722d79c833d296ff-stdout with O_PATH: open /run/containerd/io.containerd.grpc.v1.cri/con
tainers/a16675f95a3f3c6b61307357be3c32be7e70463a5e37b703722d79c833d296ff/io/182064948/a16675f95a3f3c6b61307357be3c32be7e70463a5e37b703722d79c833d296ff-stdout: too many open files"
WARN[2021-04-21T10:07:55.435388176-04:00] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?

WARN[2021-04-21T10:07:55.435428998-04:00] Using google dns 8.8.8.8
WARN[2021-04-21T10:07:55.616783855-04:00] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?

WARN[2021-04-21T10:07:55.616812815-04:00] Using google dns 8.8.8.8
ERRO[2021-04-21T10:07:55.676765035-04:00] stock containerd failed to start UC           error="rpc error: code = Unknown desc = failed to create container io: failed to open /run/containerd/io.containerd.grpc.v1.cri/containers/4e09456ef058db9794e0ab37610a6359e004776c84c7559ac00229fa3b519b44/io/133637766/4e09456ef058db9794e0ab37610a6359e004776c84c7559ac00229fa3b519b44-stdout with O_PATH: open /run/containerd/io.containerd.grpc.v1.cri/containers/4e09456ef058db9794e0ab37610a6359e004776c84c7559ac00229fa3b519b44/io/133637766/4e09456ef058db9794e0ab37610a6359e004776c84c7559ac00229fa3b519b44-stdout: too many open files"
WARN[2021-04-21T10:07:55.877455054-04:00] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?

WARN[2021-04-21T10:07:55.877491851-04:00] Using google dns 8.8.8.8
WARN[2021-04-21T10:07:55.962490807-04:00] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?

WARN[2021-04-21T10:07:55.962533852-04:00] Using google dns 8.8.8.8
ERRO[2021-04-21T10:07:56.225404440-04:00] VM config for pod 2ecfef2261526a5afc22069926c7e928d9c2e9d55a1517f114d24e735cd6c127 does not exist
ERRO[2021-04-21T10:07:56.225433001-04:00]                                               error="VM config for pod does not exist"
ERRO[2021-04-21T10:07:56.234538488-04:00] VM config for pod 884e10c32b080de86f8f7b390f68e2e94faabe8a12d3ed52dfb51f199a67eb7d does not exist
ERRO[2021-04-21T10:07:56.234564359-04:00]                                               error="VM config for pod does not exist"
ERRO[2021-04-21T10:07:56.292045020-04:00] stock containerd failed to start UC           error="rpc error: code = Unknown desc = failed to create container io: failed to open /run/containerd/io.containerd.grpc.v1.cri/containers/d7276f89220d2d8bd91309606e7e826d3a87c431b041fa71b1514a08cf8dde0c/io/005718738/d7276f89220d2d8bd91309606e7e826d3a87c431b041fa71b1514a08cf8dde0c-stderr with O_PATH: open /run/containerd/io.containerd.grpc.v1.cri/containers/d7276f89220d2d8bd91309606e7e826d3a87c431b041fa71b1514a08cf8dde0c/io/005718738/d7276f89220d2d8bd91309606e7e826d3a87c431b041fa71b1514a08cf8dde0c-stderr: too many open files"
WARN[2021-04-21T10:07:56.429950650-04:00] Failed to Fetch k8s dns clusterIP exit status 1
The connection to the server localhost:8080 was refused - did you specify the right host or port?

WARN[2021-04-21T10:07:56.429995899-04:00] Using google dns 8.8.8.8
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x112a116]

goroutine 8132 [running]:
github.com/ease-lab/vhive/cri.(*Service).createUserContainer(0xc000480360, 0x14077a0, 0xc000fb9cb0, 0xc000fb9ce0, 0xc000f6ab10, 0x2, 0x2)
        /users/ahoste2/vhive/cri/container_create.go:94 +0x3d6
github.com/ease-lab/vhive/cri.(*Service).CreateContainer(0xc000480360, 0x14077a0, 0xc000fb9cb0, 0xc000fb9ce0, 0xc000480360, 0xc000fb9cb0, 0xc000f6aba0)
        /users/ahoste2/vhive/cri/container_create.go:53 +0x279
k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_CreateContainer_Handler(0x12d90c0, 0xc000480360, 0x14077a0, 0xc000fb9cb0, 0xc0016b2cc0, 0x0, 0x14077a0, 0xc000fb9cb0, 0xc000f31600, 0xd72)
        /users/ahoste2/go/pkg/mod/k8s.io/[email protected]/pkg/apis/runtime/v1alpha2/api.pb.go:7699 +0x214
google.golang.org/grpc.(*Server).processUnaryRPC(0xc0004ae000, 0x140f7e0, 0xc00049fc80, 0xc000e20100, 0xc000482660, 0x1abaaf0, 0x0, 0x0, 0x0)
        /users/ahoste2/go/pkg/mod/google.golang.org/[email protected]/server.go:1210 +0x522
google.golang.org/grpc.(*Server).handleStream(0xc0004ae000, 0x140f7e0, 0xc00049fc80, 0xc000e20100, 0x0)
        /users/ahoste2/go/pkg/mod/google.golang.org/[email protected]/server.go:1533 +0xd05
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc00019e070, 0xc0004ae000, 0x140f7e0, 0xc00049fc80, 0xc000e20100)
        /users/ahoste2/go/pkg/mod/google.golang.org/[email protected]/server.go:871 +0xa5
created by google.golang.org/grpc.(*Server).serveStreams.func1
        /users/ahoste2/go/pkg/mod/google.golang.org/[email protected]/server.go:869 +0x1fd
@amohoste amohoste added the bug Something isn't working label Apr 21, 2021
@amohoste amohoste changed the title Segfault when creating lot's of pods Segfault when creating many pods simultaneously Apr 21, 2021
@ustiugov
Copy link
Member

ustiugov commented May 4, 2021

I think that we should limit the number of concurrently booting VMs on each worker to some number, say 10 (because why not 10) so that we avoid contention on the disk that is likely to cause this problem.

As a fix, my suggestion is to use a global semaphore (i.e., bound channel) here that would limit the number of concurrent StartVM calls.

I think that the limit for the snapshot-based starts (i.e., LoadVM calls), the limit should be higher compared to boot-based starts. However, in my experience, this has to be adjusted according to the host's storage performance.

This should be a new knob for vHive.

@amohoste
Copy link
Contributor Author

amohoste commented May 5, 2021

I agree disk contention due to booting lot's of VMs concurrently is a genuine concern.

Although for the "too many open files" error, which seems to be the underlying issue for the segfault, closing the SSH connection and logging in again after running setup_node.sh seems to resolve the problem. Apparently, all active session windows must be closed for the file handle limit updates to be applied.

HermioneKT added a commit that referenced this issue Jan 31, 2024
parent 6674807
author HermioneKT <[email protected]> 1706694164 +0800
committer HermioneKT <[email protected]> 1706694164 +0800

test

# This is the commit message #157:

test

# This is the commit message #159:

test

# This is the commit message #160:

test

# This is the commit message #161:

test

# This is the commit message #162:

test

# This is the commit message #163:

test

# This is the commit message #164:

tesT

# This is the commit message #165:

test

# This is the commit message #166:

test

# This is the commit message #167:

test

# This is the commit message #168:

test

# This is the commit message #169:

test

# This is the commit message #170:

test

# This is the commit message #171:

Test

# This is the commit message #172:

test

# This is the commit message #173:

test

# This is the commit message #174:

test

# This is the commit message #175:

test

# This is the commit message #176:

test

# This is the commit message #177:

test

# This is the commit message #178:

test

# This is the commit message #179:

test

# This is the commit message #180:

test

# This is the commit message #181:

test

# This is the commit message #182:

test

# This is the commit message #183:

test

# This is the commit message #184:

test

# This is the commit message #185:

test

# This is the commit message #186:

test

# This is the commit message #187:

test

# This is the commit message #188:

test

# This is the commit message #189:

test

# This is the commit message #190:

Test

# This is the commit message #191:

Test

# This is the commit message #192:

test

# This is the commit message #193:

Test

# This is the commit message #194:

test

# This is the commit message #195:

test

# This is the commit message #196:

test

# This is the commit message #197:

test

# This is the commit message #198:

test

# This is the commit message #199:

Test

# This is the commit message #200:

test

# This is the commit message #201:

test

# This is the commit message #202:

Test

# This is the commit message #203:

test

# This is the commit message #204:

test

# This is the commit message #205:

test

# This is the commit message #206:

test

# This is the commit message #207:

test

# This is the commit message #208:

test

# This is the commit message #209:

test

# This is the commit message #210:

test

# This is the commit message #211:

Test

# This is the commit message #212:

test

# This is the commit message #213:

Test

# This is the commit message #214:

Test

# This is the commit message #215:

Test

# This is the commit message #216:

test

# This is the commit message #217:

Test

# This is the commit message #218:

test

# This is the commit message #219:

test

# This is the commit message #220:

test

# This is the commit message #221:

test

# This is the commit message #222:

test

# This is the commit message #223:

test

# This is the commit message #224:

test

# This is the commit message #225:

test

# This is the commit message #226:

test

# This is the commit message #227:

test

# This is the commit message #228:

test

# This is the commit message #229:

Test

# This is the commit message #230:

test

# This is the commit message #231:

test

# This is the commit message #232:

test

# This is the commit message #233:

test

# This is the commit message #234:

Test

# This is the commit message #235:

test

# This is the commit message #236:

test

# This is the commit message #237:

test

# This is the commit message #238:

test

# This is the commit message #239:

test

# This is the commit message #240:

test

# This is the commit message #241:

test

# This is the commit message #242:

test

# This is the commit message #243:

test

# This is the commit message #244:

test

# This is the commit message #245:

test

# This is the commit message #246:

test

# This is the commit message #247:

test

# This is the commit message #248:

test

# This is the commit message #249:

test

# This is the commit message #250:

test

# This is the commit message #251:

test

# This is the commit message #252:

test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants