Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiparty, big file upload and memory consumption #82

Closed
frankrousseau opened this issue Jul 21, 2014 · 21 comments
Closed

Multiparty, big file upload and memory consumption #82

frankrousseau opened this issue Jul 21, 2014 · 21 comments
Labels

Comments

@frankrousseau
Copy link

Hello there,

Thank you for multiparty that does a pretty good job at handling file uploads! I have a question about how using properly streams with your lib.

First, let me give you a little bit of context. I'm working on a file application called Cozy Files. The usage is basically the same as Dropbox. I use multiparty to handle file uploads. So, when it comes to upload files it works well... until I upload big files (> 500MB). When the file is big, it looks like that most parts of the file are stored in cache.

Previously we were using the bodyParser from Express 3. So I started working on an upgrade to Express 4 and using multiparty instead of bodyParser. At first, I used your 'file' event. Then, I read it was not a good idea to use it when it comes to deal with big file. Moreover in my case, it's almost useless to store it on the disk (I push it directly to our storage layer instead). So I decided to use the 'part' event to handle the stream myself.

I thought that using Node.js streams would be enough to not consume too much memory. But it seems it doesn't change things much. Even when I simplify my problem by storing the file directly on the disk myself via a stream. So I would like to know what I could do to "limit" the memory consumption of streams. Do you have some good practice to recommend about streaming? Did I miss something?

Regards,

Frank

NB: Here are two examples where I use multiparty. The first one is related to the case I described before, the second one is much simpler:
https:/frankrousseau/cozy-files/blob/e594ba6de666536079f916210e16e1a860591654/server/controllers/files.coffee#L102-L225
https:/frankrousseau/cozy-data-system/blob/4a0afd385ba474bd2e44b73f8bb0d932a30cb961/server/controllers/attachments.coffee#L27-L56

@andrewrk
Copy link
Collaborator

If you can provide a simple JavaScript file that demonstrates parsing a large file input taking up too much memory then you have found a bug. Otherwise you're using the API incorrectly and buffering when you should be streaming.

@frankrousseau
Copy link
Author

Ok that's what I'm asking you. How can I figure out what makes the app buffering? From my point of view, I only piped streams together. Is there something else I should do like managing wisely .pause() and .resume() methods from the stream?

This exemple is quite simple, but I will try to give you a simpler version:
https:/frankrousseau/cozy-data-system/blob/4a0afd385ba474bd2e44b73f8bb0d932a30cb961/server/controllers/attachments.coffee#L27-L56

@andrewrk
Copy link
Collaborator

I'm not willing to debug your code. Your next step in resolving this issue is to provide self-contained JavaScript code that demonstrates a bug in multiparty.

@frankrousseau
Copy link
Author

What I mean is that I'm not sure there are any bugs in multiparty nor asking for code debugging. I just want to know if you can provide helpers or good practices on how to handle streams properly in case of big file upload.

If I can identify that the problem comes form multiparty, I will tell it to you with a cleaner example of course.

@andrewrk
Copy link
Collaborator

In my experience, you can always use .pipe() and everything works correctly.

@frankrousseau
Copy link
Author

Ok. So I will try to reproduce it with an Express server that does only one thing: storing a file on the disk when a POST request is sent via a given route.

@dougwilson
Copy link
Contributor

Basic example of handling parts to disk correctly:

form.on('part', function onpart(part) {
  if (part.filename === null) return part.resume(); // ignore fields for example
  part.pipe(fs.createWriteStream(/* your args */))
})

It's also important to note that I cannot read CoffeeScript; I'm not sure about @andrewrk , so if you post code, you may want to post it in JavaScript.

I'm not sure what db.saveAttachment does in your code, but it may not be a stream that is properly applying back-pressure within the Node.js core system.

@frankrousseau
Copy link
Author

Sure I will post an example written in JS. About the saveAttachment method, it comes from a 3rd party library called Cradle. Here is the code:
https:/flatiron/cradle/blob/master/lib/cradle/database/attachments.js#L48

What do you mean by applying properly back-pressure? Is using .pipe() not enough?
By the way, what tool do you recommend to me to provide you with good memory consumption trace?

@dougwilson
Copy link
Contributor

What do you mean by applying properly back-pressure?

It's hard to explain, I'm sorry, I don't have the time to write about it :(

Is using .pipe() not enough?

No, because it just "pipes" the data. The place you are piping into needs to "apply back pressure" by telling the source when you stop reading, otherwise it'll just balloon your memory if the destination (that cradle code) stops writing for a bit.

The Cradle code itself it super abstracted as to where it goes (looks like into the request library).

By the way, what tool do you recommend to me to provide you with good memory consumption trace?

Periodic heap dumps? You can heap dump using https:/bnoordhuis/node-heapdump and inspect them using Chrome.

@frankrousseau
Copy link
Author

No, because it just "pipes" the data. The place you are piping into needs to "apply back pressure" by telling the source when you stop reading, otherwise it'll just balloon your memory if the destination (that cradle code) stops writing for a bit.

That's probably the source of the problem. Thank you. I will work on that on wednesday and provides you with better feedback then.

Thank you too for the node-heapdump resource. It looks very interesting.

@andrewrk
Copy link
Collaborator

pipe will handle backpressure correctly if the streams it is piping handle it correctly. The streams that multiparty provides handle backpressure correctly.

@frankrousseau
Copy link
Author

Hello. As suggested by @dougwilson and @andrewrk I wrote a simple example.

Client: https:/frankrousseau/multiparty-memory-consumption-tests/blob/master/client.js
Server: https:/frankrousseau/multiparty-memory-consumption-tests/blob/master/server.js

When I upload files, I print the memory usage from the process point of view. Everything seems correct. But if you launch a top while running the upload, you will see your free memory drop drastically.

It looks like the problem is more related to Node.js. What do you think?

@dougwilson
Copy link
Contributor

It looks like the problem is more related to Node.js. What do you think?

It may be the typical workings of how v8's garbage collector works, which is that it doesn't keep freeing memory back to the system the moment it's not in use, because it assumes it may suddenly come back into use again. You can try to force it's hand by adding the call gc() above this line: https:/frankrousseau/multiparty-memory-consumption-tests/blob/master/server.js#L69 and run that script with node --expose-gc server.js and see how your memory looks.

@frankrousseau
Copy link
Author

I'm still getting the same problem. The client looks keeping most of the garbage. When I stop the server, 60MB are freed, when I stop the client, it frees 400-500MB.

@dougwilson
Copy link
Contributor

We are not concerned with the memory use of the client, as that has nothing to do with multiparty. Can you re-state that, but ignore the memory-use of the client.js file, please? :)

@dougwilson
Copy link
Contributor

Also, just so I know for sure, can you declare what version of Node.js you are running this on for me?

@frankrousseau
Copy link
Author

Node version is 0.10.26.
I tried with an empty server and it looks like the problem comes from request and/or form-data. Sorry for the inconvenience. You can close the issue. If I have new information related to multiparty, I will reopen it.

Thank you for the support!

@frankrousseau
Copy link
Author

@andrewrk
Copy link
Collaborator

multiparty depends neither on node-form-data nor node-combined-stream.

@dougwilson
Copy link
Contributor

@andrewrk I believe @frankrousseau has acknowledged there is only a memory leak in the client script and not in the server script.

@andrewrk
Copy link
Collaborator

Ahh ok I understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants