Replace the ioutil.ReadAll usage #36151

alexsapran · 2023-07-25T08:38:51Z

While we were working on the offsite and analyzing some of the memory profiles of Filebeat the output to ES was consuming memory for parsing the buffer from the _bulk response.

Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.

We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.

While this issue points to a specific case of the usage of ReadAll taking a look into the codebase of other Beats, it looks like we can even optimize that as well.

The desired output is to write some go benchmark that compares the before and after.

/cc @AndersonQ @pierrehilbert

The text was updated successfully, but these errors were encountered:

elasticmachine · 2023-07-25T08:53:48Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

amitkanfer · 2023-08-03T10:21:21Z

trying something out here: #36215

rdner · 2024-02-07T10:05:50Z

Let's get it straight first:

Looking further into it, we see that when we are executing the bulk request and consuming the response we call the ioutil.ReadAll(resp.Body) which is deprecated and consumes more memory, especially for those big responses.

The only reason it's deprecated because it was moved to another package and now it's io.ReadAll

We should replace this with the Buffer and maybe the io.Copy to copy the response Buffer into a consumable reusable Buffer.

Okay, I was not sure about this and collected some data:

Analysis

io.ReadAll allocates a buffer of 512 bytes and then re-allocates it to a larger size if the content exceeds this size. I believe in the most of our cases it would probable re-allocate.

Using io.Copy and bytes.Buffer would work in a similar way – first time you try to write to a bytes.Buffer it allocates a slice to accommodate the currently written chunk, if the chunk is smaller than 64 bytes, the minimal allocation is 64 bytes. The next written chunk would probably cause another allocation. However, the buffer growing algorithm is quite sophisticated and probably more efficient in some edge cases.

Benchmarking

To prove the hypothesis I created this benchmark:

package main

import (
	"bytes"
	"fmt"
	"io"
	"io/ioutil"
	"testing"
)

func BenchmarkReadBytes(b *testing.B) {
	sizes := []int{
		100,         // 100 bytes
		100 * 1024,  // 100KB
		1024 * 1024, // 1MB
	}
	for _, size := range sizes {
		b.Run(fmt.Sprintf("size: %d", size), func(b *testing.B) {

			// emulate a file or an HTTP response
			generated := bytes.Repeat([]byte{'a'}, size)
			content := bytes.NewReader(generated)
			b.ResetTimer()

			b.Run("deprecated ioutil.ReadAll", func(b *testing.B) {
				for i := 0; i < b.N; i++ {
					content.Seek(0, io.SeekStart) // reset
					data, err := ioutil.ReadAll(content)
					if err != nil {
						b.Fatal(err)
					}
					if len(data) != size {
						b.Fatalf("size does not match, expected %d, actual %d", size, len(data))
					}
				}
			})
			b.Run("bytes.Buffer+io.Copy", func(b *testing.B) {
				for i := 0; i < b.N; i++ {
					content.Seek(0, io.SeekStart) // reset
					buf := bytes.NewBuffer(nil)
					_, err := io.Copy(buf, content)
					if err != nil {
						b.Fatal(err)
					}
					if buf.Len() != size {
						b.Fatalf("size does not match, expected %d, actual %d", size, buf.Len())
					}
				}
			})
		})
	}
}

And indeed, bytes.Buffer + io.Copy is significantly faster on every size I tested (100 bytes, 100KB, 1MB) and it allocates almost 5 times less memory:

go test -run=none -bench='BenchmarkReadBytes/' -benchmem -benchtime=1000x
goos: darwin
goarch: arm64
pkg: github.com/rdner/tar-test
BenchmarkReadBytes/size:_100/deprecated_ioutil.ReadAll-10                   1000               226.6 ns/op           512 B/op          1 allocs/op
BenchmarkReadBytes/size:_100/bytes.Buffer+io.Copy-10                        1000               173.9 ns/op           160 B/op          2 allocs/op
BenchmarkReadBytes/size:_102400/deprecated_ioutil.ReadAll-10                1000             55907 ns/op          514308 B/op         18 allocs/op
BenchmarkReadBytes/size:_102400/bytes.Buffer+io.Copy-10                     1000              6875 ns/op          106544 B/op          2 allocs/op
BenchmarkReadBytes/size:_1048576/deprecated_ioutil.ReadAll-10               1000            349308 ns/op         5241107 B/op         27 allocs/op
BenchmarkReadBytes/size:_1048576/bytes.Buffer+io.Copy-10                    1000             72474 ns/op         1048626 B/op          2 allocs/op
PASS
ok      github.com/rdner/tar-test       0.700s

In terms of CPU (smaller is better):

In terms of memory (smaller is better):

Conclusion

Even a simple replacement of ioutil.ReadAll (io.ReadAll) with bytes.Buffer + io.Copy (no buffer re-use) can give us a significant boost in both CPU and memory performance. So, it's worth spending time on this change.

Additionally, we can also investigate places where we might know the fixed buffer size and re-use the same buffer multiple times like we do in filestream fingerprinting. However, this is more effort and will take significantly more time.

amitkanfer · 2024-02-07T14:47:58Z

@rdner well done, great summary.

rdner · 2024-02-12T16:03:04Z

I estimated how big of a change it would be:

ioutil.ReadAll currently has 40 affected files https:/search?q=repo%3Aelastic%2Fbeats%20%22ioutil.ReadAll%22&type=code
io.ReadAll currently has 25 affected files https:/search?q=repo%3Aelastic%2Fbeats+%22io.ReadAll%22&type=code
I created an optimized version of ReadAll just for HTTP responses here Add optimized ReadAll for http responses elastic-agent-libs#183 Once it's merged and we update to this new version of elastic-agent-libs it's just 65 files. It's medium effort if we don't address all the linter issues in the affected files, otherwise it's going to be a way higher effort.

rdner · 2024-02-13T08:32:21Z

Just realised that we need to fix this in the https:/elastic/elastic-agent-libs repository too (10 files).

alexsapran · 2024-03-05T15:34:33Z

I think we should prioritize the input side of Beats when consuming HTTP endpoints, especially for the metricbeat. During EAH I discussed with some folks from the platform SRE complaining about the memory usage of the "agent", which was mostly consuming metrics running as a pod inside MKI.

Given the impact that @rdner showed with this, I think we should start with replacing that part of the code, then talk with SRE-o11y and canary it to measure the impact.

rdner · 2024-03-05T20:50:37Z

@alexsapran I was planning to contribute to Go itself replacing the ReadAll implementation there with the more efficient one. Then it's just updating to the next version and gaining all this performance for free. I'll keep you updated here.

rowlandgeoff · 2024-06-04T19:11:23Z

@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?

I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.

pierrehilbert · 2024-06-05T07:06:52Z

Hey @rowlandgeoff
As @rdner is currently focused on our CI reliability, he didn't make any progress on this for a while.
If we can have your team's to automate the triaging he is doing and fix some issues (not related to the tests themselves) we are having, it will allow us to come back to this important topic.

mauri870 · 2024-07-04T12:11:44Z

@rdner I'm browsing around looking at perf issues... any progress on the fix to Go itself? Or should we go ahead and fix our ReadAll usages and get those performance gains soon? Or if we want to break this down further, are there particular usages which may give us a better ROI on our time spent refactoring?

I think we've already got SRE-o11y ready to monitor any perf gains, as Alex mentioned above.

I found the optimizations discussed here to be incredibly insightful. Great work on the findings and investigation! Considering the current development cycle, Go 1.23 is frozen for the August release. Therefore, the most realistic timeframe for these changes to be included would be the Go 1.24 release in February 2025, assuming a CL with the improvements is approved and merged by then. The issue at go.dev/issues/50774 seems to have been stale for two years, so it will likely remain that way unless someone pushes the effort forward.

Readers and writers are excellent abstractions and two of the three most important interfaces in Go (alongside the empty interface). However, covering and optimizing all possible implementations and permutations of these is very difficult and time-consuming, often involving trade-offs. The Go standard library leverages these abstractions extensively, and if a particular case presents too many downsides, it is preferable to maintain stable performance in the standard library rather than optimize for a specific scenario.

Given that said, I would advice to not wait for this to land in the stdlib, the performance gains seems more than justifiable to just keep an optimized version of io.Readall around.

mauri870 · 2024-09-19T21:25:08Z

I spent some time looking into this (probably more time than I should) and the conclusion that I got is a big it depends.

It is clear that io.ReadAll uses more resources than a bytes.Buffer + io.Copy in general, but this is not always the case.

TL;DR;

Most Readers in the standard library implement io.WriterTo, I'll leave a list here for posterity. This seems to be the main optimization that makes the bytes.Buffer + io.Copy fast. The io.ReadAll function cannot take advantage of this.

Third party io.Readers might not implement this optimization, and this makes the < 512 bytes reads 4x slower, this is due to io.ReadAll preallocating a 512 byte buffer. For some use cases you might want faster performance in this scenario. That makes me skeptical about replacing io.ReadAll everywhere because we might not know about the underlying reader and context. But in general, it seems safe to do this replacement as long as we pay attention to these details.

$ cd $GOROOT/api; rg '\) WriteTo\('

go1.22.txt
118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808
125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808
go1.5.txt

585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool)
go1.8.txt

211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error)
go1.txt

197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error)

4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error)

4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error)

4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error)

5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error
go1.1.txt

29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error)

37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error)

47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error)

2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)

(beats) ➜  api git:(master) ✗ cd $GOROOT/api; rg ') WriteTo(' *.txt

go1.22.txt

118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808

125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808
go1.5.txt

585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool)
go1.8.txt

211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error)
go1.txt

197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error)

4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error)

4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error)

4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error)

5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error
go1.1.txt

29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error)

37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error)

47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error)

2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)

(beats) ➜  api $(git_prompt_info)

(beats) ➜  api git:(master) ✗

(beats) ➜  api git:(master) ✗ cd $GOROOT/api; rg ') WriteTo(' *.txt

go1.22.txt

118:pkg net, method (*TCPConn) WriteTo(io.Writer) (int64, error) #58808

125:pkg os, method (*File) WriteTo(io.Writer) (int64, error) #58808
go1.5.txt

585:pkg go/types, method (*Scope) WriteTo(io.Writer, int, bool)
go1.8.txt

211:pkg net, method (*Buffers) WriteTo(io.Writer) (int64, error)
go1.txt

197:pkg bytes, method (*Buffer) WriteTo(io.Writer) (int64, error)

4428:pkg net, method (*IPConn) WriteTo([]uint8, Addr) (int, error)

4479:pkg net, method (*UDPConn) WriteTo([]uint8, Addr) (int, error)

4498:pkg net, method (*UnixConn) WriteTo([]uint8, Addr) (int, error)

5756:pkg runtime/pprof, method (*Profile) WriteTo(io.Writer, int) error
go1.1.txt

29:pkg bufio, method (*Reader) WriteTo(io.Writer) (int64, error)

37:pkg bufio, method (ReadWriter) WriteTo(io.Writer) (int64, error)

47:pkg bytes, method (*Reader) WriteTo(io.Writer) (int64, error)

2636:pkg strings, method (*Reader) WriteTo(io.Writer) (int64, error)

Small Reads (< 512 bytes)

For smaller reads (up to 512 bytes) io.ReadAll is faster, since it preallocates a single 512 bytes chunk of memory by default and does not need to grow it until it needs more space. The downside is that it will always allocate 512 bytes even if the io.Reader you have only yields 32 bytes of data, it trades a slightly larger memory footprint for cpu.

Also there are io.Readers and there are io.Readers. Some readers implement io.WriterTo, which is an optimization that allows the Read operation to avoid creating an intermediary buffer during io.Copy. Some readers, let's call then "dumb readers", only implement io.Read. For io.ReadAll this optimization does not really matter as it only relies on Read anyway.
If the source is a dumb reader then the performance of bytes.Buffer + io.Copy becomes much worse than an io.ReadAll for < 512 bytes since it cannot take advantage of this optimization:

BenchmarkReadAll/size_64B/io.ReadAll/WriterTo-32                 7117320               166.3 ns/op           512 B/op          1 allocs/op
BenchmarkReadAll/size_64B/io.ReadAll/Reader-32                   7319155               167.4 ns/op           512 B/op          1 allocs/op
BenchmarkReadAll/size_64B/ReadAll/WriterTo-32                   10699069               113.4 ns/op           112 B/op          2 allocs/op
BenchmarkReadAll/size_64B/ReadAll/Reader-32                      2122534               551.0 ns/op          1584 B/op          3 allocs/op

So in conclusion, for readers yielding < 512 bytes:

io.ReadAll never takes io.WriterTo/io.ReaderFrom optimization into account.
For dumb readers io.ReadAll is 4x faster than bytes.Buffer+io.Copy due to the preallocated buffer.
If the source implements io.WriterTo then bytes.Buffer+io.Copy is faster and uses less memory than io.ReadAll

Remaining Reads (> 512 bytes)

For these sizes it does not really matter if the reader is dumb or not, since io.ReadAll does not take advantage of the optimization. This means that bytes.Buffer + io.Copy is able to use io.WriterTo from the source and read it in one Go (no pun intended), with this there is a huge perf gain. If the reader is dumb then we rely on bytes.Buffer growing algorithm, which is less picky than the runtime append growth ratio.

BenchmarkReadAll/size_100KB/io.ReadAll/WriterTo-32                 13582             88505 ns/op          514304 B/op         18 allocs/op
BenchmarkReadAll/size_100KB/io.ReadAll/Reader-32                   13425             89588 ns/op          514304 B/op         18 allocs/op
BenchmarkReadAll/size_100KB/ReadAll/WriterTo-32                    53968             22920 ns/op          106544 B/op          2 allocs/op
BenchmarkReadAll/size_100KB/ReadAll/Reader-32                      22172             54702 ns/op          261680 B/op         10 allocs/op

In conclusion, for > 512 byte reads:

io.ReadAll is slower and more resource intensive in all cases, its only advantage was the 512 bytes preallocated buffer, but at this point it needs to regrow multiple times, following the append growth ratio.
bytes.Buffer + io.Copy is the better choice for dumb and optimized readers.

amitkanfer · 2024-09-19T21:40:45Z

Thanks @mauri870 !
So what should be our next step here? Should we look into the different places where we use io.ReadAll and assess if the average allocation is smaller or bigger than 512 bytes?

mauri870 · 2024-09-20T11:33:13Z

Given that most readers in the standard library do implement io.WriterTo, it seems wise for us to take advantage of this whenever we can. Unfortunately there is a multitude of io.Readers in the stdlib and third party packages that it is hard to track the details, this starts to blur the line of microptimization territory IMHO. Generally when you deal with readers you just want to read from them an not spend time figuring out which kind of reader it is or if it yields x bytes of data.

This is one of the reasons it is very unlikely that this would land in the stdlib, there is too many corner cases and details, there is no way to make it faster in all scenarios, it is very dependent on the situation.

My plan is to add this optimized ReadAll to elastic-agent-libs alongside tests and benchmarks, like what @rdner did with the http optimized ReadAll. But in order to actually use this we need to identify places where we can use this and don't have downsides. That means careful analysis of the surrounding context and preferably a benchmark to confirm it is actually a speedup.

For example:

f, _ := os.Open("x.txt")
b, _ := io.ReadAll(f)

We could just blindly replace this with a bytes.Buffer + io.Copy, but pre Go 1.22 os.File did not implement io.WriterTo. This means that if this file was < 512 bytes this change would be 4x slower.

mauri870 · 2024-10-01T13:11:05Z

I believe this issue has deviated slightly from its original purpose. The initial goal was to improve the usage of io.ReadAll in the libbeat Elasticsearch client, which has already been addressed in #36275. This is confirmed by the benchmarks introduced in #40885.

While there are still many instances of io.ReadAll across multiple projects, I think it's impractical to replace all occurrences with a buffer without first confirming that doing so doesn’t introduce a slowdown in the scenarios previously mentioned. I’ve added an optimized version of readall with accompanying benchmarks elastic/elastic-agent-libs#229.

If we want to spend more time investigating each occurrence, we should open a separate issue for that. Perhaps those more familiar with the codebase can highlight areas where these optimizations would provide significant benefits, allowing us to focus our efforts more effectively.

amitkanfer · 2024-10-01T13:17:22Z

Ack. Thank you @mauri870 .
@alexsapran @jeniawhite @rdner - specific places we need to focus at, or should we close this issue as done? (or both ;) )

alexsapran · 2024-10-01T13:21:16Z

Given the focus on the OTel, it's safe to say we can close this.

alexsapran added Filebeat Filebeat help wanted Indicates that a maintainer wants help on an issue or pull request labels Jul 25, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 25, 2023

pierrehilbert added the Team:Elastic-Agent Label for the Agent team label Jul 25, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 25, 2023

alexsapran mentioned this issue Aug 9, 2023

Improve the esleg bulk response parsing #36275

Merged

6 tasks

rdner self-assigned this Feb 7, 2024

rdner mentioned this issue Feb 7, 2024

Optimizing response read #36215

Closed

rdner mentioned this issue Feb 12, 2024

Add optimized ReadAll for http responses elastic/elastic-agent-libs#183

Merged

3 tasks

rdner mentioned this issue Feb 14, 2024

Add support for limiting HTTP responses in httpcommon.ReadAll elastic/elastic-agent-libs#184

Merged

3 tasks

pierrehilbert assigned mauri870 and unassigned rdner Aug 29, 2024

ycombinator changed the title ~~Replace the ioutils.ReadAll usage~~ Replace the ioutil.ReadAll usage Sep 3, 2024

This was referenced Sep 16, 2024

libbeat: use optimized version of io.ReadAll to read _bulk responses #40844

Closed

libbeat: improve BenchmarkExecHttpRequest with multiple sizes #40885

Merged

mauri870 mentioned this issue Sep 20, 2024

Add optimized ReadAll function elastic/elastic-agent-libs#229

Merged

3 tasks

alexsapran closed this as completed Oct 1, 2024

mergify bot mentioned this issue Oct 1, 2024

[8.x](backport #40885) libbeat: improve BenchmarkExecHttpRequest with multiple sizes #41062

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace the ioutil.ReadAll usage #36151

Replace the ioutil.ReadAll usage #36151

alexsapran commented Jul 25, 2023 •

edited

Loading

elasticmachine commented Jul 25, 2023

amitkanfer commented Aug 3, 2023

rdner commented Feb 7, 2024 •

edited

Loading

amitkanfer commented Feb 7, 2024

rdner commented Feb 12, 2024 •

edited

Loading

rdner commented Feb 13, 2024 •

edited

Loading

alexsapran commented Mar 5, 2024

rdner commented Mar 5, 2024

rowlandgeoff commented Jun 4, 2024

pierrehilbert commented Jun 5, 2024

mauri870 commented Jul 4, 2024

mauri870 commented Sep 19, 2024

amitkanfer commented Sep 19, 2024

mauri870 commented Sep 20, 2024

mauri870 commented Oct 1, 2024

amitkanfer commented Oct 1, 2024

alexsapran commented Oct 1, 2024

Replace the ioutil.ReadAll usage #36151

Replace the ioutil.ReadAll usage #36151

Comments

alexsapran commented Jul 25, 2023 • edited Loading

elasticmachine commented Jul 25, 2023

amitkanfer commented Aug 3, 2023

rdner commented Feb 7, 2024 • edited Loading

Analysis

Benchmarking

In terms of CPU (smaller is better):

In terms of memory (smaller is better):

Conclusion

amitkanfer commented Feb 7, 2024

rdner commented Feb 12, 2024 • edited Loading

rdner commented Feb 13, 2024 • edited Loading

alexsapran commented Mar 5, 2024

rdner commented Mar 5, 2024

rowlandgeoff commented Jun 4, 2024

pierrehilbert commented Jun 5, 2024

mauri870 commented Jul 4, 2024

mauri870 commented Sep 19, 2024

TL;DR;

Small Reads (< 512 bytes)

Remaining Reads (> 512 bytes)

amitkanfer commented Sep 19, 2024

mauri870 commented Sep 20, 2024

mauri870 commented Oct 1, 2024

amitkanfer commented Oct 1, 2024

alexsapran commented Oct 1, 2024

alexsapran commented Jul 25, 2023 •

edited

Loading

rdner commented Feb 7, 2024 •

edited

Loading

rdner commented Feb 12, 2024 •

edited

Loading

rdner commented Feb 13, 2024 •

edited

Loading