Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant network throughput drop after upgrading to 1.27.0 to 1.31.0 (Apache v2) #820

Closed
chanseokoh opened this issue Sep 12, 2019 · 11 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@chanseokoh
Copy link
Contributor

chanseokoh commented Sep 12, 2019

I've filed several issues here and I saw that 1.31.0 is the first release that contained all the fixes I needed. We finally managed to upgrade from 1.27.0 to 1.31.0 (Apache v2) and get rid of all deprecated usage.

However, after making a release of our product, we suddenly got numerous reports of slow network operations (slow write at least). Often slow to the point of hanging or taking a couple hours. At first I thought these were isolated issues in their environments, but we soon realized this is universal and consistent. We observed the degree of throughput drop is already significant on fast network at the office (for example, 9s vs 67s). The symptom gets dramatically worse when you are on not-so-fast network or an operation takes longer to complete (and I will explain why later). For example, in my own testing at home, the same task takes ~45s with 1.27.0, but it hangs indefinitely and times out after 17 minutes with 1.31.0.

Our team members individually verified the performance drop. We were able to track down the cause, and downgrading the library to 1.27.0 resolves the issue. For now, we decided to go back to 1.27.0. We don't know who's causing this issue at which level.

However, we cannot hold off upgrading the library forever. So, this issue must be cracked down eventually for us to move forward.

Here's one interesting observation from my own experiment, which I hope will shed some light:

When uploading ~40MB data in a fresh new single HTTP request, the throughput is always steadily >1MB/s with 1.27.0. The whole operation takes ~30 seconds or more. But with 1.31.0, the throughput starts with lower, say ~800KB/s, and it gradually wanes over time to 600KB/s, 400KB/s, ..., to absolute 0. No activity in the end. After 15 minutes or so, it throws java.net.SocketException: Connection timed out (Write failed).

@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Sep 13, 2019
@chanseokoh
Copy link
Contributor Author

chanseokoh commented Sep 13, 2019

I came up with some code that reproduces this issue.

It wasn't easy to find an external server where I can dump some random content and gives me average latency/throughput from the office, so I decided to mimic what our product actually does: pushing a blob to Docker Hub. The slowdown is already significant at our Google office.

  • 1.32.0
start timing com.google.api.client.http.apache.v2...
elapsed (s): 47.1
  • 1.27.0
start timing com.google.api.client.http.apache...
elapsed (s): 3.1

The code below writes 40MB data to Docker Hub. It requires an actual Docker Hub account. You can create one easily in a second. Or, I can share my password privately.

import com.google.api.client.http.*;
import com.google.api.client.http.apache.v2.ApacheHttpTransport;
import com.google.common.io.CharStreams;
import com.google.common.net.MediaType;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.regex.*;

public class Main {

  static class RandoContent implements HttpContent {
    private final int megaBytes;
    public RandoContent(int megaBytes) { this.megaBytes = megaBytes; }
    @Override public long getLength() throws IOException { return megaBytes * 1024 * 1024; }
    @Override public String getType() { return MediaType.OCTET_STREAM.toString(); }
    @Override public boolean retrySupported() { return true; }
    @Override public void writeTo(OutputStream out) throws IOException {
      byte[] rando = new byte[1024];
      for (int i = 0; i < megaBytes * 1024; i++) out.write(rando);
      out.flush();
    }
  }

  public static void main(String[] args) throws IOException {
    String username = "francium25";
    String password = ...
    String dockerHubRepo = "francium25/test";

    String authToken = getAuthToken(dockerHubRepo, username, password); // don't show this in public
    String uploadUrl = getUploadUrl(dockerHubRepo, authToken);

    System.out.println("start timing " + ApacheHttpTransport.class.getPackage().getName() + "...");
    long started = System.nanoTime();
    new ApacheHttpTransport().createRequestFactory()
        .buildPatchRequest(new GenericUrl(uploadUrl), new RandoContent(40))
        .setHeaders(new HttpHeaders().setAuthorization("Bearer " + authToken))
        .execute();
    System.out.println("elapsed (s): " + (System.nanoTime() - started) / 100000000L / 10.);
  }

  private static String getAuthToken(String repository, String username, String password) throws IOException {
    String authUrl = "https://auth.docker.io/token?service=registry.docker.io&scope=repository:" + repository + ":pull,push";
    HttpResponse res = new ApacheHttpTransport().createRequestFactory()
        .buildGetRequest(new GenericUrl(authUrl))
        .setHeaders(new HttpHeaders().setBasicAuthentication(username, password))
        .execute();

    String json = CharStreams.toString(new InputStreamReader(res.getContent(), StandardCharsets.UTF_8));
    Matcher m = Pattern.compile("\"token\"\\s*:\\s*\"([^\"]+)\"").matcher(json);
    m.find();
    return m.group(1); // don't show this in public
  }

  private static String getUploadUrl(String repository, String authToken) throws IOException {
    String uploadInitUrl = "https://registry-1.docker.io/v2/" + repository + "/blobs/uploads/";
    HttpResponse res =
        new ApacheHttpTransport().createRequestFactory()
            .buildPostRequest(new GenericUrl(uploadInitUrl), new RandoContent(0))
            .setHeaders(new HttpHeaders().setAuthorization("Bearer " + authToken))
            .execute();
    return res.getHeaders().getLocation();
  }
}

@elharo
Copy link
Contributor

elharo commented Sep 16, 2019

Do any of these look like likely causes:

https://issues.apache.org/jira/projects/HTTPCORE/versions/12344865

@elharo
Copy link
Contributor

elharo commented Sep 16, 2019

1.27.0 used <project.httpclient.version>4.5.5</project.httpclient.version>

so there's been a lot of time for performance regressions to sneak in.

We might need to do some http debugging and packet inspection here. It might also be worthwhile pinning the version of httpclient and httpcore to 4.5.5 in your own project and see if that fixes it.

@chanseokoh
Copy link
Contributor Author

Did quick tests.

  • 1.30.1 is OK. 1.30.2 is slow. Something happened between them.
  • 1.30.2 still slow even when forcing Apache httpcore:4.4.9 and httpclient:4.5.5, the ones used by 1.27.0.
  • 1.30.2 still slow when forcing the latest httpcore:4.4.12 and httpclient:4.5.10.

Looks like it's not the Apache library, but some change on our side with 1.30.2.

@chanseokoh
Copy link
Contributor Author

One thing that comes to my mind is #715. 1.30.1 wasn't configuring values the code is supposed to. But fixing the bug to set the values may have adversely affected performance?

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Sep 17, 2019
chingor13 pushed a commit that referenced this issue Sep 17, 2019
@chingor13 chingor13 added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Sep 18, 2019
@chingor13 chingor13 self-assigned this Sep 18, 2019
@yoshi-automation yoshi-automation removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Sep 18, 2019
@chanseokoh
Copy link
Contributor Author

I think this deserves priority. People (including us) can't/shouldn't use recent versions of this library, because the performance degraded so much to the point of being not usable.

@elharo
Copy link
Contributor

elharo commented Oct 21, 2019

It's worth listing out all the socket options that are being set, either directly or by default. Then we can run it by a TCP guru to see if any of them jump out as suspicious. If not, we can do more detailed testing and experiments.

Based on your experiments so far I'd guess that the one or more of the TCP options, previously ignored, is now being set to a bad value. If we can identify the culprit it should be a simple case of setting it to a good value for your use case.

@elharo
Copy link
Contributor

elharo commented Oct 21, 2019

some of the defaults for TCP were set back in the dark ages of the 1980s, and don't make a lot of sense for today's networks...

@elharo
Copy link
Contributor

elharo commented Oct 21, 2019

In particular try bumping the send and receive buffer sizes from 8K to 64K and see if that helps.

@elharo
Copy link
Contributor

elharo commented Oct 22, 2019

I see an issue with your sample code. For large files you should use request/response entity streaming.

@elharo
Copy link
Contributor

elharo commented Oct 24, 2019

fixed by #855

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants