-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nonlinear performance with just a few bad tags, among a sea of valid HTML. #161
Comments
Can you post a zip with the test HTML files or a code snippet to generate the test files? |
This was due to inefficient parsing. Improved in 4.0.210 and I'm seeing linear performance now. |
Thanks for the quick turnaround! We'll be testing that shortly. |
Fantastic - 100mb in 52s, 8mb in~4s. Brilliant stuff. Thank you, bravo! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Great component, loving it, but...
while using HtmlSanitizer, I noticed some poor, non-linear performance figures. I understand my data is not representative of HTML pages, but obviously when data has come from external systems, which may have not been guarded or sanitized, it might not be small. My fear is that the poor performance could expose this component to a denial of service attack. I understand this could all be AngleSharp's problem, but wondered if this component could mitigate/prevent these issues...
To test: I created HTML simply containing
<br/>
tags, and injected one or two bad<br>
tags at position 196 and 1736. And in the extreme case 100 bad tags scattered randomly.Here are the figures for performance.
?=Didn't wait to find out, but longer than I can be bothered to wait.
Is this what you'd expect? All we do that's special is whitelist 60 tags, allow "face" attributes and disallow "src" attributes.
Breaking the debugger usually shows that all the work is being done in AngleSharp.Dom.Node.RemoveChild, but I haven't run a perfview to find out more.
If there are size, speed, memory limits to this module, can they be published?
At the moment, my own plan for mitigation is that once any embeded
img data
tags are stripped, if the size of the HTML is over 1mb, I won't bother sanitizing it, and may reject it.The text was updated successfully, but these errors were encountered: