Encoding issue #158

Sicos1977 · 2019-01-25T14:06:42Z

Hi,

Is it somehow possible to let HtmlSanitizer detect the encoding of the string that it is sanitizing?

I now load the file with File.ReadAllText and then feed it to the .Sanitize method but after that words like kopieën end up like kopiee?n

In this case there is encoding information in the header of the html file.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

When that encoding is used then everything is oke

The text was updated successfully, but these errors were encountered:

mganss · 2019-01-25T14:49:32Z

Can you post a demo file?

Sicos1977 · 2019-01-25T15:50:57Z

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]--><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;
	mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p
	{mso-style-priority:99;
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Times New Roman",serif;}
p.msonormal0, li.msonormal0, div.msonormal0
	{mso-style-name:msonormal;
	mso-style-priority:99;
	mso-margin-top-alt:auto;
	margin-right:0cm;
	mso-margin-bottom-alt:auto;
	margin-left:0cm;
	font-size:12.0pt;
	font-family:"Times New Roman",serif;}
span.E-mailStijl19
	{mso-style-type:personal;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
span.E-mailStijl20
	{mso-style-type:personal;
	font-family:"Calibri",sans-serif;
	color:#1F497D;}
span.E-mailStijl21
	{mso-style-type:personal-reply;
	font-family:"Calibri",sans-serif;
	color:#1F497D;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="NL" link="blue" vlink="purple">
<div class="WordSection1">

<p style="margin-left:35.4pt"><span style="font-size:8.0pt;font-family:&quot;Arial&quot;,sans-serif">*****</span><o:p></o:p></p>
<p style="margin-left:35.4pt"><i><span style="font-size:8.0pt;font-family:&quot;Arial&quot;,sans-serif">kopieën</span></i><o:p></o:p></p>
<p style="margin-left:35.4pt"><span style="font-size:8.0pt;font-family:&quot;Arial&quot;,sans-serif">*****</span><o:p></o:p></p>
</div>
</body>
</html>

mganss · 2019-01-26T19:44:10Z

This is not an issue of HtmlSanitizer. When you read an ISO-8859-1 encoded file using File.ReadAllText without specifying the encoding, the returned string will contain the 0xfffd "replacement character" � instead of "ë". At this point, the information is lost and can't be fixed afterwards. Also, keep in mind that a string in C# doesn't have an encoding that you can change, it's always UTF-16 internally.

I'm not sure if AngleSharp does encoding detection if you supply a byte stream to its HTML parser. I'll try and possibly add overloads of Sanitize that take a Stream instead of a string.

mganss · 2019-01-27T16:20:11Z

In 4.0.205 there is now an overload of SanitizeDocument that takes a Stream. You can use it like so to enable your use case:

using (var stream = File.OpenRead("path/to/your/file"))
{
    var sanitized = sanitizer.SanitizeDocument(stream);
    // ...
}

If you're on .NET Core, you need to add the System.Text.Encoding.CodePages NuGet package and call Encoding.RegisterProvider(CodePagesEncodingProvider.Instance); before sanitizing.

mganss closed this as completed in 11b2716 Jan 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding issue #158

Encoding issue #158

Sicos1977 commented Jan 25, 2019 •

edited

Loading

mganss commented Jan 25, 2019

Sicos1977 commented Jan 25, 2019

mganss commented Jan 26, 2019

mganss commented Jan 27, 2019

Encoding issue #158

Encoding issue #158

Comments

Sicos1977 commented Jan 25, 2019 • edited Loading

mganss commented Jan 25, 2019

Sicos1977 commented Jan 25, 2019

mganss commented Jan 26, 2019

mganss commented Jan 27, 2019

Sicos1977 commented Jan 25, 2019 •

edited

Loading