PROWAREtech
.NET: Strip/Remove HTML Tags from Text Using Regex
How to remove the tags from HTML text using C# and regular expressions.
See related: Find Keywords in Text and strip SCRIPT tags from HTML text
It is very easy to remove all HTML tags using Regex.Replace()
.
string html = "This is a <b>test</b>!<img src='test.jpg' />";
string text = Regex.Replace(html, "<[^>]*>", string.Empty);
Here is the above code snippet used in a complete example.
using System.Text.RegularExpressions;
using System;
namespace ConsoleAppRemoveHtml
{
class Program
{
static void Main(string[] args)
{
string html = "This is a <b>test</b>!<img src='test.jpg' />";
string text = Regex.Replace(html, "<[^>]*>", " ");
Console.WriteLine(html);
Console.WriteLine(text);
}
}
}
This code will remove HTML comments, too.
Sample program output:
This is a <b>test</b>!<img src='test.jpg' /> This is a test!
If wanting to remove just parts of the HTML document or just inline tags, then take a look at these options.
using System.Text.RegularExpressions;
Regex inline_tags = new Regex(@"<\/?(?:i|b|a|span|strong|cite|em|code)[^>]*>", RegexOptions.Compiled); // remove common inline tags but leaves the text contained within
Regex strike = new Regex(@"<s\b[^<]*(?:(?!<\/s>)<[^<]*)*<\/s>", RegexOptions.Compiled); // remove striked text and eveyrthing contained within the tag
Regex sup = new Regex(@"<sup\b[^<]*(?:(?!<\/sup>)<[^<]*)*<\/sup>", RegexOptions.Compiled); // remove sup tag and eveyrthing contained within
Regex sub = new Regex(@"<sub\b[^<]*(?:(?!<\/sub>)<[^<]*)*<\/sub>", RegexOptions.Compiled); // remove sub tag and eveyrthing contained within
Regex header = new Regex(@"<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>", RegexOptions.Compiled); // remove header tag and eveyrthing contained within
Regex footer = new Regex(@"<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>", RegexOptions.Compiled); // remove footer tag and eveyrthing contained within
Regex head = new Regex(@"<head\b[^<]*(?:(?!<\/head>)<[^<]*)*<\/head>", RegexOptions.Compiled); // remove head tag and eveyrthing contained within
Regex nav = new Regex(@"<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>", RegexOptions.Compiled); // remove nav tag and eveyrthing contained within
Regex comment = new Regex("<!--.*?-->", RegexOptions.Compiled | RegexOptions.Singleline); // remove html comments
string StripSomeHTML(string html)
{
html = comment.Replace(html, string.Empty);
html = head.Replace(html, string.Empty);
html = header.Replace(html, string.Empty);
html = footer.Replace(html, string.Empty);
html = nav.Replace(html, string.Empty);
html = sub.Replace(html, " ");
html = sup.Replace(html, " ");
html = strike.Replace(html, " ");
html = inline_tags.Replace(html, string.Empty);
return html;
}
These are just some common examples. By now, it should be obvious the changes to make should other tags want to be removed or stripped.
Comment