PROWAREtech

articles » current » dot-net » strip-html-tags-from-text-using-regex

.NET: Strip/Remove HTML Tags from Text Using Regex

How to remove the tags from HTML text using C# and regular expressions.

See related: Find Keywords in Text and strip SCRIPT tags from HTML text

It is very easy to remove all HTML tags using Regex.Replace().


string html = "This is a <b>test</b>!<img src='test.jpg' />";
string text = Regex.Replace(html, "<[^>]*>", string.Empty);

Here is the above code snippet used in a complete example.


using System.Text.RegularExpressions;
using System;

namespace ConsoleAppRemoveHtml
{
	class Program
	{
		static void Main(string[] args)
		{
			string html = "This is a <b>test</b>!<img src='test.jpg' />";
			string text = Regex.Replace(html, "<[^>]*>", " ");
			Console.WriteLine(html);
			Console.WriteLine(text);
		}
	}
}

This code will remove HTML comments, too.

Sample program output:

This is a <b>test</b>!<img src='test.jpg' />
This is a test!

If wanting to remove just parts of the HTML document or just inline tags, then take a look at these options.


using System.Text.RegularExpressions;

Regex inline_tags = new Regex(@"<\/?(?:i|b|a|span|strong|cite|em|code)[^>]*>", RegexOptions.Compiled); // remove common inline tags but leaves the text contained within
Regex strike = new Regex(@"<s\b[^<]*(?:(?!<\/s>)<[^<]*)*<\/s>", RegexOptions.Compiled); // remove striked text and eveyrthing contained within the tag
Regex sup = new Regex(@"<sup\b[^<]*(?:(?!<\/sup>)<[^<]*)*<\/sup>", RegexOptions.Compiled); // remove sup tag and eveyrthing contained within
Regex sub = new Regex(@"<sub\b[^<]*(?:(?!<\/sub>)<[^<]*)*<\/sub>", RegexOptions.Compiled); // remove sub tag and eveyrthing contained within
Regex header = new Regex(@"<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>", RegexOptions.Compiled); // remove header tag and eveyrthing contained within
Regex footer = new Regex(@"<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>", RegexOptions.Compiled); // remove footer tag and eveyrthing contained within
Regex head = new Regex(@"<head\b[^<]*(?:(?!<\/head>)<[^<]*)*<\/head>", RegexOptions.Compiled); // remove head tag and eveyrthing contained within
Regex nav = new Regex(@"<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>", RegexOptions.Compiled); // remove nav tag and eveyrthing contained within
Regex comment = new Regex("<!--.*?-->", RegexOptions.Compiled | RegexOptions.Singleline); // remove html comments

string StripSomeHTML(string html)
{
	html = comment.Replace(html, string.Empty);
	html = head.Replace(html, string.Empty);
	html = header.Replace(html, string.Empty);
	html = footer.Replace(html, string.Empty);
	html = nav.Replace(html, string.Empty);
	html = sub.Replace(html, " ");
	html = sup.Replace(html, " ");
	html = strike.Replace(html, " ");
	html = inline_tags.Replace(html, string.Empty);
	return html;
}

These are just some common examples. By now, it should be obvious the changes to make should other tags want to be removed or stripped.


PROWAREtech

Hello there! How can I help you today?
Ask any question

PROWAREtech

This site uses cookies. Cookies are simple text files stored on the user's computer. They are used for adding features and security to this site. Read the privacy policy.
ACCEPT REJECT