4 Best Ways to Strip HTML Tags from a Python String : Emily Rosemary Collins
by: Emily Rosemary Collins
blow post content copied from Be on the Right Side of Change
click here to view original post
Problem Formulation: Python developers often face the challenge of removing HTML tags from strings, especially when dealing with web scraping or text processing. The goal is to clean up a string containing HTML, like
<p>Hello, World!</p>
, and obtain the plain text without any markup: Hello, World!
. This article outlines five different methods to accomplish this task effectively.
Method 1: Using the re
module (regular expressions)
This method involves using the built-in re
(regular expression) module in Python to create a pattern that matches all HTML tags and replace them with an empty string. It’s a powerful tool for string manipulation and works well for most cases of HTML.
Here’s an example:
import re def remove_html_tags(text): html_pattern = re.compile('<.*?>') clean_text = re.sub(html_pattern, '', text) return clean_text sample_html = '<html>Visit <a href="https://www.example.com">my site</a>.</html>' print(remove_html_tags(sample_html))

Output: Visit my site.
This snippet defines a function remove_html_tags
that uses re.compile
to compile a regular expression that matches anything between less than and greater than symbols (HTML tags) and the re.sub
method to substitute them with an empty string, effectively removing them.
Method 2: Using BeautifulSoup
BeautifulSoup is a Python library designed for web scraping, which includes functions for parsing HTML and XML documents. It can also be used to remove HTML tags from strings quite effortlessly.
Here’s an example:
from bs4 import BeautifulSoup def strip_html_tags(html_content): soup = BeautifulSoup(html_content, 'html.parser') stripped_text = soup.get_text(separator=' ') return stripped_text html_data = '<title>Strip HTML Tags</title><p>This is <b>bold</b> text.</p>' print(strip_html_tags(html_data))
Output: Strip HTML Tags This is bold text.
In the provided function strip_html_tags
, the BeautifulSoup
object represents the parsed HTML content. The method get_text()
is invoked on the object to obtain the text without HTML tags, with spaces as separators for consecutive elements.
Method 3: Using lxml
lxml
is a high-performance library for XML and HTML parsing. It can be utilized to clean HTML tags from a string rapidly, although it may be less forgiving than BeautifulSoup for broken HTML.
Here’s an example:
from lxml import html def remove_html_from_string(html_string): tree = html.fromstring(html_string) text = tree.text_content() return text html_example = '<div>This is an <em>example</em> string with <a href="#">HTML</a> tags.</div>' print(remove_html_from_string(html_example))
Output: This is an example string with HTML tags.
The remove_html_from_string
function creates an html
tree object and then calls text_content()
to retrieve all the text contained within the HTML, minus the tags. This method is straightforward and effective but might need additional handling for malformed HTML.
Method 4: Using HTMLParser from the Standard Library
The HTMLParser
module from Python’s standard library can be utilized to parse HTML documents and easily extract text without the tags. It’s built-in and does not require the installation of third-party libraries.
Here’s an example:
from html.parser import HTMLParser class MLStripper(HTMLParser): def __init__(self): super().__init__() self.reset() self.strict = False self.convert_charrefs= True self.text = StringIO() def handle_data(self, d): self.text.write(d) def get_data(self): return self.text.getvalue() def strip_tags(html): s = MLStripper() s.feed(html) return s.get_data() html_code = '<body>Sample text with <code>HTMLParser</code></body>' print(strip_tags(html_code))
Output: Sample text with HTMLParser
The custom class MLStripper
extends HTMLParser
and overrides the handle_data
method to write the text data into a StringIO
object. The strip_tags
function feeds HTML to the parser and returns the extracted text without the HTML tags.
Summary/Discussion
- Method 1: Using
re
. Strengths: Part of the standard library, fast, straightforward. Weaknesses: Regular expressions can be complex, may not handle malformed HTML well. - Method 2: Using BeautifulSoup. Strengths: Robust, excellent for parsing, handles malformed HTML. Weaknesses: Third-party dependency, slightly slower performance.
- Method 3: Using
lxml
. Strengths: Very fast and efficient. Weaknesses: Third-party dependency, less tolerant of broken HTML. - Method 4: Using
HTMLParser
. Strengths: Built-in, good for well-formed HTML. Weaknesses: More verbose than other methods, not as robust for malformed HTML.
February 15, 2024 at 07:24PM
Click here for more details...
=============================
The original post is available in Be on the Right Side of Change by Emily Rosemary Collins
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Post a Comment