Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV : Jordan Marshall

Python BeautifulSoup XML to Dict, JSON, DataFrame, CSV
by: Jordan Marshall
blow post content copied from  Finxter
click here to view original post


5/5 - (1 vote)

Though Python’s BeautifulSoup module was designed to scrape HTML files, it can also be used to parse XML files.

In today’s professional marketplace, it is useful to be able to change an XML file into other formats, specifically dictionaries, CSV, JSON, and dataframes according to specific needs.

In this article, we will discuss that process.

Scraping XML with BeautifulSoup

💡 Extensible Markup Language or XML differs from HTML in that HTML primarily deals with how information is displayed on a webpage, and XML handles how data is stored and transmitted. XML also uses custom tags and is designed to be user and machine-readable.

When inspecting a webpage, a statement at the top of the page will denote what type of file you are viewing.

For an XML file, you may see <?xml version="1.0"?>.

As a side note, “version 1.0” is a little deceiving in that several modifications have been made since its inception in 1998 the name has just not changed.

Despite the differences between HTML and XML, because BeautifulSoup creates a Python object tree, it can be used to parse both. The process for parsing both is similar. For this article, I will be using a sample XML file from w3 schools.com.

Import the BeautifulSoup library and requests modules to scrape this file.

# Import needed libraries
from pprint import pprint
from bs4 import BeautifulSoup
import requests

Once these have been imported, request the content of the webpage.

# Request data
webpage = requests.get("https://www.w3schools.com/xml/cd_catalog.xml")
data = webpage.content
pprint(data)

At this point, I like to print just to make sure I am getting what I need. I use the pprint() function to make it more readable.

Next, create a BeautifulSoup object and declare the parser to be used. Because it is an XML file, use an XML parser.

# Create a BeautifulSoup object
soup = BeautifulSoup(data, 'xml')
print(soup.prettify())

With that printed, you can see the object tree created by BeautifulSoup. The parent, “<CATALOG>”, its child “<CD>”, and all of the children of “CD” are displayed.

Output of the first CD:

<CATALOG>
<CD>
<TITLE>Empire Burlesque</TITLE>
<ARTIST>Bob Dylan</ARTIST>
<COUNTRY>USA</COUNTRY>
<COMPANY>Columbia</COMPANY>
<PRICE>10.90</PRICE>
<YEAR>1985</YEAR>
</CD>

All left is to scrape the desired data and display it.

Using the enumerate() and find_all() function each occurrence of a tag can be found, and its contents can be placed into a list.

After that, using a for loop, unpack the created lists, and create groupings. The .text attribute string and strip() function gives only the text and removes the white space.

Just for readability, print a blank line after each grouping.

# Scrape data
parent = soup.find('CATALOG')
for n, tag in enumerate(parent.find_all('CD')):
    title = [x for x in tag.find_all('TITLE')]
    artist = [x for x in tag.find_all('ARTIST')]
    country = [x for x in tag.find_all('COUNTRY')]
    company = [x for x in tag.find_all('COMPANY')]
    price = [x for x in tag.find_all('PRICE')]
    year = [x for x in tag.find_all('YEAR')]
    # view data
    for item in title:
        print('Title: ', item.text.strip())
    for item in artist:
        print('Artist: ', item.text.strip())
    for item in country:
        print('Country: ', item.text.strip())
    for item in company:
        print('Company: ', item.text.strip())
    for item in price:
        print('Price: ', item.text.strip())
    for item in year:
        print('Year: ', item.text.strip())
    print()

With that, the CDs should be cataloged in this format.

Title:  Empire Burlesque
Artist:  Bob Dylan
Country:  USA
Company:  Columbia
Price:  10.90
Year:  1985 

XML to Dictionary

Besides lists, dictionaries are a common structure for storing data in Python.

Information is stored in key: value pairs. Those pairs are stored within curly {} brackets.

Example: capital = {Pennsylvania: Harrisburg, Michigan: Lansing}

The key of the pair is case-sensitive and unique. The value can be any data type and may be duplicated.

Accessing the value of the pair can be done via the Key. Since the key cannot be duplicated, finding a value in a large dictionary is easy so long as you know the key. A key list can be obtained using the keys() method.

Example: print(capital.keys())

Finding information in a dictionary is quick since you only search for a specific key.

Dictionaries are used quite often, if memory usage is not a concern, because of the quick access. For this reason, it is important to know how to convert information gained in an XML file to a dictionary.

There are six basic steps to convert an XML to a dictionary:

  1. import xmltodict
  2. import pprint
  3. with open('C:\Users\Jordan Marshall\Downloads\cd_catalog.xml', 'r', encoding='utf-8') as file:
    • cd_xml = file.read()
  4. cd_dict = xmltodict.parse(cd_xml)
  5. cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]
  6. pprint.pprint(cd_dict_list)

First, for the conversion, Python has a built-in called xmltodict. So first import that module and any other modules to be used.

import xmltodict
import pprint

Second, the file needs to be opened, read, and assigned to a variable.

with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r', encoding='utf-8') as file:
    cd_xml = file.read()

Third, using xmltodict.parse() convert the XML file to a dictionary and view it.

cd_dict = xmltodict.parse(cd_xml)
cd_dict_list = [dict(x) for x in cd_dict['CATALOG']['CD']]
pprint.pprint(cd_dict_list)

The output of this is a nice clean list of dictionaries. To view all artists, a simple for loop can be used.

for item in cd_dict_list:
    print(item['ARTIST'])

XML to JSON

💡 JSON stands for JavaScript Object Notation. These files store data in key:value form like a Python dictionary. JSON files are used primarily to transmit data between web applications and servers.

Converting an XML file to a JSON file requires only a few lines of code. 

As always, import the needed libraries and modules.

import json
from pprint import pprint
import xmltodict

Again, you will see the use of xmltodict. Because of their similarities, first, convert the file to a dictionary and then later write it to a JSON file. The json_dumps() function is used to take in the XML data. That data will later be written to a JSON file.

with open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog example.xml') as xml_file:
    data_dict = xmltodict.parse(xml_file.read())
    xml_file.close()
    json_data = json.dumps(data_dict)
    with open('data.json', 'w') as json_file:
        json_file.write(json_data)
        json_file.close()

Output:

('{"CATALOG": {"CD": [{"TITLE": "Empire Burlesque", "ARTIST": "Bob Dylan", '
 '"COUNTRY": "USA", "COMPANY": "Columbia", "PRICE": "10.90", "YEAR": "1985"}, '
 '{"TITLE": "Hide your heart", "ARTIST": "Bonnie Tyler", "COUNTRY": "UK", '
 '"COMPANY": "CBS Records", "PRICE": "9.90", "YEAR": "1988"}, {"TITLE": '
 '"Greatest Hits", "ARTIST": "Dolly Parton", "COUNTRY": "USA", "COMPANY": '
 '"RCA", "PRICE": "9.90", "YEAR": "1982"}, {"TITLE": "Still got the blues", '….)

The data that started as an XML file has now been written to a JSON file called json_data

XML to DataFrame

There are a couple of ways to achieve this goal.

Using Python’s ElementTree is one. I am, however, partial to Pandas.

💡 Pandas is a great module for working with data, and it simplifies many daily tasks of a programmer and data scientist. I strongly suggest becoming familiar with this module.

For this code, use a combination of BeautifulSoup and Pandas.

Import the necessary libraries.

import pandas as pd
from bs4 import BeautifulSoup

To display the output fully, display values may need to be altered. I am going to set the max number of columns as well as the display width. This will overwrite any default settings that may be in place.

Without doing this, you may find some of your columns are replaced by ‘’ or the columns may be displayed under your first couple of columns.

# set max columns and display width
pd.set_option("display.max_columns", 10)
pd.set_option("display.width", 1000)

The width and columns can be changed according to your needs. With that completed, open and read the XML file. Store the contents in a variable.

xml_file = open('C:\\Users\\Jordan Marshall\\Downloads\\cd_catalog.xml', 'r')
contents = xml_file.read()

Next, create a BeautifulSoup object.

# BeautifulSoup object
soup = BeautifulSoup(contents, 'xml')

The next step is to extract the data and assign it to a variable. 

# Extract data and assign it to a variable
title = soup.find_all("TITLE")
artist = soup.find_all("ARTIST")
country = soup.find_all("COUNTRY")
company = soup.find_all("COMPANY")
price = soup.find_all("PRICE")
year = soup.find_all("YEAR")

Now a for loop can be used to extract the text.

Should data be added or removed at any time using the length of one of the variables removes the need to know from memory how many items are cataloged.

Place the text in an empty list.

# Text
cd_info = []
for i in range(0, len(title)):
    rows = [title[i].get_text(),
            artist[i].get_text(),
            country[i].get_text(),
            company[i].get_text(),
            price[i].get_text(),
            year[i].get_text()]
    cd_info.append(rows)

Lastly, create the data frame and name the columns.

# Create a dataframe with Pandas and print
df = pd.DataFrame(cd_info, columns=['Title', 'Artist ', '   Company', 'Country', '   Price', '   Year'])
print(df)

Output

            Title                  Artist              Country         Company      Price     Year
0           Empire Burlesque       Bob Dylan           USA             Columbia     10.90     1985
1           Hide your heart        Bonnie Tyler        UK              CBS Records  9.90      1988
2           Greatest Hits          Dolly Parton        USA             RCA          9.90      1982

A nice, neat table containing each CD’s data has been created.

XML to CSV

💡 A CSV file or comma-separated values file contains plain text easily readable by the user. It can contain numbers and letters only and is used to exchange data between apps. CSV files can be opened by any editor.

For example, Microsoft Excel. Each line represents a new row of data. The comma represents a new column. Using the code from above the XML file can be converted to a CSV file with one new line.

catalog = df.to_csv('cd catalog.csv')

With that, go to files and search the C: drive for 'cd catalog.csv'. It will open in the default program used for spreadsheets. In this case Microsoft Excel.

Title Artist  Country Company Price  Year
Empire Burlesque Bob Dylan USA Columbia 10.90 1985
Hide your heart Bonnie Tyler UK CBS Records 9.90 1988
Greatest Hits Dolly Parton USA RCA 9.90 1982
Still got the blues Gary Moore UK Virgin records 10.20 1990
Eros Eros Ramazzotti EU BMG 9.90 1997
One night only Bee Gees UK Polydor 10.90 1998
Sylvias Mother Dr.Hook UK CBS 8.10 1973
Maggie May Rod Stewart UK Pickwick 8.50 1990
Romanza Andrea Bocelli EU Polydor 10.80 1996

July 16, 2022 at 08:34PM
Click here for more details...

=============================
The original post is available in Finxter by Jordan Marshall
this post has been published as it is through automation. Automation script brings all the top bloggers post under a single umbrella.
The purpose of this blog, Follow the top Salesforce bloggers and collect all blogs in a single place through automation.
============================

Salesforce