7 min read

How to Convert XML to Dataframe Using Xlm2


xml code screenshot
A dataset that was only available in pdf or xml.

View raw source for this post

Summary

Everytime I run into a file with an .xml extension, I cringe. Though, admittedly, it’s a file format that you have to be familiar with when it comes to sending and receiving data over the web. R has a package xlm2 to assist in the conversion of nested data to tabular data.

Table of Contents

Overview

The task here is to take an easy and short xml file and convert it to a dataframe. The xml2 package was specifically designed for the task and is the successor to the xml package.

Background

Sanity Checks

Let’s face it: you wouldn’t be on this page if you could’ve figured it out using W3schools or the R help menu. Let’s start with some super basic stuff. Have you opened the .xml file using a browser? Do that first and see if you can get a sense of how the data is organized. Second, open the xml file in a linter and see if it’s formatted correctly. (Here’s an example) I’ve lost a ton of time double checking code only to figure out that some .json was missing a “}”. Finally, open it in a good text editor and beautify the code. (I use Atom.) My last project, I “peeled” the outer layers of the onion to get at the data. It was hacky and lacked reproducibility, but it worked.

Definitions

First, it’s important to get some defintions first. (W3schools has a really helpful set of tutorials on the topic.) A “node” in general and speaking in a broad way is an HTML element.

DOM (Document Object Model) - a tree structure that represents the HTML of the website, and every HTML element is a “node”.

Element Nodes - model the actual HTML elements in the document.

Attribute Nodes - model the various attributes in the different HTML elements. Attributes include id, class, title and style.

Text Nodes - model the text content inside the different HTML elements.

Root Node - the node on the very top of the document tree, usually called the document node.

Parent Node - a node that has children; represents an element that has at least one other element or text nested inside it.

Child Node - a node that has a parent; represents an element or text that is nested inside another element. For example, a <p> tag is often the child of a <div> tag.

Example

See [DOM model](https://en.wikipedia.org/wiki/Document_Object_Model) on Wikipedia

Figure 1: See DOM model on Wikipedia

Structure

XML structured data looks like this:

<?xml version="1.0" encoding="UTF-8"?> #xml prologue
<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

Xpath

XPath is a syntax for defining and navigating parts of an XML document. They can be used across many languages. Xpath has over 200 functions according to W3schools. See the W3 page for xpath syntax.

XML Data

All of the functions below are from the xlm2 package and the help page.

# get W3 data
url <- "https://www.w3schools.com/xml/simple.xml"
# read breakfast menu
bkfst <- read_xml(x = url)
str(bkfst)
List of 2
 $ node:<externalptr> 
 $ doc :<externalptr> 
 - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
xml_name(bkfst)
[1] "breakfast_menu"

See XML Data Structure

xml_structure(bkfst)
<breakfast_menu>
  <food>
    <name>
      {text}
    <price>
      {text}
    <description>
      {text}
    <calories>
      {text}
  <food>
    <name>
      {text}
    <price>
      {text}
    <description>
      {text}
    <calories>
      {text}
  <food>
    <name>
      {text}
    <price>
      {text}
    <description>
      {text}
    <calories>
      {text}
  <food>
    <name>
      {text}
    <price>
      {text}
    <description>
      {text}
    <calories>
      {text}
  <food>
    <name>
      {text}
    <price>
      {text}
    <description>
      {text}
    <calories>
      {text}
# see text
xml_text(bkfst)
[1] "Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900French Toast$4.50Thick slices made from our homemade sourdough bread600Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950"

Extract XML

The real key is understanding how to define or set the “xpath” argument.

xml_find_all(bkfst, xpath = "//name")
{xml_nodeset (5)}
[1] <name>Belgian Waffles</name>
[2] <name>Strawberry Belgian Waffles</name>
[3] <name>Berry-Berry Belgian Waffles</name>
[4] <name>French Toast</name>
[5] <name>Homestyle Breakfast</name>
xml_text(xml_find_all(bkfst, xpath = "//name"))
[1] "Belgian Waffles"             "Strawberry Belgian Waffles" 
[3] "Berry-Berry Belgian Waffles" "French Toast"               
[5] "Homestyle Breakfast"        

Build table

library(tibble)
name <- xml_text(xml_find_all(bkfst, xpath = "//name"))
price <- xml_text(xml_find_all(bkfst, xpath = "//price"))
df <- tibble(name = name, price = price)
df
# A tibble: 5 x 2
  name                        price
  <chr>                       <chr>
1 Belgian Waffles             $5.95
2 Strawberry Belgian Waffles  $7.95
3 Berry-Berry Belgian Waffles $8.95
4 French Toast                $4.50
5 Homestyle Breakfast         $6.95

Lists and XML

Lists and xml are similar and easily converted within the xml2 package.

document <- list(root = list(parent = list(child = list(1))))
xmlDoc <- xml2::as_xml_document(document)
xml2::xml_structure(xmlDoc)
<root>
  <parent>
    <child>
      {text}
{xml_nodeset (1)}
[1] <child>1</child>

Conclusion

If you can get a basic understanding of html web pages and xpath syntax, then you should be able to parse xml files efficiently with xlm2.

Acknowledgements

(Get bibliographic stuff from “archetype hill”.)

References

Disclaimer

The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimd as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.

Reproducibility

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.3 (2020-02-29)
 os       macOS Catalina 10.15.7      
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/Chicago             
 date     2021-03-05                  

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source                           
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)                   
 blogdown      1.1.14  2021-02-27 [1] Github (rstudio/blogdown@b0fa6ed)
 bookdown      0.21    2020-10-13 [1] CRAN (R 3.6.3)                   
 bslib         0.2.4   2021-01-25 [1] CRAN (R 3.6.2)                   
 cachem        1.0.3   2021-02-04 [1] CRAN (R 3.6.2)                   
 callr         3.5.1   2020-10-13 [1] CRAN (R 3.6.2)                   
 cli           2.3.0   2021-01-31 [1] CRAN (R 3.6.2)                   
 codetools     0.2-18  2020-11-04 [1] CRAN (R 3.6.2)                   
 colorspace    2.0-0   2020-11-11 [1] CRAN (R 3.6.2)                   
 crayon        1.4.1   2021-02-08 [1] CRAN (R 3.6.2)                   
 DBI           1.1.1   2021-01-15 [1] CRAN (R 3.6.2)                   
 desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.0)                   
 devtools    * 2.3.2   2020-09-18 [1] CRAN (R 3.6.2)                   
 digest        0.6.27  2020-10-24 [1] CRAN (R 3.6.2)                   
 dplyr         1.0.4   2021-02-02 [1] CRAN (R 3.6.2)                   
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 3.6.2)                   
 evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.0)                   
 fastmap       1.1.0   2021-01-25 [1] CRAN (R 3.6.2)                   
 fs            1.5.0   2020-07-31 [1] CRAN (R 3.6.2)                   
 generics      0.1.0   2020-10-31 [1] CRAN (R 3.6.2)                   
 ggplot2     * 3.3.3   2020-12-30 [1] CRAN (R 3.6.2)                   
 ggthemes    * 4.2.4   2021-01-20 [1] CRAN (R 3.6.2)                   
 glue          1.4.2   2020-08-27 [1] CRAN (R 3.6.2)                   
 gtable        0.3.0   2019-03-25 [1] CRAN (R 3.6.0)                   
 htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 3.6.2)                   
 jquerylib     0.1.3   2020-12-17 [1] CRAN (R 3.6.2)                   
 jsonlite      1.7.2   2020-12-09 [1] CRAN (R 3.6.2)                   
 knitr         1.31    2021-01-27 [1] CRAN (R 3.6.2)                   
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 3.6.0)                   
 magrittr      2.0.1   2020-11-17 [1] CRAN (R 3.6.2)                   
 memoise       2.0.0   2021-01-26 [1] CRAN (R 3.6.2)                   
 munsell       0.5.0   2018-06-12 [1] CRAN (R 3.6.0)                   
 pillar        1.4.7   2020-11-20 [1] CRAN (R 3.6.2)                   
 pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 3.6.2)                   
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 3.6.0)                   
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 3.6.2)                   
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.0)                   
 processx      3.4.5   2020-11-30 [1] CRAN (R 3.6.2)                   
 ps            1.5.0   2020-12-05 [1] CRAN (R 3.6.2)                   
 purrr         0.3.4   2020-04-17 [1] CRAN (R 3.6.2)                   
 R6            2.5.0   2020-10-28 [1] CRAN (R 3.6.2)                   
 remotes       2.2.0   2020-07-21 [1] CRAN (R 3.6.2)                   
 rlang         0.4.10  2020-12-30 [1] CRAN (R 3.6.2)                   
 rmarkdown     2.7     2021-02-19 [1] CRAN (R 3.6.3)                   
 rprojroot     2.0.2   2020-11-15 [1] CRAN (R 3.6.2)                   
 sass          0.3.1   2021-01-24 [1] CRAN (R 3.6.2)                   
 scales        1.1.1   2020-05-11 [1] CRAN (R 3.6.2)                   
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.0)                   
 stringi       1.5.3   2020-09-09 [1] CRAN (R 3.6.2)                   
 stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.0)                   
 testthat      3.0.1   2020-12-17 [1] CRAN (R 3.6.2)                   
 tibble        3.0.6   2021-01-29 [1] CRAN (R 3.6.2)                   
 tidyselect    1.1.0   2020-05-11 [1] CRAN (R 3.6.2)                   
 usethis     * 2.0.1   2021-02-10 [1] CRAN (R 3.6.2)                   
 vctrs         0.3.6   2020-12-17 [1] CRAN (R 3.6.2)                   
 withr         2.4.1   2021-01-26 [1] CRAN (R 3.6.2)                   
 xfun          0.21    2021-02-10 [1] CRAN (R 3.6.2)                   
 xml2        * 1.3.2   2020-04-23 [1] CRAN (R 3.6.2)                   
 yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.0)                   

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library