Files
librsvg/devel-docs/xml_parser.rst

145 lines
4.0 KiB
ReStructuredText

XML Parser in Rust
==================
**Status as of 2025/May/02: not implemented**
The purpose of this proposal is to replace libxml2 in librsvg with a
Rust-based XML parser.
Librsvg uses `libxml2 <https://gitlab.gnome.org/GNOME/libxml2>`_ to do
the initial XML parsing of an SVG document. It does not let libxml2
build its own tree representation; instead, it uses the SAX2
"streaming" parser API and so librsvg builds a tree of its own with
tag names and attributes.
Pragmatically speaking, there is nothing wrong with using libxml2 in
the way that librsvg uses it:
* Libxml2 is fast.
* It is well-maintained, is fuzz-tested at scale, and is such a
critical piece of infrastructure that people actually pay attention
to it.
* It has built-in mitigations for common XML attacks like the
"`billion laughs
<https://en.wikipedia.org/wiki/Billion_laughs_attack>`".
* Librsvg is careful to turn off features like network access and
external XML entities, which are a well-known source of
attacks.
However, libxml2 has had many CVEs and security problems in the past.
It is the sort of infrastructure that should be replaced with
memory-safe code at some point.
Steps
-----
1. Separate the XML tree from the SVG element tree.
2. Change the XML tree to one that is produced by the new Rust-based
XML parser.
The sections below explore each of these steps.
Separating the XML tree from the SVG element tree
-------------------------------------------------
Librsvg has a tree data structure, managed by the ``rctree`` crate,
where each node is a combination of XML data (element name for the
tag, and a list of attributes with their string values) and the parsed
SVG data (individual structs for ``Group``, ``Path``, etc., plus
parsed properties and element-specific attributes).
From ``document.rs``:
.. code-block:: rust
pub struct Document {
/// Tree of nodes; the root is guaranteed to be an `<svg>` element.
tree: Node,
// ...
}
From ``node.rs``:
.. code-block:: rust
pub type Node = rctree::Node<NodeData>;
pub enum NodeData {
Element(Box<Element>),
Text(Box<Chars>),
}
From ``xml/attributes.rs``:
.. code-block:: rust
pub struct Attributes {
attrs: Box<[(QualName, AttributeValue)]>,
// ...
}
From ``element.rs``:
.. code-block:: rust
pub struct Element {
element_name: QualName,
attributes: Attributes,
specified_values: SpecifiedValues,
pub element_data: ElementData,
// ... some fields omitted
}
pub enum ElementData {
Circle(Box<Circle>),
ClipPath(Box<ClipPath>),
Ellipse(Box<Ellipse>),
// ...
}
Here, ``struct Element`` is a combination of XML string data
(``element_name``, ``attributes``), plus the result of parsing those
strings into SVG and CSS-specific information (``specified_values``,
``element_data``).
**Goal:** Basically, have ``Element`` *not* contain XML string data.
It may contain a pointer back to its corresponding XML node, and that
may even depend on what the crate that represents that XML tree lets
us do.
Things to consider
~~~~~~~~~~~~~~~~~~
* With the libxml2-based SAX2 parser, as soon as librsvg gets a "start
element" event it will parse each value in the list of attributes.
It will then use this information to construct an ``Element`` and
then a ``Node``. We may have to change this "build from the inside
out" process to instead assume that an XML tree is available and
full of strings, and later an SVG tree can be constructed from it.
Change the XML tree to one from a Rust-based parser
---------------------------------------------------
FIXME
* The code in ``css.rs`` which implements the ``selectors::Element``
trait for nodes in the tree, needs O(1) access to a node's parent and
to its next sibling.
Notes
-----
This used to be https://gitlab.gnome.org/GNOME/librsvg/-/issues/224
but it was mostly a wishlist item, instead of a specification document
like the present one.