Core Java

Understanding Invalid Characters in XML

XML, the Extensible Markup Language, is a fundamental building block for data exchange and configuration files. But like any language, it has its own rules about what characters are allowed and how they should be used. In XML, certain characters can cause parsing errors if we are not careful. This article will explore invalid characters in XML and how to keep our data tidy and error-free.

1. What are Invalid Characters in XML?

XML documents must adhere to specific syntax rules to be considered well-formed. Any deviation from these rules can render the XML document invalid. One such issue arises from the presence of characters that are not permitted within XML documents.

Two main categories of characters can cause trouble in XML:

Reserved Characters: These are characters that have special meanings within XML itself, like angle brackets (<, >), the ampersand (&), and quotation marks (“, ‘). If you try to use these characters directly in your data, they can confuse the XML parser and lead to errors.

Non-Unicode Characters (Unsupported Characters): XML is designed to be encoded using Unicode, which supports a vast range of characters from various writing systems worldwide. However, certain characters outside the Unicode range are considered invalid in XML documents.

1.1 Valid Characters Allowed in XML

While XML imposes restrictions on certain characters, it also defines a set of valid characters that can be used without any issues. These characters include:

  • Basic Latin Alphabet: All characters from the basic Latin alphabet (A-Z, a-z) are valid in XML.
  • Digits: Numeric digits (0-9) are allowed.
  • Special Characters: Certain special characters such as hyphen (-), underscore (_), period (.), colon (:), and comma (,), among others, are permitted in XML.
  • Unicode Characters: XML supports a vast range of Unicode characters beyond the basic Latin alphabet, allowing representation of various languages, symbols, and emojis.

1.2 XML 1.1

XML 1.1, introduced a revision to XML 1.0 and brought several enhancements and changes to the XML specification. These improvements aimed to address certain limitations present in XML 1.0 and to provide better support for internationalization and character handling. Some key features and enhancements introduced in XML 1.1 include:

  • Expanded Character Set Support: XML 1.1 allows a broader range of characters from various scripts and languages, facilitating better internationalization support. For instance, characters from the Unicode 3.1 repertoire, such as mathematical symbols, are permitted.
  • Support for Additional Control Characters: XML 1.1 permits certain control characters that were not allowed in XML 1.0.
  • Relaxed Restrictions on XML Names: XML 1.1 relaxes some of the naming restrictions imposed by XML 1.0. For instance, XML 1.1 allows names to start with a digit or a combining character, which was not permitted in XML 1.0.
  • Enhanced Internationalization Support: XML 1.1 includes features aimed at improving internationalization support.

2. Handling Invalid Characters

Encountering invalid characters in XML can disrupt parsing by XML parsers. So, it’s crucial to deal with them properly to protect XML data integrity. Here are some ways to handle them:

2.1 Character Replacement

This technique involves replacing invalid characters with valid XML escape sequences. An escape sequence is a combination of characters that starts with an ampersand (&), followed by a name or a decimal reference of the character to be escaped, and ends with a semicolon (;).

For example, the character “&” itself is an invalid character in XML because it’s used to introduce character references. To include an ampersand in your XML data, you would escape it as &amp; and replace < with &lt;. Let’s examine a code example where we introduce invalid characters:

import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.IOException;
import java.io.StringReader;

public class CharacterReplacementExample {

    public static Document parseXML(String xmlString) {
        try {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();

            // Parse the XML string
            InputSource inputSource = new InputSource(new StringReader(xmlString));
            Document document = builder.parse(inputSource);

            System.out.println("XML parsed successfully!");
            return document;
        } catch (ParserConfigurationException | IOException | org.xml.sax.SAXException e) {
            System.out.println("Failed to parse XML: " + e.getMessage());
            String cleanedXMLString = xmlString;
            System.out.println("Invalid characters in XML");
            System.out.println(cleanedXMLString);
            return null;
        }
    }

    public static void main(String[] args) {

        // Sample XML string with invalid characters
        String xmlData = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><item>This is some text with invalid characters >: Core Java & Fundamentals</item></root>";

        // Parse the XML data
        Document parsedXML = parseXML(xmlData);
    }
}

In this example, we define a static method called parseXML(). The parseXML() method attempts to parse the XML string using Java’s Document Object Model (DOM) API. If parsing fails due to invalid characters, it catches and outputs the relevant exceptions. If the parsing is successful, it outputs the message XML parsed successfully. The output from running the above code is:

Fig 1: Output of invalid xml characters in Java example.
Fig 1: Output of invalid XML characters in Java example.

To maintain the integrity and validity of XML documents by effectively managing invalid and reserved characters, it is essential to escape reserved characters using their predefined character entities correctly. In the above example, if we replace > with &gt; and & with &amp; in the XML string:

    public static void main(String[] args) {
        // Sample XML string with valid characters
        String xmlData = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><item>This is some text with valid characters &gt;: Core Java &amp; Fundamentals</item></root>";

        // Parse the XML data
        Document parsedXML = parseXML(xmlData);
    }

We will get the following output which indicates that the parsing is successful.

XML parsed successfully!

2.1.1 Handling Reserved Characters in XML Using CDATA Section

Another technique to deal with these reserved characters is by utilizing the CDATA section. CDATA (Character Data) section in XML allows us to include blocks of text containing characters that would otherwise be recognized as markup. This means that XML parsers will not attempt to parse the content inside a CDATA section, treating it as raw character data instead.

In Java, handling reserved characters in XML using the CDATA section is straightforward. Here’s how we can do it:

    public static void main(String[] args) {

        //CDATA Section
        String xmlData = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root><item><![CDATA[CDATA is useful for handling reserved characters like <, >, & in XML.]]></item></root>";
        
        // Parse the XML data
        Document parsedXML = parseXML(xmlData);
    }

In this example, the text enclosed within the <item> element is wrapped inside a CDATA section. This ensures that any reserved characters within the CDATA section will not be interpreted as XML markup.

2.2 Handling Non-Unicode Characters in XML

In addition to handling invalid characters, dealing with non-Unicode and non-printable characters is crucial when parsing XML documents. In ASCII, control characters are non-printable characters that are used for control purposes rather than representing printable characters.

The “End of Text” character (\u0003) is one such control character. To parse non-printable ASCII control characters in an XML document, we need to ensure that these characters are properly encoded and handled during the parsing process. Failure to handle them properly can result in parsing errors. Here are some suggestions for parsing XML documents containing non-printable ASCII values:

  • Specify Character Encoding: Ensure that the XML document specifies the correct character encoding UTF8 or UTF16 in its declaration.
  • Handle Text Content: When retrieving text content from XML elements, ensure that you handle non-printable ASCII characters properly.
  • Handle Control Characters in Code: Control characters can be represented as numeric character references, such as &#3; for the “End of Text” character (\u0003) and &#x1E; for the “Unit Separator” character (\u001E).

For example:

public class EncodingConversionExample {

    public static void main(String[] args) {
        
        // XML string with a non-Unicode character
        String xmlString = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
                            "<root>\n" +
                            "    <text>Non-Unicode Character: \u0003</text>\n" +
                            "</root>";

        // Parse XML using DOM
        parseXMLWithDOM(xmlString);
    }

    private static void parseXMLWithDOM(String xmlString) {
        try {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();
            InputSource is = new InputSource(new StringReader(xmlString));
            Document document = builder.parse(is);

            Element root = document.getDocumentElement();
            Node textNode = root.getElementsByTagName("text").item(0);
            String textContent = textNode.getTextContent();
            System.out.println("DOM Parsed Text Content: " + textContent);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Running the above code would give the following error:

[Fatal Error] :3:34: An invalid XML character (Unicode: 0x4) was found in the element content of the document.
org.xml.sax.SAXParseException; lineNumber: 3; columnNumber: 34; An invalid XML character (Unicode: 0x4) was found in the element content of the document.

To fix the error, run the above code with the appropriate encoding, and XML version and represent the control character using their corresponding numeric character references.

 public static void main(String[] args) {
        
        String xmlString = "<?xml version=\"1.1\" encoding=\"UTF-8\"?>\n" +
                            "<root>\n" +
                            "    <text>Non-Unicode Character: &#3;</text>\n" +
                            "</root>";

        // Parse XML using DOM
        parseXMLWithDOM(xmlString);
    }

3. Conclusion

This article explored the process of parsing XML documents containing invalid characters in Java. We discussed the importance of specifying proper character encoding, handling text content, and representing control characters using their corresponding numeric character references. By following these steps, we can ensure the accurate parsing and processing of XML documents.

4. Download the Source Code

This was an article on handling XML invalid characters with Java.

Download
You can download the full source code of this example here: java xml invalid characters

Omozegie Aziegbe

Omos holds a Master degree in Information Engineering with Network Management from the Robert Gordon University, Aberdeen. Omos is currently a freelance web/application developer who is currently focused on developing Java enterprise applications with the Jakarta EE framework.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button