About Joe Stein

XML to Avro Conversion

We all know what XML is right?  Just in case not, no problem here is what it is all about.
 
 
 
 
 
 
 
 
 

<root>
<node>5</node>
</root>

Now, what the computer really needs is the number five and some context around it. In XML you (human and computer) can see how it represents context to five. Now lets say instead you have a business XML document like FPML

<FpML xmlns="http://www.fpml.org/2007/FpML-4-4" xmlns:fpml="http://www.fpml.org/2007/FpML-4-4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4-4" xsi:schemaLocation="http://www.fpml.org/2007/FpML-4-4 ../fpml-main-4-4.xsd http://www.w3.org/2000/09/xmldsig# ../xmldsig-core-schema.xsd" xsi:type="RequestTradeConfirmation">
<!--  start of distinct  -->
<strike>
<strikePrice>32.00</strikePrice>
</strike>
<numberOfOptions>150000</numberOfOptions>
<optionEntitlement>1.00</optionEntitlement>
<equityPremium>
<payerPartyReference href="party2"/>
<receiverPartyReference href="party1"/>
<paymentAmount>
<currency>EUR</currency>
<amount>405000</amount>
</paymentAmount>
<paymentDate>
<unadjustedDate>2001-07-17Z</unadjustedDate>
<dateAdjustments>
<businessDayConvention>NONE</businessDayConvention>
</dateAdjustments>
</paymentDate>
<pricePerOption>
<currency>EUR</currency>
<amount>2.70</amount>
</pricePerOption>
</equityPremium>
</equityOption>
<calculationAgent>
<calculationAgentPartyReference href="party1"/>
</calculationAgent>
<documentation>
<masterAgreement>
<masterAgreementType>ISDA2002</masterAgreementType>
</masterAgreement>
<contractualDefinitions>ISDA2002Equity</contractualDefinitions>
<!--
 populate credit support document with correct value 
-->
<creditSupportDocument>TODO</creditSupportDocument>
</documentation>
<governingLaw>GBEN</governingLaw>
</trade>
<party id="party1">
<partyId>Party A</partyId>
</party>
<party id="party2">
<partyId>Party B</partyId>
</party>
</FpML>

That is a lot of extra unnecessary data points. Now lets look at this using Apache Avro.

With Avro, the context and the values are separated. This means the schema/structure of what the information is does not get stored or streamed over and over and over and over (and over) again.

The Avro schema is hashed. So the data structure only holds the value and the computer understands the fingerprint (the hash) of the schema and can retrieve the schema using the fingerprint.

0x d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592

This type of implementation is pretty typical in the data space.

When you do this you can reduce your data between 20%-80%. When I tell folks this they immediately ask, “why such a large gap of unknowns”. The answer is because not every XML is created the same. But that is the problem because you are duplicating the information the computer needs to understand the data. XML is nice for humans to read, sure … but that is not optimized for the computer.

Here is a converter we are working on https://github.com/stealthly/xml-avro to help get folks off of XML and onto lower cost, open source systems. This allows you to keep parts of your systems (specifically the domain business code) using the XML and not having to be changed (risk mitigation) but store and stream the data with less overhead (optimize budget).
 

Reference: XML to Avro Conversion from our JCG partner Joe Stein at the All Things Hadoop blog.
Related Whitepaper:

Functional Programming in Java: Harnessing the Power of Java 8 Lambda Expressions

Get ready to program in a whole new way!

Functional Programming in Java will help you quickly get on top of the new, essential Java 8 language features and the functional style that will change and improve your code. This short, targeted book will help you make the paradigm shift from the old imperative way to a less error-prone, more elegant, and concise coding style that’s also a breeze to parallelize. You’ll explore the syntax and semantics of lambda expressions, method and constructor references, and functional interfaces. You’ll design and write applications better using the new standards in Java 8 and the JDK.

Get it Now!  

Leave a Reply


7 + = ten



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books