I am currently working on a project that has proven to be very challenging, architecturally speaking. I've been learning a lot along the way, and I have come up with some approaches to addressing the challenge that I feel are pretty doggone sweet, so I thought I'd share some of the details in case it helps spark ideas for other people.
Let me describe how this challenge came about, and the challenge itself.
I'm sitting in my office working on something when I hear several people in an informal meeting discussing how much in time and money it will cost them to pay several temps to enter data from a spreadsheet into a remote system. I wrestle with the idea of making work for myself for a few seconds, then inject myself into the conversation and ask them a few questions about the remote system and what ways it provides for inputting data. As it turns out, the remote system has an interface for receiving this same data in the form of an XML file. Simply drop the XML file into a special folder in their system and it is automatically imported. "So, can you turn our spreadsheets into xml files"?, they ask me. "Of course"!, I reply. And so it begins.
As I try to always do, I explored what potential future requests might entail surrounding this process, examined the possibility of this process' usefulness to other companies, and bore in mind the potential of using this process to accommodate OTHER spreadsheet sources and OTHER xml destination formats. The result was a full blown utility application that allows a user to define a spreadsheet layout, map that layout to a targeted xml layout, then be able to import spreadsheets and produce the xml files in the target format, fully populated and validated; upload those files to their FTP destination; and monitor the status of the uploaded files via feedback mechanisms built in to the target system.
Bear in mind this application is not finished, but I have a substantial amount of the architecture completed and code working up to the point of the FTP process, so its the ins and outs of some of those moving parts that I wanted to share with you.
System Overview: How it Works
1. User logs in and chooses 'Upload Claims'
2. selected spreadsheet is uploaded and validated
3. validated spreadsheet is imported via POI
4. xml node to spreadsheet column mappings are retrieved
for each row in the spreadsheet
5. blank xml file is created by applying an XML stylesheet (xslt) to the xsd on file;
6. node values are translated from their tokens to actual values
7. nodes are located within the xml file via their id and their value replaced with translated value
8. xml files are "cleaned up" (optional), removing all empty nodes and node trees by applying an XML stylesheet
9. completed xml files are FTP'd to their destination (based on client settings)
DETAILS
Maintaining XML Definitions in the System
So, the first aspect I tackled was the best way to maintain target XML definitions. Well, since I know I want to be able to validate the XML files I create, and since the existing methods of xml validation of necessity ARE a definition of the XML, it made perfect sense to me to start with either a DTD or an XML Schema that represents the targeted XML format. DTDs don't have near the depth of metadata that I would need for future use, so I opted to use XML Schemas (or, 'xsd' files). Every potential target XML format this system maintains will be represented by an existing XSD file stored in a subfolder of the app.
If you aren't familiar with XML Schemas, let me tell you they are invaluable if you want to validate your xml in any detail. You define elements, the data types they hold, any rules that should be applied to those values (such as "only one of the items in this list of values", "a date in the following format", etc.), what attributes an element has and the rules for the values THEY hold, what elements are required, how many instances of a particular element are legal, what order elements should appear in, whether an element is required or not, and pretty much ANYTHING else you can think of! If what you need to store about the XML is NOT available as a standard XSD item, then do what I did and create your OWN namespace within the XSD and add whatever other attributes you want!
For the system I am building, one vital step is to ensure that every node, or potential spot for storing a system value, had to have a unique id. Since there was nothing in the existing xs namespace that allowed for that, I created my own namespace named after the app and added an "ID" attribute to every element and attribute defined for the target XML format. Here's the header of my XSD (notice the 'xclaim' namespace defined):
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
attributeFormDefault="unqualified"
xmlns:xclaim="http://www.somesite.com" >
Having defined the xclaim namespace allows me to legally do the following:
So, in my system, to add a new XML target, simply create an XSD schema file that 'describes' that format, complete with the xclaim namespace's 'id' attribute, and you have accomplished your goal!
Maintaining Spreadsheet Definitions
I wanted to be able to "head off at the pass" as much confusion or needed validation code as possible, so I opted to 'validate' an uploaded spreadsheet before attempting to process it. What this means to me is that, when a spreadsheet is uploaded I will first check to see that it contains all of the columns (appropriately named) as the user said it should and that all of those columns appear in the correct order. If the spreadsheet passes these two checks, I call it good and we proceed. Otherwise, I reject it and give them a nice message so saying.
In my UI I provide them a way to create and maintain spreadsheet definitions. Essentially they tell the app what the name of the column is and what order it appears in.

Simple enough, eh?
Maintaining Spreadsheet Column/XML Mappings
Ah, now THIS little bugger was much tougher than one may think at first glance. You see, properly populating your target XML poses many challenges, such as:
-not every column maps directly to only one xml node (element or attribute). This is a one to many relationship, for sure!;
-not every value needed in your xml file COMES from the incoming spreadsheet! Values such as today's date, the client's name, address, or other metadata, etc.;
-not every value needed in your xml file is a SINGLE value, but could very well be a CONCATENATED value! Such as in my case, where the values from three columns needed to be concated along with today's date and placed into a specific element's text, or where one attribute's value had to be the concatenation of a fixed string "COV" and an iterating value (COV1, COV2, etc.)
So you see, designing a storage mechanism to allow such fluidity and yet still be easily maintainable was no easy task to conquer! The solution I came up with accommodates all of the above, but still does have its limitations. Not a perfect solution, but close enough I'd say!
Here then is what the stored column-to-node mappings look like:

Some highlights of this approach:
-nodeID values correspond to the "xclaim:id" values spoken of earlier with regard to the XSD file;
-nodeID values with decimal counters indicate that there may be multiple occurrences of that node's branch in the final xml file. The decimal indicates which iteration of that branch the value belongs in;
-I created "tokens" as placeholders for those values that must be dynamically determined at run time. {column-x} refers to the value in column X, {column-x-name} is replaced with the actual column name, {iterator} is an incrementing value, {currentDateTime} is the current date and time, etc.;
-by including a targeted xsd for a node-value mapping (note the xsdID field), I can effectively map a single spreadsheet to multiple destination xml formats;
-the sortorder field is used in cases where a nodeid has multiple values indicated. This means that those multiple values should be concatenated in the order of 'sortorder'.
By using mysql's really awesome group_concat function, I can return a nice list of values, in their proper order. Here's the query I use to retrieve my mappings. Note the use of group_concat:
FROM mapping m
WHERE m.sheetid = <cfqueryparam value="#arguments.sheetID#" cfsqltype="cf_sql_integer" />
AND m.xsdID = <cfqueryparam value="#arguments.xsdID#" cfsqltype="cf_sql_integer" />
GROUP BY m.sheetID, m.xsdID, m.nodeID
ORDER BY m.nodeID
The results looks like this:

With the query above retrieved, I can now walk through each row of my imported spreadsheet and calculate the appropriate value for the targeted node IDs. :)
Creating the Blank, or Skeleton XML File
In this system, the only thing we maintain regarding a tarteted xml format is its XSD schema file. Since the XSD schema file is itself an XML file, the logical most expeditious approach to creating our skeleton xml is to leverage XML Stylesheets, or XSLT. I searched high and low for an XSLT that would transform a valid schema file into the skeleton XML it represents and could not find a single example. I thought surely such a thing would have been done many times over! So, since I couldn't find one, I wrote one. It's not all that lengthy really, so here it is in all its glory:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings"
xmlns:xalan="http://xml.apache.org/xalan"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xclaim="http://www.somesite.com"
exclude-result-prefixes="#default xsl str xalan xs xclaim">
<xsl:output method="xml" indent="no" encoding="ISO-8859-1" omit-xml-declaration="yes" />
<xsl:template match="/">
<xsl:element name="XACTDOC">
<xsl:apply-templates />
</xsl:element>
</xsl:template>
<!-- getting our elements, but excluding complexTypes with a name attribute (entity definitions, as i refer to them) -->
<xsl:template match="xs:complexType[not(@name)]/xs:sequence/xs:element|xs:complexType[not(@name)]/xs:all/xs:element|xs:complexType[not(@name)]/xs:choice/xs:element">
<xsl:call-template name="outputElement">
<xsl:with-param name="isEntity" select="false()" />
<xsl:with-param name="parentID" select="@xclaim:id" />
</xsl:call-template>
</xsl:template>
<xsl:template name="outputElement">
<xsl:param name="isEntity" />
<xsl:param name="parentID" />
<xsl:choose>
<!-- if the element we are looking at has a type specified, and that type is not a predefined xs: type (must be an entity!)... -->
<xsl:when test="@type and substring(@type,1,3)!='xs:'">
<xsl:element name="{@name}">
<xsl:call-template name="outputEntity">
<xsl:with-param name="thisname" select="@type" />
<xsl:with-param name="parentID" select="@xclaim:id" />
</xsl:call-template>
</xsl:element>
</xsl:when>
<xsl:otherwise><!-- dealing with an element that is NOT pre-defined -->
<!-- does this element have a type attribute that is enumerable? -->
<xsl:variable name="hasEnum">
<xsl:choose>
<xsl:when test="count(xs:complexType/xs:attribute[@name='type']/*/xs:restriction/xs:enumeration) > 0">
<xsl:value-of select="true()"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="false()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="thisnode">
<xsl:element name="{@name}"><!-- output the new element tag -->
<!-- if this element will contain text... -->
<xsl:if test="@type or count(xs:simpleType/xs:restriction|xs:complexType/xs:restriction)>0">
<xsl:value-of select="@xclaim:id" />
</xsl:if>
<!-- output all the attributes this element has. if it has a type attribute and the values are enumerable, put a marker in -->
<xsl:for-each select="xs:complexType/xs:attribute">
<xsl:attribute name="{@name}">
<xsl:choose>
<xsl:when test="$isEntity">
<!-- if this attribute is from a pre-defined entity, create a value placeholder for it that is unique yet predictable -->
<xsl:value-of select="concat($parentID,'-',substring-after(@xclaim:id,'_'))" />
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="@xclaim:id" />
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
</xsl:for-each>
<!-- this makes it all recursive! -->
<xsl:apply-templates />
</xsl:element>
</xsl:variable>
<xsl:copy-of select="$thisnode" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="outputEntity">
<xsl:param name="thisname" />
<xsl:param name="parentID" />
<xsl:for-each select="/xs:schema/xs:complexType">
<xsl:if test="@name=$thisname">
<xsl:for-each select="xs:sequence/xs:element|xs:choice/xs:element|xs:all/xs:element">
<xsl:call-template name="outputElement">
<xsl:with-param name="isEntity" select="true()" />
<xsl:with-param name="parentID" select="$parentID" />
</xsl:call-template>
</xsl:for-each>
</xsl:if>
</xsl:for-each>
</xsl:template>
<xsl:template match="xs:annotation/xs:documentation">
</xsl:template>
</xsl:stylesheet>
This XSLT is not quite generic enough to work with any XSD, but I hope to make it so before I'm finished (if anybody else wants to make that happen, please do!). The XSD I'm applying this XSLT to is fairly complex, and includes some "named complexTypes" as well. For those who aren't aware (I know I wasn't), a named complexType is almost the equivalent of defining a variable, and then re-using that value throughout the rest of your document simply by referring to the variable's name. For instance, if I have a chunk of my schema that defines an address block, and there are several places within my xsd where this exact same xml is repeated, rather than actually write it each time i can write it one time, give it a name such as "addressblock", and then wherever I want to include that I simply refer to it by name.
<cfset skeletonXML = xmltransform(rtrim(ltrim(ToSTring(targXSD))),ltrim(rtrim(xslt))) />
CONCLUSION - FOR NOW
There are a LOT more details to all of this that I haven't touched on, like the xpath statements I use to find and populate the values of nodes, how I reproduce node trees that have multiple occurrences, etc. I'll save more of the nitty gritty details of that sort of thing for another post. In the meantime, if you have any questions or input (especially on how to make my xslt more efficient!), please do leave a comment for me! Most of this was fairly new territory for me, so there's likely lots of room for improvement and efficiency, and I'm all ears.
You are not logged in, so your subscription status for this entry is unknown. You can login or register here.
No comments found.
