subscribe: Posts | Comments

Structure files OpenXML

0 comments
HTML clipboard

Foreword
OpenXML is the new file format adopted by the documents of the Office suite, from the 2007 version. This format, a collaboration of Microsoft, Intel and Apple, among others, is completely free of royalties, and its longevity and its independence vis-à-vis any publisher are secured by its elevation to the rank of standard by ECMA (ISO standard should soon follow).

OpenXML breaks radically with binary formats owners of previous versions of Office (up to the 2003 version, which began turning), by adopting XML, open language, as storage format. It is now possible for developers to read, create, edit, view on different media Office documents, without depending on Microsoft applications, using tools such as XSLT, SAX or DOM, directly or through bookstores that do OpenXML should not delay to appear.

OpenXML standard is a very rich, whose body is several thousand pages. We will restrict ourselves in this article introduction to the study of the section dealing with the organization of the contents of files OpenXML entitled Open Packaging Conventions (OPC), and more specifically the implementation of these conventions to Office documents.

1. Anatomy of a file OpenXML
1.1. Internal structure of a file OpenXML

The documents Office OpenXML files are compressed according to the Zip format. One can therefore see their content in the decompressant with any utility recognizing the Zip format.

A Word file 2007 "dezippe" shows the typical structure following (1):
Contents of a file OpenXML Office (Word)
Once unpacked, OpenXML file reveals a myriad of other files, all of these files is called, in the terminology OpenXML, a package (package). This package includes a collection of files called parts (shares) in a tree which they are the leaves. This fragmented structure contrasts with the monolithic format used by Office 2003, in which the entire document is found within the single XML file. Here we are dealing with a modular structure, each party or directory containing parts as an element of the ZIP file.

1.2. The parties
OpenXML gives each party a unique name, coompose logical path leading from the root of the package file constituting the party itself. Thus, our example document we can take this list (partial) name parts:
* / [Content_Types]. Xml
* / _rels / .rels
* / Word / document.xml
* / Word / styles.xml
* / Word / _rels / document.xml.rels
* / DocProps / core.xml
* / Word/media/image1.png
* …

The parties that make up a file OpenXML document can be divided into two categories:
1. The parties that contain data (text, images, sounds, videos, etc..) Which constitute the document itself. These parties may contain data defined in OpenXML (WordprocessingML in our example), other XML data whose pattern is not part specifications OpenXML, binary data (OLE objects, images JPEG or PNG, AVI videos, etc. .), Simple text …
2. The parties which contain information concerning the internal structure of the package, including the type of content from other parties, and the logical links between them they have these parties contain XML data whose pattern is defined by the standard OpenXML ( the Open Packaging Conventions more precisely)

OpenXML does not define what should be the names of parts of the first category, those that contain data of the document. Each application generating OpenXML files that can define its own names for parts, how an application intended to read them can they get to parts of which it ignores a priori the name and content?

This is made possible by the parties of the second category, which contain information that will help to know precisely the role of each of the other parts of the package, and the links that unite them. These parties "special" are of two types, the file types of content and files relations (relationships shares). Both types of files are the key to access the data OpenXML document, we will study them in detail.

1.3. The file types of content
The file types of content of a document Office OpenXML is the party named / [Content_Types]. Xml.
It is placed at the root of the package and its name is invariably the same body of a document OpenXML to another, as defined by the standard.
This file contains an XML document that lists all the parties which constitute the package, and associated with each type MIME. Here is an excerpt of the contents of this file from our example document:
Included [Content_Types]. Xml
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<Types Xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
<Override PartName="/word/footnotes.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/>
<Default Extension="png" ContentType="image/png"/>
<Default Extension="emz" ContentType="image/x-emz"/>
<Default Extension="xls" ContentType="application/vnd.ms-excel"/>
<Override PartName="/word/comments.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml"/>
<Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
<Default Extension="xml" ContentType="application/xml"/>
<Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
<Override PartName="/word/numbering.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.numbering+xml"/>
<Override PartName="/word/styles.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.styles+xml"/>
<Override PartName="/word/endnotes.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/>

</ Types>
The pattern of this document XML is composed of the principal element types and its two elements son, Default and Override.
Default defines a MIME type by default, designated by the attribute value ContentType for the parties whose name ends with the extension contained in the attribute Extension. Thus, in our example, all parties whose names ending in ". Png" will be MIME type "image / png", and those ending with. "Xml" will be MIME type "application / xml".

The Override indicates that a party has a MIME type other than those defined by default for parties with the same extension. The name of the party benefiting from such specific MIME is contained in the attribute PartName. For example, the party named / word / document.xml east of MIME type application / vnd.openxmlformats-officedocument.wordprocessingml.document.main + xml, and non-application / xml. This type MIME, and all those parts containing WordprocessingML, have been specifically defined by Microsoft for documents from OpenXML Office.

Through this file types of content, a client application can determine the exact contents of each party, without having to make less risky from its extension or seeking in its content value "magic" indicative of the nature File.

1.4. The files relations
These are XML documents containing a set of relations, a relationship is a mapping between a party, the party source, always implicit, and another part of the package (the target). The relationship files are the backbone of the package.

The location of these files in the package and their Namespace follow several rules defined in the specification OpenXML. The first stipulates that the file relationship associated with a party source must be a son of a directory that contains the party, and that this directory must be named _rels. The second rule states that the end of the file name relationship should be the same as the Associated Party, added the extension. Tural.

For example, the file relationship with the party / word / document.xml will be located in the package to the location / word / _rels, and appoint / word / _rels / document.xml.rels.

1.4.1. Contents of a file relationship
Consider the contents of a file relationship, for example / word / _rels / document.xml.rels:
Content / word / _rels / document.xml.rels
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<Relationships Xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/>
<Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="media/image1.png"/>
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/>
<Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes" Target="footnotes.xml"/>
<Relationship Id="rId15" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/>

</ Relationships>
The pattern of this document includes the main element Relationships and its Relationship son.

Each element Relationship defines a relationship maintained between the source (document.xml here) and another part of the package. The characteristics of the relationship are allocated among the various attributes of this element:
* Type attribute indicates the nature of the relationship. It takes the form of a URI, whose own semantics is the application to the original document. Each publisher implements OpenXML defines its own set of URI, namely those used by the Office OpenXML documents are all with the prefix "http://schemas.openxmlformats.org/officeDocument/2006/relationships/"
* Target attribute indicates the party (or resources) focused on the relationship, in the form of a URI always on the party source
* The Id attribute is the unique identifier of the relationship and will be used within document.xml to refer to this relationship, and through it the party to target
* The optional attribute TargetMode indicates whether the target of the relationship is a part located in the package or, if it is equal to "external" if the target is an external resource to the package, in the absence of that attribute, the target of the relationship is internal to the package

To submit natural language is an example of one of these relations, part / word / document.xml has appointed rId2 a relationship with the party / word / styles.xml, a relationship whose type is the URI "http:/ / schemas.openxmlformats.org/officeDocument/2006/relationships/styles. " The reading of the specification OpenXML informs us that this URI indicates that the target resource contains the definition of styles of characters, paragraphs and other things, the party source. We could have us in doubt given the name of the party targeted, but the URI we confirmed both formally and unambiguously.

1.4.2. The main relationship file
Among the files relationship that may contain a package, it is a bit special as it is not associated with a party source, but the root of the package itself. This file relationship has the same structure as that view before, and is called invariably / _rels / .rels (note that the name of the file follows the same conventions that other files relationship).
This file relationship
Consider the contents of this file for our example document:
Content / _rels / .rels
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<Relationships Xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" Target="docProps/app.xml"/>
<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" Target="docProps/core.xml"/>
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/>
</ Relationships>
Of all the relationships that may contain file main relationship, we are interested in this tutorial that relations among the three most important:


URI type of relationship Mandatory Description
http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument

    Yes

The target of this relationship is the main part of the document, which contains the text of the document (word / document in our example)
http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties

No

party targeted by this relationship contains metadata common to all Office documents, such as the creation date, title, description , Creator, etc.. (Dublin Core)
 http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties

No

party targeted by this relationship contains specific properties to the type of document Office represented by the main part of the document; whether WordprocessingML, the properties will be the number of pages, characters, words, paragraphs, etc.. If this is SpreadsheetML, the number of spreadsheets, etc.

 

The file relationship is the main entry point for privileged access to any part of the package, since from the part containing the body of the document can be accessed on the parties through its annexes file relations. If a party has annex itself a roster of relations, one can still access to other relationships, and so on, up to browse the full tree package.

1.5. Structure at least one file OpenXML
We saw that the parties properties containing tracts and metadata of the document are optional, what about other parties? In fact, it is also true of most other parties contained in the document tree of our example, so a document not including header or footer will have no parties named / word / footer.xml and / word / header.xml. What are the parties that a file OpenXML Office should include a minimum to be readable by the Office applications? They are among 3:
* The files of types of content / [Content_Types]. Xml
* The file relationship principal / _rels / .rels
* The section containing the body of the document (/ word / document.xml for a Word document)

2. Scenario reading a file Office OpenXML
Now that we have examined how a file was structured OpenXML, we can now establish the sequence of tasks that must make an application to access the contents of the document.

This process takes place in 4 phases:

1. Open file relationship principal (_rels / .rels) and extract the target of the relationship "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
2. Opening the party [Content_Types]. Xml and use the value obtained previously as the key to find the MIME type of the main part of the document
3. (Optional) Read the document properties, general and specific, opening the parties targeted in the corresponding file main relationship; specific properties will be interpreted in accordance with the document type detected in Stage 2
4. Read the main part of the document (the document) identified in phase 1, and access other parts of the document through the roster of this relationship
You may be surprised that the first two phases are necessary to determine the type of document Office which is dealing with, could we rely on it for the file extension, for example. Docx for a Word document?

It is indeed possible, if you totally mastered the production of material (that is you – your application – which generate for example) and have an assurance that the extension corresponds to its type. If this is not the case, if you come this document, for example third on which you have no control, consultation file relationship principal and identification of the type MIME is the most robust and the most reliable for the type of document and handle accordingly.

On the other hand, you might be tempted to move files relations to reach different parts of the document, starting from the premise that they are always assigned the same URI. Thus, word / document.xml always contain the body of the document, word / styles.xml styles used in the document. Again, it does if verified that you totally control the generation of these documents. If these documents are generated by a process out of your attack, or even directly from Office, you can not be absolutely certain that this nomenclature is always the same. Rather than make risky assumptions about the role of each part of the package, it is infinitely wiser to rely on files related to replenish the puzzle and identify the parties.

Conclusion
The technological choices selected by Microsoft for its OpenXML format, XML and Zip open the door for virtually all current development platforms for its manipulation. Now, it does'll just take advantage of all this information to manipulate files OpenXML with your favorite language!