subscribe: Posts | Comments

Read and edit a document in Word OpenXML C #

0 comments
HTML clipboard

Introduction
With the arrival of the new version of Microsoft Office 2007, Microsoft introduced the new document format Office Open XML to Word, Excel and PowerPoint and succeeding the binary file formats of Office (. Doc,. Xls and. Ppt) emerged with the release of Office 97.

With this new format standardized by the ECMA, Word files are becoming mere packages zip files containing XML. Thus, it is no longer necessary to have Microsoft Word to create or view files, a simple text editor or a "home" is enough.

The purpose of this article is to show you how to read a document in Word format Open XML. We will see more specifically, through several examples, the code needed to load a document OpenXML, to recover properties (author, creation date, etc..), To investigate the existence of a word or still retrieve the images it contains.

I. Reminder on the structure of a document WordProcessingML
WordprocessingML is a set of conventions to represent a Word document format Open XML. For documents Excel SpreadsheetML there and PowerPoint documents it is PresentationML.

This article is not intended to introduce you to the architecture and structure of a document the Open XML format. You must have read before reading this article. If this is not the case, I advise you to visit these links: white papers and structure of a document Open XML. However, we will still make a little reminder on the structure of a document (or package) Open XML type WordprocessingML you can view particular by adding the extension. Zip files. Docx and opening it with your reader zip files.
The three main components of the new format are:

Units: Each file contained in the tree is a Part. Most are XML files but there may also be binary files (pictures, videos, etc. OLE objects.) Or even other multimedia files Open XML if the Word document contains.

The type of content: they are metadata contained in the file [Content_Types]. Xml and to describe the type of content stored in a Share (jpeg file of styles, file relationships, etc.).. You can also see which method of reading used to read a Part.

The relationship: they help define the associations between a source and a share target. Relations specify how shares are mixed to form a document. The relationships are defined in the files. Tural.
The docProps folder contains files document properties.
The document.xml file is the main Part of a document WordprocessingML and contains the text of the document.

II. Prerequisites necessary
To facilitate the task, the Framework 3.0 from Microsoft. NET includes new packaging API provided in the WindowsBase.dll assembly. The classes that constitute the API packaging are contained in the namespace System.IO.Package.

You will need Framework 3.0 and its SDK (not compulsory).
To add a reference to your project in Visual Studio: in the draft menu, click add a reference. If the dll WindowsBase is not in the tab. Net, choose the Browse tab and go look for it in Program Files Reference Assemblies Microsoft Framework v3.0.

You should also refer namespaces System.IO and System.Xml in your project:
using System.IO;
using System.Xml;
using System.IO.Packaging;

III. Opening a Word document OpenXML
The first thing to do to be able to read a document OpenXML is to load an object of type Package. This is done in a row thanks to the Open Class Package. Remember the end of your treatment, call the Close method to close the package. Here is the code to open a read / write a document OpenXML:

docWord = @ "C: monFichier.docx";
/ / overture package read / write
Package officePackage = Package.Open (docWord, FileMode.Open, FileAccess.ReadWrite);

/ / put here code treatments
/ / closure package
officePackage.Close ();
Now that we have our object Package, we will see how to get the various parties composing the document.

IV. Reading the document properties
IV-A. Overview

The files containing the document properties are stored in the directory docProps located at the root of the package.
Core.xml The file contains a set of properties common to all files Open XML. These properties include the creator's name, date of creation, title and description. So whether you treat a document docx, xlsx or pptx, these properties will always be placed at that location.

The file contains app.xml specific properties for each type of package Open XML. For example, a package WordprocessingML (docx), these properties include the number of characters, words, lines, paragraphs and pages in the document. For a package type Spreadsheet (xlsx), these properties include the titles of leaves. For a package type Presentation (pptx), these properties include the presentation format, the number of slides, the number of notes.

We will only interest us in this party to file core.xml.
Here is an example of what could be the contents of this file:
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<CP: coreProperties
xmlns: cp = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties"
xmlns: dc = "http://purl.org/dc/elements/1.1/"
xmlns: dcterms = "http://purl.org/dc/terms/"
xmlns: dcmitype = "http://purl.org/dc/dcmitype/"
xmlns: xsi = "http://www.w3.org/2001/XMLSchema-instance">
<dc:title> OpenXML and dotnet </ dc: title>
<dc:subject> Read a Word 2007 file </ dc: subject>
<dc:creator> Florian </ dc: creator>
<dc:description> comments </ dc: description>
<dcterms:created xsi:type="dcterms:W3CDTF"> 2007-05-20T10: 16:00 Z </ dcterms: created>
<dcterms:modified xsi:type="dcterms:W3CDTF"> 2007-05-20T10: 16:00 Z </ dcterms: modified>
<cp:category> Article dotnet </ cp: category>
<cp:contentStatus> Ongoing </ cp: contentStatus>
<cp:keywords> dotnet OpenXML </ cp: keywords>
<cp:revision> 2 </ cp: revision>
</ cp: coreProperties>
Note the use of different namespaces.

IV-B. Get the party core.xml
We will first recover part of the file core.xml in the form of an object PackagePart. The purpose Package (which we have built previously at the loading OpenXML document) has a function GetRelationshipsByType which takes a parameter type of content (we give the type corresponding to the core.xml part) and returns a list of objects PackageRelationship type. These objects represent an association between a source and a target (in our case, a part of core-type properties). Since there is only one part of core-type properties in a package OpenXML, the function GetRelationshipsByType not return a list containing more than one element. Be aware that this list may be empty because the files properties are not mandatory in a package OpenXML. The purpose PackageRelationship contains a property TargetUri who will retrieve the Uri on the target (by the Uri party core.xml). Starting from this Uri then it is easy to recover the corresponding part thanks to the method of the object GetPart Package.

Here is an illustration of this principle by the code:
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<CP: coreProperties
xmlns: cp = "http://schemas.openxmlformats.org/package/2006/metadata/core-properties"
xmlns: dc = "http://purl.org/dc/elements/1.1/"
xmlns: dcterms = "http://purl.org/dc/terms/"
xmlns: dcmitype = "http://purl.org/dc/dcmitype/"
xmlns: xsi = "http://www.w3.org/2001/XMLSchema-instance">
<dc:title> OpenXML and dotnet </ dc: title>
<dc:subject> Read a Word 2007 file </ dc: subject>
<dc:creator> Florian </ dc: creator>
<dc:description> comments </ dc: description>
<dcterms:created xsi:type="dcterms:W3CDTF"> 2007-05-20T10: 16:00 Z </ dcterms: created>
<dcterms:modified xsi:type="dcterms:W3CDTF"> 2007-05-20T10: 16:00 Z </ dcterms: modified>
<cp:category> Article dotnet </ cp: category>
<cp:contentStatus> Ongoing </ cp: contentStatus>
<cp:keywords> dotnet OpenXML </ cp: keywords>
<cp:revision> 2 </ cp: revision>
</ cp: coreProperties>
You will understand that this code will be reused to recover any part of a package OpenXML. Suffice it to replace the type of content in order to get the party (subject PackagePart) correspondent.

Important note to read:
We PackageRelationship recover objects from the object Package (officePackage). Thus, ownership TargetUri objects PackageRelationship return a Uri on the package (the root of the package). For the Uri Party core.xml this will be "docProps / core.xml." To recover part core.xml using the method GetPart we need his Uri absolute. That is why we use the method that returns the ResolvePartUri Uri absolute part from the Uri a source (here "/", ie the package) and the Uri on this part.

 

IV-C. Reading properties
Now that we object PackagePart corresponding to file core.xml, we will be able to load its contents in an XmlDocument object to read the contents. From there it does more than just reading an XML file.

Here is the code to charge the core.xml part in a XmlDocument and to get some information (author, title, a list of keywords and date of creation of the document):
if (corePart! = null)
(
/ / construction of a XmlNamespaceManager containing the namespaces used
NameTable nt = new NameTable ();
XmlNamespaceManager nsmgr = new XmlNamespaceManager (s);
nsmgr.AddNamespace ( "dc" dcPropertiesSchema);
nsmgr.AddNamespace ( "PC", cpPropertiesSchema);
nsmgr.AddNamespace ( "dcterms" dctermsPropertiesSchema);

/ / loading of the party in an XmlDocument
XmlDocument doc = new XmlDocument (s);
doc.Load (corePart.GetStream ());

XmlNode nodeAuteur = doc.DocumentElement.SelectSingleNode ( "/ / dc: creator", nsmgr);
if (nodeAuteur! = null)
labelAuteur.Text = nodeAuteur.InnerText;

XmlNode nodeTitre = doc.DocumentElement.SelectSingleNode ( "/ / dc: title," nsmgr);
if (nodeTitre! = null)
labelTitre.Text = nodeTitre.InnerText;

XmlNode nodeMotsClefs = doc.DocumentElement.SelectSingleNode ( "/ / cp: keywords", nsmgr);
if (nodeMotsClefs! = null)
labelMotsClefs.Text = nodeMotsClefs.InnerText;

XmlNode nodeDate = doc.DocumentElement.SelectSingleNode ( "/ / dcterms: created", nsmgr);
if (nodeDate! = null)
labelDate.Text = DateTime.Parse (nodeDate.InnerText). ToShortDateString ();
)
It first built a XmlNamespaceManager containing different namespaces used in the file core.xml. It then instantiates an object XmlDocument passing it to the XmlNamespaceManager parameter. Finally, it calls the method GetStream to load the contents of the party in memory and it filled the XmlDocument object through this flow.

It then retrieves the necessary elements by using the SelectSingleNode.

IV-D. Modifying the properties
After reading the document properties we will see how to modify their values. There is nothing complicated about that. Simply use XmlNode that we created previously and alter their value.

Thus, to modify the author's name you write:
/ / Update value:
nodeAuteur.InnerText = "Toto";

These changes take place on the XmlDocument loaded into memory. So we must then save it in the package OpenXML. To do so using the method Save the XmlDocument object:
/ / Save the XML properties in his party:
doc.Save (corePart.GetStream (FileMode.Create, FileAccess.Write));
That's it!

V. Playing the main part of a document OpenXML
The main part of a Word document OpenXML is generally represented by the document.xml file in the file word (see the tree at the beginning of this tutorial). While the textual contents of a Word document there. We will illustrate here reading this part by writing a method that will aim to seek a word (or expression) any files in Word.

The first thing to do is recover the main part of the document OpenXML through an object PackagePart. For that we need to reuse the method previously view by simply changing the type of content to search:
/ / type of content for the main part
const String officeDocRelType = @ "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";

PackagePart mainPart = null;
Uri documentUri = null;
/ / get the part containing the properties
foreach (PackageRelationship relationship in officePackage.GetRelationshipsByType (officeDocRelType))
(
/ / There is only one part of partType type in the package
documentUri = PackUriHelper.ResolvePartUri (new Uri ("/", UriKind.Relative), relationship.TargetUri);
mainPart = officePackage.GetPart (documentUri);
break;
)
"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" is the type of content the main part of a document OpenXML.

We can then load the contents of this part in a XmlDocument object and seek a word with the method Contains:
if (mainPart! = null)
(
/ / loading of the party in an XmlDocument
XmlDocument doc = new XmlDocument ();
doc.Load (mainPart.GetStream ());

/ / sensitive
if (doc.DocumentElement.InnerText.Contains (textBoxRechercher.Text))
(
MessageBox.Show ( "Text found in the document");
)
Else
(
MessageBox.Show ( "Can not find the text in the document");
)
)
And this is no more complicated than that!

VI. Extract images in a Word document OpenXML
The processing of images will require a little more code (but really not much more). The principle is in fact always the same: the images are like other parties that will have to recover using the technique above. Of course, reading differ from that of an XML file.

The first thing to do is recover the main part of the OpenXML document. We have done just above, it is therefore unnecessary to put the code. Maybe you ask yourself why we need this party. Indeed, so far we récupérions the party directly interested us through our lines of code going anywhere, changing only the type of content. So why not simply repeat the same thing?

For a very simple reason: the package does not know where these images. To better understand, take another look at the tree of a file OpenXML, which is at the beginning of the article. At the root is a folder _rels with a file. Tural inside. Here is an excerpt of what it contains:
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<Relationships Xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id = "rId3"
Type = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties"
Target = "docProps / app.xml" />
<Relationship Id = "rId2"
Type = "http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"
Target = "docProps / core.xml" />
<Relationship Id = "rId1"
Type = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target = "word / document.xml" />
</ Relationships>
There are a few old friends … This file indicates the package whereabouts of certain types of content (like the main part or parts properties). As you can see, no trace of any image. Now's move in the dossier _rels found in the word file and open the file document.xml.rels (which brings together existing links between document.xml and other parties).

This is finally our famous images! Please note in passing reference to the portion of styles, and this indicates that his recovery will take place in the same manner as for images.

What do we learn from all this? Simply that some parties (as document.xml or core.xml) are linked directly to the package while others (such as images) are linked to another party (here the party document.xml). We must see things this way: a package OpenXML is linked to a main part, parts of properties, and so on. The main part may itself be linked to other parties (styles, images, videos, etc.).. Thus images are known to the main part which is itself known to the package. In short, to have the images must be the main part.
<? xml version = "1.0" encoding = "UTF-8" standalone = "yes"?>
<Relationships Xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship Id = "rId1"
Type = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles"
Target = "styles.xml" />
<Relationship Id = "rId5"
Type = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target = "media/image2.png" />
<Relationship Id = "rId4"
Type = "http://schemas.openxmlformats.org/officeDocument/2006/relationships/image"
Target = "media/image1.png" />
</ Relationships>
As it is written, the code to retrieve a portion that allows the recovery of a party associated with this package. We are therefore going to change a little so that they can recover part linked to the main part of the package. In our case we want to retrieve related images. Here is the new version:

/ / content type for a picture
const String imageRelType = @ "http://schemas.openxmlformats.org/officeDocument/2006/relationships/image";
List <PackagePart> listePackageParts = new List <PackagePart> ();
Uri imageUri = null;
/ / retrieves the corresponding parts to the images. The images are on the mainPart
foreach (PackageRelationship relationship in mainPart.GetRelationshipsByType (imageRelType))
(
/ / relationship.TargetUri contains media/image1.jpg (eg)
imageUri = PackUriHelper.ResolvePartUri (new Uri (mainPart.Uri.ToString (), UriKind.Relative)
Relationship.TargetUri);
listePackageParts.Add (officePackage.GetPart (imageUri));
/ / there is not necessarily a single image so it does not break!!
)
What's new? We have of course changed the type of content to put the corresponding type images. We then said a list of items PackagePar. Indeed, this time there will be no single party to recover but several (of course everything will depend on the number of images in the package). The novelty following the method GetRelationshipsByType. We will call more from the object Package but since the main part (object PackagePart). As previously said we want to recover the relations between the main parties and the type of image.

The property TargetUri objects PackageRelationship will return a Uri on a party-type image compared to the main part (for example media/image1.jpg). To get the party we will need the Uri absolute that we built with the method ResolvePartUri which is given to setting the Uri source (the main part) and Uri on the image.

Well, here we are now owner of a list PackagePart corresponding to different images of the package. As in previous examples, we can get the contents by using the GetStream (except obviously this time when it is not only the charge in an XML file.).

Several options now to you. For example, you can build an object of type Image:
foreach (PackagePart imagePart in listePackageParts)
(
Image image = Image.FromStream (imagePart.GetStream ());
/ / Treatment with any image
)
Either save the images on the disk:
/ / For each party image, on the record
foreach (PackagePart imagePart in listePackageParts)
(
/ / To retrieve the names of images
String [] imagePart.Uri.ToString tab = (). Split (new Char [] ( '/'));
String nomImage = tab

;
using (Stream sourceStream = imagePart.GetStream ())
(
/ / Destination directory
String path = Path.Combine (@ "C: ", "images");
if (! Directory.Exists (path))
Directory.CreateDirectory (Path.Combine (@ "C: ", "images"));
using (FileStream targetStream = new FileStream (Path.Combine (path, nomImage)
FileMode.Create, FileAccess.Write))
(
byte [] buffer = new byte [1024];
int nrBytesWritten = sourceStream.Read (buffer, 0, 1024);
while (nrBytesWritten> 0)
(
targetStream.Write (buffer, 0, nrBytesWritten);
nrBytesWritten = sourceStream.Read (buffer, 0, 1024);
)
)
)
)

VII. To go further
Microsoft has released a package of snippets used with Visual Studio 2005 for the document format Open XML (Word, Excel and PowerPoint). And it is here: snipets. These model codes are a treasure trove of information to better understand how to manipulate the OpenXML format.

VIII. Conclusion
We have seen through this article the foundations for reading documents in Word format Open XML using bricks Framework 3.0 around the assembly System.IO.Package.
The reading Office documents (Word in this example) can now be executed without qu'Office being installed and without having used the Primary Interop Assemblies of Office as was the case with the old formats.