0

i need to extract the text of a specific page from a XPS document. The extracted text should be written in a string. I need this to read out the extracted text using Microsofts SpeechLib. Please examples only in C#.

Thanks

Tim Trabold
  • 47
  • 1
  • 1
  • 2
  • Since you have tagged your question as C#, hence almost all answers will be in C# but Why only C#. Are you alergic to other languages? – Nikhil Agrawal Sep 04 '12 at 11:07
  • no but my company develops in c# and i have to do so too – Tim Trabold Sep 04 '12 at 11:11
  • So what? Create in any other language and then use any online converter (like this http://www.developerfusion.com/tools/convert/csharp-to-vb/#convert-again) to change it into your desired language. In my last company i coded in C# and in present one i code in VB. and it(syntax) was a problem for first 2 days. – Nikhil Agrawal Sep 04 '12 at 11:17
  • 1
    -1 http://www.WhatHaveYouTried.com (Please update your question to provide some examples of what you've tried and I will happily remove the downvote.) – JDB Sep 05 '12 at 13:10

4 Answers4

10

Add References to ReachFramework and WindowsBase and the following using statement:

using System.Windows.Xps.Packaging;

Then use this code:

XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader 
    =_xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
IXpsFixedPageReader _page 
    = _document.FixedPages[documentViewerElement.MasterPageNumber];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
  while (_pageContentReader.Read())
  {
    if (_pageContentReader.Name == "Glyphs")
    {
      if (_pageContentReader.HasAttributes)
      {
        if (_pageContentReader.GetAttribute("UnicodeString") != null )
        {                                   
          _currentText.
            Append(_pageContentReader.
            GetAttribute("UnicodeString"));                              
        }
      }
    }
  }
}
string _fullPageText = _currentText.ToString();

Text exists in Glyphs -> UnicodeString string attribute. You have to use XMLReader for fixed page.

Loren Pechtel
  • 8,945
  • 3
  • 33
  • 45
Sanjay
  • 315
  • 3
  • 15
  • 2
    @Tim Trabold: Feedback for answer will be helpful. – Sanjay Sep 12 '12 at 10:06
  • I am getting the exception as: Error 1 The type 'System.IO.Packaging.Package' is defined in an assembly that is not referenced. You must add a reference to assembly 'WindowsBase, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35'. –  Sep 26 '13 at 05:33
  • + Cleared it.. Great Job. –  Sep 26 '13 at 06:30
  • I'm seeing into the .xps I'm trying to parse but I'm finding that sometimes what shows as multiple strings (albeit all on a line) are coming through as one string! – Loren Pechtel May 12 '14 at 04:16
  • how to read xps file using asp.net, and i am getting error, documentviewer element does not exists in the current content – Anjali Jul 16 '14 at 11:04
  • @Anjali - DocumentViewer is a windows control. Try adding a documentviewer in a windows user control and try to host it in your web page. I am not much with ASP.Net. Check this link about hosting windows control in web page http://www.4guysfromrolla.com/articles/052604-1.aspx – Sanjay Jul 16 '14 at 13:57
2

Method that returns text from all pages (modified Amir:s code, hope that's ok):

/// <summary>
///   Get all text strings from an XPS file.
///   Returns a list of lists (one for each page) containing the text strings.
/// </summary>
private static List<List<string>> ExtractTextFromXps(string xpsFilePath)
{
   var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read);
   var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
   if (fixedDocSeqReader == null)
      return null;

   const string UnicodeString = "UnicodeString";
   const string GlyphsString = "Glyphs";

   var textLists = new List<List<string>>();
   foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments)
   {
      foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages)
      {
         var pageContentReader = pageReader.XmlReader;
         if (pageContentReader == null)
            continue;

         var texts = new List<string>();
         while (pageContentReader.Read())
         {
            if (pageContentReader.Name != GlyphsString)
               continue;
            if (!pageContentReader.HasAttributes)
               continue;
            if (pageContentReader.GetAttribute(UnicodeString) != null)
               texts.Add(pageContentReader.GetAttribute(UnicodeString));
         }
         textLists.Add(texts);   
      }
   }
   xpsDocument.Close();
   return textLists;
}

Usage:

var txtLists = ExtractTextFromXps(@"C:\myfile.xps");

int pageIdx = 0;
foreach (List<string> txtList in txtLists)
{
   pageIdx++;
   Console.WriteLine("== Page {0} ==", pageIdx);
   foreach (string txt in txtList)
      Console.WriteLine(" "+txt);
   Console.WriteLine();
}
salle55
  • 2,101
  • 2
  • 25
  • 27
1
    private string ReadXpsFile(string fileName)
    {
        XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read);
        IXpsFixedDocumentSequenceReader fixedDocSeqReader
            = _xpsDocument.FixedDocumentSequenceReader;
        IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
        FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence();
        string _fullPageText="";
        for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
        {
            IXpsFixedPageReader _page
                = _document.FixedPages[pageCount];
            StringBuilder _currentText = new StringBuilder();
            System.Xml.XmlReader _pageContentReader = _page.XmlReader;
            if (_pageContentReader != null)
            {
                while (_pageContentReader.Read())
                {
                    if (_pageContentReader.Name == "Glyphs")
                    {
                        if (_pageContentReader.HasAttributes)
                        {
                            if (_pageContentReader.GetAttribute("UnicodeString") != null)
                            {
                                _currentText.
                                  Append(_pageContentReader.
                                  GetAttribute("UnicodeString"));
                            }
                        }
                    }
                }
            }
            _fullPageText += _currentText.ToString();
        }
        return _fullPageText;
    }
Nurkhan
  • 11
  • 3
  • I get ArgumentOutOfRangeException using this code, _document.FixedPages only contains a single element (even tho the XPS contains multiple pages). See: http://i.imgur.com/gpcKxCX.png – salle55 Jan 30 '17 at 16:02
0

Full Code of Class:

using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Windows.Xps.Packaging;

namespace XPS_Data_Transfer
{
    internal static class XpsDataReader
    {
        public static List<string> ReadXps(string address, int pageNumber)
        {
            var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read);
            var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
            if (fixedDocSeqReader == null) return null;

            const string uniStr = "UnicodeString";
            const string glyphs = "Glyphs";
            var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1];
            var page = document.FixedPages[0];
            var currentText = new List<string>();
            var pageContentReader = page.XmlReader;

            if (pageContentReader == null) return null;
            while (pageContentReader.Read())
            {
                if (pageContentReader.Name != glyphs) continue;
                if (!pageContentReader.HasAttributes) continue;
                if (pageContentReader.GetAttribute(uniStr) != null)
                    currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr)));
            }
            return currentText;
        }
    }
}

that return a list of string data from custom page of custom file.

Amir
  • 57
  • 1
  • 9