4

I'm trying to create a node.js web app hosted by a linux server. the app must read and parse a table in a word document.

I've looked around and saw that Powershell can trivially accomplish this. The problem is that Powershell is an MS scripting language, and its Mac port (pash) is very unstable and chokes whenever I want to execute something as simple as this:

$wd = New-Object -ComObject Word.Application
$wd.Visible = $true
$doc = $wd.Documents.Open($filename)
$doc.Tables | ForEach-Object {
  $_.Cell($_.Rows.Count, $_.Columns.Count).Range.Text
}

I've looked into other solutions like Docsplit and it's too generic (ie it converts an entire word doc to just plain text, not granular enough for my purposes).

some suggested using the saaspose API, but it costs lotsa money! I think I can do this myself.

ideas?

Community
  • 1
  • 1
abbood
  • 23,101
  • 16
  • 132
  • 246
  • Is it a doc or a docx? – Andy Arismendi Apr 02 '13 at 16:53
  • The PowerShell method also requires that you have MS Word installed. I think that's going to be pretty unlikely on a Linux server (not to mention the licensing concerns around doing so). – alroc Apr 02 '13 at 17:16
  • Not sure if it's possible but all you have is a table in a word document. Can't you copy the table to Excel, and then export the excel document as a CSV file. – E.V.I.L. Apr 02 '13 at 17:21
  • @BobLobLaw I thought the same thing.. however that's the explicit requirement of my client, who said that excel is just lotsa trouble.. i don't wanna just shift the burden on him you know – abbood Apr 02 '13 at 19:44
  • @AndyArismendi it can be either.. whichever makes my life easier – abbood Apr 02 '13 at 19:45
  • @alroc PowerShell is totally out of the question, since the author of it's port to linux/mac has just [asserted](https://github.com/Pash-Project/Pash/issues/65) that Pash doesn't work with Word, regardless if it's installed or not. – abbood Apr 02 '13 at 19:47

2 Answers2

1

Here's a python module that can read/write docx files:

https://github.com/mikemaccana/python-docx

Andy Arismendi
  • 50,577
  • 16
  • 107
  • 124
  • I like this one.. it says tables can be composed, but I wonder if they can be read? lemme try – abbood Apr 03 '13 at 04:24
0

If you're deploying on a Linux machine, it's probably best to use Docsplit and then parse the output text, or you could try Apache POI.

Another option would be to try MS COM API running on Wine, but I'm not sure if it's compatible.

MisterMetaphor
  • 5,900
  • 3
  • 24
  • 31
  • please excuse my skepticism.. but any open source project that isn't hosted on github or another modern open source repo host makes me suspicious.. it just tells me that the interest has waned and/or support no longer exist. for example [Pash](http://pash.sourceforge.net/) stayed dormant from 2008 till 2012, that's when [Jay Bazuzi](https://github.com/JayBazuzi) decided to continue the [project](https://github.com/Pash-Project/Pash).. – abbood Apr 02 '13 at 19:56
  • 1
    Apache isn't an "open source repo host" - it's a foundation that supports & manages the development and management of many high-profile open source projects. Discounting a project just because of where it's hosted is a bit short-sighted. – alroc Apr 02 '13 at 20:15
  • @alroc you're right, I'm aware of that distinction. I wasn't referring to just the technicality of hosting types, I was referring more to kind of support one would expect from different open source clusters/communities. In my experience.. github is by far the most vibrant cluster, and any open source project i've cloned from it was always accompanied by quick and excellent support.. I couldn't say the same about the rest. I'm sure there are excellent ones out there.. but for me it's more of a trial and error thing, something I like to avoid the best I could. – abbood Apr 03 '13 at 04:27