0

I am using POI to read .doc files, and I want to select some of the contents to form new .doc files. Specifically speaking, is it possible to write the content of a “paragraph” in the “range” to a new file? Thank you.

HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++) {
    //here I wish to write the content in a Paragraph
    //into a new .doc file "doc1""doc2"
    //instead of doc.write(pathName) that only write one .doc file.
}
akash
  • 22,664
  • 11
  • 59
  • 87
flyingmouse
  • 1,014
  • 3
  • 13
  • 29
  • Do you mean you have a .doc with , let's say, 100 paragraphs, and you want to have 2 .doc files: the 1st will have paragraphs 21-30, and the second - all the paragraphs EXCEPT 21-30, which is 1-20 and 31-100. If you want to split like that, than it seems to me that `doc.getRange()` will not work, as it takes ALL the paragraphs. Can you precise, what is your criteria to SPLIT? Maybe you want to extract one specific **chapter** into another file? – DenisFLASH Aug 02 '14 at 08:36
  • Thank you for your reply. My criteria is a little bit complex (font related). Your example is enough for this question, to split it into 2 files, 21-30 will be one file and the rest will be the other. – flyingmouse Aug 02 '14 at 08:51
  • is it a obligation to work with .doc files (HWPFDocument)? POI has much more possibilities for .docx files (XWPFDocument). If it's obligatory, i will go on to try it for .doc, but there's much more chance that i'll be able to help you with .docx – DenisFLASH Aug 02 '14 at 11:27
  • Thank you for your reply. I am dealing with WTO related documents. All the documents downloaded from the official website are either .doc or .pdf. I am not sure if all these files can be transformed to .docx without any problem? I will appreciate if .doc files can be handled. (P.S. the fonts, styles of the texts needed to be kept in new files) – flyingmouse Aug 02 '14 at 13:40
  • I think XWPF is also acceptable if there is no problem to save .doc as .docx. My work is that, suppose I have a .doc file that contains some information of different countries such as US, Japan, Italy. Then the first step is to extract the US related information into a new file us.doc and that of Japan into a new file Japan.doc. – flyingmouse Aug 02 '14 at 13:50

2 Answers2

1

So here is the code that works with the current task. Here the criteria of selecting paragraphs is quite simple: paragraphs 11..20 go to the file "us.docx", and 21..30 - to "japan.docx".

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.usermodel.Range;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;


public class SplitDocs {

    public static void main(String[] args) {

        FileInputStream in = null;
        HWPFDocument doc = null;

        XWPFDocument us = null;
        XWPFDocument japan = null;
        FileOutputStream outUs = null;
        FileOutputStream outJapan = null;

        try {
            in = new FileInputStream("wto.doc");
            doc = new HWPFDocument(in);

            us = new XWPFDocument();
            japan = new XWPFDocument();

            Range range = doc.getRange();

            for (int parIndex = 0; parIndex < range.numParagraphs(); parIndex++) {  
                Paragraph paragraph = range.getParagraph(parIndex);

                String text = paragraph.text();
                System.out.println("***Paragraph" + parIndex + ": " + text);

                if ( (parIndex >= 11) && (parIndex <= 20) ) {
                    createParagraphInAnotherDocument(us, text);
                } else if ( (parIndex >= 21) && (parIndex <= 30) ) {
                    createParagraphInAnotherDocument(japan, text);
                }
            }

            outUs = new FileOutputStream("us.docx");
            outJapan = new FileOutputStream("japan.docx");
            us.write(outUs);
            japan.write(outJapan);

            in.close();
            outUs.close();
            outJapan.close();

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    private static void createParagraphInAnotherDocument(XWPFDocument document, String text)  {         XWPFParagraph newPar = document.createParagraph();
        newPar.createRun().setText(text, 0);
    }

}

I used .docx as the output as it is waaaaay easier to add new paragraphs to a .docx than to a .doc file. The method insertAfter(ParagraphProperties props, int styleIndex) for inserting a new Paragraph to a given range is now deprecated (i use POI version 3.10), and i couldn't find an easy and logical way to create a new Paragraph object in the empty .doc file. Whereas it's a pleasure to use straightforward and clean XWPFParagraph newPar = document.createParagraph();.

However, this code uses .doc as an input, as required in your task. Hope this will help :)

P.S. Here we use a simple choosing criteria, using paragraph indices. If you need something like font criteria, as you said, you will probably post another questions, or maybe you'll find the solution yourself. Anyway, with docx things get easier.

DenisFLASH
  • 734
  • 1
  • 9
  • 14
  • Thank you so much. However I wonder the `string` losts the information of the format of texts? I will take your advices, to do with .docx files. I posted another question #25130419, to copy some contents in one .docx file to another without losing the format. I hope you could have a look. Thank you again for your advices:) – flyingmouse Aug 05 '14 at 02:49
  • @flyingmouse you're welcome! thank you for a green mark, i'm glad it helped. i'll look at that question. If i have time (kinda busy these days), i'll try to help. Good luck! – DenisFLASH Aug 05 '14 at 07:02
  • @Denis this solution will not move the table, images, or pretty much anything else other than the paragraph text. Is there any other solution to really split the document – WiredCoder Jul 05 '16 at 09:02
0

This is the same situation I have had, please check Apache POI - Split Word document (docx) to pages for a solution. One word of caution, while this solution is better than the one contributed above in sense that it generates formatted pages, it falls short in handling tables and images.

Community
  • 1
  • 1
WiredCoder
  • 916
  • 1
  • 11
  • 39