0

This class opens a file, delimitates at // and returns a string. I then use sets of patterns with Matcher to search the string and return snippets of data. Later, this will be used to reformat the data into many files and specific orders. For now, this process has worked on multiple pattern matches, but when I pass EditorList and AuthorList it returns null when the data is clearly there. The program crashes later when it tries to use the null Strings and I get a null pointer exception. This is my first time using Pattern and Matcher, what obvious thing am I neglecting to do here?

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MatchMethod {
public static String cancerCat=null;    
public static String paperType="book";
public static String Paper=null;
public static String Title=null;
public static String Abstr=null;
public static String Publi=null;
public static String Editi=null;
public static String Pagen=null;
public static String Bookt=null;
public static String Years=null;
public static String Editl=null;
public static String Edito=null;
public static String Authl=null;
public static String Autho=null;
public static String Foren=null;
public static String Initi=null;
public static String Lastn=null;

public static Scanner scanner;
public static File file;

static Pattern PaperBegin = Pattern.compile("<PaperBegin>(.+?)</PaperBegin>");
static Pattern PaperTitle = Pattern.compile("<PaperTitle>(.+?)</PaperTitle>");
static Pattern Abstract =   Pattern.compile("<Abstract>(.+?)</Abstract>");
static Pattern BookTitle =  Pattern.compile("<BookTitle>(.+?)</BookTitle>");
static Pattern Publisher =  Pattern.compile("<Publisher>(.+?)</Publisher>");
static Pattern Edition =    Pattern.compile("<Edition>(.+?)</Edition>");
static Pattern Page =       Pattern.compile("<Page>(.+?)</Page>");
static Pattern EditorList = Pattern.compile("<EditorList>(.+?)</EditorList>");
static Pattern Editor =     Pattern.compile("<Editor>(.+?)</Editor>");
static Pattern Year =       Pattern.compile("<Year>(.+?)</Year>");   
static Pattern AuthorList = Pattern.compile("<AuthorList>(.+?)</AuthorList>");
static Pattern Author =     Pattern.compile("<Author>(.+?)</Author>");
static Pattern ForeName =   Pattern.compile("<ForeName>(.+?)</ForeName>");
static Pattern Initials =   Pattern.compile("<Initials>(.+?)</Initials>");
static Pattern LastName =   Pattern.compile("<LastName>(.+?)</LastName>");

public static String find (String text, Pattern pattern)
{
    String found=null;
    Matcher match = pattern.matcher(text);
    if (match.find()) {found = match.group(1);}
    System.out.println((pattern.toString()) + " found: "+found);
    return  found;
}

@SuppressWarnings("resource")
static void readBook (String book) throws FileNotFoundException
{
    file = new File (book);
    scanner = new Scanner(file).useDelimiter("\\//");
    while (scanner.hasNext()) 
    {           
        Paper=scanner.next();
        Title = find (Paper, PaperTitle);
        Abstr = find (Paper, Abstract);
        Publi = find (Paper, Publisher);
        Editi = find (Paper, Edition);
        Pagen = find (Paper, Page);
        Bookt = find (Paper, BookTitle);
        Years = find (Paper, Year);         
        Editl = find (Paper, EditorList);
        Authl = find (Paper, AuthorList);

        Matcher mEdito = Editor.matcher(Editl);
        Edito = mEdito.group(1);
        while (mEdito.find()) // while loop to find all editors
        {
            System.out.println("Searching editors");                
            Foren = find (Edito, ForeName);
            Initi = find (Edito, Initials);
            Lastn = find (Edito, LastName);
            System.out.println ("EDITORS: " + Bookt + "\t" + Foren + "\t" + Initi + "\t" + Lastn);
        }
        Matcher mAutho = Author.matcher(Authl);
        while (mAutho.find()) // while loop to find all editors
        {
            System.out.println("Searching authors");
            Autho = mAutho.group(1);
            Foren = find (Autho, ForeName);
            Initi = find (Autho, Initials);
            Lastn = find (Autho, LastName);
            System.out.println ("AUTHORS: " + Bookt + "\t" + Foren + "\t" + Initi + "\t" + Lastn);
        }   
    }
}

public static void main(String[] args) throws IOException 
{
    readBook ("CC_book.txt"); //opens text file to be mined


    //Start reading Colon Cancer Book Information

    //Start reading Endocrine Cancer Book Information

    //Start reading Lung Cancer Book Information

    //Start reading Other Cancer Book Information   

    //Start reading Pancreatic Cancer Book Information
    scanner.close();
}

}

Here is sample data from the file:

<PaperTitle>True incidence of all complications following immediate and delayed breast     reconstruction.</PaperTitle>
<Abstract>BACKGROUND: Improved self-image and psychological well-being after breast      reconstruction are well documented. To determine methods that optimized results with minimal morbidity, the authors examined their results and complications based on reconstruction method and timing. METHODS: The authors reviewed all breast reconstructions after mastectomy for breast cancer performed under the supervision of a single surgeon over a 6-year period at a tertiary referral center. Reconstruction method and timing, patient characteristics, and complication rates were reviewed. RESULTS: Reconstruction was performed on 240 consecutive women (94 bilateral and 146 unilateral; 334 total reconstructions). Reconstruction timing was evenly split between immediate (n = 167) and delayed (n = 167). Autologous tissue (n = 192) was more common than tissue expander/implant reconstruction (n = 142), and the free deep inferior epigastric perforator was the most common free flap (n = 124). The authors found no difference in the complication incidence with autologous reconstruction, whether performed immediately or delayed. However, there was a significantly higher complication rate following immediate placement of a tissue expander when compared with delayed reconstruction (p = 0.008). Capsular contracture was a significantly more common late complication following immediate (40.4 percent) versus delayed (17.0 percent) reconstruction (p &lt; 0.001; odds ratio, 5.2; 95 percent confidence interval, 2.3 to 11.6). CONCLUSIONS: Autologous reconstruction can be performed immediately or delayed, with optimal aesthetic outcome and low flap loss risk. However, the overall complication and capsular contracture incidence following immediate tissue expander/implant reconstruction was much higher than when performed delayed. Thus, tissue expander placement at the time of mastectomy may not necessarily save the patient an extra operation and may compromise the final aesthetic outcome.</Abstract>
<BookTitle>Book1</BookTitle>
<Publisher>Publisher01, Boston</Publisher>
<Edition>1st</Edition>
<EditorList>
<Editor>
    <LastName>Lewis</LastName>
    <ForeName>Philip M</ForeName>
    <Initials>PM</Initials>
</Editor>
<Editor>
    <LastName>Kiffer</LastName>
    <ForeName>Michael</ForeName>
    <Initials>M</Initials>
</Editor>
</EditorList>
<Page>19-28</Page>
<Year>2008</Year>
<AuthorList>
            <Author ValidYN="Y">
                <LastName>Sullivan</LastName>
                <ForeName>Stephen R</ForeName>
                <Initials>SR</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Fletcher</LastName>
                <ForeName>Derek R D</ForeName>
                <Initials>DR</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Isom</LastName>
                <ForeName>Casey D</ForeName>
                <Initials>CD</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Isik</LastName>
                <ForeName>F Frank</ForeName>
                <Initials>FF</Initials>
            </Author>
</AuthorList>
//
<PaperTitle>Polygenes, risk prediction, and targeted prevention of breast cancer.</PaperTitle>
<Abstract>BACKGROUND: New developments in the search for susceptibility alleles in complex disorders provide support for the possibility of a polygenic approach to the prevention and treatment of common diseases. METHODS: We examined the implications, both for individualized disease prevention and for public health policy, of findings concerning the risk of breast cancer that are based on common genetic variation. RESULTS: Our analysis suggests that the risk profile generated by the known, common, moderate-risk alleles does not provide sufficient discrimination to warrant individualized prevention. However, useful risk stratification may be possible in the context of programs for disease prevention in the general population. CONCLUSIONS: The clinical use of single, common, low-penetrance genes is limited, but a few susceptibility alleles may distinguish women who are at high risk for breast cancer from those who are at low risk, particularly in the context of population screening.</Abstract>
<BookTitle>Book2</BookTitle>
<Publisher>Publisher02, New York</Publisher>
<Edition>3rd</Edition>
<EditorList>
<Editor>
    <LastName>Bernstein</LastName>
    <ForeName>Arthur</ForeName>
    <Initials>A</Initials>
</Editor>
</EditorList>
<Page>2796-803</Page>
<Year>2008</Year>
<AuthorList>
            <Author ValidYN="Y">
                <LastName>Pharoah</LastName>
                <ForeName>Paul D P</ForeName>
                <Initials>PD</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Antoniou</LastName>
                <ForeName>Antonis C</ForeName>
                <Initials>AC</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Easton</LastName>
                <ForeName>Douglas F</ForeName>
                <Initials>DF</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Ponder</LastName>
                <ForeName>Bruce A J</ForeName>
                <Initials>BA</Initials>
            </Author>
</AuthorList>
//
<PaperTitle>Invasive breast cancer: predicting disease recurrence by using high-spatial-resolution signal enhancement ratio imaging.</PaperTitle>
<Abstract>PURPOSE: To retrospectively evaluate high-spatial-resolution signal enhancement ratio (SER) imaging for the prediction of disease recurrence in patients with breast cancer who underwent preoperative magnetic resonance (MR) imaging. MATERIALS AND METHODS: This retrospective study was approved by the institutional review board and was HIPAA compliant; informed consent was waived. From 1995 to 2002, gadolinium-enhanced MR imaging data were acquired with a three time point high-resolution method in women undergoing neoadjuvant therapy for invasive breast cancers. Forty-eight women (mean age, 49.1 years; range, 29.7-72.4 years) were divided into recurrence-free or recurrence groups. Volume measurements were tabulated for SER values between set ranges; cutoff criteria were defined to predict disease recurrence after surgery. Wilcoxon rank sum tests and the multivariate Cox proportional hazards regression model were used for evaluation. RESULTS: Breast tumor volume calculated from the number of voxels with SER values above a threshold corresponding to the upper limit of mean redistribution rate constant in benign tumors (0.88 minutes(-1)) and the volume of cancerous breast tissue infiltrating into the parenchyma were important predictors of disease recurrence. Seventy-five percent of patients with recurrence and 100% of deceased patients were identified as being at high risk for recurrence. Thirty percent of patients with recurrence and 67% of deceased patients were identified as having high risk before chemotherapy. No patients in the recurrence-free group were misidentified as likely to have recurrence. All three prechemotherapy parameters (total tumor volume, tumor volumes with high and low SER) and the postchemotherapy tumor volume with high SER were significantly different between the two groups. The multivariate Cox proportional hazards regression showed that, of the three prechemotherapy covariates, only the low SER and high SER tumor volumes (P = .017 and .049, respectively) were significant and independent predictors of tumor recurrence. Tumor volume with high SER was the only significant postchemotherapy covariate predictor (P = .038). CONCLUSION: High-spatial-resolution SER imaging may improve prediction for patients at high risk for disease recurrence and death.</Abstract>
<BookTitle>Book3</BookTitle>
<Publisher>Publisher03, London</Publisher>
<Edition>3rd</Edition>
<EditorList>
<Editor>
    <LastName>Anderson</LastName>
    <ForeName>John T</ForeName>
    <Initials>JT</Initials>
</Editor>
<Editor>
    <LastName>Hoffman</LastName>
    <ForeName>John A</ForeName>
    <Initials>JA</Initials>
</Editor>
<Editor>
    <LastName>Smithson</LastName>
    <ForeName>Joshua H</ForeName>
    <Initials>JH</Initials>
</Editor>
</EditorList>
<Page>79-87</Page>
<Year>2008</Year>
<AuthorList>
            <Author ValidYN="Y">
                <LastName>Li</LastName>
                <ForeName>Ka-Loh</ForeName>
                <Initials>KL</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Partridge</LastName>
                <ForeName>Savannah C</ForeName>
                <Initials>SC</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Joe</LastName>
                <ForeName>Bonnie N</ForeName>
                <Initials>BN</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Gibbs</LastName>
                <ForeName>Jessica E</ForeName>
                <Initials>JE</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Lu</LastName>
                <ForeName>Ying</ForeName>
                <Initials>Y</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Esserman</LastName>
                <ForeName>Laura J</ForeName>
                <Initials>LJ</Initials>
            </Author>
            <Author ValidYN="Y">
                <LastName>Hylton</LastName>
                <ForeName>Nola M</ForeName>
                <Initials>NM</Initials>
            </Author>
</AuthorList>
//
VLAZ
  • 26,331
  • 9
  • 49
  • 67
BigData
  • 38
  • 7
  • Is your scanner working properly. How many loops is it doing? Not sure about that \\// delimiter. Shouldn't it be \/\/ – Phil Nov 29 '14 at 01:01
  • Scanner works fine for the first 9 matches within the first loop. In a separate check the delimiter did work as formatted. The file has a "//" separating each article. – BigData Nov 29 '14 at 01:57
  • 2
    Parsing XML with regular expressions... :-/ why not use an XML parser? – Jesper Nov 29 '14 at 22:40
  • 1
    @Phil_1984_: It should be `//`; the slash has no special meaning in a regex. But that's not hurting anything. (If it *were* special, you would need to escape both of them: `"\\/\\/"`.) – Alan Moore Nov 29 '14 at 22:45
  • I needed a quick turnaround on this, and never having done regex or xml parsing before, I grasped the code behind regex faster so that was the direction I went in. In retrospect, perhaps I should have gone with xml parsing. – BigData Dec 01 '14 at 13:58

2 Answers2

1

I know java must have plenty of tools for parsing xml, and parsing xml with regex, on top of many things, can get pretty messy.


I'm not a java programmer, but you're probably running into . not matching newline by default. You can prefix all your regexes with the switch (?s) which really only pertains to Editor, Author, EditorList and AuthorList.

For example, you author regex would look something like this.

static Pattern Author = Pattern.compile("(?s)<Author>(.+?)</Author>");

Source: Regular expression does not match newline obtained from Formatter object


As to something you commented

... This is another problem, since if it is empty it should just return another null string. ...

The reason your regexes won't do this is because you're using (.+?). If you change each occurrence of this, where applicable, to (.*?), then you permit empty strings. .+ requires a character (any character) between the opening and closing tags. .* doesn't require a character but grabs any present. And the ? makes the matching non-greedy so it captures as soon as it meets the criteria.

Consider the string: I like cats, I wonder if you like cats
"I (.*) cats" matches the whole string.
"I (.*?) cats" matches "I like cats", and if  the global flag is on, seperately matches "I wonder if you like cats"

Are you sure that AuthorList/EditorList is what's crashing it?

Your Author regex doesn't account for the attribute ValidYN at all, yet every instance in this sample data contains it, so you should be matching for it.

For Author regex try

<Author(?: [\w\-]*="[^"]*")*>(.+?)</Author>

This is a simple pattern that looks for attributes that contain letters, numbers, _ or hyphen in the attribute and a quoted attribute that can't itself contain a quote.

Or, simpler, if ValidYN is the only attribute you'll encouter:

<Author ValidYN="(?:Y|N)">(.+?)</Author>

The first regex, however, may come in handly for dealinng with other tags that may have attributes, if that issue should arise.

Community
  • 1
  • 1
Regular Jo
  • 5,190
  • 3
  • 25
  • 47
  • This is the next issue in the list, but the current problem is that the editor list, when searched for, returns null. After the editor list, it will search for editors, then for the author list, then for specific authors. The program crashes as it tries to look for specific editors in the empty editorlist string. This is another problem, since if it is empty it should just return another null string. – BigData Nov 29 '14 at 10:50
  • @BigData Updated my response, I believe another problem is that you're multiline flag is not turned on. – Regular Jo Nov 29 '14 at 21:03
  • I first implemented the change from + to *, but i had no effect on the output. I should note that although EditorList returns null, if I were to initialize String found to something else (ex: "Not Found!") then the search would return that as well. Both Author Regexes that you suggested were flagged as containing uncompilable errors. It is moot, as the class does not proceed that far into the code before crashing due to an out of bounds error when editors are searched for. – BigData Dec 01 '14 at 14:03
0

Any group can only be gotten after a find (or match or lookingAt).

    Matcher mEdito = Editor.matcher(Editl);
    Edito = mEdito.group(1);
    while (mEdito.find()) // while loop to find all editors
    {

Must be

    Matcher mEdito = Editor.matcher(Editl);
    while (mEdito.find()) // while loop to find all editors
    {
        Edito = mEdito.group(1);

P.S,

JAXB would allow this XML to be read in java objects (classes Author, Editor etc.).


I saw that you needed the regex . to match newlines too. This is the DOT_ALL option, which can be written in the regex as (?s) command ("single line").

static Pattern EditorList = Pattern.compile("(?s)<EditorList>(.+?)</EditorList>");
...
static Pattern AuthorList = Pattern.compile("(?s)<AuthorList>(.+?)</AuthorList>");
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • This had no effect on the code, but thank you for trying to help. I have a short time frame o learn and implement the code, regex made more sense/was easier and quicker to learn. In retrospect I should hav learned to parse xml – BigData Dec 01 '14 at 14:05
  • I took a nearer look, and saw that the regex was not okay too. – Joop Eggen Dec 01 '14 at 14:52