Problems finding a csv grammar where cells have a special meaning

Question

I'm trying to find a grammar for the following example csv:

a; test;test ;
;a; test;test ;
<ignore>; <ignore> ;test
a; <ignore> test;test
a; this is test ;test

The semicolon is used as the separator. Cells containing only the text <ignore> have a special meaning and should be represented by their own type in the EMF model. However <igonore> test is not such a special value. The whitespace around semicolons must be ignored. Cells may contain any characters except the semicolon.

So far I have come up with this grammar:

grammar com.example.Csv

import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate impEx "http://www.example.com/Csv

Model:
    valueLine=ValueLine

ValueLine:
    ';'? WHITE_SPACE values+=Value WHITE_SPACE (';' WHITE_SPACE values+=Value WHITE_SPACE)* ';'*;

Value:
    ( (=>'<ignore>') {IGNORE_VALUE} IGNORE_VALUE) | text=TEXT_VALUE;

terminal TEXT_VALUE:
    (!';')*;

IGNORE_VALUE:
    '<ignore>';

WHITE_SPACE:
    (' '|'\t')*;

But using my testcase

@InjectWith(CsvInjectorProvider.class)
@RunWith(XtextRunner.class)
public class ParserTest {

    @Inject
    private ParseHelper<Model> parser;

    @Test
    public void parseDomainmodel() throws Exception {
        Model parsed = parser.parse("abc;  <ignore>;  <ignore> \t;  <ignore> a;def");
        System.out.println(parsed.getValueLine().getValues());
    }
}

I see that the IGNORE_VALUE rule doesn't match <ignore>. The parser seems to use the TEXT_VALUE rule for the starting whitespace.

What do I need to do in order to parse the <ignore> values correctly?

score 0 · Answer 1 · answered Sep 21 '15 at 18:05

0

I see you are dealing with regular expression in your grammar file try the below:

IGNORE_VALUE:
    '\<ignore\>';

if you are dealing with spaces it should be something like:

IGNORE_VALUE:
    '\ *\<ignore\>';

Hopefully that helps.

answered Sep 21 '15 at 18:05

Mehdi Karamosly

5,388
2
32
50

score 0 · Accepted Answer · edited Oct 09 '15 at 20:19

0

The problem here is that the Lexer performs a longest match. And since your TEXT terminal matches pretty much anything, it gets chosen.

I would suggest to only have text columns and do the analysis of "is this column ignored?" in the later stages like validation and highlighting.

edited Oct 09 '15 at 20:19

approxiblue

6,982
16
51
59

answered Oct 09 '15 at 18:55

Stefan Oehme

449
2
7

Problems finding a csv grammar where cells have a special meaning

2 Answers2