0

I'm trying to parse a categories file with PEG.js

How can I group categories (set of non-empty lines followed by a blank line)

stopwords:fr:aux,au,de,le,du,la,a,et,avec

synonyms:en:flavoured, flavored

synonyms:en:sorbets, sherbets

en:Artisan products
fr:Produits artisanaux

< en:Artisan products
fr:Gressins artisanaux

en:Baby foods
fr:Aliments pour bébé, aliment pour bébé, alimentation pour bébé, aliment bébé, alimentation bébé, aliments bébé

< en:Baby foods
fr:Céréales pour bébé, céréales bébé

< en:Whisky
fr:Whisky écossais
es:Whiskies escoceses
wikipediacategory:Q8718387

For now I can parse line by line with this code:

start = stopwords* synonyms* category+

language_and_words = l:[^:]+ ":" w:[^\n]+ {return {language: l.join(''), words: w.join('')};}

stopwords = "stopwords:" w:language_and_words "\n"+ {return {stopwords: w};}

synonyms = "synonyms:" w:language_and_words "\n"+ {return {synonyms: w};}

category_line = "< "? w:language_and_words "\n"+ {return w;}

category = c:category_line+ {return c;}

I got:

{
    "language": "en",
    "words": "Artisan products"
},
{
    "language": "fr",
    "words": "Produits artisanaux"
}

but I want (for each group):

{
    {
        "language": "en",
        "words": "Artisan products"
    },
    {
        "language": "fr",
        "words": "Produits artisanaux"
    }
}

I tried this too, but it doesn't group and I got \n at the beginning of some lines.

category_line = "< "? w:language_and_words "\n" {return w;}

category = c:category_line+ "\n" {return c;}
LisaMM
  • 675
  • 1
  • 16
  • 28
Fabrice Theytaz
  • 305
  • 2
  • 9

2 Answers2

0

I found a partial solution:

start = category+

word = c:[^,\n]+ {return c.join('');}

words = w:word [,]? {return w.trim();}

parent = p:"< "? {return (p !== null);}

line = p:parent w:words+ "\n" {return {parent: p, words: w};}

category = l:line+ "\n"? {return l;}

I can parse this...

< fr:a,b
fr:aa,bb

en:d,e,f
fr:dd,ee, ffff

and get grouped:

[
    [ {...}, {...} ],
    [ {...}, {...} ]
]

But there is a problem with "lang:" at the beginning of each category, if I try to parse "lang:" my catégories are not grouped...

Fabrice Theytaz
  • 305
  • 2
  • 9
0

I find it's useful to break down iteratively the parse (problem decomposition, old-school à la Wirth). Here's a partial solution that I think gets you in the right direction (I didn't parse the Line elements of categories.

start = 
  stopwords 
  synonyms 
  category+

category "category"
  = category:(Line)+ categorySeparator { return category }

stopwords "stopwords"
  = stopwordLine*

stopwordLine "stopword line"
  = stopwordLine:StopWordMatch EndOfLine* { return stopwordLine }

StopWordMatch 
  = "stopwords:" match:Text { return match }

synonyms "stopwords"
  = synonymLine*

synonymLine "stopword line"
  = synonymLine:SynonymMatch EndOfLine* { return synonymLine }

SynonymMatch 
  = "synonyms:" match:Text { return match }

Line "line"
  = line:Text [\n] { return line }

Text "text"
  = [^\n]+ { return text() }

EndOfLine "(end of line)"
  = '\n'

EndOfFile 
  = !. { return "EOF"; }

categorySeparator "separator"
  = EndOfLine EndOfLine* / EndOfLine? EndOfFile

My use of mixed case is arbitrary and not very stylish. There's also a way to save the solutions online: http://peg.arcanis.fr/2WQ7CZ/

Fuhrmanator
  • 11,459
  • 6
  • 62
  • 111