0

I'm writing a syntax-highlighting file for all .bed files. The exact content of each column may vary and generally looks like below

chr1    11873   14409   uc001aaa.3  0   + 11873 11873   0   3   354,109,1189,   0,739,1347,
chr21   1000000 1230000 peakValue   200 -
chrX    11873   14409   selection
....
<string>  <numeric> <numeric> <string> <numeric 1-1000> <+ or - or .> <numeric> <numeric> <numeric> <numeric> <comma separated list> <comma separated list>

So far I have first column selection and strand working:

bed.lang

<?xml version="1.0" encoding="UTF-8"?>

<language id="bed" _name="Bed" version="2.0" _section="Scientific">
  <metadata>
    <property name="mimetypes">text/bed</property>
    <property name="globs">*.bed</property>
  </metadata>

  <styles>

    <style id="chrom"        _name="Chrom"    map-to="bed:chr" />
    <style id="strand"       _name="Coords"   map-to="bed:strand" />

  </styles>

  <definitions>
    <context id="bed">
      <include>

    <context id="1_chr" style-ref="chrom">
      <match extended="true">
            ^\w+
      </match>
        </context>

    <context id="6_strand" style-ref="strand">
      <match extended="true">
            \t[+\-\.]\t
      </match>
        </context>

      </include>
    </context>
  </definitions>
</language>

I'd like to extend this so each column is formatted differently based on a scheme I can define. i.e. coordinates are one color, names are another, scores are another color. The problem is that things like coordinates and scores are all numeric strings.

The 'simplest' solution I can see is a regex expression which can select columns, and if the selection is greater then the number of columns returns nothing (does not wrap around lines).

Backsearching doesn't seem to work (because of the '>' character in the regex expression. Some regex I've tried but that don't behave nicely are:

  1. Building up Iterative matches and formatting each differently doesn't work. Multiple selections of the same string causes all syntax highlighting to fail.

    ^.+?\t
    ^.+?\t.+?\t
    ^.+?\t.+?\t.+?\t ...
    
  2. Selecting 'Numeric Strings'

    Single numeric string
    (?<=^\w\t)[0-9]+(?=\t)
    
    Numeric string doublets
    (?<=\t)[0-9]+\t[0-9]+(?=\t){1}
    

I'll be continuing to hack together an ugly solution but I was wondering if there was something elegant I'm not thinking of.

Artem
  • 217
  • 2
  • 10
  • What are you actually trying to match? Could you clearly state what the demonstration text is, how the column data types are defined and what you actually don't know how to do? – ssc-hrep3 Jan 27 '17 at 23:16
  • I've explained a bit more about bed files. What I would like is a way of selecting particular columns in regex that would be compatible with different files which could be different column widths. – Artem Jan 27 '17 at 23:28
  • Okay, I see. But do you actually want to format the values according to their column number? So, the second and third column should for example have different colors even though they match the same (numeric) pattern? If so, this will get messy with regex... – ssc-hrep3 Jan 27 '17 at 23:33
  • Second and third are 'coordinates' so they'll probably be the same color, but it should be distinct from 'score' or 'blockCount' – Artem Jan 28 '17 at 00:20
  • You can escape a > file in XML with > (ampersand followed by the two leters gt followed by a semicolon, in case it doesn't appear in my reply, sometimes stackoverflow eats them) so use that instead of an unescaped greater-than sign (and use lt to escape a less-than sign and amp for an ampersand) – barefootliam Jan 28 '17 at 05:20

0 Answers0