0

I have an archive of PCL files. I would like to make a console app that would read a file, strip out all print control codes, and write the codes to a separate file, leaving the rest of the document in tack. I think I can do this with a regex(), but I'm not sure how to approach the task. My language of choice is C#. Any advice you can provide will be greatly appreciated.

I've made progress with

    public static string RemoveBetween(string s, char begin, char end)
    {
        Regex regex = new Regex(string.Format("\\{0}.*?{1}", begin, end));
        return regex.Replace(s, string.Empty);
    }

    public static string[] getPclCodes(string line)
    {
        string pattern = "\\x1B.*?H";
        string[] pclCodes = Regex.Split(line, pattern);

        return pclCodes;
    }

but the codes return as empty strings. I can strip them out of the PCL and write a txt file, but I need the codes also. I call getPclCodes before RemoveBetween. Any ideas?

110100100
  • 3
  • 2
  • This is really just a how-to question, so why not go to regular-expressions.info and start reading up on regex? They even have language-specific idiosyncracies documented. – Brian Warshaw Sep 13 '12 at 16:55
  • Can you post an example of the file including at least one control code? Do you mean the list of codes specified at https://support.transfrm.com/attachments/token/ontu8wag731xpbi/?name=PCL.pdf in the About PCL5e section? If the codes don't follow a pattern and you just have to look for a set of "hardcoded" values, you may as well just use string replacement instead of regex. – Tyson Sep 13 '12 at 18:03
  • DICT D&T: 02/15/11 1229 TRANS D&T: 02/18/11 2004 BY: CJR (s0s0B &d@ &k10.000H (s0s0B &d@ &k10.000H (s0s0B &d@ &k10.000H (s0s0B &d@ &k10.000H (s0s0B &d@ &k10.000H Run: 02/22/11-12:27 by DOE,JANE A (s0s0B &d@ &k10.000H (s0s0B &d@ &k10.000H PT PROGRESS NOTES-Additional copy Page 1of1 (s0s0B &d@ &k10.000H (8U (s0p12h10v3T )10U )s0p12h10v3T (s0s0B &d@ &k10.000H This is the bottom of a sample file. I see that they end with .000H. I am unsure of how to identify the ESC character that starts the command. I can see it in Notepad++, but not here. – 110100100 Sep 13 '12 at 18:31

2 Answers2

0

If I am understanding correctly. This should do the trick. I modified your method to accept both the line you want scanned by the pattern, and a reference to a MatchCollection. This way, you can simply assign the reference to the matches before it splits the line.

    public static string[] getPclCodes(string line, out MatchCollection codes)
    {
        string pattern = "\\x1B.*?H";

        Regex regex = new Regex(pattern);
        codes = regex.Matches(line);

        string[] pclCodes = Regex.Split(line, pattern);

        return pclCodes;
    }

So now, in your main or where ever you call it the getPclCodes from, you can do something like this.

        MatchCollection matches;
        string[] codes = getPclCodes(codeString, out matches);

        foreach (Match match in matches)
            Console.WriteLine(match.Value);

I am sure there is a better way but this works, again... if we are on the same page.

James Shaw
  • 839
  • 1
  • 6
  • 16
0

OP presumably wanted C#, but if anyone else just wants it using GNU sed, this works:

sed 's/\x1B[^][@A-Z^\\]*[][@A-Z^\\]//g'

How it works: in each line find and remove any character sequence which starts with ESC (\x1B) and continues until any of ASCII characters 64-94 (i.e. A-Z or any of @[\]^). The trailing g means repeat until no further matches.

scoobydoo
  • 99
  • 3