0

I want to search inside multiple big text files (200MB each) as fast as possible. I am using the command line tool ripgrep and I want to call it only once.

In the following string:

***foo***bar***baz***foo***bar***baz

(*** stands for a different type and number of characters.)

I want to match baz, but only if it follows the first occurence of foo***bar***

So in ***foo***bar***baz***foo***bar***baz it matches the first baz and in ***foo***bar***qux***foo***bar***baz it shall match nothing.

I tried several solutions but it did not work. Can this be done with a single regular expression?

trent
  • 25,033
  • 7
  • 51
  • 90
maikelmeyers
  • 288
  • 3
  • 9
  • https://docs.rs/regex/1.3.1/regex/#syntax – maikelmeyers Dec 04 '19 at 14:19
  • And it has to be a "Oneliner"? Why? Is this homework? If it is, we do not do your homework for you, not at least without your first showing your valiant effort to do it yourself. And if this is a real-world problem, there are never such constraints; you do what you need to do to get a solution that works. Depending on the details of the problem (what language you are using, what you then need to do with each match, etc.), you might be able to match all of the occurrences and drop the first one. – Booboo Dec 04 '19 at 14:19
  • I want to search inside multiple big text files (200MB each ) and it shall be as fast as possible. So I use the command line tool ripgrep (https://github.com/BurntSushi/ripgrep) and I want to call it only once. So there is no other language than the regex and there is nothing else to do with the match than showing it in the terminal. – maikelmeyers Dec 04 '19 at 14:36
  • 1
    I must be misunderstanding, because if `baz` "follows the first occurence of `foo***bar***`", then it's obvious that "a `foo***bar***` precedes it" (i.e. your two conditions are redundant). Can you clarify? – Aaron Dec 04 '19 at 14:37
  • Something dirty like `/baz(?<=foo.*bar.*)(?<!.*foo.*bar.*foo.*bar.*)/` (positive lookbehind for occurrence, but negative lookbehind for multiple occurrence) may work here, but there's got to be a better way! https://regexr.com/4q441 – cmbuckley Dec 04 '19 at 14:38
  • According to the doc, `ripgrep` is based on the `Rust` regex engine and thus lacks support for *lookbehinds* (and *lookaheads*). Not the right tool for this job. – Booboo Dec 04 '19 at 14:42
  • 1
    If there is only a single `baz` to match per line, lookarounds aren't strictly necessary : you can match from the start of the line up to the `baz` (capturing it in a capturing group if it needs to be extracted) using wildcards that prevent from matching more than a single occurence of `foo***bar***`. – Aaron Dec 04 '19 at 14:48
  • @Aaron You are right. The two conditions are redundant :) "follows the first occurence of ```foo***bar***```" is enough – maikelmeyers Dec 04 '19 at 14:49
  • What do you mean by *``***`` stands for a different type and number of characters*? Do you mean that in `foo***bar`, the `***` can be anything **but not** contain `bar`? Can it contain `foo`? For example, does `foo1foo2foo3barbaz` match at index 0, at index 8 or not at all? – trent Dec 04 '19 at 15:04
  • 1
    I don't know your tool, but without lookarounds an idea could be to use a pattern with capturing groups like [`foo.*?bar.*?(?:(baz)|(foo))`](https://regex101.com/r/hIcM3a/1). If `cap[2]` is set, throw the match away. It means, there has been another `foo` before `baz`. – bobble bubble Dec 04 '19 at 15:43
  • The [manual says](https://www.mankier.com/1/rg#--pcre2), there is a `-P` or `--pcre2` flag to enable PCRE if built with. With PCRE you could use lookarounds or try something like `rg -P 'foo.*?bar.*?(?:foo.*(*SKIP)(*F)|baz.*)'` – bobble bubble Dec 04 '19 at 16:34
  • As a workaround: `rg 'foo.*bar' | grep -P 'foo.*?bar.*?(?:foo.*(*SKIP)(*F)|baz.*)'` – bobble bubble Dec 04 '19 at 17:47
  • @trentcl ```***``` can not contain ```foo``` or ```bar``` – maikelmeyers Dec 05 '19 at 08:01

1 Answers1

2

I'm pretty sure that a regex is overkill in this case. A simple series of find can do the job:

fn find_baz(input: &str) -> Option<usize> {
    const FOO: &str = "foo";
    const BAR: &str = "bar";

    // 1: we find the occurrences of "foo", "bar" and "baz":
    let foo = input.find(FOO)?;
    let bar = input[foo..].find(BAR).map(|i| i + foo)?;
    let baz = input[bar..].find("baz").map(|i| i + bar)?;

    // 2: we verify that there is no other "foo" and "bar" between:
    input[bar..baz]
        .find(FOO)
        .map(|i| i + bar)
        .and_then(|foo| input[foo..baz].find(BAR))
        .xor(Some(baz))
}

#[test]
fn found_it() {
    assert_eq!(Some(15), find_baz("***foo***bar***baz***foo***bar***baz"));
}

#[test]
fn found_it_2() {
    assert_eq!(Some(27), find_baz("***foo***bar***qux***foo***baz"));
}

#[test]
fn not_found() {
    assert_eq!(None, find_baz("***foo***bar***qux***foo***bar***baz"));
}

#[test]
fn not_found_2() {
    assert_eq!(None, find_baz("***foo***bar***qux***foo***"));
}
Boiethios
  • 38,438
  • 19
  • 134
  • 183