0

Consider this text:

...
bedeubedeu France The Provençal name for tripe
bee balmbee balm Bergamot
beechmastbeechmast Beech nut
beech nutbeech nut A small nut from the beech tree,

genus Fagus and Nothofagus, similar in
flavour to a hazelnut but not commonly used.
A flavoursome oil can be extracted from
them. Also called beechmast

beechwheatbeechwheat Buckwheat
beefbeef The meat of the animal known as a cow

(female) or bull (male) (NOTE: The Anglo-
saxon name ‘Ox’ is still used for some of what
were once the less desirable parts e.g. oxtail,
ox liver)

beef bourguignonnebeef bourguignonne See boeuf à la
bourguignonne
...

I would like to parse with python this text and keep only the strings that appear exactly twice and are adjacent. For example an acceptable result should be

bedeu
bee balm
beechmast
beech nut
beechwheat
beef
beef bourguignonne

because the trend is that each string appears adjacent to an identical one, just like this:

bedeubedeu
bee balmbee balm
beechmastbeechmast
beech nutbeech nut
beechwheatbeechwheat
beefbeef
beef bourguignonnebeef bourguignonne

So how can someone search for adjacent and identical strings with a regular expression? I am testing my trials here. Thanks!

pebox11
  • 3,377
  • 5
  • 32
  • 57

1 Answers1

2

You can use the following regex:

(\b.+)\1

See demo

Or, to just match and capture the unique substring part:

(\b.+)(?=\1)

Another demo

The word boundary \b makes sure we only match at the beginning of a word, and then match 1 or more characters other than a newline (in a singleline mode, . will also match a newline), and then with the help of a backreference we match exactly the same sequence of characters that was captured with (\b.+).

When using the version with a (?=\1) look-ahead, the matched text does not contain the duplicate part because look-aheads do not consume text and the match does not contain those chunks.

UPDATE

See Python demo:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'(\b.+)\1')
test_str = u"zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM\nabbrühenabbrühen"
for i in p.finditer(test_str):
    print i.group(1).encode('utf-8')

Output:

zyme
abbrühen
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you very much for this correct answer. I wonder if there is also a feature to get the half of the string (since it would give the wanted result) with regex in order to save a second pass of the data for the final output. Thank you again `stribizhev`. – pebox11 Aug 25 '15 at 15:27
  • Excuse me, I think I should have posted this from the beginning: [`(\b.+)(?=\1)`](https://regex101.com/r/lL0cW7/2). Right? – Wiktor Stribiżew Aug 25 '15 at 15:29
  • No need thanking so much, upvoting is really enough :) BTW, you should have posted what you tried since I saw you tried something. – Wiktor Stribiżew Aug 25 '15 at 15:33
  • I now see that when I search with the proposed regex in this data: `zymezyme Yeast, the origin of the word enzyme, as the first enzymes were extracted from yeast Page 632 Thursday, August 19, 2004 7:50 PM` what I get is `[['zyme'], [], [], [' ', ' ']]`, i.e. it parses also the commas. I am using this code: `reg = re.compile(r"(\b.+)(?=\1)") for line in textfile: matches += [(reg.findall(line))] textfile.close()`, do you think this can be improved? – pebox11 Aug 26 '15 at 16:35
  • lso why '`abbrühenabbrühen'` is parsed as `'abbr\xc3\xbchen'` ? How can I avoid these special characters parsed in this way? – pebox11 Aug 26 '15 at 16:54
  • Please see my demo. It is almost a must to mention the programming language you are working with at the beginning. – Wiktor Stribiżew Aug 26 '15 at 19:38
  • Another `bag` that I saw is that it catches also strings like `in` in `containing` so I guess we should set the boundary to be a line feed on the left and empty space on the right? – pebox11 Aug 27 '15 at 11:47
  • Maybe you need [`p = re.compile(ur'(\b.+)\1\b')`](http://ideone.com/CzupiF)? the first `\b` word boundary already helps filter out substrings inside a larger word. – Wiktor Stribiżew Aug 27 '15 at 11:56