0

hope someone could help me. I am new to python and just learning. I would like to know how to delete unwanted characters from a string.

For example,

I have some strings in a text file such as 'dogs op care 6A domain, cats op pv=2 domain 3, pig op care2 domain 3'

I don't need anything after that starts with op. i.e., what I would like to get is just 'dogs, cats, pig'

I see 'op' as the pattern in all these strings and therefore tried the below code

import re
f = open('animalsop.txt','r')
s = f.read()
p = re.compile('op')
match = p.search(s)
print (s[:match.start()])

The output I get is just 'dog'

why do I not get the cat and pig as well since they contain 'op' too.

Any help would be greatly appreciated because I would the code to analyse a huge similar data I have got.

The above code was derived from String splitting in Python using regex

credits to Varuna and kragniz

Tikku
  • 137
  • 1
  • 1
  • 6
  • I'd suggest using dr jimbob's answer since some other answers here might break depending on input. For example, if you have a sentence that says `dog opportunities`, some answers here may break. dr jimbob's looks for spaces on either side. If you do use regex, you should use `\bop\b`, which ensures that what precedes/followed `op` is a non-word character (not `a-zA-Z0-9_`), or ` op ` which does pretty much what dr jimbob's answer does but in regex – ctwheels Oct 03 '17 at 15:12

4 Answers4

2

It's probably easiest to not use regular expressions to solve your problem.

Assuming a file named animalsop.txt that looks like:

dogs op care 6A domain
cats op pv=2 domain 3
pig op care2 domain 3

A pythonic solution to your problem would be something like:

with open('animalsop.txt', 'r') as f:
    for line in f:
        before_op = line.split(' op ')[0]
        print(before_op)

The nice thing about the with construct for opening files in python is that it ensures that you close the file when you are done.

If instead, your animalsop.txt file is just one long line with various clauses separated by commas like:

dogs op care 6A domain, cats op pv=2 domain 3, pig op care2 domain 3

Then you could do something like:

with open('animalsop.txt', 'r') as f:
    for line in f:
        for clause in line.split(','):
            before_op = clause.strip().split(' op')[0]
            print(before_op)

(The clause.strip() removes whitespace if it's present after the comma).

dr jimbob
  • 17,259
  • 7
  • 59
  • 81
  • Hi drjimbob, many thanks for the code. I did try that but the output looks like – Tikku Oct 03 '17 at 15:32
  • dog op, cat op, pig op – Tikku Oct 03 '17 at 15:32
  • any suggestions how I could have just have dog, cat, pig without the 'op'. Many thanks – Tikku Oct 03 '17 at 15:33
  • I am sorry if I have confused, but when I try the code, it returns only 'dog' and not the cat and pig. Am I doing anything wrong here please – Tikku Oct 03 '17 at 16:03
  • @Tikku - are you sure? If I have a file that consists of three lines: `dogs op care 6A domain`, `cats op pv=2 domain 3`, `pig op care2 domain 3` inside a file called `animalsop.txt`, and you paste the code snippet above, you'll get `dogs`, `cats`, and `pig` on three separate lines. – dr jimbob Oct 03 '17 at 16:04
  • sorry, my mistake. Apologise. The text file I had was not in three separate lines one below the other but instead in a single line separated by commas. However when I put them line by line they work. Thanks for that. But I need to extract the same result from a huge text file where the strings are separated by only commas. – Tikku Oct 03 '17 at 16:13
  • YOU ARE A GENIUS :-) – Tikku Oct 03 '17 at 16:24
  • Thanks very much for all your help and time – Tikku Oct 03 '17 at 16:24
  • with open('animalsop.txt', 'r') as f: for line in f: for clause in line.split(','): before_op = clause.strip().split(' op')[0] print(before_op) #this worked in the end – Tikku Oct 03 '17 at 16:25
1

Based on examples you have provided I suggest to use simple .split() string method and select first part - e.g. part before " op".

partOfYourInterest = "dogs op care 6A domain".split(" op")[0]

for more you can iterate e.g. via for loop

text = ["dogs op care 6A domain","cats op pv=2 domain 3", "pig op care2 domain 3"]

for part in text:
    animal = part.split(" op")[0]
    print(animal)

And for your txt it could be like

with open('animalsop.txt', 'r') as f:
    for line in f:
       animal = part.split(" op")[0]
       print(animal)
Petr Matuska
  • 553
  • 5
  • 15
  • Good solution @Petr Matuska – Marvin Oct 03 '17 at 14:41
  • Many thanks Petr Matuska. I tried the code and I got exactly what I wanted however, I am wondering how to get the strings such as this within quotes. It was easier to type text = ["dogs op care 6A domain","cats op pv=2 domain 3", "pig op care2 domain 3"], but could suggest how I could put this in a huge text file. Many thanks – Tikku Oct 03 '17 at 15:35
  • Yes, you can open and read you txt file and process it line by line - I edit my code. – Petr Matuska Oct 04 '17 at 07:50
0

If you want to use a regular expression you can use:

re.findall('\w+?(?= op)', s)

['dogs', 'cats', 'pig']
Evan Nowak
  • 895
  • 4
  • 8
  • thanks Evan for you kind code. its easier when I can pick up dogs, cats and pig however when I use large data sets I wondering how to pick them. – Tikku Oct 03 '17 at 16:04
  • The regex will work with any dataset, it just looks for the word before "op" – Evan Nowak Oct 03 '17 at 18:08
0

if you only want the first word, you can use if string is your string

string='dog fgfdggf fgs, cat afgfg, pig fggag'
strings=string.split(', ')
newstring=strings[0].split(' ', 1)[0]
for stri in strings[1:]:
    newstring=newstring+', '+stri.split(' ', 1)[0]
Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55