1

I have a text file containing random strings. I want to use specific criterias to extract the strings that match these criterias.

Example text :

B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293

Example criteria :

All the strings that contains characters seperated by hyphens this way : XXX-XX-XXXX

Output : 'B311-SG-1700'

I tried creating a function but I can't seem to know how to use criterias for string specifically and how to apply them.

Jeza
  • 13
  • 3
  • Hi. Can you explain a bit more what you're trying to achieve? I understand that you have strings seperated by spaces and not newlines? And you want to match a specific part of that strings? What I don't understand is that your example "XXX-XX-XXXX" does not match your expected result "B311-SG-1700". I assume that you might want to use Regex. Something like this might work: "\b.{4}-.{2}-.{4}" which will match anything starting with a word boundary (e.g. space or beginning). It might also be helpful to know which programming language you're working with. See also https://regex101.com/r/8hVqbI/1 – Wolfspirit Nov 24 '22 at 19:00
  • Hi ! I'm so sorry I forgot. The programming language is Python 3.8 . Basically, what I'm trying to achieve is to extract these strings (seperated either by spaces or newlines) and to write them in a text file, line by line. The text file is going to be converted into .csv format after. – Jeza Nov 24 '22 at 19:43

2 Answers2

0

You can use re module to extract the pattern from text:

import re

text = """\
B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293
BAKSJD873-JAN-1293 B312-SG-1700-ASJND83-ANSDN762"""

for m in re.findall(r"\b.{4}-.{2}-.{4}", text):
    print(m)

Prints:

B311-SG-1700
B312-SG-1700
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

Based on your comment here is a python script that might do what you want (I'm not that familiar with python).

import re

p = re.compile(r'\b(.{4}-.{2}-.{4})')

results = p.findall('B111-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293\nB211-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293 B311-SG-1700-ASJND83-ANSDN762 BAKSJD873-JAN-1293')

print(results)

Output: ['B111-SG-1700', 'B211-SG-1700', 'B311-SG-1700']

You can read a file as a string like this

text_file = open("file.txt", "r")
data = text_file.read()

And use findall over that. Depending on the size of the file it might require a bit more work (e.g. reading line by line for example

Wolfspirit
  • 728
  • 7
  • 9
  • Thank you for the answer. It helped me a bit. I guess now my question would be, is there a way to extract the same specific string (XXXX-XX-XXXX) with any characters that might come first. Example : space, hyphen, comma... etc – Jeza Nov 24 '22 at 20:37
  • Could you provide some examples with the expected outputs? "\b" means a word boundary. You can replace that with a ".?" (e.g. ".?(.{4}-.{2}-.{4})") which means any character or none. With the brackets you can control which part should be included in the result (called capturing group). That way something like "BAKSJD873-JA-1293" will be included in the result "D873-JA-1293" aswell. For specific characters instead of ".?" you can use e.g. "[-.,\w]?" which will limit the chars to the ones in the brackets. Take a look at regex "metacharacters" for similar things to "\b" or "\w" (whitespace). – Wolfspirit Nov 24 '22 at 21:01
  • Yes of course, So basically what I want for outputs is a list containing strings with these specific criterias : 1- Strings must be 11 characters long (including the hyphens) 2- Must start with a letter 3- Can't have any spaces in the string 4- The last 4 characters are numbers 5- the 2 characters between the hyphens should be letters I'm really not familiar with the syntax thats why. – Jeza Nov 25 '22 at 13:41
  • As far as I understand you're looking for "[a-zA-Z][^\s]{3}-[a-zA-Z]{2}-\d{4}" which means "A character ranging from a-Z, 3 characters which are not space, a dash, 2 characters ranging from a-Z, a dash, 4 characters that are numbers. You can try around here: https://regex101.com/r/x9FqCm/1 – Wolfspirit Nov 25 '22 at 14:45