3

I have a json file containing texts like:

dr. goldberg offers everything.parking is good.he's nice and easy to talk

How can I extract the sentence with the keyword "parking"? I don't need the other two sentences.

I tried this:

with open("test_data.json") as f:
    for line in f:
        if "parking" in line:
            print line

It prints all the text and not that particular sentence.

I even tried using regex:

f=open("test_data.json")
for line in f:
    line=line.rstrip()
    if re.search('parking',line):
        print line

Even this shows the same result.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    When you use readline in file pointer, it will not read only one line. It will read till it sees "\n". – Myjab Nov 22 '14 at 07:00
  • Use simple regex. Use pattern as mention by dmitry_romanov or even you can try the pattern re.search(".*\.(.*parking.*\.)",a).group(1) – Myjab Nov 22 '14 at 07:22

3 Answers3

5

you can use nltk.tokenize :

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
f=open("test_data.json").read()
sentences=sent_tokenize(f)
my_sentence=[sent for sent in sentences if 'parking' in word_tokenize(sent)] #this gave you the all sentences that your special word is in it ! 

and as a complete way you can use a function :

>>> def sentence_finder(text,word):
...    sentences=sent_tokenize(text)
...    return [sent for sent in sentences if word in word_tokenize(sent)]

>>> s="dr. goldberg offers everything. parking is good. he's nice and easy to talk"
>>> sentence_finder(s,'parking')
['parking is good.']
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0

You can use the standard library re module:

import re
line = "dr. goldberg offers everything.parking is good.he's nice and easy to talk"
res = re.search("\.?([^\.]*parking[^\.]*)", line)
if res is not None:
    print res.group(1)

It will print parking is good.

Idea is simple - you search for sentence starting from optional dot character ., than consume all non-dots, parking word and the rest of non-dots.

Question mark handles the case where your sentence is in the start of the line.

Tom Zych
  • 13,329
  • 9
  • 36
  • 53
dmitry_romanov
  • 5,146
  • 1
  • 33
  • 36
  • But that will fail on any sentence with a punctuated abbreviation, such as the previous sentence in the input. – tripleee Nov 22 '14 at 07:39
  • @tripleee, there are no syntax for the meaning, I'm afraid. Dot `.` in `dr.` is the same like at the end of any sentence. If someone need's the solution which can read like a human, he/she either writes fragile regexp or train a neural network. Both cases are overkill, IMHO. May be `dr` was ment for `delta r` like in phys textbook, who knows? My solution will handle comma, etc. Terminating with !, ? are easy to add, etc. – dmitry_romanov Nov 22 '14 at 17:41
  • For a question tagged [tag:nltk] I would hope and expect a solution which handles at least the basics of actual human language. Yes, it's context-dependent, so a context-free tool such as regex is inherently inadequate. – tripleee Nov 22 '14 at 19:05
  • @tripleee I agree with you completely (right now I play with nltk, thank you for the link :-) ). Regarding "inadequate", we cannot say from here if OP is interested in language-aware solution or not, neither we can say if extra dependencies in his project are allowed (often I do not have such luxury at work). That is his/her design decisions, not ours. Thus I fixed pattern in his regexp solution so it will work on data provided giving exactly result OP asked for. Thats all. – dmitry_romanov Nov 23 '14 at 06:03
0

How about parsing the string and looking at the values?

import json

def sen_or_none(string):
  return "parking" in string.lower() and string or None

def walk(node):
  if isinstance(node, list):
    for item in node:
      v = walk(item)
      if v:
        return v
  elif isinstance(node, dict):
    for key, item in node.items():
      v = walk(item)
      if v:
        return v
  elif isinstance(node, basestring):
    for item in node.split("."):
      v = sen_or_none(item)
      if v:
        return v
  return None

with open('data.json') as data_file:    
  print walk(json.load(data_file))
generalpiston
  • 911
  • 7
  • 11