0

I am able to extract person names using Spacy NER model but it includes the lawyer/police/or everyone else who is a human.My problem is to extract the name of the person who is an accused/convicted/or has committed the crime based on news article.

e.g. the below nes article https://www.channelnewsasia.com/news/world/turkey-frees-opposition-figure-pending-terrorism-trial---anadolu-11095480

ANKARA: A Turkish court on Monday ordered the release on bail of a former opposition lawmaker while he is being tried on terrorism-related charges, state-owned Anadolu news agency said.

Eren Erdem, who lost his seat in mid-2018 elections that granted President Tayyip Erdogan sweeping new powers, has been jailed since June and accused of publishing illegal wiretaps while editor of an opposition newspaper in 2014.

He denies charges of assisting followers of U.S.-based cleric Fethullah Gulen, who is accused of orchestrating a failed 2016 putsch.

Eren Erdem is the prime accused and I need only this name but Spacy model extracts all the people names Tayyip Erdogan(president) Fethullah Gulen Enis Berberoglu Tuvan Gumrukcu etc

I need the name of the criminal not president or police.

Can we do it using Python/NER ?

Edit : Can we apply Knowledge graph concept here ? I explored a lot about it but couldn't find convincing article regarding the case.it would be great if someone could walkover this concept or provide article links (relevant).

Laster
  • 388
  • 5
  • 18
  • Do you have any info before-hand about the people involved? Or is the input article the only thing you have? – Kraay89 Oct 17 '19 at 10:23
  • The input news article is the only input we have – Laster Oct 18 '19 at 06:03
  • Then i think there is no plug-and-play way to do this. This requires some sort of machine learning, with which i can't help you... – Kraay89 Oct 18 '19 at 06:57
  • Yes that might be the case,I have already tried a few things.If ML is required,then can anyone help me that ? – Laster Oct 20 '19 at 07:27

2 Answers2

1

Firstly, you have to ask yourself how some reader of the text is capable of identifying the criminal. The proper name representing the criminal takes the argument function of a verb (let it be a copular verb like in "He is a criminal" or a semantically more complex verb like "the man also commited the murder 2 years ago"). This argument function (the "subject" in case of the examples) perfectly identifies the criminal entity. What you have to do is:

  1. identifying the sentence containing the criminal, including the so-called subcategorisation frame of the verb (giving the arguments, e.g. "SUBJECT", "OBJECT" etc.).
  2. Parsing the sentence, such that the arguments are made accessible (using nltk or spaCy) and using NER
  3. extracting the entity, which is both recognized by NER and subcategorized by the verb in the argument position that assigns the role of the criminal to the entity
  4. if necessary, performing anaphora resolution, when a personal pronoun is used, which needs to be matched with the entity to which the pronoun refers (you can imagine this as some sort of reference chaining of pronouns).

Really, there is no out of the box model, its rather a linguistic pipeline with implementations for each separate steps that takes you there. For anything more detailed, you really need to paste some code for direct questions on the implementation pipeline.

You can use machine learning, but for this you need to perform steps 1 and 2 anyways, so better first try those steps.

CLpragmatics
  • 625
  • 6
  • 21
1

I'm also using spacy in my project to extract victim names and I also get a lot of non-victim names like police officers, doctors, suspect, etc. Tools like spacy are very useful but you also need to help it out in order to identify which type of PERSON entity you want to extract. To filter out the names I want, what I do is:

  1. Analyze the articles and recognize some common patterns. Usually, articles from the same sources follow the same formats. In your case, I checked a few articles from the given website and it follows formats like "Suspect name, age, was accused/arrested/other synonyms" or "Suspect name, who , was accused/arrested/other synonyms". This is a pretty common format for crime-related articles. There could be other format, of course, but it's unlikely that there will be too many since these sites usually follow a certain standard or the articles are written by a few authors.

What pattern do you see from this? It's that the sentences that have the suspect name is often divided into three chunks. The [1] first one is the name followed by a comma, the [2] second one is either digits (age) or some description beginning with "who" followed by a comma, and the [3] third one includes the verbs similar to "arrests" such as arrested, jailed, accused, etc.

In your example: "[1] Eren Erdem, [2] who lost his seat in mid-2018 elections that granted President Tayyip Erdogan sweeping new powers, [3] has been jailed since June and accused of publishing illegal wiretaps while editor of an opposition newspaper in 2014.

  1. Use regular expression to catch only phrases that have this pattern. In Python:

    import re for result in re.finditer(r'(\w+\W+\w+){1,5},\swho\s(\w+\W+\w+){0,20},\s(\w+\W+){0,5}(arrested|jailed)\s(\w+\W+){0,10}', text, flags=re.I): print(result.group()) # pass this to spacy print(result.group().split(",")[0]) # or this

You can use machine learning but there will always be some results that require tuning. You can also utilize scoring. If the articles are about a suspect, then the PERSON entity that will occur the most is often the suspect himself, other entities will probably be mentioned only a few times or sometimes just once.

  • The real issue is I have articles from reuters,BBC,Times of India,NY Times etc and all of them don't follow similar pattern. – Laster Nov 26 '19 at 09:47