0

Can one change IPTC taxonomy to boolean expression? For easing the exchange of news, the International Press Telecommunication Council (IPTC) has developed the NewsML Architecture (NAR), As part of this architecture, specific controlled vocabularies, such as the IPTC News Codes, are used to categorize news items. the Subject Codes is a thesaurus of 1300 terms used for categorizing the main topics (subjects) of each news items." as of 2021, there are 1400 plus terms. The IPTC subjectCodes (from 2012) are tree-like structure with 3 layers. My assumption is a group of vocabularies defines the category of the news. My question: is it possible to convert the hierarchy to a boolean expression like this : "armed conflict" OR "armed dispute" OR "civil riots" OR (("armed" OR "weapon") AND ("right-wing" OR "left-wing" OR "extremist" OR "dangerous" OR "confrontation")) " ?

tursunWali
  • 71
  • 8

1 Answers1

1

We at IPTC have looked at this question in the past when we built a rules-based classification engine as a Google News Initiative project. It's called IPTC EXTRA and it allows users to create rules based on boolean logic to classify documents against terms in the IPTC Media Topics controlled vocabulary (or any other CV).

The rule language, Extra Query Language (EQL) is more expressive than simple Boolean and/or/not operators. We also look at proximity of words and some other characteristics: see the EXTRA User Manual for details.

You can see a set of test rules created for the EXTRA project on our GitHub repository. But please note that this is just a small subset of the rules that would be required to classify any content against the IPTC Media Topics vocabulary. At present, we don't know of a full set of rules for classifying all Media Topics.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Brendan Quinn
  • 2,059
  • 1
  • 19
  • 22
  • thank you. I noteiced two things: 1. In IPTC Extra website, I see this declaration: Commercial Considerations The core EXTRA platform will be open source and will not be directly monetized by the IPTC. when? 2.You said " Extra Query Language (EQL) is more expressive than simple Boolean and/or/not operators. We also look at proximity of words and some other characteristics" , I could not agree any more. Is downgrading from EQL expression to Boolean expression possible (is there a tool)? Thank you. – tursunWali Feb 18 '21 at 01:31
  • @tursunWali, you could turn proximity-based rules into simple "or" searches, but it will hugely impact the quality of your classification. We don't provide a tool to do this, sorry. – Brendan Quinn Feb 19 '21 at 09:23
  • @Brendan, thank you, I got the above,I'd like to go back to your first answer. 1. Should I understand "term" as a word/named entity? how can one get all terms? I assume there are hundreds for each category.I was there:https://iptc.org/standards/media-topics/ 2.You said "this is just a small subset of the rules that would be required to classify any content against the IPTC Media Topics vocabulary. At present, we don't know of a full set of rules for classifying all Media Topics." EXTRA fitted into elastic search as I understand, without a full set of rules how does it do IPTC classification? – tursunWali Feb 19 '21 at 23:58
  • I wonder if this comes out :" To facilitate adoption and consistency, the IPTC will also create EXTRA extraction rules for tagging documents in two different languages with its industry standard Media Topics vocabulary." – tursunWali Mar 05 '21 at 04:30
  • Hi @tursunWali, sorry I missed your messages earlier. The full Media Topics vocabulary is at http://cv.iptc.org/newscodes/mediatopic/ and can be downloaded in RDF/XML, JSON-LD etc. As for the EXTRA classification rules, we never created more than that set of test rules. I'm afraid you would have to create your own set. – Brendan Quinn Mar 11 '21 at 13:48