0

everyone! I am so very sorry for this question, but I don't have any experience in regex and I would like to know if something is truely possible to do.

I am working on a corpus of news stories taken from the BBC News. However, some news items are repeated in my corpus and I would like to know if something can be done to highlight these duplicates without sorting out my data. Thank you so much and I do apologise again for this maybe naive question.

Antonio
  • 1
  • 2
  • Can you provide an example of the text you have? Is it multiple files? A single file with just headlines? All the text concatenated together? – mikdav Apr 03 '15 at 16:19
  • The question is not naive, but incomplete. You didn't tell in what form is your data. And how do you want to highlight and what you mean by 'without sorting out'. – Lorenz Meyer Apr 03 '15 at 17:34
  • Dear Mikdav, I cannot give an example taken from my corpus, since the BBC asked me not to share the data extracted from their website, but... if I just go to the BBC website and take a random news story.... here it is an example of my corpus: bbc.com/news/world-africa-32184638 Kenya al-Shabab: Kenyatta vows tough response to Garissa attack Kenyan President Uhuru Kenyatta has vowed to respond "in the severest ways possible" to the al-Shabab militant attack on Garissa university in which 148 people died. – Antonio Apr 04 '15 at 17:20
  • Ok, I have played with XML annotations so as to reproduce the structure of my corpus. In other words, I have the link to a particular news story, its headline and lead paragraph of all the news stories published on the BBC website from June to August 2014 (1 .txt file = 1 day of news stories). – Antonio Apr 04 '15 at 17:22
  • Dear Lorenz, My data have been collected in a .txt file (UTF-8) and I said that I didn't want to sort out my data because I know that TextPad allows you to delete duplicates, but in order to do that, you need to sort out your data, which I don't want to do. I know that there are some software which can do this, but I fear they might corrupt the data. And this is the reason why I asked if it could be possible to do remove duplicates with regex in TextPad. – Antonio Apr 04 '15 at 17:24

1 Answers1

1

Usually I make a sort with removing duplicates and save the result in a different file (leaving the original file unchanged). Then I compare the two files (total commander, exam diff, ...).

Peter
  • 97
  • 13