I had more than 1500 entries of in txt format.The following picture shows an example of such an entry:
Every entry has a title and abstract.
I wanted to extract the title ('TI -') and corresponding abstract ('AB -') of each entry, and write them into an excel spreadsheet. I tried to do so by writing the following code:
import xlsxwriter
headings = ['AB -','TI -']
ABSTRACT = []
TITLE = []
with open(r'C:\Users\A\Desktop\test\Ward.txt', encoding='utf8') as file:
for i in file:
if headings[0] in i:
ABSTRACT.append(i)
if headings[1] in i:
TITLE.append(i)
zipped = list(zip(TITLE, ABSTRACT))
print(zipped)
with xlsxwriter.Workbook(r'C:\Users\A\Desktop\test\Ward.xlsx') as workbook:
worksheet = workbook.add_worksheet()
for row_num, data in enumerate(zipped):
worksheet.write_row(row_num, 0, data)
I managed to extract the titles and abstracts. As I scroll down the excel file, however, I realised that certain abstracts do not correspond to the title.
An example of the mismatch between the title and abstract.
I am not sure why this is happening. Due to the triple spacing betwen the acronym and the dash ('TI -' or 'AB -'), the likelihood of the parser extracting 'TI -' or 'AB -' from anywhere but the headings should be very low. In addition, every entry would have a title and abstract heading - which means that there should not be a mismatch between titles and abstracts.
Any advise to tackle this error will be very much appreciated. Thank you.