Extracted title and abstract from txt file to excel but some of the abstracts do not match the titles

Question

I had more than 1500 entries of in txt format.The following picture shows an example of such an entry:

Every entry has a title and abstract.

I wanted to extract the title ('TI -') and corresponding abstract ('AB -') of each entry, and write them into an excel spreadsheet. I tried to do so by writing the following code:

import xlsxwriter

headings = ['AB  -','TI  -']
ABSTRACT = []
TITLE = []

with open(r'C:\Users\A\Desktop\test\Ward.txt', encoding='utf8') as file:
    for i in file:
        if headings[0] in i:
            ABSTRACT.append(i)
        if headings[1] in i:
            TITLE.append(i)

zipped = list(zip(TITLE, ABSTRACT))
print(zipped)

with xlsxwriter.Workbook(r'C:\Users\A\Desktop\test\Ward.xlsx') as workbook:
    worksheet = workbook.add_worksheet()

    for row_num, data in enumerate(zipped):
        worksheet.write_row(row_num, 0, data)

I managed to extract the titles and abstracts. As I scroll down the excel file, however, I realised that certain abstracts do not correspond to the title.

An example of the mismatch between the title and abstract.

I am not sure why this is happening. Due to the triple spacing betwen the acronym and the dash ('TI -' or 'AB -'), the likelihood of the parser extracting 'TI -' or 'AB -' from anywhere but the headings should be very low. In addition, every entry would have a title and abstract heading - which means that there should not be a mismatch between titles and abstracts.

Any advise to tackle this error will be very much appreciated. Thank you.

I suspect something irregular in Ward.txt at "TI - Attention-deficit/hyperactivity disorder". In the spreadsheet, for the rows after that entry, it seems like each "TI" entry could correspond to the "AB" entry in the previous row. Also, that title seems a lot less specific than the others; maybe it shouldn't have picked that up as a title... or it missed that one's abstract entry? — Poosh, Jan 12 '22 at 08:08
This isn't an Excel issue. Xlsxwriter is outputting the data it is told to output in the order it is told. You can verify that by replacing `write_row()` with `print()`. I would guess that the parsing logic is too brittle (that there is a TI or AB title with a different number of spaces or something like that). — jmcnamara, Jan 12 '22 at 09:13
@SolarMike Yes I did. In fact, I tried a couple of approaches - (1) text to columns in Excel, (2) extracting the abstracts from ScopusSearch using the titles (via EIDs) and (3) extracting the titles and abstracts from the txt file. For (1), I only retrieved 748 abstracts & 800 something titles. For (2), I only managed to retrieve 490 abstracts from 863 titles. Using python code to extract from the txt file yielded 1573 titles and abstracts. (I am a bit stuck here as I am unsure how to go about to tweak my code to tackle the inconsistencies in the text file.) — Apples, Jan 13 '22 at 16:30

Extracted title and abstract from txt file to excel but some of the abstracts do not match the titles

0 Answers0