Parse text into table with R or Python

Question

I'm trying to deal with unstructured data from NCBI's Biosample search results. The only way for me to export the metadata is in .txt form, with the data looking like 1,274 entries of this:

1: Pathogen: environmental/food/other sample from Escherichia coli
Identifiers: BioSample: SAMN30954130; Sample name: WSDA203; SRA: SRS15198261
Organism: Escherichia coli
Attributes:
    /strain="WSDA203"
    /collected by="Washington State Department of Agriculture"
    /sequenced by="Washington State Department of Agriculture"
    /collection date="2022-09-06"
    /geographic location="USA:WA"
    /isolation source="retail raw whole milk goat"
    /source type="food"
    /latitude and longitude="missing"
    /project name="GenomeTrakr; LFFM-FY3"
Accession: SAMN30954130 ID: 30954130

2: Pathogen: environmental/food/other sample from Escherichia coli
Identifiers: BioSample: SAMN30942192; Sample name: FSIS12214981; SRA: SRS15185796
Organism: Escherichia coli
Attributes:
    /strain="FSIS12214981"
    /collected by="USDA-FSIS"
    /collection date="2021"
    /isolation source="Animal-Cattle-Heifer (cecal)"
    /geographic location="USA:WA"
    /latitude and longitude="missing"
Accession: SAMN30942192 ID: 30942192

3: Pathogen: environmental/food/other sample from Escherichia coli
Identifiers: BioSample: SAMN10354025; SRA: SRS4000068; CFSAN: CFSAN087006
Organism: Escherichia coli
Attributes:
    /collection date="2002"
    /strain="2.3858"
    /attribute_package="environmental/food/other"
    /isolate name alias="CFSAN087006"
    /collected by="Pennsylvania State University | Escherichia coli Reference Center"
    /latitude and longitude="missing"
    /geographic location="USA:WA"
    /isolation source="romaine lettuce"
    /ontological term="lettuce vegetable food product:FOODON_00001998| romaine:CURATION_0000143"
    /IFSAC+ Category="vegetable row crops (leafy)"
    /source type="food"
    /PublicAccession="CFSAN087006"
    /ProjectAccession="PRJNA357722"
    /Species="coli"
    /Genus="Escherichia"
    /contact:first_name="Edward"
    /contact:last_name="Dudley"
    /contact:email="ECRC@psu.edu"
Accession: SAMN10354025 ID: 10354025

What's the easiest to read this into tabular format in R or Python? All I really need is the accession number and the attributes (ie, 'strain' through 'project name' on the top entry). I'm floundering because the attribute lines are structured differently than the rest of the data and there are a variable number of attributes for each entry.

Edit: As an example of what I'm trying to get as an end result, here is some R code resulting in a one-row table with information from the first entry of the raw data above:

accession <- "SAMN30954130"
strain <- "WSDA203"
collected_by <- "Washington State Department of Agriculture"
sequenced_by <- "Washington State Department of Agriculture"
collection_date <- "2022-09-06"
geographic_location <- "USA:WA"
isolation_source <- "retail raw whole milk goat"
source_type <- "food"
latitude_and_longitude <- "missing"
project_name <- "GenomeTrakr; LFFM-FY3"

example_data <- data.frame(accession, strain, collected_by, sequenced_by, collection_date,
                           geographic_location, isolation_source, source_type, latitude_and_longitude,
                           project_name)
print(example_data)

Thanks so much for any help.

Hit the nail with the right hammer. You might want to look at using [entrez](https://bioconnector.github.io/workshops/r-ncbi.html) to get your search results in friendlier form, and in searches, use both `R` and `Bioconductor`. — Chris, Oct 04 '22 at 00:17

Parse text into table with R or Python

0 Answers0