I have a dataframe of ~20,0000 observations. I am focused specifically on a column that has abstracts of scientific journals. I am attempting to pull plant species names out of these abstracts.
The genus was already extracted out of the abstract, so I could use a look around using the genus to find the species name, as the species name would directly follow the genus in the abstract (e.g. Genus species). The issue I have is the fact that there are thousands of genera pulled out of these articles and creating a typical pattern
for example...
pattern = Malus|Gentiana|Acer|Quercus
along with the look around for the 1000's of genera would not be rational. I am wondering, is there a way (maybe a function) to keep the follow-by look around in the pattern
and substitute in the genus (they are currently as a single column data frame
right now) to pull out the matches?
What I would like... example sentences that would appear in abstracts
ex 1.
axillary bud cultures were initiated from 3 types of nodal explants of lagerstroemia parviflora. the cultures derived from explants of seedlings, terminal twigs of a 50-year-old tree and basal-sprouts of another 50-year-old tree showed significant variation in responses at establishment, shoot proliferation and rooting stages. all the 3 types of explants exuded phenolic substances from their cut ends. the exudation was checked by suspending them in a solution of 25 mu m pvp 40 and 522.5 mu m citric acid; and by the addition of 100 mu m pvp 40 and 522.5 mu m citric acid in establishment medium. leaching continued upto rooting stage, therefore, pvp 40 and citric acid were added in ms medium used for successive transfer and rooting of microshoots. seedling and basal-sprout explants placed on ms medium with 0.44 mu m ba showed maximum shoot lengths 1.45 cm +/- 0.13 and 1.16 cm +/- 0.22, respectively. tree explants exhibited best axillary shoot elongation (0.8 cm +/- 0.07) on ms medium without plant growth regulators. the cultures derived from seedling and basal-sprout explants could be successfully maintained upto 6th successive transfers, whereas, those derived from tree explants died after 3rd transfer. microshoots obtained from seedling and basal-sprout explants showed 10% rooting on ms medium supplemented with 4.9 mu m iba.
ex 2.
the influence of headspace ethylene on anthocyanin, anthocyanidin, and carotenoid accumulation was studied in suspension cultures of vaccinium pahalae. exogenous application of ethrel (an ethylene-releasing compound) significantly reduced growth and secondary metabolite production, whereas incorporation of 5.0 or 10.0 mg l(-1) clcl(2) or nicl(2) effectively reduced ethylene accumulation and improved product accumulation, but agno(3) was toxic to cells. this study showed an overall negative impact of increased ethylene levels in the vessel headspace on phytochemical production in ohelo cell cultures.
I would like a pattern the looks ahead lagerstroemia
and vaccinium
to get lagerstroemia parviflora
and vaccinum pahalae
and to do this for 1000's of other genera to get a format of "genus species" extracted
I would also like to account for abstracts that mention multiple genera. For example...
in two turfgrass species, festuca arundinacea schreb, (tall fescue) and zoysia japonica steud, (zoysiagrass), regeneration culture systems using two types of bioreactors were developed, regenerants of tall fescue and zoysiagrass were efficiently produced by using an aeration-agitation type bioreactor and a rotating drum type bioreactor, respectively, the regenerants of each species were harvested from the bioreactors and cultivated in vitro during the preparation stage either on a 1/4 strength ng gellan gum (4 g l(-1)) medium without sucrose or with 30 g l(-1) sucrose, and under co2 concentration of 0.4 or 50 mmol mol(-1), a photoperiod of 24 h per day and a photosynthetic photon flux density of 125 mu mol m(-2) s(-1). the shoot and root lengths and shoot and root dry weights of tall fescue regenerants and the root dry weight of zoysiagrass regenerants were greater on the medium with sucrose than those on the medium without sucrose, regardless of the co2 concentration, the survival percentage, shoot number and shoot length of zoysiagrass regenerants growing on the medium without sucrose under 50 mmol mol(-1)of co2 were the highest among all the treatments.
In the above example, I would like to search for festuca
and zoysia
and extract festuca arundinacea schreb
, (tall fescue) and zoysia japonica steud