4

I have lines of text containing multiple variables which correspond to a specific entry.

I have been trying to use regular expressions, such as the one below, with mixed success (lines are quite standardised but do contain typos and inconsistencies)

re.compile('matching factor').findall(input)

I was wondering what is the best way to approach this case, what data structures to use and how to loop it to go though multiple lines of text. Here is the sample of the text, with highlighted data I would like to scrape:

CHINA: National Grain Trade Centre: in auction of state reserves, govt. sold 70,418 t wheat (equivalent to 3.5% of total volume offered) at an average price of CNY2,507/t ($378.19) and 4,359 t maize (4.7%), at an average price of CNY1,290/t ($194.39). Separately, sold 2,100 t of 2013 wheat imports (1.5%) at CNY2,617/t ($394.25). 23 Oct

I am interested to create a data set containing variable such as:

VOLUME - COMMODITY - PERCENTAGE SOLD - PRICE - DATE

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
  • 1
    Could include the exact output you would like to obtain from the sample you gave? Also, I think regexes might not be enough to handle this problem of yours, since the source is too unstructured. – Arne Nov 07 '17 at 16:00
  • So I am looking to get three strings: ('70,418', 'wheat', '3.5', '2,507', '23 Oct'); ('4,359', 'maize', '4.7', '1,290', '23 Oct') and ('2,100', 'wheat', '1.5', '2,617', '23 Oct') – Bullet Dodger Nov 07 '17 at 19:25
  • I think this question of yours exceeds the scope of a StackOverflow post. You'd need to write a whole program around this problem, knowing very well which Format your Data has, what kind of errors you can tolerate, which Entries may be missing and how you'd handle that.. – Arne Nov 08 '17 at 08:44

0 Answers0