1

I have one file which is a .Txt file and i want to make a regex which can parse some kind of data from it.

I Have tried to do that, But I am not able to get what i am looking for

This is one kind of TABLE data, formation maybe same for other files

Here I am adding those data, kindly consider it as a .Txt file

Help will be appreciated.


            Tribhuwan  Diagnostic Centre |  HOSPITALROAD, Morne) 


                                                                                  East Champaran- 845401 (Bihar) 


                                 (FULLY AUTOMATED   & COMPUTERISED   LAB)        Mob. :+9162046  29003 
             Name        HAJAN sadshaj                    Booking Date           22/s/2020 
             G/A   male  18 Yrs                        Reporting Date         22/05/2020 
             Lab No.     10203693                              Sample Collected At    Lab 
             Ref. By Dr. I.C.U 
                  ;                                                                          UVLO 
             Test Name                                  Value         Unit            Biological Ref Interval 
                                           COMPLETE   BLOOD   COUNT (CBC) 
             TOTAL LEUCOCYTES    COUNT (TLC)            23160         cells/cmm       4000 - 11000 
             DIFFERENTIAL LEUCOCYTES  COUNT (DLC) 
             NEUTROPHILS                                93.4          %               45.0 - 65.0 
             LYMPHOCYTES                                 3.3          %               20.0 - 45.0 
             MONOCYTES                                   3.1          %               4.0 - 10.0 
             EOSINOPHILS                                0.2           %               0.0 - 5.0 
             BASOPHILS                                   0.0          %               0.0-1.0 
             ABSOLUTE   NEUTROPHILS                      21620.0                      3000.0 - 7000.0 
             ABSOLUTE   LYMPHOCYTES                      750.0                        800.0 - 4000.0 
             ABSOLUTE  MONOCYTES                         730.0                        0.0 - 1200.0 
             ABSOLUTE  EOSINOPHILS                       50.0                         0.0 - 500.0 
             ABSOLUTE  BASOPHILS                         10.0                         0.0 - 100.0 
             RBC  COUNT                                  4.31         Millions/cmm    3.80 - 5.80 
             HAEMOGLOBIN   (Hb)                          13.1         gm/dl            11.0 - 16.5 
             P.C.V/HCT                                   41.2         %                35.0 - 50.0 
             MCV                                         95.5         fl.              80.0 - 97.0 
             MCH                                         30.3         Picogram         26.5 - 35.5 
             MCHC                                        31.8         g/dl             31.5-35.5 
             RDW  / SD                                   49.7         FI               37.0 - 54.0 
             RDW  / CV                                   12.3         %                10.0 - 15.0 
             PLATELET  COUNT                             148000       /cmm             150000 - 450000 
             PDW                                         17.0         fl               10.0 - 18.0 
             MPV                                         13.3         fl               6.5 - 11.7 
             PCT                                         0.198        %                0.108 - 0.282 


Le 


_ 

I want to get only first two columns from this

output I want (Test Name , Value ):

             TOTAL LEUCOCYTES    COUNT (TLC)            23160       
             DIFFERENTIAL LEUCOCYTES  COUNT (DLC) 
             NEUTROPHILS                                93.4         
             LYMPHOCYTES                                 3.3         
             MONOCYTES                                   3.1         
             EOSINOPHILS                                0.2       
             BASOPHILS                                   0.0         
             ABSOLUTE   NEUTROPHILS                      21620.0                     
             ABSOLUTE   LYMPHOCYTES                      750.0                     
             ABSOLUTE  MONOCYTES                         730.0                       
             ABSOLUTE  EOSINOPHILS                       50.0                      
             ABSOLUTE  BASOPHILS                         10.0                      
             RBC  COUNT                                  4.31         
             HAEMOGLOBIN   (Hb)                          13.1         
             P.C.V/HCT                                   41.2         
             MCV                                         95.5         
             MCH                                         30.3         
             MCHC                                        31.8         
             RDW  / SD                                   49.7         
             RDW  / CV                                   12.3         
             PLATELET  COUNT                             148000       
             PDW                                         17.0         
             MPV                                         13.3         
             PCT                                         0.198        

jony
  • 924
  • 10
  • 25

2 Answers2

2

This kind of data is hard to parse with regex, but you can try this one (probably it will need adjusting for other text files) (regex101):

import re

# variable `txt` is your text file from question
for col1, col2 in re.findall(r'^\s{13}([A-Z.]{2}[^\n\d]*[A-Z)])(?:\s*([\d.]+)|[^$])', txt, flags=re.MULTILINE):
    print('{:<50}{}'.format(col1, col2))

Prints:

TOTAL LEUCOCYTES    COUNT (TLC)                   23160
DIFFERENTIAL LEUCOCYTES  COUNT (DLC)              
NEUTROPHILS                                       93.4
LYMPHOCYTES                                       3.3
MONOCYTES                                         3.1
EOSINOPHILS                                       0.2
BASOPHILS                                         0.0
ABSOLUTE   NEUTROPHILS                            21620.0
ABSOLUTE   LYMPHOCYTES                            750.0
ABSOLUTE  MONOCYTES                               730.0
ABSOLUTE  EOSINOPHILS                             50.0
ABSOLUTE  BASOPHILS                               10.0
RBC  COUNT                                        4.31
HAEMOGLOBIN   (Hb)                                13.1
P.C.V/HCT                                         41.2
MCV                                               95.5
MCH                                               30.3
MCHC                                              31.8
RDW  / SD                                         49.7
RDW  / CV                                         12.3
PLATELET  COUNT                                   148000
PDW                                               17.0
MPV                                               13.3
PCT                                               0.198
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
0

You can use the python regex library to achieve what you want. I started to write a regex for your problem, but didn't finished it. I'll update my post when I'll arrive to something satisfying.

Currently, the regex expression is matching the first and second columns of each line that starts with blank characters, have a first alphanumerical column and a second numerical column. We need to add the match on lines with only one column.

^\s+([[a-zA-Z()\/. ]+)\s+(\d+.\d+)

You can write and test your regexes easily on regex101.com, it allows you to visualize easily what they are doing to debug them.

[EDIT]

This one should do the trick, but you need to clean up a bit your input string before passing through the regex. Assuming that the title COMPLETE BLOOD COUNT (CBC) will always be present, you can call the python find function and remove the previous characters.

(^\s+([[a-zA-Z()\/. ]+)\s+((\d+.\d+)))|(^\s+(([[a-zA-Z()\/. ]+))\s+\R)
totok
  • 1,436
  • 9
  • 28