13

I'm doing some web scraping, this is the format for the data

Sr.No.  Course_Code Course_Name Credit  Grade   Attendance_Grade

The actual string that i receive is of the following form

1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M

The things that I am interested in are the Course_Code, Course_Name and the Grade, in this example the values would be

Course_Code : CA727
Course_Name : PRINCIPLES OF COMPILER DESIGN
Grade : A

Is there some way for me to use a regular expression or some other technique to easily extract this information instead of manually parsing through the string. I'm using jruby in 1.9 mode.

nikhil
  • 8,925
  • 21
  • 62
  • 102
  • Are these tab delimited or space delimited? Are credit, grade, and attendance all guaranteed to exist (not be empty)? – Phrogz Jun 05 '12 at 21:32
  • Yes all the elements are guaranteed to exist also its delimited by spaces. – nikhil Jun 05 '12 at 21:36
  • The delimiter is a single space. – nikhil Jun 05 '12 at 21:41
  • Sorry about that I copied from the webpage and its formatting came along, my script doesn't retrieve the formatting. I have edited the post to fix this. – nikhil Jun 05 '12 at 21:51

5 Answers5

42

Let's use Ruby's named captures and a self-describing regex!

course_line = /
    ^                  # Starting at the front of the string
    (?<SrNo>\d+)       # Capture one or more digits; call the result "SrNo"
    \s+                # Eat some whitespace
    (?<Code>\S+)       # Capture all the non-whitespace you can; call it "Code"
    \s+                # Eat some whitespace
    (?<Name>.+\S)      # Capture as much as you can
                       # (while letting the rest of the regex still work)
                       # Make sure you end with a non-whitespace character.
                       # Call this "Name"
    \s+                # Eat some whitespace
    (?<Credit>\S+)     # Capture all the non-whitespace you can; call it "Credit"
    \s+                # Eat some whitespace
    (?<Grade>\S+)      # Capture all the non-whitespace you can; call it "Grade"
    \s+                # Eat some whitespace
    (?<Attendance>\S+) # Capture all the non-whitespace; call it "Attendance"
    $                  # Make sure that we're at the end of the line now
/x

str = "1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M"
parts = str.match(course_line)

puts "
Course Code: #{parts['Code']}
Course Name: #{parts['Name']}
      Grade: #{parts['Grade']}".strip

#=> Course Code: CA727
#=> Course Name: PRINCIPLES OF COMPILER DESIGN
#=>       Grade: A
Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • Fantastic! You have explained the regex beautifully. Thanks for the wonderful solution. The best part is that I can extract all the information out of this. – nikhil Jun 05 '12 at 21:44
  • 3
    You can use symbols to access a MatchData if the strings cause quote confusion (`"... #{parts[:Grade]} ..."`), just a matter of taste really. – mu is too short Jun 05 '12 at 22:00
6

Just for fun:

str = "1 CA727 PRINCIPLES OF COMPILER DESIGN 3 A M"
tok = str.split /\s+/
data = {'Sr.No.' => tok.shift, 'Course_Code' => tok.shift, 'Attendance_Grade' => tok.pop,'Grade' => tok.pop, 'Credit' => tok.pop, 'Course_Name' => tok.join(' ')}
pguardiario
  • 53,827
  • 19
  • 119
  • 159
3

Do I see that correctly that the delimiter is always 3 spaces? Then just:

serial_number, course_code, course_name, credit, grade, attendance_grade = 
  the_string.split('   ')
Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
3

Assuming everything except for the course description consists of single words and there are no leading or trailing spaces:

/^(\w+)\s+(\w+)\s+([\w\s]+)\s+(\w+)\s+(\w+)\s+(\w+)$/

Your example string will yield the following match groups:

1.  1
2.  CA727
3.  PRINCIPLES OF COMPILER DESIGN
4.  3
5.  A
6.  M
theglauber
  • 28,367
  • 7
  • 29
  • 47
1

This answer isn't very idiomatic Ruby, because in this case I think clarity is better than being clever. All you really need to do to solve the problem you described is to split your lines with whitespace:

line = '1   CA727   PRINCIPLES OF COMPILER DESIGN   3   A   M'
array = line.split /\t|\s{2,}/
puts array[1], array[2], array[4]

This assumes your data is regular. If not, you will need to work harder at tuning your regular expression and possibly handling edge cases where you don't have the required number of fields.

A Note for Posterity

The OP changed the input string, and modified the delimiter to a single space between fields. I'll leave my answer to the original question as-is (including the original input string for reference) as it may help others besides the OP in a less-specific case.

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
  • A good and simple solution if multiple whitespaces had been the delimiter. Not sure if I should +1 for the general case or -1 for not solving the (clarified) question, so I'll just leave this comment. – Phrogz Jun 05 '12 at 21:50