1

Little bit of a beginner here, working on a personal project to scrape my schools course offerings into a easy-to-read tabular format, but am having trouble with the initial step of scraping the data from the site.

I just added the JSoup library to my project in eclipse, and am now having trouble initializing the connection when using the documentation for Jsoup.

In the end, my goal is to grab each class name / time / description, but for now I want to just grab the name. The HTML of the source website appears like this:

<td class='CourseNum'><img src='images/minus.gif' class='ICS3330 SW' onclick="toggledetails('CS3330')

My first guess was to getElementsByTag(td), and then query these elements for the parameter of onclick= or the value of the 'class' parameter, cleaning it up by removing the initial "I" and the suffix of " SW" leaving behind the name "CS3330."

Now onto the actual implementation:

Document doc = Jsoup.parse("UTF-8", "http://rabi.phys.virginia.edu/mySIS/CS2/page.php?Semester=1118&Type=Group&Group=CompSci").get();
Elements td = doc.getElementsByTag("td");

At this point, I am already running into problems (even though I am not straying far from the examples provided in the documentation) and would appreciate some guidance on getting my code to function!

edit: GOT IT! Thank you all!

asolanki
  • 37
  • 1
  • 7

2 Answers2

3

According to documentation you should be doing:

Document doc = Jsoup.connect(url).get();

The parse() method is for files.

Vlad
  • 10,602
  • 2
  • 36
  • 38
  • Type mismatch: cannot convert from org.jsoup.nodes.Document to javax.swing.text.Document – asolanki Aug 07 '11 at 20:49
  • @asolanki: this is a bug in your code: you're trying to use javax.swing.Document rather than org.jsoup.nodes.Document. In other words, don't use the Swing Document but rather the Document class that comes with JSoup. Again Vlad is correct and I recommend that you up-vote him too. – Hovercraft Full Of Eels Aug 07 '11 at 20:52
  • Fixed this -- seemed I made an error in the Document package that I imported. Damn IDE's making things seem simpler while obfuscating my understanding! – asolanki Aug 07 '11 at 20:56
  • Exactly. It really is this simple, then just use firebug to see what id's and classes to read or manipulate. – nckbrz May 28 '14 at 12:06
1

I just downloaded JSoup and tried it out on your school's website and got this output:

Unit: Computer Science
   CS 1010: Introduction to Information Technology
   CS 1110: Introduction to Programming
   CS 1111: Introduction to Programming
   CS 1112: Introduction to Programming
   CS 1120: From Ada and Euclid to Quantum Computing and the World Wide Web
   CS 2102: Discrete Mathematics I
   CS 2110: Software Development Methods
   CS 2150: Program and Data Representation
   CS 2220: Engineering Software
   CS 2330: Digital Logic Design
   CS 2501: Special Topics in Computer Science
   CS 3102: Theory of Computation
   CS 3330: Computer Architecture
   CS 4102: Algorithms
   CS 4240: Principles of Software Design
   CS 4414: Operating Systems
   CS 4444: Introduction to Parallel Computing
   CS 4457: Computer Networks
   CS 4501: Special Topics in Computer Science
   CS 4753: Electronic Commerce Technologies
   CS 4810: Introduction to Computer Graphics
   CS 4993: Independent Study
   CS 4998: Distinguished BA Majors Research
   CS 6161: Design and Analysis of Algorithms
   CS 6190: Computer Science Perspectives
   CS 6354: Computer Architecture
   CS 6444: Introduction to Parallel Computing
   CS 6501: Special Topics in Computer Science
   CS 6610: Programming Languages
   CS 7457: Computer Networks
   CS 7993: Independent Study
   CS 7995: Supervised Project Research
   CS 8501: Special Topics in Computer Science
   CS 8524: Topics in Software Engineering
   CS 8897: Graduate Teaching Instruction
   CS 8999: Thesis
   CS 9999: Dissertation

Too flippin' cool! Vlad is right though; use the connect(...) method. 1+ to Vlad

Other suggestions and hints:
These are the constants that I used in my little program:

   private static final String URL = "http://rabi.phys.virginia.edu/mySIS/CS2/" +
        "page.php?Semester=1118&Type=Group&Group=CompSci";
   private static final String TD_TAG = "td";
   private static final String CLASS_ATTRIB = "class";
   private static final String CLASS_ATTRIB_UNIT_NAME = "UnitName";
   private static final String CLASS_ATTRIB_COURSE_NUM = "CourseNum";
   private static final String CLASS_ATTRIB_COURSE_NAME = "CourseName";

And these are the variables I used inside the scraping method:

     String unitName = "";
     List<String> courseNumbNameList = new ArrayList<String>();
     String courseNumbName = "";

Edit 1
Based on your recent comments, I think that you're over-thinking things a bit. What worked well for me is this simple algorithm:

  • Create the 3 variables I have listed above
  • Get my document as Vlad recommends.
  • Create a td Elements variable and assign to it all elements that have a td tag.
  • Use a for loop with int i going from 0 to < td.size() and get each Element, element using td.get(i);
  • Inside the loop check the element's class attribute.
  • If the attribute String equals the CLASS_ATTRIB_UNIT_NAME String (see above), get the element's text and use it to set the unitName variable.
  • If the attribute String equals CLASS_ATTRIB_COURSE_NUM set the courseNumbName to the element's text.
  • If the attribute String equals CLASS_ATTRIB_COURSE_NAME append the element's text to the courseNumbName String, add the String to the array list, and set courseNumbName = to "".
Hovercraft Full Of Eels
  • 283,665
  • 25
  • 256
  • 373
  • This is very much like what I want! Here is where I am now: Document doc = Jsoup.connect(URL).get(); Elements tables = doc.getElementsByTag(TD_TAG); Now say that the HTML in question is: Introduction to Information Technology Any pointers on A) targetting this table element and B) cleaning up the innerHTML to make it a readable CSXXXX? – asolanki Aug 07 '11 at 21:02
  • @asolanki: post this code as an edit to your original question, not in a comment as it won't format correctly in comments. – Hovercraft Full Of Eels Aug 07 '11 at 21:04
  • @asolanki: please see **Edit 1** in my answer above. – Hovercraft Full Of Eels Aug 07 '11 at 21:12
  • thanks so much for your patience, it seems like I am almost completed with the task at hand. I have re-edited the OP for what I hope to be my last question. – asolanki Aug 07 '11 at 22:18
  • @asloanki: congrats on moving forward. Again, don't forget to up-vote Vlad for his contribution. – Hovercraft Full Of Eels Aug 07 '11 at 22:29
  • I will definitely upvote both you and Vlad as soon as I have enough reputation to do so :). You wouldn't happen to know how to eliminate the tag from my output would you? – asolanki Aug 07 '11 at 22:42
  • There's no need to worry about the img tag if you follow my recommendations above. It extracts only the text via the getText() method. – Hovercraft Full Of Eels Aug 07 '11 at 23:55