0

Hi I am working on a project in Cloud computing an amazon. Part of code where am stuck at is getting user wish list from amazon. Since there are permissions restrictions what I did was extracted the entire page source given the wish list url. To extract the itemID I used pattern compile like

Pattern p = Pattern.compile("/dp/(\\w+)/");
                    Matcher matcher = p.matcher(content);

This was easy and it now correctly lists all the products with their itemId in that wish list. I also need the price for each. According to page source the price is

<span class="a-size-base a-color-price a-text-bold">
                      $7.19
                    </span>

I need to write a pattern for this one and am all confused and stuck.I suck at Regex expressions. Could anyone help please. I saw online references for href, but I don't think that will work for me.

Thanks to dkatzel I found this tool Jsoup. I tried the online conversion at Online Jsoup Try so when I do CSS Query div I get the required output. But how do I hard code it in my java program. I have the jsoup jar.

Ali
  • 56,466
  • 29
  • 168
  • 265
sa_nyc
  • 971
  • 1
  • 13
  • 23
  • 2
    I recommend you use a HTML parsing library like http://jsoup.org/ to do all this for you. (unless you need to parse it yourself for school work) – dkatzel Dec 17 '13 at 21:21
  • I don't need to parse it myself. My main project is completely different. – sa_nyc Dec 17 '13 at 21:22
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – GriffeyDog Dec 17 '13 at 21:22
  • So what is the context that these prices appear in? Are they always in that kind of span tag with those class names? – mtanti Dec 17 '13 at 21:26
  • Yes. These are the only tags which contain the price. I could attach the page source but it will be very long – sa_nyc Dec 17 '13 at 21:26

2 Answers2

3

An alternative answer where Jsoup is used.

Element e = doc.select("span.a-size-base").first();

Include jsoup-1.x.x.jar in your project or when you compile, and add the following imports.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
Daniel B
  • 8,770
  • 5
  • 43
  • 76
1

Wouldn't a simple expression work?

\\$\\d+(?:\\.\\d+)

\\$ matches a literal $.

\\d+ matches digits.

(?:\\.\\d+) matches potential decimals.

The whole match is what you're looking for I guess, unless you don't need the dollar symbol, then you can use either a capture group and take the first group (i.e. \\$(\\d+(?:\\.\\d+))) or a lookbehind (i.e. (?<=\\$)\\d+(?:\\.\\d+))

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • I did `List price = new ArrayList(); Pattern pr=Pattern.compile("\\$\\d+(?:\\.\\d+)"); Matcher priceMatcher= pr.matcher(content); while(priceMatcher.find()) { if(!price.contains(priceMatcher.group(1))) price.add(priceMatcher.group(1)); } System.out.println("Prices fetched in iteration "+count); for(String s : price) { System.out.println(s); }` **gives IndexOutOfBoundsException("No group " + group);** – sa_nyc Dec 17 '13 at 21:29
  • @sa_nyc Use `.group(0)` since it's the whole match. – Jerry Dec 17 '13 at 21:30
  • If you want to match the whole tag, you would use this however: `\\s*(\\$\\d+(?:\\.\\d+))\\s*` and then use `.group(1)` because there's a capture group. – Jerry Dec 17 '13 at 21:33
  • Well worked with a small modification i.e. an escape character for double quotes within the tag. Thank you for saving the day – sa_nyc Dec 17 '13 at 21:44
  • what do I do if I don't wanna match the $ symbol – sa_nyc Dec 18 '13 at 00:13
  • @sa_nyc Then you can use something like that: `\\s*\\$(\\d+(?:\\.\\d+))\\s* and then use .group(1)`. And sorry about the quotes, I'm not quite used to Java's regex. – Jerry Dec 18 '13 at 04:06
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/43447/discussion-between-sa-nyc-and-jerry) – sa_nyc Dec 18 '13 at 18:44