-2

Hi i am trying to extract data from another site which i am able to do but problem is that i want to extract my data in my desired format which i am not able to achieve so how can i achieve my goal

here is my code which i did

import com.gargoylesoftware.htmlunit.BrowserVersion;
import java.util.StringTokenizer;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;
import org.openqa.selenium.support.ui.Select;
import java.sql.*;

public class Getdata2 {

    Statement st=null;
    Connection cn=null;
    public static void main(String args[]) throws InterruptedException, ClassNotFoundException, SQLException {

        WebDriver driver = new HtmlUnitDriver(BrowserVersion.getDefault());
        String sDate = "27/03/2014";

        String url="http://www.upmandiparishad.in/commodityWiseAll.aspx";
        driver.get(url);
        Thread.sleep(5000);

        new Select(driver.findElement(By.id("ctl00_ContentPlaceHolder1_ddl_commodity"))).selectByVisibleText("Jo");
        driver.findElement(By.id("ctl00_ContentPlaceHolder1_txt_rate")).sendKeys(sDate);

        Thread.sleep(3000);
        driver.findElement(By.id("ctl00_ContentPlaceHolder1_btn_show")).click();
        Thread.sleep(5000);


        WebElement findElement = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1"));
        String htmlTableText = findElement.getText();
        // do whatever you want now, This is raw table values.
        htmlTableText=htmlTableText.replace("S.No.DistrictMarketPrice","");
        System.out.println(htmlTableText);


        driver.close();
        driver.quit();

    }
}

i want to extract my data like this

1 Agra Achhnera NIL
2 Agra Agra NIL
3 Agra Fatehabad NIL
4 Agra FatehpurSikri NIL
5 Agra Jagner NIL
6 Agra Jarar NIL
7 Agra Khairagarh NIL
8 Agra Shamshabad NIL
9 Aligarh Atrauli NIL
10 Aligarh Chharra NIL
11 Aligarh Aligarh 1300.00
12 Aligarh Khair 1300.00
13 Allahabad Allahabad NIL
14 Allahabad Jasra NIL
15 Allahabad Leriyari NIL
16 Allahabad Sirsa NIL
17 AmbedkarNagar Akbarpur NIL
18 Ambedkar Nagar TandaAkbarpur NIL

How can i achieve my desired output

Thanks in advance

songyuanyao
  • 169,198
  • 16
  • 310
  • 405
  • possible duplicate of [How to do web scraping using htmlunitsriver?](http://stackoverflow.com/questions/22807527/how-to-do-web-scraping-using-htmlunitsriver) – Nadun Apr 04 '14 at 07:14
  • 1
    How many accounts do you have? Why is that? – Nadun Apr 04 '14 at 07:18
  • i dont know why my that account blocked for 7 days so i had to make sorry – user3496498 Apr 04 '14 at 07:19
  • @Nadun can u under stand my prob solve it dear – user3496498 Apr 04 '14 at 07:20
  • you can check this thread (maybe the second answer): [Parsing HTML table data with xpath and selenium in java](http://stackoverflow.com/questions/10323884/parsing-html-table-data-with-xpath-and-selenium-in-java) – stan Apr 04 '14 at 07:57

1 Answers1

1

Note: You do not need regex. Selenium itself provides good tools to extract data from tables.

Let's analyze this. Looking at the source from that website ... here is the way its arranged.

<table id="ctl00_ContentPlaceHolder1_GridView1">
    <tbody>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            <td></td>
        </tr>
        ... more <trs>
</table>
  • First you get the "table rows".
  • This is done by using findElement and findElements.

(Below code is an example, modify according to your code)

List<WebElement> tableRows = driver.findElement(By.id("ctl00_ContentPlaceHolder1_GridView1")).findElements(By.xpath(".//tbody/tr"));
  • Now loop through each of the List<WebElement> elements, which you got above.

You do this using

for (WebElement tableRow : tableRows) {
...
}
  • Next, each table row has 4 entries (i.e 4 table cells).
  • Again use findElements as shown above.
  • Store this in a List<WebElement> (again as shown above)

Code:

tableRow.findElements(By.xpath(".//td")
  • Now, loop through each <td> WebElement.
  • Get the text within each element by calling the .getText() method on each WebElement.
  • Format the text output according to your needs.
Vish
  • 2,144
  • 5
  • 25
  • 48