How to do web scraping in Java – Part II

In second part, we will build the program. For previous step(s), follow the Part I.

Step IV : ( Using Jsoup )

  • Add the following code in the main function of your class.

 

  • final Document document ;
     Scanner input = new Scanner(System.in);
     
     System.out.print("Enter the search : >>>> ");
     String name = input.nextLine();

    Note : Make sure Document in imported from jsoup dependency.

Important :

  • Next, you need to read how your browser works. Search any query and notice the pattern how the link is generated for your browser.Add the next line,
  •  document = (Document) Jsoup.connect("https://search.yahoo.com/search?ei=utf-8&fr=tightropetb&type=11745&p="+name).get();
  • Now in the video, we noticed a list of <li> elements, we need to run a for loop to get all the elements so we need to figure out under which class in Inspector do they reside.

scrap_stepII

  • Using the class, declare a for loop to select each element in that class. So in next line,
  • for(Element row : document.select("ol.mb-15.reg.searchCenterMiddle li")){
     
     }
  • Now, search in each <li> elements what all things you want to get. Say, I want only the headings & links of the search results.
  • So I search for the tag which has the heading & their link.

For Heading(s):

scrap_stepIII

For link(s) :

scrap_stepIV.png

You can see the class name under which we can find our heading and their link. So, we just need to search for those tags in each <li> elements.  So inside of for loop –

  • String title = row.select(".title").text();
     String url = row.select("span.fz-ms.fw-m.fc-12th.wr-bw.lh-17 ").text();
     System.out.println(title + "-> URL: " + url+ "\n\n");
     title = title.replace(",", "|");
  • The last line is because sometimes when you export the search result to csv/excel format. It prints them into separate blocks if “,” is not replaced with “|”.

That’s it. You can now build your program and give in any input as seach result. You can get the scraped data. Now, you can export them to txt , csv or whatever formats.

You can get the entire project here.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s