суббота, 27 июня 2015 г.

Java: How To Parse HTML (Using Jsoup Java library)

In order to parse HTML on any web page I use Jsoup library. It light and very useful tool which allows to perform complex operations for getting and processing data from HTML.

You can get actual version of the library from official site. If you use Maven just place the following into your POM's <dependencies> section:

<dependency>
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.8.2</version>
</dependency>

In case you have standart Java project you should to convert it to Maven Project like figure below:

After that you will see POM.xml in your work directory.

For example I will create JsoupParserExample class which will be get and output all links from google.com web page.

 JsoupParserExample.java
package com.gabdev.jsoup;

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupParserExample {

    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://google.com").get();
        Elements links = doc.getElementsByTag("a");
        for (Element link : links) {
            System.out.println(link.text());
        }
    }
}
After running the code you will see in output window something like this:
Images
Maps
Play
YouTube
News
Mail
Drive
More »
 It is the links (anchor texts) from google.com.

Комментариев нет:

Отправить комментарий