Java Web Scraping (7 Part Series)
1 Introduction to Web Scraping With Java
2 Web Scraping Handling Ajax Website
… 3 more parts…
3 An Automatic Bill Downloader in Java
4 How to Log in to Almost Any Websites
5 Scraping E-Commerce Product Data
6 Serverless Web Scraping With Aws Lambda and Java
7 Introduction to Chrome Headless
In this article I am going to show how to download bills (or any other file ) from a website with HtmlUnit.
I suggest you to read these articles first : Introduction to web scraping with Java and Autologin
Since I am hosting this blog on Digital Ocean (10$ in credit if you sign up via this link), I will show how to write a bot to automatically download every bills you have.
Login
To submit the login form without needing to inspect the dom, we will use the magic method I wrote in the previous article.
Then we have to go to the bill page : https://cloud.digitalocean.com/settings/billing
String baseUrl = "https://cloud.digitalocean.com";
String login = "email";
String password = "password" ;
try {
WebClient client = Authenticator.autoLogin(baseUrl + "/login", login, password);
HtmlPage page = client.getPage("https://cloud.digitalocean.com/settings/billing");
if(page.asText().contains("You need to sign in for access to this page")){
throw new Exception(String.format("Error during login on %s , check your credentials", baseUrl));
}
}catch (Exception e) {
e.printStackTrace();
}
Enter fullscreen mode Exit fullscreen mode
Fetching the bills
Let’s create a new Class called Bill or Invoice to represent a bill :
Bill.java
public class Bill {
private String label ;
private BigDecimal amount ;
private Date date;
private String url ;
//... getters & setters
}
Enter fullscreen mode Exit fullscreen mode
Now we need to inspect the dom to see how we can extract the description, amount, date and URL of each bill. Open your favorite tool :
We are lucky here, it’s a clean DOM, with a nice and well structured table. Since HtmlUnit has many methods to handle HTML tables, we will use these :
-
HtmlTable
to store the table and iterate on each rows -
getCell
to select the cells
Then, using the Jackson library we will export the Bill objects to JSON and print it.
HtmlTable billsTable = (HtmlTable) page.getFirstByXPath("//table[@class='listing Billing--history']");
for(HtmlTableRow row : billsTable.getBodies().get(0).getRows()){
String label = row.getCell(1).asText();
// We only want the invoice row, not the payment one
if(!label.contains("Invoice")){
continue ;
}
Date date = new SimpleDateFormat("MMMM d, yyyy", Locale.ENGLISH).parse(row.getCell(0).asText());
BigDecimal amount =new BigDecimal(row.getCell(2).asText().replace("$", ""));
String url = ((HtmlAnchor) row.getCell(3).getFirstChild()).getHrefAttribute();
Bill bill = new Bill(label, amount, date, url);
bills.add(bill);
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(bill) ;
System.out.println(jsonString);
Enter fullscreen mode Exit fullscreen mode
It’s almost finished, the last thing is to download the invoice. It’s pretty easy, we will use the Page
object to store the pdf, and call a getContentAsStream
on it. It’s better to check if the file has the right content type when doing this (application/pdf
in our case)
Page invoicePdf = client.getPage(baseUrl + url);
if(invoicePdf.getWebResponse().getContentType().equals("application/pdf")){
IOUtils.copy(invoicePdf.getWebResponse().getContentAsStream(), new FileOutputStream("DigitalOcean" + label + ".pdf"));
}
Enter fullscreen mode Exit fullscreen mode
That’s it, here is the ouput :
{"label":"Invoice for December 2015","amount":0.35,"date":1451602800000,"url":"/billing/XXXXX.pdf"} {"label":"Invoice for November 2015","amount":6.00,"date":1448924400000,"url":"/billing/XXXX.pdf"} {"label":"Invoice for October 2015","amount":3.05,"date":1446332400000,"url":"/billing/XXXXX.pdf"} {"label":"Invoice for April 2015","amount":1.87,"date":1430431200000,"url":"/billing/XXXXX.pdf"} {"label":"Invoice for March 2015","amount":5.00,"date":1427839200000,"url":"/billing/XXXXX.pdf"} {"label":"Invoice for February 2015","amount":5.00,"date":1425164400000,"url":"/billing/XXXXX.pdf"} {"label":"Invoice for January 2015","amount":1.30,"date":1422745200000,"url":"/billing/XXXXXX.pdf"} {"label":"Invoice for October 2014","amount":3.85,"date":1414796400000,"url":"/billing/XXXXXX.pdf"}
Enter fullscreen mode Exit fullscreen mode
As usual you can find the full code on this Github Repo
If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.
Java Web Scraping (7 Part Series)
1 Introduction to Web Scraping With Java
2 Web Scraping Handling Ajax Website
… 3 more parts…
3 An Automatic Bill Downloader in Java
4 How to Log in to Almost Any Websites
5 Scraping E-Commerce Product Data
6 Serverless Web Scraping With Aws Lambda and Java
7 Introduction to Chrome Headless
暂无评论内容