How to Log in to Almost Any Websites

Java Web Scraping (7 Part Series)

1 Introduction to Web Scraping With Java
2 Web Scraping Handling Ajax Website
… 3 more parts…
3 An Automatic Bill Downloader in Java
4 How to Log in to Almost Any Websites
5 Scraping E-Commerce Product Data
6 Serverless Web Scraping With Aws Lambda and Java
7 Introduction to Chrome Headless

In the first article about java web scraping I showed how to extract data from CraigList website.
But what about the data you want or if the action you want to carry out on a website requires authentication ?

In this short tutorial I will show you how to make a generic method that can handle most authentication forms.

Authentication mechanism

There are many different authentication mechanisms, the most frequent being a login form , sometimes with a CSRF token as a hidden input.

To auto-magically log into a website with your scrapers, the idea is :

GET /loginPage
Select the first <input type="password"> tag
Select the first <input> before it that is not hidden
Set the value attribute for both inputs
Select the enclosing form, and submit it.

Hacker News Authentication

Let’s say you want to create a bot that logs into hacker news (to submit a link or perform an action that requires being authenticated) :

Here is the login form and the associated DOM :

Now we can implement the login algorithm

    public static WebClient autoLogin(String loginUrl, String login, String password) throws FailingHttpStatusCodeException, MalformedURLException, IOException{
        WebClient client = new WebClient();
        client.getOptions().setCssEnabled(false);
        client.getOptions().setJavaScriptEnabled(false);

        HtmlPage page = client.getPage(loginUrl);

        HtmlInput inputPassword = page.getFirstByXPath("//input[@type='password']");
        //The first preceding input that is not hidden
        HtmlInput inputLogin = inputPassword.getFirstByXPath(".//preceding::input[not(@type='hidden')]");

        inputLogin.setValueAttribute(login);
        inputPassword.setValueAttribute(password);

        //get the enclosing form
        HtmlForm loginForm = inputPassword.getEnclosingForm() ;

        //submit the form
        page = client.getPage(loginForm.getWebRequest(null));

        //returns the cookie filled client :)
        return client;
    }

Then the main method, which :

calls autoLogin with the right parameters
Go to https://news.ycombinator.com
Check the logout link presence to verify we’re logged
Prints the cookie to the console

    public static void main(String[] args) {

        String baseUrl = "https://news.ycombinator.com" ;
        String loginUrl = baseUrl + "/login?goto=news" ; 
        String login = "login";
        String password = "password" ;

        try {
            System.out.println("Starting autoLogin on " + loginUrl);
            WebClient client = autoLogin(loginUrl, login, password);
            HtmlPage page = client.getPage(baseUrl) ;

            HtmlAnchor logoutLink = page.getFirstByXPath(String.format("//a[@href='user?id=%s']", login)) ;
            if(logoutLink != null ){
                System.out.println("Successfuly logged in !");
                // printing the cookies
                for(Cookie cookie : client.getCookieManager().getCookies()){
                    System.out.println(cookie.toString());
                }
            }else{
                System.err.println("Wrong credentials");
            }

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

You can find the code in this Github repo

Go further

There are many cases where this method will not work : Amazon, DropBox… and all other two-steps/captcha protected login forms.

Things that can be improved with this code :

Handle the check for the logout link inside autoLogin
Check for null inputs/form and throw an appropriate exception

In a next post I will show you how to deal with captchas or virtual numeric keyboards with OCR and captchas breaking APIs !

If you like web scraping and are tired taking care of proxies, JS rendering and captchas, you can check our new web scraping API, the first 1000 API calls are on us.

Java Web Scraping (7 Part Series)

原文链接：How to Log in to Almost Any Websites

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END