How to use proxy IP to crawl web pages in Java

I. Introduction

When crawling web pages, especially when facing high-frequency requests or websites with restricted access, using proxy IP can significantly improve crawling efficiency and success rate. Java is a widely used programming language, and its rich network library makes it relatively simple to integrate proxy IP. This article will explain in detail how to set up and use proxy IP for web crawling in Java, provide practical code examples, and briefly mention the 98IP proxy service.

II. Basic concepts and preparation

2.1 Proxy IP basics

Proxy IP is a network service that forwards client requests to the target server through an intermediate server (proxy server), thereby hiding the client’s real IP address. In web crawling, proxy IP can effectively avoid the risk of being blocked by the target website due to frequent visits.

2.2 Preparation

Java development environment: Make sure that the Java Development Kit (JDK) and integrated development environment (such as IntelliJ IDEA or Eclipse) are installed.
Dependent libraries: The java.net package in the Java standard library provides basic functions for handling HTTP requests and proxy settings. If you need more advanced features, consider using a third-party library such as Apache HttpClient or OkHttp.
Proxy service: Select a reliable proxy service, such as 98IP Proxy, and obtain the IP address and port number of the proxy server, as well as authentication information (if necessary).

III. Set the proxy IP using the Java standard library

3.1 Sample code

The following is a sample code that uses the HttpURLConnection class in the Java standard library to set the proxy IP and perform web crawling:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.Proxy.Type;
import java.net.URL;

public class ProxyExample {
    public static void main(String[] args) {
        try {
            // Target URL
            String targetUrl = "http://example.com";

            // Proxy server information
            String proxyHost = "proxy.98ip.com"; // Example, in practice you should replace it with the proxy IP provided by 98IP.
            int proxyPort = 8080; // Example ports, when used in practice, should be replaced with the ports provided by the 98IP

            // Creating a URL object
            URL url = new URL(targetUrl);

            // Creating a proxy object
            Proxy proxy = new Proxy(Type.HTTP, new InetSocketAddress(proxyHost, proxyPort));

            // Open the connection and set up the proxy
            HttpURLConnection connection = (HttpURLConnection) url.openConnection(proxy);

            // Setting the request method (GET)
            connection.setRequestMethod("GET");

            // Read response content
            BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
            String inputLine;
            StringBuilder content = new StringBuilder();
            while ((inputLine = in.readLine()) != null) {
                content.append(inputLine);
            }

            // Closing the input stream
            in.close();

            // Print page content
            System.out.println(content.toString());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

3.2 Notes

  • Proxy authentication: If the proxy service requires authentication, you need to set an Authenticator to handle authentication requests.
  • Exception handling: In actual applications, more detailed exception handling logic should be added to deal with network failures, unavailable proxy servers, and other situations.
  • Resource management: Make sure that connections and input streams are closed correctly after use to avoid resource leaks.

IV. Use third-party libraries (such as Apache HttpClient)

Although the Java standard library provides basic proxy setting functions, using third-party libraries such as Apache HttpClient can simplify the code, provide richer functions, and better performance. The following is an example of using Apache HttpClient to set the proxy IP:

import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.BasicHttpClientConnectionManager;
import org.apache.http.conn.routing.HttpRoutePlanner;
import org.apache.http.conn.scheme.Scheme;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.socket.ConnectionSocketFactory;
import org.apache.http.conn.socket.PlainConnectionSocketFactory;
import org.apache.http.conn.routing.DefaultProxyRoutePlanner;
import org.apache.http.util.EntityUtils;

import java.net.HttpHost;
import java.net.InetSocketAddress;
import java.net.Proxy;

public class HttpClientProxyExample {
    public static void main(String[] args) {
        try {
            // Target URL
            String targetUrl = "http://example.com";

            // Proxy server information
            HttpHost proxy = new HttpHost("proxy.98ip.com", 8080); // Example, in practice, should be replaced with the proxy IP and port provided by the 98IP

            // Creating a Connection Manager
            BasicHttpClientConnectionManager cm = new BasicHttpClientConnectionManager(
                    new DefaultProxyRoutePlanner(proxy)
            );

            // Create HttpClient
            try (CloseableHttpClient httpClient = HttpClients.custom()
                    .setConnectionManager(cm)
                    .build()) {

                // Creating an HttpGet Request
                HttpGet request = new HttpGet(targetUrl);

                // execute a request
                try (CloseableHttpResponse response = httpClient.execute(request)) {

                    // Get Response Entity
                    HttpEntity entity = response.getEntity();

                    // Print response content
                    System.out.println(EntityUtils.toString(entity));
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

V. Summary

This article details the methods of using proxy IP for web crawling in Java, including the use of Java standard libraries and third-party libraries (such as Apache HttpClient). Through reasonable proxy settings, the success rate and efficiency of web crawling can be effectively improved. When choosing a proxy service, such as 98IP proxy, factors such as its stability, speed, and coverage should be considered. I hope this article can provide useful reference and help for Java developers when performing web crawling.

原文链接:How to use proxy IP to crawl web pages in Java

© 版权声明
THE END
喜欢就支持一下吧
点赞15 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容