Parse in Gambling: How to Write Your Parser in 15 Minutes?

Step 1 – Parsing: What? Why? How?

Generally, parsing is a linear comparison of words sequence with the rules of a language. The concept of “language” is considered here in the widest context – it may be a human language (for example, English, German, Japanese, etc.) used for communication between people. As well as it can be a formalized language, in particular – any programming language.
Parsing of web-sites – is sequential syntactic analysis of information posted on Internet pages. Focus that information on the web pages is a hierarchical data set, structured using human and computer programming languages. Creating a website, the developer inevitably faces the task of determining the optimal page structure. But where to take an example of the optimal page? Do not reinvent the wheel in the initial stages of automating optimization process! It’s enough just to analyse your direct competitors, especially in such a saturated and highly competitive niche as gambling. There is a lot of such data, so a number of non-trivial tasks for its extraction should be solved, such as:

collection of search engine results;
large amounts of information provided in the net, which processing is hardly possible for one person or even a team of analysts;
one person or even a well-coordinated team of operators are not able to provide frequent updates – maintaining a huge stream of dynamically changing information, because sometimes information changes every minute and its updating is hardly advisable manually; so automating this process allows you to save time on monitoring changes for instance in casino promotions and automate its updates on your site. Compared to a human, computer parser program can:
quickly bypass thousands of web pages;
neatly separates technical information from “human”;
unmistakably select the right and discard the superfluous;
effectively pack the final data in the required form.

In most cases the subjected to additional processing database or spreadsheet is the result of parsing. Currently, parsers are written in a large scale of programming languages such as Python , R, C ++, Delphi, Perl, Ruby, PHP. But I certainly choose Python as the most universal language with a simple syntax. At the same time the uniqueness of Python lies in its syntax. It allows a large number of programmers to read someone else’s code, written in Python with no trouble.

Step 1 P.S. – Ways to Improve

If you want to improve your script in the future or write a smarter parser, then you may find some useful tips here: https://www.seleniumhq.org/download/, and https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
The result (whether it’s a database or spreadsheet) needs further processing for sure. However, the subsequent manipulations with the collected information do not concern the topic of parsing.

Step 2 – Software implementation

First, let’s discuss the algorithm. What do we want to get? We want to get the optimal structure of the document relative to the keyword. So it’s likely that the most successful structure is presented by those who are in the top for the chosen keyword.
Thus, the algorithm of the forthcoming work can be divided into several parts:
1) Choose an example of the page to check the structure: https://casino-now.co.uk/mobile-casino/
2) Identify a keyword: mobile casino
3) Get a list of the most optimized competitors
4) Get the structure of their pages and check the optimization parameters for the keyword

In order to ensure the correctness of the request processing, we have to programmatically set a time delay equal to the page load time. The function body will be:

// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42
def readystate_complete(d): return d.execute_script("return document.readyState") == "complete"

Then, it’s necessary to determine the keyword for which we want to view the output and implement Selenium driver to simulate keyboard input:

// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42
mainKey = "mobile casino" driver = webdriver.Firefox() driver.get("http://www.google.com") elem = driver.find_element_by_name("q") elem.send_keys(mainKey) elem.submit() WebDriverWait(driver, 30).until(readystate_complete) time.sleep(1) htmltext = driver.page_source

As the result, the source code of the page shown below will be stored in ‘htmltext’ variable:

It is worth paying attention that the robot icon presented on the screenshot means that at the moment the browser is under remote control, in our case – by Python.

After the raw html text is obtained, it’s time to unload the pages for parsing. The easiest way is when you check the code for the element you are interested in, and then use regular expressions to isolate the information you need, forming a list of objects. For example, to collect the URLs of competitors:

Then let’s check the occurrence of the pattern of interest:

And write out regular expression that looks like:
// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42
pages = re.compile('(.*?)' , re.DOTALL | re.IGNORECASE).findall(str(htmltext))

As the result, we get a list of top 10 competitors’ pages for the keyword of interest.

Next stage is re-parsing, similar to the above but for each URL-address of the chosen keyword. The results are formed in the dataframe, the full presentation can be viewed at GitHub:
https://github.com/TinaWard/FirstStepForParsing/.

The following is only a snippet of code that’s responsible for clearing your raw html document from tags and scripts. From this perspective, the result of this command will be the cleared text of the site, which will be used to calculate the keyword’s density.

// hidden setup JavaScript code goes in this preamble area const hiddenVar = 42
html = driver.page_source soup = BeautifulSoup(html) "kill all script and style elements" for script in soup(["script", "style"]): "rip it out" script.extract() "get text" text = soup.get_text() "break into lines and remove leading and trailing space on each" lines = (line.strip() for line in text.splitlines()) "break multi-headlines into a line each" chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) "drop blank lines" text = '\n'.join(chunk for chunk in chunks if chunk)

The text document will be the result of the parsing script on the basis of which the necessary numerical characteristics are determined to evaluate competitors – for example, keyword density estimation.

The result is presented in the form of a file containing the competitor’s page address, html page as well as the document structure and the calculated keyword density:

Wrapping Up

Thus, this material has discovered two key points of parsing – automatic browser control using selenium and raw-html pages processing by means of Beautiful Soup.
Create your web-sites based on the best practice! Good luck!
Leave comments and propose topics you would like to know more: tina.ward@mail.uk
Wordcloud of the article. Have fun!

原文链接：Parse in Gambling: How to Write Your Parser in 15 Minutes?

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END