Introduction
Ensuring the availability and reliability of critical web applications is a challenge for any organization. During my tenure at one of the largest education providers in the world, I encountered a recurring issue with a secure file transfer platform that frequently became unavailable due to database deadlocks. As part of my role, I researched, developed, and implemented an automated monitoring and remediation solution using Selenium and Python to address this challenge.
The Problem
The secure file transfer platform supported hundreds of concurrent users, making its uptime crucial. However, we repeatedly faced an issue where the application became unresponsive due to database deadlocks, causing the database connection to become unavailable. The only known workaround was restarting the SQL Server services, followed by restarting the application services—or vice versa, depending on the situation.
A major issue was that we had no proactive way of detecting downtime. We only became aware of failures when users reported them. While working with vendor support for a long-term fix, we needed an interim solution that could monitor the application, detect downtime, and apply the necessary workarounds automatically.
Choosing a Solution
Due to budget constraints, commercial monitoring solutions like New Relic were not an option. After thorough research, I determined that Selenium, a web automation framework, could be used to automate periodic login attempts and verify application availability. Selenium allowed us to interact with the web application just as a user would, making it an ideal choice.
Tools Used
Python: Scripting language for automation
Chrome Headless: Command-line interface browser
Selenium: Web automation framework
PS Tools: pskill.exe is used to terminate services, while psService.exe is utilized to start remote services, as both the database and application services are hosted on a Windows Server environment.
Download the Script
You can download the full monitoring script along with the required files from GitHub: Download from GitHub
Implementing the Monitoring Function
Importing modules:
selenium
module is imported to navigate through web pages and interact with web elements. smtplib
module is imported to send notification emails. datetime
module is used to write a date stamp in log files for keeping a log of the script’s activity. time
module is utilized to wait for a specific amount of time before proceeding with the next task. os
module is imported to create and delete reboot_flag
file utilised for changing the sequence of service restarts.
from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsfrom selenium.webdriver.chrome.service import Servicefrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECimport smtplibfrom email import messagefrom datetime import datetimeimport timeimport osimport subprocessfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import smtplib from email import message from datetime import datetime import time import os import subprocessfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import smtplib from email import message from datetime import datetime import time import os import subprocess
Enter fullscreen mode Exit fullscreen mode
Setting up log file: The script is set to open a log file named uat_runlog
in append mode to record the execution logs:
log_object = open('uat-runlog', 'a')log_object = open('uat-runlog', 'a')log_object = open('uat-runlog', 'a')
Enter fullscreen mode Exit fullscreen mode
Email notification: The following code defines a function send_failure_email()
that sends an email notification in case of a login check failure meaning the web application is down. The function sets email parameters and it then creates a message object using the message.Message()
method and sets its headers and payload.
The function then creates an SMTP server instance and connects to the SMTP server using the smtplib.SMTP()
method with the server address and port number as arguments. The ehlo()
method is called twice to identify the client and initiate the SMTP conversation. The function then logs in to the SMTP server using the login()
method with the from_addr
and yoursmtppassword
as arguments.
Finally, the function sends the message to the specified to_addr
and to_addr2
email addresses using the server.send_message()
method, passing the message object, from_addr
and to_addrs
as arguments.
def send_failure_email():from_addr = 'sftalert@yourdomain.com'to_addr = 'team1@yourdomain.com'to_addr2 = 'team2@yourdomain.com'subject = 'SFT Alert!'body = 'Login check has failed. Check application availablility ASAP!'msg = message.Message()msg.add_header('from', from_addr)msg.add_header('to', to_addr)msg.add_header('subject', subject)msg.set_payload(body)server = smtplib.SMTP('smtp.yoursmtpserver.com', 587)server.ehlo()server.starttls()server.ehlo()server.login(from_addr, 'yoursmtppassword')server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr])server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr2])def send_failure_email(): from_addr = 'sftalert@yourdomain.com' to_addr = 'team1@yourdomain.com' to_addr2 = 'team2@yourdomain.com' subject = 'SFT Alert!' body = 'Login check has failed. Check application availablility ASAP!' msg = message.Message() msg.add_header('from', from_addr) msg.add_header('to', to_addr) msg.add_header('subject', subject) msg.set_payload(body) server = smtplib.SMTP('smtp.yoursmtpserver.com', 587) server.ehlo() server.starttls() server.ehlo() server.login(from_addr, 'yoursmtppassword') server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr]) server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr2])def send_failure_email(): from_addr = 'sftalert@yourdomain.com' to_addr = 'team1@yourdomain.com' to_addr2 = 'team2@yourdomain.com' subject = 'SFT Alert!' body = 'Login check has failed. Check application availablility ASAP!' msg = message.Message() msg.add_header('from', from_addr) msg.add_header('to', to_addr) msg.add_header('subject', subject) msg.set_payload(body) server = smtplib.SMTP('smtp.yoursmtpserver.com', 587) server.ehlo() server.starttls() server.ehlo() server.login(from_addr, 'yoursmtppassword') server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr]) server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr2])
Enter fullscreen mode Exit fullscreen mode
Monitoring: The code snippet below outlines the fundamental purpose of the script - to log in to the web application, click and verify the loading of specific elements.
First the code is setting up the options and configuration for the ChromeDriver using the Selenium WebDriver library in Python to control the Chrome browser.
options.add_argument("--headless")
: This line sets the “headless” option to run the Chrome browser in headless mode, meaning the browser will run without a graphical user interface.
options.add_argument("--no-sandbox")
: This line sets the “no-sandbox” option to disable the Chrome browser sandbox, which is a security feature that isolates browser tabs and prevents them from affecting each other.
s = Service("chromedriver")
: This line creates an instance of the Service class which specifies the path to the ChromeDriver executable.
url = "https://uat-sft.yourdomain.com/"
line sets the URL of the web application that the script will be interacting with. driver.get(url)
line instructs the Chrome browser to navigate to the specified URL.
The next part of the code is a try-except block that attempts to login to the web application and perform certain checks. If the login and checks are successful, the code returns up and writes logs of its activities. If the login and checks fail, the script catches the exception, writes logs of the failure, sends a failure email using the send_failure_email()
function, and returns down.
Here is a breakdown of the code:
The script first waits for a maximum of 10 seconds for the presence of the HTML username element with EC.presence_of_element_located((By.ID, "username")
and writes a log if the login page is loaded successfully.
The script then finds the HTML elements username
and password
, enters the login credentials, clicks the sign-in button with the following code:
driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com")driver.find_element(By.ID, "password").send_keys("platformuserpassword")driver.find_element(By.ID, "signinButton").click()driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com") driver.find_element(By.ID, "password").send_keys("platformuserpassword") driver.find_element(By.ID, "signinButton").click()driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com") driver.find_element(By.ID, "password").send_keys("platformuserpassword") driver.find_element(By.ID, "signinButton").click()
Enter fullscreen mode Exit fullscreen mode
Once it is able to login, in the first page, there is an HTML element called compose-delivery-link
which is a compose button for a secure delivery. The script waits for a maximum of 10 seconds for the presence of the compose-delivery-link, with EC.presence_of_element_located((By.ID, "compose-delivery-link"))
and writes a log if the login is successful.
Then the script clicks the compose button, waits for a maximum of 10 seconds for the presence of the divSecureMessage
element which is the body of the secure message window, then it writes logs if the checks are passed, logs out by calling driver.get(logouturl)
, and writes logs of the successful logout.
If the checks fail, the script catches the exception, writes logs of the failure, sends a failure email using the send_failure_email()
function, and returns down. If the checks are successful, the script creates a reboot_flag
file if it does not exist and returns up
. Finally, the script closes the web driver and the log file.
def monitor():options = Options()options.add_argument("--headless")options.add_argument("--no-sandbox")s = Service("chromedriver")url = "https://uat-sft.yourdomain.com/"driver = webdriver.Chrome(options=options, service=s)driver.get(url)try:usernameelement = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "username")))time.sleep(10)if usernameelement.is_displayed() == True:print ("Login page loaded!")now = datetime.now()log_object.write("Login page loaded at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com")driver.find_element(By.ID, "password").send_keys("platformuserpassword")now = datetime.now()log_object.write("Attempting login at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")driver.find_element(By.ID, "signinButton").click()time.sleep(10)composebuttonelement = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "compose-delivery-link")))if composebuttonelement.is_displayed() == True:now = datetime.now()log_object.write("Successfully logged in at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")now = datetime.now()log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")driver.find_element(By.ID, "compose-delivery-link").click()time.sleep(10)divSecureMessageelement = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "divSecureMessage")))if divSecureMessageelement.is_displayed() == True:now = datetime.now()log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")now = datetime.now()log_object.write("All checks passed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")logouturl = "https://uat-sft.yourdomain.com/bds/Logout.do"now = datetime.now()log_object.write("Successfully logged out at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")log_object.write("------------------------------------------------\n")driver.get(logouturl)log_object.close()if os.path.exists('reboot_flag') == False:open('reboot_flag', 'x')driver.close()return "up"except:now = datetime.now()log_object.write("SFT health check failed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")send_failure_email()return "down"def monitor(): options = Options() options.add_argument("--headless") options.add_argument("--no-sandbox") s = Service("chromedriver") url = "https://uat-sft.yourdomain.com/" driver = webdriver.Chrome(options=options, service=s) driver.get(url) try: usernameelement = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "username")) ) time.sleep(10) if usernameelement.is_displayed() == True: print ("Login page loaded!") now = datetime.now() log_object.write("Login page loaded at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com") driver.find_element(By.ID, "password").send_keys("platformuserpassword") now = datetime.now() log_object.write("Attempting login at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") driver.find_element(By.ID, "signinButton").click() time.sleep(10) composebuttonelement = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "compose-delivery-link")) ) if composebuttonelement.is_displayed() == True: now = datetime.now() log_object.write("Successfully logged in at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") now = datetime.now() log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") driver.find_element(By.ID, "compose-delivery-link").click() time.sleep(10) divSecureMessageelement = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "divSecureMessage")) ) if divSecureMessageelement.is_displayed() == True: now = datetime.now() log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") now = datetime.now() log_object.write("All checks passed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") logouturl = "https://uat-sft.yourdomain.com/bds/Logout.do" now = datetime.now() log_object.write("Successfully logged out at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") log_object.write("------------------------------------------------\n") driver.get(logouturl) log_object.close() if os.path.exists('reboot_flag') == False: open('reboot_flag', 'x') driver.close() return "up" except: now = datetime.now() log_object.write("SFT health check failed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") send_failure_email() return "down"def monitor(): options = Options() options.add_argument("--headless") options.add_argument("--no-sandbox") s = Service("chromedriver") url = "https://uat-sft.yourdomain.com/" driver = webdriver.Chrome(options=options, service=s) driver.get(url) try: usernameelement = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "username")) ) time.sleep(10) if usernameelement.is_displayed() == True: print ("Login page loaded!") now = datetime.now() log_object.write("Login page loaded at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com") driver.find_element(By.ID, "password").send_keys("platformuserpassword") now = datetime.now() log_object.write("Attempting login at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") driver.find_element(By.ID, "signinButton").click() time.sleep(10) composebuttonelement = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "compose-delivery-link")) ) if composebuttonelement.is_displayed() == True: now = datetime.now() log_object.write("Successfully logged in at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") now = datetime.now() log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") driver.find_element(By.ID, "compose-delivery-link").click() time.sleep(10) divSecureMessageelement = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "divSecureMessage")) ) if divSecureMessageelement.is_displayed() == True: now = datetime.now() log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") now = datetime.now() log_object.write("All checks passed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") logouturl = "https://uat-sft.yourdomain.com/bds/Logout.do" now = datetime.now() log_object.write("Successfully logged out at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") log_object.write("------------------------------------------------\n") driver.get(logouturl) log_object.close() if os.path.exists('reboot_flag') == False: open('reboot_flag', 'x') driver.close() return "up" except: now = datetime.now() log_object.write("SFT health check failed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") send_failure_email() return "down"
Enter fullscreen mode Exit fullscreen mode
Corrective Actions:
In the following code, appstatus = monitor()
assigns the output of the monitor() function to appstatus. monitor()
checks the status of the application and returns up
or down
.
If appstatus
is down
, the code checks for the existence of a file called reboot_flag
. If the file exists, it initiates a SQL services restart using subprocess.call(['psService.exe', mssql_arguments])
, then it initiates a Tomcat server restart by first using subprocess.call(['pskill.exe', tomcat_kill_arguments])
The reason for using pskill.exe
to kill the process instead of using psService.exe
is that tomcat service takes a significant amount of time to gracefully shutdown. To avoid that, we are forcefully killing the process and using subprocess.call(['psService.exe', tomcat_start_arguments])
to start it. Finally it deletes the reboot_flag
file using os.remove("reboot_flag")
Once the reboot_flag
file is deleted using os.remove("reboot_flag")
, the code iterates from the beginning to check whether the application is up and running. If it still fails, it comes back to the part where it checks if os.path.exists('reboot_flag') == True
and goes inside code in else
and start restarting the application services first and then start the SQL services. The it again creates the reboot_flag
file. This is how the reboot flag has been used to change the service start sequence.
appstatus = monitor()if appstatus == "down":now = datetime.now()if os.path.exists('reboot_flag') == True:log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver'subprocess.call(['psService.exe', mssql_arguments])time.sleep(60)log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9'subprocess.call(['pskill.exe', tomcat_kill_arguments])time.sleep(60)tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9'subprocess.call(['psService.exe', tomcat_start_arguments])log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")os.remove("reboot_flag")log_object.close()else:log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9'subprocess.call(['pskill.exe', tomcat_kill_arguments])time.sleep(60)tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9'subprocess.call(['psService.exe', tomcat_start_arguments])time.sleep(60)log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver'subprocess.call(['psService.exe', mssql_arguments])time.sleep(60)log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")open('reboot_flag', 'x')log_object.close()appstatus = monitor() if appstatus == "down": now = datetime.now() if os.path.exists('reboot_flag') == True: log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver' subprocess.call(['psService.exe', mssql_arguments]) time.sleep(60) log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9' subprocess.call(['pskill.exe', tomcat_kill_arguments]) time.sleep(60) tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9' subprocess.call(['psService.exe', tomcat_start_arguments]) log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") os.remove("reboot_flag") log_object.close() else: log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9' subprocess.call(['pskill.exe', tomcat_kill_arguments]) time.sleep(60) tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9' subprocess.call(['psService.exe', tomcat_start_arguments]) time.sleep(60) log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver' subprocess.call(['psService.exe', mssql_arguments]) time.sleep(60) log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") open('reboot_flag', 'x') log_object.close()appstatus = monitor() if appstatus == "down": now = datetime.now() if os.path.exists('reboot_flag') == True: log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver' subprocess.call(['psService.exe', mssql_arguments]) time.sleep(60) log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9' subprocess.call(['pskill.exe', tomcat_kill_arguments]) time.sleep(60) tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9' subprocess.call(['psService.exe', tomcat_start_arguments]) log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") os.remove("reboot_flag") log_object.close() else: log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9' subprocess.call(['pskill.exe', tomcat_kill_arguments]) time.sleep(60) tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9' subprocess.call(['psService.exe', tomcat_start_arguments]) time.sleep(60) log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver' subprocess.call(['psService.exe', mssql_arguments]) time.sleep(60) log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n") open('reboot_flag', 'x') log_object.close()
Enter fullscreen mode Exit fullscreen mode
Flowchart:
Conclusion
Through research and development, I was able to build a proactive monitoring solution using Selenium and Python that significantly reduced downtime. This automation eliminated the need for manual intervention, allowing engineers to focus on higher-priority tasks. Eventually, after months of investigation, the vendor provided a script to clean unnecessary data, permanently resolving the deadlock issue. However, during that time, our automation saved countless hours and prevented service disruptions.
By leveraging Selenium, Python, and system administration tools, we successfully implemented an automated recovery system that ensured seamless application availability with minimal human intervention.
原文链接:Proactive Web Application Monitoring and Automated Recovery with Selenium and Python
暂无评论内容