Being a developer I feel thrilled to watch conferences on machine-learning, demonstrating life like chatbots or super awesome image analysis tools.
But personally I have failed multiple times in making a start towards handling/analyzing data, which as many say is the base of ML. Learning about data-analysis can be tough to start. Most of the available tutorials took me so deep in complex math formulas that I was not able to follow after a few lectures.
This weekend I decided to do something basic without any knowledge about python or other tools. Had to search solutions all along the way, but was a nice experience over-all.
So, I decided to scrape my historical data from Uber’s website and try to find some answers like –
- How much I have paid to Uber in total?
- How much time I have spend on Uber rides or waiting for rides?
- How much I have saved by choosing ride-sharing as an option?
- How much I have paid per km for various Uber rides?
Although data may be obtained from Uber developer APIs. But for learning important concept of web-scraping I decided to take a longer path.
And much more, depending on how much data you scrape and how much investigative you are 😀
So let’s start
Step 1 – Scrape data from Uber’s history
We can get uber’s history from this link. This is a paginated endpoint which has a list of rides. Most of the information is available on this page itself.
To make things more simple we have a plugin available – Uber data extractor. We just need to open this link and click on the plugin. Within some-time we will have an excel sheet in our downloads with most of the information we need.
This plugin iterates through the paginated endpoint –https://riders.uber.com/trips?page={page-number}
and fetches data to save it in an excel sheet. After collecting information we have following columns – trip_id, date, date_time, driver, car_type, city, price, payment_method, start_time, start_address, end_time, end_address.
But things are not finished yet. Two important parameters are still left to be captured – distance and trip-time.
This data is available on details page of each ride, which can be accessed from https://riders.uber.com/trips/{trip-id}
.
Thankfully we have trip-ids in the data collected from above chrome plugin, using which we can generate an array of links we need to visit.
Now we can use selenium and chrome-web-driver to automate opening of links and printing of fields required. Since I am not so strong in python, please ignore if I have made few basic mistakes.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time
option = webdriver.ChromeOptions()
browser = webdriver.Chrome(executable_path='/Users/atiwari4/Downloads/chromedriver', chrome_options=option)
browser.get("https://riders.uber.com/trips")
time.sleep(60)
Enter fullscreen mode Exit fullscreen mode
Above code tries to take you to https://riders.uber.com/trips link, which in turn asks for authentication. We need to login into uber’s account in order to process next steps.
Thus, I have added time.sleep(60)
. It provides us with a buffer of 60 secs to help us complete the login process.
for a in arr:
row = []
browser.get(a)
time.sleep(3)
distance = browser.find_elements_by_xpath("/html/body/div[2]/div/div[2]/div/div/div[2]/div[3]/div/div[1]/div[2]/div[2]/div/div[2]/h5")
trip_time = browser.find_elements_by_xpath("/html/body/div[2]/div/div[2]/div/div/div[2]/div[3]/div/div[1]/div[2]/div[2]/div/div[3]/h5")
row.append(a)
distancex = [x.text for x in distance]
row.append(distancex)
trip_timex = [x.text for x in trip_time]
row.append(trip_timex)
print row
Enter fullscreen mode Exit fullscreen mode
Here arr
is array of ride detail page links.
Using XPath we pick the required elements from the page and print them on console. Above code may take time depending on number of trips you have completed. For more info on web-scraping using selenium, refer this article.
Now my data-sheet looks something like this.
Step 2 – Analyze data using Google Colaboratory
Google colabs is a cloud environment which can be used to run live code.
We can import sheets in colabs using various methods. Since we have our sheet already uploaded on google-sheets, we can directly access it from there.
First we need to authenticate our google account with colabs.
!pip install --upgrade -q gspread
!pip install -U matplotlib
from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
gc = gspread.authorize(GoogleCredentials.get_application_default())
Enter fullscreen mode Exit fullscreen mode
Once we are authenticated this is how we import the sheet from google-sheets
# Open our new sheet and read some data.
worksheet = gc.open('Uber').sheet1
# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)
# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
df.head()
Enter fullscreen mode Exit fullscreen mode
df stands for DataFrame which is provided by Pandas. According to there documentation it is –
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.
Enter fullscreen mode Exit fullscreen mode
head(n=5)
function returns top n (default 5) rows of the data-frame.
The first row in our sheet is wrong and should me made header for the sheet.
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))
Enter fullscreen mode Exit fullscreen mode
Now we need to define a column which has unique values.
df['trip_id'].is_unique
df = df.set_index('trip_id')
df.head()
Enter fullscreen mode Exit fullscreen mode
The schema looks better now. But what about data. We need to clean it, before we start looking for answers. For example, the price column is something like ₹89.53
. We need to remove first 3 characters from that column.
df['price'] = df['price'].str.slice(3)
Enter fullscreen mode Exit fullscreen mode
Next we need to remove comma from price column in order to convert its datatype to number. Also we need the data-type of date column to be of date-time.
df['price'] = df['price'].str.replace(',','')
df['price'] = pd.to_numeric(df['price'])
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%y')
Enter fullscreen mode Exit fullscreen mode
Now we will have a lot of columns as NaN. Those are especially the cancelled rides, and we need to get rid of those.
import numpy as np
df['end_time'].replace('', np.nan, inplace=True)
df['end_time'].replace('Dropoff time unknown', np.nan, inplace=True)
df['start_time'].replace('', np.nan, inplace=True)
df.dropna(subset=['end_time'], inplace=True)
df.dropna(subset=['start_time'], inplace=True)
df.dropna(subset=['price'], inplace=True)
Enter fullscreen mode Exit fullscreen mode
Finally the question of decade 😀
How much I have paid to Uber?
Just type this to get your answer –
df['price'].sum()
Enter fullscreen mode Exit fullscreen mode
Turns out that I have paid Rs. 26852.03 to Uber(involving Uber Eats). We can also segregate various uber services and see what have we paid for them individually.
To know exact number of uber services availed with count we need to run following code –
print pd.value_counts(df['car_type'].values, sort=True)
Enter fullscreen mode Exit fullscreen mode
Here is the result –
We can also plot graphs in colabs using matplotlib
library. Here is an example of a time vs price graph.
A thousand more answers can be retrieved from this dataset. Feel free to try new combinations and comment in the section below.
Click here for colabs sheet with many more inferences.
暂无评论内容