Introduction:
In this tutorial I will show you how to use python to automatically surf a website like Expedia on hourly basis looking for flights and sending you the best flight rate for a particular route you want every hour straight to your email.The end result is this nice email:
We will work as follows:
- Connect python to our web browser & access the website (Expedia in our example here)
- Choose the ticket type based on our preference (round trip, one way... etc.)
- Select the departure country
- Select the arrival country (if round trip)
- Select departure & return dates
- Compile all available flights in a structured format (for those who love to do some exploratory data analysis!)
- Connect to your email
- Send the best rate for the current hour
Let's get started!
Importing Libraries:
Let's go ahead and import our libraries:Selenium (for accessing websites & automation testing):
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys
Pandas (we will mainly just used pandas for structuring our data):
import pandas as pd
time & datetime (for using delays & returning current time we will see why later)
import time import datetime
We need those for connecting to our email & sending our message
import smtplib from email.mime.multipart import MIMEMultipart
Note: I will not go too deeply into web scraping using selenium, but if you want a more detailed tutorial for scraping in general check my previous tutorials for scraping using Selenium and web scraping in general Part 1 & Part 2.
Let's get coding:
Connect to the web browser:
browser = webdriver.Chrome(executable_path='/chromedriver')
This will open an empty browser telling you that this browser is being controlled by automated test software like so:
Choose ticket:
Next I will quickly go to Expedia to check the interface and the options available to choose from.I click right click + inspect on the ticket type (roundtrip, one way... etc.) to see the tags related to it.
As we can see below it has a 'label' tag with 'id = flight-type-roundtrip-label-hp-flight'.
Accordingly I will use those to store the tags & ids for the 3 different ticket types as follows:
#Setting ticket types paths return_ticket = "//label[@id='flight-type-roundtrip-label-hp-flight']" one_way_ticket = "//label[@id='flight-type-one-way-label-hp-flight']" multi_ticket = "//label[@id='flight-type-multi-dest-label-hp-flight']"
Then I define a function to choose a ticket type:
def ticket_chooser(ticket): try: ticket_type = browser.find_element_by_xpath(ticket) ticket_type.click() except Exception as e: pass
The above sequence is the same sequence I will use for the rest of the code (look for tags & ids or other attributes and define a function to make the choice on the web page).
Choose departure & arrival countries:
Below I define a function to choose the departure country.def dep_country_chooser(dep_country): fly_from = browser.find_element_by_xpath("//input[@id='flight-origin-hp-flight']") time.sleep(1) fly_from.clear() time.sleep(1.5) fly_from.send_keys(' ' + dep_country) time.sleep(1.5) first_item = browser.find_element_by_xpath("//a[@id='aria-option-0']") time.sleep(1.5) first_item.click()
I follow the below logic:
- Find the element using it's tag & attributes.
- Clear any value written in the country field.
- Type in the country I want (that will be passed into the function) using .sendkeys
- Choose the first choice that appears from the drop down menu (also using it's tag & id which can be found by right click + inspect on the element when the drop down menu appears)
- Click this first choice.
Let's do the same for the arrival country.
def arrival_country_chooser(arrival_country): fly_to = browser.find_element_by_xpath("//input[@id='flight-destination-hp-flight']") time.sleep(1) fly_to.clear() time.sleep(1.5) fly_to.send_keys(' ' + arrival_country) time.sleep(1.5) first_item = browser.find_element_by_xpath("//a[@id='aria-option-0']") time.sleep(1.5) first_item.click()
Choosing the departure & return dates:
Departure date:def dep_date_chooser(month, day, year): dep_date_button = browser.find_element_by_xpath("//input[@id='flight-departing-hp-flight']") dep_date_button.clear() dep_date_button.send_keys(month + '/' + day + '/' + year)
Very straight forward:
- Find the element on the web page like before.
- Clear whatever was written previously.
- Fill the element with the month, day & year entered in the function as arguments + the slashes for date format.
def return_date_chooser(month, day, year): return_date_button = browser.find_element_by_xpath("//input[@id='flight-returning-hp-flight']") for i in range(11): return_date_button.send_keys(Keys.BACKSPACE) return_date_button.send_keys(month + '/' + day + '/' + year)
For the return date clearing whatever was written wasn't working for some reason (probably due to the page having this as autofill not allowing me to override it with .clear())
The way I worked around this is by using Keys.BACKSPACE which simply tells python to click backspace (to delete whatever is written in the date field). I put it in a for loop to click back space 11 times to delete all the characters for the date in the field.
Getting the results:
Define the function that will click the search button.def search(): search = browser.find_element_by_xpath("//button[@class='btn-primary btn-action gcw-submit']") search.click() time.sleep(15) print('Results ready!')
The resulting webpage is as follows (with the fields I am interested in marked):
Compiling the data:
We will use this sequence to compile our data:- First create a pandas DataFrame to hold our data
- Create variables for all the flight attributes (highlighted in the previous picture) to be stored in lists.
- Find the all elements for an attribute (for example all departure times)
- Store them in it's related variable we created as a list
- Put all those lists side by side as columns in our DataFrame
- Save the DataFrame to an excel sheet (in case we want to analyze this data later)
df = pd.DataFrame() def compile_data(): global df global dep_times_list global arr_times_list global airlines_list global price_list global durations_list global stops_list global layovers_list #departure times dep_times = browser.find_elements_by_xpath("//span[@data-test-id='departure-time']") dep_times_list = [value.text for value in dep_times] #arrival times arr_times = browser.find_elements_by_xpath("//span[@data-test-id='arrival-time']") arr_times_list = [value.text for value in arr_times] #airline name airlines = browser.find_elements_by_xpath("//span[@data-test-id='airline-name']") airlines_list = [value.text for value in airlines] #prices prices = browser.find_elements_by_xpath("//span[@data-test-id='listing-price-dollars']") price_list = [value.text.split('$')[1] for value in prices] #durations durations = browser.find_elements_by_xpath("//span[@data-test-id='duration']") durations_list = [value.text for value in durations] #stops stops = browser.find_elements_by_xpath("//span[@class='number-stops']") stops_list = [value.text for value in stops] #layovers layovers = browser.find_elements_by_xpath("//span[@data-test-id='layover-airport-stops']") layovers_list = [value.text for value in layovers] now = datetime.datetime.now() current_date = (str(now.year) + '-' + str(now.month) + '-' + str(now.day)) current_time = (str(now.hour) + ':' + str(now.minute)) current_price = 'price' + '(' + current_date + '---' + current_time + ')' for i in range(len(dep_times_list)): try: df.loc[i, 'departure_time'] = dep_times_list[i] except Exception as e: pass try: df.loc[i, 'arrival_time'] = arr_times_list[i] except Exception as e: pass try: df.loc[i, 'airline'] = airlines_list[i] except Exception as e: pass try: df.loc[i, 'duration'] = durations_list[i] except Exception as e: pass try: df.loc[i, 'stops'] = stops_list[i] except Exception as e: pass try: df.loc[i, 'layovers'] = layovers_list[i] except Exception as e: pass try: df.loc[i, str(current_price)] = price_list[i] except Exception as e: pass print('Excel Sheet Created!')
now = datetime.datetime.now()
current_date = (str(now.year) + '-' + str(now.month) + '-' + str(now.day))
current_time = (str(now.hour) + ':' + str(now.minute))
current_price = 'price' + '(' + current_date + '---' + current_time + ')'
Setting up our email functions:
In this part I will set up 3 functions:- One to connect to my email.
- One to create the message.
- A final one to actually send it.
#email credentials username = 'myemail@hotmail.com' password = 'XXXXXXXXXXX'
Connect:
def connect_mail(username, password): global server server = smtplib.SMTP('smtp.outlook.com', 587) server.ehlo() server.starttls() server.login(username, password)
Create the message:
#Create message template for email def create_msg(): global msg msg = '\nCurrent Cheapest flight:\n\nDeparture time: {}\nArrival time: {}\nAirline: {}\nFlight duration: {}\nNo. of stops: {}\nPrice: {}\n'.format(cheapest_dep_time, cheapest_arrival_time, cheapest_airline, cheapest_duration, cheapest_stops, cheapest_price)
Also the variables used here like cheapest_arrival_time, cheapest_airline.... etc. will be defined later when we start running all our functions to hold the values for each particular run.
Send the message:
def send_email(msg): global message message = MIMEMultipart() message['Subject'] = 'Current Best flight' message['From'] = 'myemail@hotmail.com' message['to'] = 'myotheremail@hotmail.com' server.sendmail('myemail@hotmail.com', 'myotheremail@hotmail.com', msg)
Let's run our code!
Now we will finally run our functions.We will use the below logic:
The data scraping part:
- Access our link for Expedia and sleep 5 seconds to allow the page to load.
- Choose "flights only" as I am not currently interested in other offers such as flights + hotel.
- Run our ticket chooser function for a return ticket.
- Run our departure country chooser (for Cairo since this is where I am currently at :D)
- Run our arrival country chooser (Let's do New york)
- Run our departure date chooser (it is preferred to put 0 before your month or day like 01 for January for example as this is the format Expedia uses)
- Run our return date chooser.
- Run our Search & compile functions.
- Access the first row of our DataFrame since usually the first flight is the cheapest & best one on Expedia, but anyway if we want to go deeper we can filter by the minimum price and get that row.
- Assign the values in each column of the row we chose into variables to be used in our email message like (cheapest_dep_time, cheapest_arrival_time.... etc.)
- Run our email functions to create the message, connect & send the email.
This loop will run 8 times with 1 hour intervals thus it will run for 8 hours. You can tweak the timing to your preference.
for i in range(8): link = 'https://www.expedia.com/' browser.get(link) time.sleep(5) #choose flights only flights_only = browser.find_element_by_xpath("//button[@id='tab-flight-tab-hp']") flights_only.click() ticket_chooser(return_ticket) dep_country_chooser('Cairo') arrival_country_chooser('New york') dep_date_chooser('04', '01', '2019') return_date_chooser('05', '02', '2019') search() compile_data() #save values for email current_values = df.iloc[0] cheapest_dep_time = current_values[0] cheapest_arrival_time = current_values[1] cheapest_airline = current_values[2] cheapest_duration = current_values[3] cheapest_stops = current_values[4] cheapest_price = current_values[-1] print('run {} completed!'.format(i)) create_msg() connect_mail(username,password) send_email(msg) print('Email sent!') df.to_excel('flights.xlsx') time.sleep(3600)
Now I will be getting this email every hour for the next 8 hours:
I also have this neat excel sheet with all the flights and it will keep updating each hour with a new column for the current price:
Now you can take this further by applying so many other ideas such as:
- Accessing multiple websites and sending yourself the current best rates from each website.
- Running loops for multiple date ranges and checking which dates give the best prices on which websites.
- Checking how the price evolves over time for each airline.
That's it! I hope you found it useful.
Can you provide me some more info on how do I run function on the webpage?
ReplyDeleteGreat job Omar.
ReplyDeleteI have a question, maybe you can help me out.
I build a flight scraper (based on your work) to expedia (also with selenium).
when I try to scrape the flight data that is under (Details & baggage fees) I'm running into trouble.
Do you know how can I scrape also this data?
Thank you,
Offir
feel free to contact my email: offirinbar@gmail.com
Where is the excel saved
ReplyDeleteAnenterprise data lakeis a system that enables the storage of any type of data (structured and unstructured) for the purpose of analyzing it in the future. Data lakes are a useful alternative to systems that focus on storing data by specific characteristics, such as data warehouses and data marts, which are often time-consuming and expensive to set up, and are therefore used by only a handful of companies.
ReplyDeleteI am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information about Web Scraping Service in USA
ReplyDelete
ReplyDeleteVery Informative and creative contents. This concept is a good way to enhance the knowledge. thanks for sharing.
Continue to share your knowledge through articles like these, and keep posting more blogs. Web Scraping Physician Review