Skip to main content

Web crawling using Selenium (OECD better life index)

Introduction:

In this article I will be using Selenium to extract data from http://www.oecdbetterlifeindex.org
I will be getting data for each country regarding:
  • Population
  • Visitors per year
  • Renewable Energy
  • Indices for things like:
    • Housing
    • Income
    • Jobs
    • Community 
    • Education
    • Environment
    • Civic Engagement
    • Health
    •  Life Satisfaction
    • Safety
    • Work-life Balance 

 Importing Libraries:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

Loading chrome driver:
We need to use the chrome driver exe since I will be using google chrome to crawl our webpage (drivers for each web browser are available online and can be downloaded)

browser = webdriver.Chrome(executable_path='/chromedriver')

Define our 1st function to:
1.        Access our link and load it according to the country we pass into the function
2.        Use xpath to look for 'td' tags which have our desired values (can be checked by doing a right click inspect on the page's HTML code
3.        Return the text values in an array "info"
4.        Keep only the values we are looking for here and those are:
o   Population
o   Visitors per year
o   Renewable Energy %
5.        Check if the array is empty & print that it is not loaded properly if True since the array can be returned empty if the page was not fully loaded when we access the website (if this happens when we run the function we put a delay after browser.get(link) using time.sleep() to wait for the page to load)
6.        Finally create a for loop to fill our values in a dataframe using columns we will later set in an array called "title" and the values inside our "info" array
def info_getter(country):
    link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country)
    browser.get(link)
    info_list = browser.find_elements_by_xpath('//td')
    info = [info.text for info in info_list]
    info = [country, info[0], info[2], info[4]]

    if not info:   
        print(country + ' not loaded properly... slow down wait time')
    else:
        print(country + ' loaded successfully')
    
        for title, value in zip(info_columns, info):
            df.loc[i, title] = value


Define our 2nd function to:
Do the same just like our previous function except this time we added time.sleep with 4 seconds since the particular elements I am accessing here takes time to load so without adding a delay for the page to load the array is returned empty.

def topics_getter(country):
    link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country)
    browser.get(link)
    time.sleep(4)
    info_list = browser.find_elements_by_xpath("//div[@class='value']")
    info = [info.text for info in info_list]
    info = info[0:11]
    
    if not info:   
        print(country + ' not loaded properly... slow down wait time')
    else:
        print(country + ' loaded successfully')
        
        for topic, value in zip(topics, info):
            df.loc[i, topic] = value



Create a dataframe to hold the data & define our columns into 2 arrays:
df = pd.DataFrame()

info_columns = ['Country', 'population(mil)', 'visitorsPerYear(mil)', 
               'Renewable_energy(%)']

topics = ['housing', 'income', 'jobs', 'community', 'education',
         'environment', 'civic_engagement', 'health', 
         'life_satisfaction', 'safety', 'work-life_balance']


Define our country list:
country_list = ['australia', 'austria', 'belgium', 'brazil', 'canada',
               'chile', 'czech-republic', 'denmark', 'estonia',
               'finland', 'france', 'germany', 'greece', 'hungary',
               'iceland', 'ireland', 'israel', 'italy', 'japan',
               'korea', 'latvia', 'luxembourg', 'mexico',
               'netherlands', 'new-zealand', 'norway', 'poland',
               'portugal', 'russian-federation', 'slovak-republic',
               'slovenia', 'south-africa', 'spain', 'sweden', 'switzerland',
               'turkey', 'united-kingdom', 'united-states']



Run our first function looping over the country list:
for i, country in enumerate(country_list):
    info_getter(country)

Snippet of the result:


Run our 2nd function looping over the country list:
for i, country in enumerate(country_list):
    topics_getter(country)

Snippet of the result:




Check our DataFrame:
df.head()


That's it!


Comments

Post a Comment

Popular posts from this blog

Make python surf the web for you and send best flight rates straight to your email!

Introduction: In this tutorial I will show you how to use python to automatically surf a website like Expedia on hourly basis looking for flights and sending you the best flight rate for a particular route you want every hour straight to your email. The end result is this nice email: We will work as follows: Connect python to our web browser & access the website (Expedia in our example here) Choose the ticket type based on our preference (round trip, one way... etc.) Select the departure country Select the arrival country (if round trip) Select departure & return dates Compile all available flights in a structured format (for those who love to do some exploratory data analysis!) Connect to your email Send the best rate for the current hour Let's get started! Importing Libraries: Let's go ahead and import our libraries: Selenium (for accessing websites & automation testing): from selenium import webdriver from selenium.webdriver.common...

Web Scraping Using Python (Part 1)

 Introduction: This is my first trial into web scraping. I hope you find it useful and if you have any tips or suggestions, please leave them in the comments below. I will be showing you how to scrape data from e-commerce websites (taking ebay as an example here). Disclaimer: Please note that it is not preferred to overload any website with too many get requests to scrape information, as this can affect their servers and they can ban your IP for this, so do this at your own responsibility. Exploring the website: I will start by doing some exploration of the website itself a bit then I will be using python to get the data we need. Before we start let's see what type of data ebay has to offer. After some exploration into ebay I decided to take a dive into Electronics ( Cell Phones & Smart Phone Parts ). By looking at the first item we have here a few things already stand out. For example we could want to get: It's title The price of the ite...