Skip to main content

Web crawling using Selenium (OECD better life index)

Introduction:

In this article I will be using Selenium to extract data from http://www.oecdbetterlifeindex.org
I will be getting data for each country regarding:
  • Population
  • Visitors per year
  • Renewable Energy
  • Indices for things like:
    • Housing
    • Income
    • Jobs
    • Community 
    • Education
    • Environment
    • Civic Engagement
    • Health
    •  Life Satisfaction
    • Safety
    • Work-life Balance 

 Importing Libraries:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

Loading chrome driver:
We need to use the chrome driver exe since I will be using google chrome to crawl our webpage (drivers for each web browser are available online and can be downloaded)

browser = webdriver.Chrome(executable_path='/chromedriver')

Define our 1st function to:
1.        Access our link and load it according to the country we pass into the function
2.        Use xpath to look for 'td' tags which have our desired values (can be checked by doing a right click inspect on the page's HTML code
3.        Return the text values in an array "info"
4.        Keep only the values we are looking for here and those are:
o   Population
o   Visitors per year
o   Renewable Energy %
5.        Check if the array is empty & print that it is not loaded properly if True since the array can be returned empty if the page was not fully loaded when we access the website (if this happens when we run the function we put a delay after browser.get(link) using time.sleep() to wait for the page to load)
6.        Finally create a for loop to fill our values in a dataframe using columns we will later set in an array called "title" and the values inside our "info" array
def info_getter(country):
    link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country)
    browser.get(link)
    info_list = browser.find_elements_by_xpath('//td')
    info = [info.text for info in info_list]
    info = [country, info[0], info[2], info[4]]

    if not info:   
        print(country + ' not loaded properly... slow down wait time')
    else:
        print(country + ' loaded successfully')
    
        for title, value in zip(info_columns, info):
            df.loc[i, title] = value


Define our 2nd function to:
Do the same just like our previous function except this time we added time.sleep with 4 seconds since the particular elements I am accessing here takes time to load so without adding a delay for the page to load the array is returned empty.

def topics_getter(country):
    link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country)
    browser.get(link)
    time.sleep(4)
    info_list = browser.find_elements_by_xpath("//div[@class='value']")
    info = [info.text for info in info_list]
    info = info[0:11]
    
    if not info:   
        print(country + ' not loaded properly... slow down wait time')
    else:
        print(country + ' loaded successfully')
        
        for topic, value in zip(topics, info):
            df.loc[i, topic] = value



Create a dataframe to hold the data & define our columns into 2 arrays:
df = pd.DataFrame()

info_columns = ['Country', 'population(mil)', 'visitorsPerYear(mil)', 
               'Renewable_energy(%)']

topics = ['housing', 'income', 'jobs', 'community', 'education',
         'environment', 'civic_engagement', 'health', 
         'life_satisfaction', 'safety', 'work-life_balance']


Define our country list:
country_list = ['australia', 'austria', 'belgium', 'brazil', 'canada',
               'chile', 'czech-republic', 'denmark', 'estonia',
               'finland', 'france', 'germany', 'greece', 'hungary',
               'iceland', 'ireland', 'israel', 'italy', 'japan',
               'korea', 'latvia', 'luxembourg', 'mexico',
               'netherlands', 'new-zealand', 'norway', 'poland',
               'portugal', 'russian-federation', 'slovak-republic',
               'slovenia', 'south-africa', 'spain', 'sweden', 'switzerland',
               'turkey', 'united-kingdom', 'united-states']



Run our first function looping over the country list:
for i, country in enumerate(country_list):
    info_getter(country)

Snippet of the result:


Run our 2nd function looping over the country list:
for i, country in enumerate(country_list):
    topics_getter(country)

Snippet of the result:




Check our DataFrame:
df.head()


That's it!


Comments

Post a Comment

Popular posts from this blog

Web Scraping Using Python (Part 1)

 Introduction: This is my first trial into web scraping. I hope you find it useful and if you have any tips or suggestions, please leave them in the comments below. I will be showing you how to scrape data from e-commerce websites (taking ebay as an example here). Disclaimer: Please note that it is not preferred to overload any website with too many get requests to scrape information, as this can affect their servers and they can ban your IP for this, so do this at your own responsibility. Exploring the website: I will start by doing some exploration of the website itself a bit then I will be using python to get the data we need. Before we start let's see what type of data ebay has to offer. After some exploration into ebay I decided to take a dive into Electronics ( Cell Phones & Smart Phone Parts ). By looking at the first item we have here a few things already stand out. For example we could want to get: It's title The price of the ite

Web Scraping Using Python (Part 2)

In this  article I will outline how to efficiently automate web scraping to return any data you may desire and then store them in a structured format for analysis. This is the 2nd part of the series where in part 1 I went through just the basics of web scraping. If you'd like to get a quick basic overview you can check part 1 in this link ( https://oaref.blogspot.com/2019/01/web-scraping-using-python-part-1.html )  Alright with that said, let's get to the exciting stuff! Scraping all the data from One Product: Note that I will continue with the ebay example we did in part 1. Just like beofre I will start by importing our libraries. from bs4 import BeautifulSoup import requests import pandas as pd import numpy as np import re I have decided I will scrape data for cell phones from ebay starting from this link( https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1 )   First also just like we did in part 1 we will:  Perform