Web crawling using Selenium (OECD better life index)

Introduction:

In this article I will be using Selenium to extract data from http://www.oecdbetterlifeindex.org
I will be getting data for each country regarding:

Population
Visitors per year
Renewable Energy
Indices for things like:

Housing
Income
Jobs
Community
Education
Environment
Civic Engagement
Health
Life Satisfaction
Safety
Work-life Balance

Importing Libraries:

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

Loading chrome driver:
We need to use the chrome driver exe since I will be using google chrome to crawl our webpage (drivers for each web browser are available online and can be downloaded)

browser = webdriver.Chrome(executable_path='/chromedriver')

Define our 1st function to:

1. Access our link and load it according to the country we pass into the function

2. Use xpath to look for 'td' tags which have our desired values (can be checked by doing a right click inspect on the page's HTML code

3. Return the text values in an array "info"

4. Keep only the values we are looking for here and those are:

o Population

o Visitors per year

o Renewable Energy %

5. Check if the array is empty & print that it is not loaded properly if True since the array can be returned empty if the page was not fully loaded when we access the website (if this happens when we run the function we put a delay after browser.get(link) using time.sleep() to wait for the page to load)

6. Finally create a for loop to fill our values in a dataframe using columns we will later set in an array called "title" and the values inside our "info" array

def info_getter(country):
    link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country)
    browser.get(link)
    info_list = browser.find_elements_by_xpath('//td')
    info = [info.text for info in info_list]
    info = [country, info[0], info[2], info[4]]

    if not info:   
        print(country + ' not loaded properly... slow down wait time')
    else:
        print(country + ' loaded successfully')
    
        for title, value in zip(info_columns, info):
            df.loc[i, title] = value

Define our 2nd function to:
Do the same just like our previous function except this time we added time.sleep with 4 seconds since the particular elements I am accessing here takes time to load so without adding a delay for the page to load the array is returned empty.

def topics_getter(country):
    link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country)
    browser.get(link)
    time.sleep(4)
    info_list = browser.find_elements_by_xpath("//div[@class='value']")
    info = [info.text for info in info_list]
    info = info[0:11]
    
    if not info:   
        print(country + ' not loaded properly... slow down wait time')
    else:
        print(country + ' loaded successfully')
        
        for topic, value in zip(topics, info):
            df.loc[i, topic] = value

Create a dataframe to hold the data & define our columns into 2 arrays:

df = pd.DataFrame()

info_columns = ['Country', 'population(mil)', 'visitorsPerYear(mil)', 
               'Renewable_energy(%)']

topics = ['housing', 'income', 'jobs', 'community', 'education',
         'environment', 'civic_engagement', 'health', 
         'life_satisfaction', 'safety', 'work-life_balance']

Define our country list:

country_list = ['australia', 'austria', 'belgium', 'brazil', 'canada',
               'chile', 'czech-republic', 'denmark', 'estonia',
               'finland', 'france', 'germany', 'greece', 'hungary',
               'iceland', 'ireland', 'israel', 'italy', 'japan',
               'korea', 'latvia', 'luxembourg', 'mexico',
               'netherlands', 'new-zealand', 'norway', 'poland',
               'portugal', 'russian-federation', 'slovak-republic',
               'slovenia', 'south-africa', 'spain', 'sweden', 'switzerland',
               'turkey', 'united-kingdom', 'united-states']

Run our first function looping over the country list:

for i, country in enumerate(country_list):
    info_getter(country)

Snippet of the result:

Run our 2nd function looping over the country list:

for i, country in enumerate(country_list):
    topics_getter(country)

Snippet of the result:

Check our DataFrame:

df.head()

That's it!

Comments

yikik furkoJuly 26, 2021 at 10:22 PM
kayseriescortu.com - alacam.org - xescortun.com
ReplyDelete
Replies
sam kirubakarFebruary 25, 2022 at 1:09 PM
This comment has been removed by the author.
ReplyDelete
Replies
sam kirubakarMay 6, 2022 at 2:00 PM

Very Informative and creative contents. This concept is a good way to enhance the knowledge. thanks for sharing.
Continue to share your knowledge through articles like these, and keep posting more blogs. Web Scraping Services in USA
ReplyDelete
Replies
AnonymousMay 26, 2022 at 4:51 AM
Smm Panel
smm panel
İs İlanlari Blog
İnstagram takipçi satın al
hirdavatciburada.com
WWW.BEYAZESYATEKNİKSERVİSİ.COM.TR
servis
tiktok jeton hilesi
ReplyDelete
Replies

Add comment

Omar Aref

Search This Blog