Introduction:
In this article I will be using Selenium to extract data from http://www.oecdbetterlifeindex.org
I will be getting data for each country regarding:
I will be getting data for each country regarding:
- Population
- Visitors per year
- Renewable Energy
- Indices for things like:
- Housing
- Income
- Jobs
- Community
- Education
- Environment
- Civic Engagement
- Health
- Life Satisfaction
- Safety
- Work-life Balance
Importing Libraries:
from selenium import webdriver from selenium.webdriver.common.by import By import time
Loading chrome driver:
We need to use the chrome driver exe since I will be using google chrome to crawl our webpage (drivers for each web browser are available online and can be downloaded)
browser = webdriver.Chrome(executable_path='/chromedriver')
Define our 1st function to:
1.
Access our link and load it
according to the country we pass into the function
2.
Use xpath
to look for 'td' tags which have
our desired values (can be checked by doing a right click inspect on the page's
HTML code
3.
Return the text values in an array "info"
4.
Keep only the values we are looking
for here and those are:
o Population
o Visitors per year
o Renewable Energy %
5.
Check if the array is empty &
print that it is not loaded properly if True since the array can be returned
empty if the page was not fully loaded when we access the website (if this
happens when we run the function we put a delay after browser.get(link) using time.sleep()
to wait for the page to load)
6.
Finally create a for loop to fill
our values in a dataframe using columns we will later set in an array called "title" and the values inside
our "info" array
def info_getter(country): link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country) browser.get(link) info_list = browser.find_elements_by_xpath('//td') info = [info.text for info in info_list] info = [country, info[0], info[2], info[4]] if not info: print(country + ' not loaded properly... slow down wait time') else: print(country + ' loaded successfully') for title, value in zip(info_columns, info): df.loc[i, title] = value
Define our 2nd function to:
Do the same just like our previous function except this time we added time.sleep with 4 seconds since the particular elements I am accessing here takes time to load so without adding a delay for the page to load the array is returned empty.
def topics_getter(country): link = 'http://www.oecdbetterlifeindex.org/countries/{}/'.format(country) browser.get(link) time.sleep(4) info_list = browser.find_elements_by_xpath("//div[@class='value']") info = [info.text for info in info_list] info = info[0:11] if not info: print(country + ' not loaded properly... slow down wait time') else: print(country + ' loaded successfully') for topic, value in zip(topics, info): df.loc[i, topic] = value
Create a dataframe to hold the data
& define our columns into 2 arrays:
df = pd.DataFrame() info_columns = ['Country', 'population(mil)', 'visitorsPerYear(mil)', 'Renewable_energy(%)'] topics = ['housing', 'income', 'jobs', 'community', 'education', 'environment', 'civic_engagement', 'health', 'life_satisfaction', 'safety', 'work-life_balance']
Define our country list:
country_list = ['australia', 'austria', 'belgium', 'brazil', 'canada', 'chile', 'czech-republic', 'denmark', 'estonia', 'finland', 'france', 'germany', 'greece', 'hungary', 'iceland', 'ireland', 'israel', 'italy', 'japan', 'korea', 'latvia', 'luxembourg', 'mexico', 'netherlands', 'new-zealand', 'norway', 'poland', 'portugal', 'russian-federation', 'slovak-republic', 'slovenia', 'south-africa', 'spain', 'sweden', 'switzerland', 'turkey', 'united-kingdom', 'united-states']
Run our first function looping over the country list:
for i, country in enumerate(country_list): info_getter(country)
Snippet of the result:
Run our 2nd function looping over
the country list:
for i, country in enumerate(country_list): topics_getter(country)
Snippet of the result:
Check our DataFrame:
df.head()
kayseriescortu.com - alacam.org - xescortun.com
ReplyDeleteThis comment has been removed by the author.
ReplyDelete
ReplyDeleteVery Informative and creative contents. This concept is a good way to enhance the knowledge. thanks for sharing.
Continue to share your knowledge through articles like these, and keep posting more blogs. Web Scraping Services in USA
Smm Panel
ReplyDeletesmm panel
İs İlanlari Blog
İnstagram takipçi satın al
hirdavatciburada.com
WWW.BEYAZESYATEKNİKSERVİSİ.COM.TR
servis
tiktok jeton hilesi