Skip to main content

Web Scraping Using Python (Part 2)

In this  article I will outline how to efficiently automate web scraping to return any data you may desire and then store them in a structured format for analysis.
This is the 2nd part of the series where in part 1 I went through just the basics of web scraping. If you'd like to get a quick basic overview you can check part 1 in this link (https://oaref.blogspot.com/2019/01/web-scraping-using-python-part-1.html)
 Alright with that said, let's get to the exciting stuff!

Scraping all the data from One Product:

Note that I will continue with the ebay example we did in part 1.

Just like beofre I will start by importing our libraries.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import re

I have decided I will scrape data for cell phones from ebay starting from this link(
https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1)
 
First also just like we did in part 1 we will:
  1.  Perform a get request for the link we're interested in after visually inspecting the web page.
  2.  Parse it with Beautiful Soup.
  3.  Make things more concise by only getting only the portion we want into the items variable.

source = requests.get('https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1').text    
soup = BeautifulSoup(source, 'lxml')
items = soup.find('li', class_='s-item')



Here I am interested in 13 categories and those are the ones I will be getting for all products:
  1. The title of the product which is the first thing written.
  2. The description of the product (written under the title).
  3. The Brand of the product.
  4. The model.
  5. Any miscellaneous features (for some products we have style, color, connectivity... etc.).
  6. The origin of the product.
  7. It's Price.
  8. Shipping information.
  9. Whether it comes from a top seller or not.
  10. It's rating (how many stars?).
  11. Number of reviewers who gave a rating.
  12. Qty sold (written for some products in red below the shipping information)
  13. Finally I will also retrieve the link to the product in case I want to get back to it later.

I need to highlight a few things to keep in mind here:
  • Some products have missing attributes (for example a product might simply not have a model, or maybe the country of origin is not stated). The way we will deal with that is that we will just return "None" to any attribute that the scrip does not find as it goes along the web page.
  • The Qty sold element for some products sometimes contains "Watching" instead which tells you how many people are watching this product. Since I am interested only in the Qty Sold whenever it is available for a product we will have to work around that to find a way to only return the value if it is actually the Qty Sold and not how many people are watching the product.

Alright let's do this for each attribute one by one.

Title:

try:  
    item_title = items.find('h3', class_='s-item__title').text
except Exception as e:
    item_title = 'None'

print(item_title)

Here I simply used the find method just like we did in part 1 specifying 'h3' as the tag and 's-item__title' as the class with .text at the end to return only the text we need.
The only difference this time is that I used try & except to ask python to return "None" into the variable if an error is raised which will come in handy if this item does not have that attribute (a title in this case)

Printing the result gives exactly what we want. The title of the first product on the webpage


New *UNOPENDED* Apple iPhone SE - 16/64GB 4.0" Unlocked Smartphone

Description:

try:  
    item_desc = items.find('div', class_='s-item__subtitle').text
except Exception as e:
    item_desc = 'None'

print(item_desc)

Exactly the same the like title I used .find with the relevant tag 'div' & the relevant class 's-item__subtitle' and .text at the end.

Again printing the result gives us the description we want.

NO-RUSH 14 DAYS SHIPPING ONLY!  US LOCATION!

Brand:

try:
    item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text
except Exception as e:
    item_brand = 'None'

print(item_brand)

Ok perfect, everything is the same just like before.
Let's print the result:

Brand: Apple

Hmm... looks ok, but we do not want to have "Brand:" then the actual brand written every time for each product. This will look a bit messy if we want to have this in an excel sheet later.

Let's try again with a minor modification at the end of the 2nd line of code:

try:
    item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text.split(' ')[1]
except Exception as e:
    item_brand = 'None'

print(item_brand)

Let's print:

Apple

Great, we got just the brand. Now what I did here is very simple. I added ".split(' ')" at the end which simply splits any text we give it based on whatever we specify between it's brackets. Here I specified to make splits based on spaces between words. The result we will get is an array ["Brand:", "Apple"] as follows. Next I simply added [1] to specify that I want the 2nd element of the array returned since I am not interested in the "Brand:" part.

Model:


try:
    item_model = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes2').text.split(' ')[1:]
    item_model = ' '.join(item_model)
except Exception as e:
    item_model = 'None'
    
print(item_model)

For the model I did exactly what we did before,  with another minor modification [1:] at the end of the first line. This is because I want to return everything after "Model:" since it is very expected that the model will not be just 1 word. This way I am telling python I want everything after the element in index 1 in the array.
In the 3rd line  we used .join which is the exact opposite of .split. It simply joins all elements of the returned array based on whatever I specify before .join. Here I specified a space to return all words in the array with spaces in between them.
Let's print the result:

Apple iPhone SE


Features:


try:
    item_features = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes3').text.split(' ')[1]
except Exception as e:
    item_features = 'None'

print(item_features) 

Same like before, nothing new here.

Result:
Bar


Origin:


try:  
    item_origin = items.find('span', class_='s-item__location s-item__itemLocation').text
    item_origin = re.sub('From ', '', item_origin)
except Exception as e:
    item_origin = 'None'
    
print(item_origin)


Here we are in the same situation like we were with "Model". We handle the text like we did before, but I thought I would show you a different method of doing this using the "re" which is great for regular expressions you can check it out. Anyway for re.sub what you do is give it a sequence of characters to look for here I put "From", then whatever you want to replace this sequence with and here I just put '' which means to just replace it with nothing. Finally you specify the variable which holds your text.

Result:
None

Which is exactly what we expect since this first item indeed does not have any origin specified.


Price:


try:
    item_price = items.find('span', class_='s-item__price').text
except Exception as e:
    item_price = 'None'
    
print(item_price)

Result:
$187.99


Shipping:


try:
    item_shipping = items.find('span', class_='s-item__shipping s-item__logisticsCost').text
except Exception as e:
    item_shipping = 'None'
    
print(item_shipping)

Result
$19.99 shipping


Top Seller:


try:
    item_top_seller = items.find('span', class_='s-item__etrs-text').text
except Exception as e:
    item_top_seller = 'None'

print(item_top_seller)  

Result:
None

Indeed it is not from a top seller.

Rating:


try:
    item_stars = items.find('span', class_='clipped').text.split(' ')[0]
except Exception as e:
    item_stars = 'None'
    
print(item_stars)

Result:
None

The product has no rating.

Number of reviews:


try:
    item_nreviews = items.find('a', class_='s-item__reviews-count').text.split(' ')[0]
except Exception as e:
    item_nreviews = 'None' 
    
print(item_nreviews)

Result:
None

There are no reviews.


Qty Sold:


try:
    item_qty_sold = items.find('span', class_='s-item__hotness s-item__itemHotness').text.split(' ')
    if item_qty_sold[1] == 'sold':
        item_qty_sold = item_qty_sold[0]
    else:
        item_qty_sold = 0
except Exception as e:
    item_qty_sold = 'None' 
print(item_qty_sold) 


Ok here is the 2nd issue we highlighted previously. This element on the webpage is sometimes denoted as Qty sold and sometimes as how many people are watching. Since normally I see the pattern goes as "some number here + sold", I added an if statement to check if the 2nd element of the returned array is equal to "sold". If indeed it is I returned the first element which is just the number.
Else I return it as zero.

Result:
0

Here it works as expected as we do not have a Qty Sold for this item.

Item Link:


try:
    item_link = items.find('a', class_='s-item__link')['href']
except Exception as e:
    item_link = 'None'
    
print(item_link)

Getting links is something we did not address before, but it is nothing too complicated.
We follow the same sequence just like we always did, but this time instead of using .text at the end we add ['href']. Very simply by clicking right click on the title of the item and inspecting the HTML code we see that right next to the class we have href = our link.



And the result is indeed the link.




Scraping all the data for all products:

Ok, now what if we want this data returned for all the products within the page how would we do that?
Very simple, we make a very minor modification our original 3 lines of code below:

source = requests.get('https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1').text    
soup = BeautifulSoup(source, 'lxml')
items = soup.find('li', class_='s-item')


Instead of assigning soup.find('li', class_='s-item') to a variable items, which will only return the first element with this 'li' tag and class = 's-item' we want to ask python to look for all the products within the full page's parsed HTML code assigned in the soup variable.
We do this by simply using a for loop to do everything we did above for each element that meets those conditions for the tag and class.

The full code will be as follows:


for items in soup.find_all('li', class_='s-item'):
    
    try:  
        item_title = items.find('h3', class_='s-item__title').text
    except Exception as e:
        item_title = 'None'

    print(item_title)

    try:  
        item_desc = items.find('div', class_='s-item__subtitle').text
    except Exception as e:
        item_desc = 'None'

    print(item_desc)

    try:
        item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text.split(' ')[1]
    except Exception as e:
        item_brand = 'None'

    print(item_brand)

    try:
        item_model = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes2').text.split(' ')[1:]
        item_model = ' '.join(item_model)
    except Exception as e:
        item_model = 'None'

    print(item_model)

    try:
        item_features = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes3').text.split(' ')[1]
    except Exception as e:
        item_features = 'None'

    print(item_features)    

    try:  
        item_origin = items.find('span', class_='s-item__location s-item__itemLocation').text
        item_origin = re.sub('From ', '', item_origin)
    except Exception as e:
        item_origin = 'None'

    print(item_origin)

    try:
        item_price = items.find('span', class_='s-item__price').text
    except Exception as e:
        item_price = 'None'

    print(item_price)

    try:
        item_shipping = items.find('span', class_='s-item__shipping s-item__logisticsCost').text
    except Exception as e:
        item_shipping = 'None'

    print(item_shipping)

    try:
        item_top_seller = items.find('span', class_='s-item__etrs-text').text
    except Exception as e:
        item_top_seller = 'None'

    print(item_top_seller)    

    try:
        item_stars = items.find('span', class_='clipped').text.split(' ')[0]
    except Exception as e:
        item_stars = 'None'

    print(item_stars)

    try:
        item_nreviews = items.find('a', class_='s-item__reviews-count').text.split(' ')[0]
    except Exception as e:
        item_nreviews = 'None' 

    print(item_nreviews)

    try:
        item_qty_sold = items.find('span', class_='s-item__hotness s-item__itemHotness').text.split(' ')
        if item_qty_sold[1] == 'sold':
            item_qty_sold = item_qty_sold[0]
        else:
            item_qty_sold = 0
    except Exception as e:
        item_qty_sold = 'None'

    print(item_qty_sold)

    try:
        item_link = items.find('a', class_='s-item__link')['href']
    except Exception as e:
        item_link = 'None'

    print(item_link)
    print()

I will not show you the result here as it will print all the data we previously returned, but now for all the products within our page.


Putting our data in a structured format:

Now we only have 1 step left. We want to put all this data in a structured format.
We do this using a pandas dataframe to hold all this data.

First I start by creating the dataframe, assigning the columns we'll need and putting all this into a variable called df.

df = pd.DataFrame(columns = ['Title', 'description',
                             'Brand', 'Model', 'Features', 'Origin', 
                             'Price', 'Shipping',
                             'Top Seller','Stars', 'No. Of Reviews',
                             'Qty Sold',  'Link'])

Next we can simply use the .loc method of pandas dataframes to put our values into the dataframe every time we loop through a product.
The .loc method can take the index of the row (which starts from zero) and the column name.
So for example df.loc[0, 'Title'] = 'My product' will put this value into the zeroth row which is our first row under the Title column.

To do this efficiently I assign an n variable at the very beginning before our loop block of code to be n=0 to act as a counter within our loop starting from zero.
After that I add this block of code at the end of the loop ending it with adding +1 to n each time we go through the loop.

    df.loc[n, 'Title'] = item_title
    df.loc[n, 'description'] = item_desc
    df.loc[n, 'Brand'] = item_brand
    df.loc[n, 'Model'] = item_model
    df.loc[n, 'Features'] = item_features
    df.loc[n, 'Origin'] = item_origin
    df.loc[n, 'Price'] = item_price
    df.loc[n, 'Shipping'] = item_shipping
    df.loc[n, 'Top Seller'] = item_top_seller
    df.loc[n, 'Stars'] = item_stars
    df.loc[n, 'No. Of Reviews'] = item_nreviews
    df.loc[n, 'Qty Sold'] = item_qty_sold
    df.loc[n, 'Link'] = item_link    
    
    n+=1

Finally we can check if this worked by doing a quick df.head() which returns the first five rows of the dataframe.


df.head()

Result:


Perfect, we got all this data in a very structured format now. One more step here is to simply save it to an excel file by using the df.to_excel


df.to_excel('ebay_phones.xlsx')

This will save an excel sheet with the data in your working directory.


I hope you find this useful and in the next part I will be discussing how to get this data from multiple pages and do some exploratory analysis.

To be continued!
 

Comments

  1. Thank you so much for making such a simple to understand and informative documentation on web scraping which was very much handful as a beginner.

    ReplyDelete
  2. Fantastic! Your article is down to earth and easy to follow. I'm very thankful for such a wonderful blog.

    ReplyDelete
  3. Thanks for the 2 part post its amazing! I wonder if you can share the .py file as I am struggling with the last part with the n=1 function and the tables. Thanks for the great effort!!!!

    ReplyDelete
  4. Great article, but as similar to another comment I am struggling to get the n=1 function to work. I have tried placing n=0 at the start of the loop but to no avail. Any chance of putting the full code up or a .py file? Thanks

    ReplyDelete

Post a Comment

Popular posts from this blog

Web crawling using Selenium (OECD better life index)

Introduction: In this article I will be using Selenium to extract data from http://www.oecdbetterlifeindex.org I will be getting data for each country regarding: Population Visitors per year Renewable Energy Indices for things like: Housing Income Jobs Community  Education Environment Civic Engagement Health  Life Satisfaction Safety Work-life Balance    Importing Libraries: from selenium import webdriver from selenium.webdriver.common.by import By import time Loading chrome driver: We need to use the chrome driver exe since I will be using google chrome to crawl our webpage (drivers for each web browser are available online and can be downloaded) browser = webdriver . Chrome(executable_path = '/chromedriver' ) Define our 1st function to: 1.         Access our link and load it according to the country we pass into the function 2.      ...

Web Scraping Using Python (Part 1)

 Introduction: This is my first trial into web scraping. I hope you find it useful and if you have any tips or suggestions, please leave them in the comments below. I will be showing you how to scrape data from e-commerce websites (taking ebay as an example here). Disclaimer: Please note that it is not preferred to overload any website with too many get requests to scrape information, as this can affect their servers and they can ban your IP for this, so do this at your own responsibility. Exploring the website: I will start by doing some exploration of the website itself a bit then I will be using python to get the data we need. Before we start let's see what type of data ebay has to offer. After some exploration into ebay I decided to take a dive into Electronics ( Cell Phones & Smart Phone Parts ). By looking at the first item we have here a few things already stand out. For example we could want to get: It's title The price of the ite...