In this article I will outline how to efficiently automate web scraping to return any data you may desire and then store them in a structured format for analysis.
This is the 2nd part of the series where in part 1 I went through just the basics of web scraping. If you'd like to get a quick basic overview you can check part 1 in this link (https://oaref.blogspot.com/2019/01/web-scraping-using-python-part-1.html)
Alright with that said, let's get to the exciting stuff!
Just like beofre I will start by importing our libraries.
I have decided I will scrape data for cell phones from ebay starting from this link(
Here I am interested in 13 categories and those are the ones I will be getting for all products:
I need to highlight a few things to keep in mind here:
Alright let's do this for each attribute one by one.
Here I simply used the find method just like we did in part 1 specifying 'h3' as the tag and 's-item__title' as the class with .text at the end to return only the text we need.
The only difference this time is that I used try & except to ask python to return "None" into the variable if an error is raised which will come in handy if this item does not have that attribute (a title in this case)
Printing the result gives exactly what we want. The title of the first product on the webpage
Exactly the same the like title I used .find with the relevant tag 'div' & the relevant class 's-item__subtitle' and .text at the end.
Again printing the result gives us the description we want.
Ok perfect, everything is the same just like before.
Let's print the result:
Hmm... looks ok, but we do not want to have "Brand:" then the actual brand written every time for each product. This will look a bit messy if we want to have this in an excel sheet later.
Let's try again with a minor modification at the end of the 2nd line of code:
Let's print:
Great, we got just the brand. Now what I did here is very simple. I added ".split(' ')" at the end which simply splits any text we give it based on whatever we specify between it's brackets. Here I specified to make splits based on spaces between words. The result we will get is an array ["Brand:", "Apple"] as follows. Next I simply added [1] to specify that I want the 2nd element of the array returned since I am not interested in the "Brand:" part.
For the model I did exactly what we did before, with another minor modification [1:] at the end of the first line. This is because I want to return everything after "Model:" since it is very expected that the model will not be just 1 word. This way I am telling python I want everything after the element in index 1 in the array.
In the 3rd line we used .join which is the exact opposite of .split. It simply joins all elements of the returned array based on whatever I specify before .join. Here I specified a space to return all words in the array with spaces in between them.
Let's print the result:
Same like before, nothing new here.
Result:
Here we are in the same situation like we were with "Model". We handle the text like we did before, but I thought I would show you a different method of doing this using the "re" which is great for regular expressions you can check it out. Anyway for re.sub what you do is give it a sequence of characters to look for here I put "From", then whatever you want to replace this sequence with and here I just put '' which means to just replace it with nothing. Finally you specify the variable which holds your text.
Result:
Which is exactly what we expect since this first item indeed does not have any origin specified.
Result:
Result
Result:
Indeed it is not from a top seller.
Result:
The product has no rating.
Result:
There are no reviews.
Ok here is the 2nd issue we highlighted previously. This element on the webpage is sometimes denoted as Qty sold and sometimes as how many people are watching. Since normally I see the pattern goes as "some number here + sold", I added an if statement to check if the 2nd element of the returned array is equal to "sold". If indeed it is I returned the first element which is just the number.
Else I return it as zero.
Result:
Here it works as expected as we do not have a Qty Sold for this item.
Getting links is something we did not address before, but it is nothing too complicated.
We follow the same sequence just like we always did, but this time instead of using .text at the end we add ['href']. Very simply by clicking right click on the title of the item and inspecting the HTML code we see that right next to the class we have href = our link.
And the result is indeed the link.
Very simple, we make a very minor modification our original 3 lines of code below:
Instead of assigning soup.find('li', class_='s-item') to a variable items, which will only return the first element with this 'li' tag and class = 's-item' we want to ask python to look for all the products within the full page's parsed HTML code assigned in the soup variable.
We do this by simply using a for loop to do everything we did above for each element that meets those conditions for the tag and class.
The full code will be as follows:
I will not show you the result here as it will print all the data we previously returned, but now for all the products within our page.
We do this using a pandas dataframe to hold all this data.
First I start by creating the dataframe, assigning the columns we'll need and putting all this into a variable called df.
Next we can simply use the .loc method of pandas dataframes to put our values into the dataframe every time we loop through a product.
The .loc method can take the index of the row (which starts from zero) and the column name.
So for example df.loc[0, 'Title'] = 'My product' will put this value into the zeroth row which is our first row under the Title column.
To do this efficiently I assign an n variable at the very beginning before our loop block of code to be n=0 to act as a counter within our loop starting from zero.
After that I add this block of code at the end of the loop ending it with adding +1 to n each time we go through the loop.
Finally we can check if this worked by doing a quick df.head() which returns the first five rows of the dataframe.
Result:
Perfect, we got all this data in a very structured format now. One more step here is to simply save it to an excel file by using the df.to_excel
This will save an excel sheet with the data in your working directory.
I hope you find this useful and in the next part I will be discussing how to get this data from multiple pages and do some exploratory analysis.
To be continued!
This is the 2nd part of the series where in part 1 I went through just the basics of web scraping. If you'd like to get a quick basic overview you can check part 1 in this link (https://oaref.blogspot.com/2019/01/web-scraping-using-python-part-1.html)
Alright with that said, let's get to the exciting stuff!
Scraping all the data from One Product:
Note that I will continue with the ebay example we did in part 1.Just like beofre I will start by importing our libraries.
from bs4 import BeautifulSoup import requests import pandas as pd import numpy as np import re
I have decided I will scrape data for cell phones from ebay starting from this link(
https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1)
First also just like we did in part 1 we will:- Perform a get request for the link we're interested in after visually inspecting the web page.
- Parse it with Beautiful Soup.
- Make things more concise by only getting only the portion we want into the items variable.
source = requests.get('https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1').text soup = BeautifulSoup(source, 'lxml') items = soup.find('li', class_='s-item')
Here I am interested in 13 categories and those are the ones I will be getting for all products:
- The title of the product which is the first thing written.
- The description of the product (written under the title).
- The Brand of the product.
- The model.
- Any miscellaneous features (for some products we have style, color, connectivity... etc.).
- The origin of the product.
- It's Price.
- Shipping information.
- Whether it comes from a top seller or not.
- It's rating (how many stars?).
- Number of reviewers who gave a rating.
- Qty sold (written for some products in red below the shipping information)
- Finally I will also retrieve the link to the product in case I want to get back to it later.
I need to highlight a few things to keep in mind here:
- Some products have missing attributes (for example a product might simply not have a model, or maybe the country of origin is not stated). The way we will deal with that is that we will just return "None" to any attribute that the scrip does not find as it goes along the web page.
- The Qty sold element for some products sometimes contains "Watching" instead which tells you how many people are watching this product. Since I am interested only in the Qty Sold whenever it is available for a product we will have to work around that to find a way to only return the value if it is actually the Qty Sold and not how many people are watching the product.
Alright let's do this for each attribute one by one.
Title:
try: item_title = items.find('h3', class_='s-item__title').text except Exception as e: item_title = 'None' print(item_title)
Here I simply used the find method just like we did in part 1 specifying 'h3' as the tag and 's-item__title' as the class with .text at the end to return only the text we need.
The only difference this time is that I used try & except to ask python to return "None" into the variable if an error is raised which will come in handy if this item does not have that attribute (a title in this case)
Printing the result gives exactly what we want. The title of the first product on the webpage
New *UNOPENDED* Apple iPhone SE - 16/64GB 4.0" Unlocked Smartphone
Description:
try: item_desc = items.find('div', class_='s-item__subtitle').text except Exception as e: item_desc = 'None' print(item_desc)
Exactly the same the like title I used .find with the relevant tag 'div' & the relevant class 's-item__subtitle' and .text at the end.
Again printing the result gives us the description we want.
NO-RUSH 14 DAYS SHIPPING ONLY! US LOCATION!
Brand:
try: item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text except Exception as e: item_brand = 'None' print(item_brand)
Ok perfect, everything is the same just like before.
Let's print the result:
Brand: Apple
Hmm... looks ok, but we do not want to have "Brand:" then the actual brand written every time for each product. This will look a bit messy if we want to have this in an excel sheet later.
Let's try again with a minor modification at the end of the 2nd line of code:
try: item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text.split(' ')[1] except Exception as e: item_brand = 'None' print(item_brand)
Let's print:
Apple
Great, we got just the brand. Now what I did here is very simple. I added ".split(' ')" at the end which simply splits any text we give it based on whatever we specify between it's brackets. Here I specified to make splits based on spaces between words. The result we will get is an array ["Brand:", "Apple"] as follows. Next I simply added [1] to specify that I want the 2nd element of the array returned since I am not interested in the "Brand:" part.
Model:
try: item_model = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes2').text.split(' ')[1:] item_model = ' '.join(item_model) except Exception as e: item_model = 'None' print(item_model)
For the model I did exactly what we did before, with another minor modification [1:] at the end of the first line. This is because I want to return everything after "Model:" since it is very expected that the model will not be just 1 word. This way I am telling python I want everything after the element in index 1 in the array.
In the 3rd line we used .join which is the exact opposite of .split. It simply joins all elements of the returned array based on whatever I specify before .join. Here I specified a space to return all words in the array with spaces in between them.
Let's print the result:
Apple iPhone SE
Features:
try: item_features = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes3').text.split(' ')[1] except Exception as e: item_features = 'None' print(item_features)
Same like before, nothing new here.
Result:
Bar
Origin:
try: item_origin = items.find('span', class_='s-item__location s-item__itemLocation').text item_origin = re.sub('From ', '', item_origin) except Exception as e: item_origin = 'None' print(item_origin)
Here we are in the same situation like we were with "Model". We handle the text like we did before, but I thought I would show you a different method of doing this using the "re" which is great for regular expressions you can check it out. Anyway for re.sub what you do is give it a sequence of characters to look for here I put "From", then whatever you want to replace this sequence with and here I just put '' which means to just replace it with nothing. Finally you specify the variable which holds your text.
Result:
None
Which is exactly what we expect since this first item indeed does not have any origin specified.
Price:
try: item_price = items.find('span', class_='s-item__price').text except Exception as e: item_price = 'None' print(item_price)
Result:
$187.99
Shipping:
try: item_shipping = items.find('span', class_='s-item__shipping s-item__logisticsCost').text except Exception as e: item_shipping = 'None' print(item_shipping)
Result
$19.99 shipping
Top Seller:
try: item_top_seller = items.find('span', class_='s-item__etrs-text').text except Exception as e: item_top_seller = 'None' print(item_top_seller)
Result:
None
Indeed it is not from a top seller.
Rating:
try: item_stars = items.find('span', class_='clipped').text.split(' ')[0] except Exception as e: item_stars = 'None' print(item_stars)
Result:
None
The product has no rating.
Number of reviews:
try: item_nreviews = items.find('a', class_='s-item__reviews-count').text.split(' ')[0] except Exception as e: item_nreviews = 'None' print(item_nreviews)
Result:
None
There are no reviews.
Qty Sold:
try: item_qty_sold = items.find('span', class_='s-item__hotness s-item__itemHotness').text.split(' ') if item_qty_sold[1] == 'sold': item_qty_sold = item_qty_sold[0] else: item_qty_sold = 0 except Exception as e: item_qty_sold = 'None'
print(item_qty_sold)
Ok here is the 2nd issue we highlighted previously. This element on the webpage is sometimes denoted as Qty sold and sometimes as how many people are watching. Since normally I see the pattern goes as "some number here + sold", I added an if statement to check if the 2nd element of the returned array is equal to "sold". If indeed it is I returned the first element which is just the number.
Else I return it as zero.
Result:
0
Here it works as expected as we do not have a Qty Sold for this item.
Item Link:
try: item_link = items.find('a', class_='s-item__link')['href'] except Exception as e: item_link = 'None' print(item_link)
Getting links is something we did not address before, but it is nothing too complicated.
We follow the same sequence just like we always did, but this time instead of using .text at the end we add ['href']. Very simply by clicking right click on the title of the item and inspecting the HTML code we see that right next to the class we have href = our link.
And the result is indeed the link.
Scraping all the data for all products:
Ok, now what if we want this data returned for all the products within the page how would we do that?Very simple, we make a very minor modification our original 3 lines of code below:
source = requests.get('https://www.ebay.com/b/Cell-Phones-Smartphones/9355/bn_320094?rt=nc&_pgn=1').text soup = BeautifulSoup(source, 'lxml') items = soup.find('li', class_='s-item')
Instead of assigning soup.find('li', class_='s-item') to a variable items, which will only return the first element with this 'li' tag and class = 's-item' we want to ask python to look for all the products within the full page's parsed HTML code assigned in the soup variable.
We do this by simply using a for loop to do everything we did above for each element that meets those conditions for the tag and class.
The full code will be as follows:
for items in soup.find_all('li', class_='s-item'): try: item_title = items.find('h3', class_='s-item__title').text except Exception as e: item_title = 'None' print(item_title) try: item_desc = items.find('div', class_='s-item__subtitle').text except Exception as e: item_desc = 'None' print(item_desc) try: item_brand = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes1').text.split(' ')[1] except Exception as e: item_brand = 'None' print(item_brand) try: item_model = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes2').text.split(' ')[1:] item_model = ' '.join(item_model) except Exception as e: item_model = 'None' print(item_model) try: item_features = items.find('span', class_='s-item__dynamic s-item__dynamicAttributes3').text.split(' ')[1] except Exception as e: item_features = 'None' print(item_features) try: item_origin = items.find('span', class_='s-item__location s-item__itemLocation').text item_origin = re.sub('From ', '', item_origin) except Exception as e: item_origin = 'None' print(item_origin) try: item_price = items.find('span', class_='s-item__price').text except Exception as e: item_price = 'None' print(item_price) try: item_shipping = items.find('span', class_='s-item__shipping s-item__logisticsCost').text except Exception as e: item_shipping = 'None' print(item_shipping) try: item_top_seller = items.find('span', class_='s-item__etrs-text').text except Exception as e: item_top_seller = 'None' print(item_top_seller) try: item_stars = items.find('span', class_='clipped').text.split(' ')[0] except Exception as e: item_stars = 'None' print(item_stars) try: item_nreviews = items.find('a', class_='s-item__reviews-count').text.split(' ')[0] except Exception as e: item_nreviews = 'None' print(item_nreviews) try: item_qty_sold = items.find('span', class_='s-item__hotness s-item__itemHotness').text.split(' ') if item_qty_sold[1] == 'sold': item_qty_sold = item_qty_sold[0] else: item_qty_sold = 0 except Exception as e: item_qty_sold = 'None' print(item_qty_sold) try: item_link = items.find('a', class_='s-item__link')['href'] except Exception as e: item_link = 'None' print(item_link) print()
I will not show you the result here as it will print all the data we previously returned, but now for all the products within our page.
Putting our data in a structured format:
Now we only have 1 step left. We want to put all this data in a structured format.We do this using a pandas dataframe to hold all this data.
First I start by creating the dataframe, assigning the columns we'll need and putting all this into a variable called df.
df = pd.DataFrame(columns = ['Title', 'description', 'Brand', 'Model', 'Features', 'Origin', 'Price', 'Shipping', 'Top Seller','Stars', 'No. Of Reviews', 'Qty Sold', 'Link'])
Next we can simply use the .loc method of pandas dataframes to put our values into the dataframe every time we loop through a product.
The .loc method can take the index of the row (which starts from zero) and the column name.
So for example df.loc[0, 'Title'] = 'My product' will put this value into the zeroth row which is our first row under the Title column.
To do this efficiently I assign an n variable at the very beginning before our loop block of code to be n=0 to act as a counter within our loop starting from zero.
After that I add this block of code at the end of the loop ending it with adding +1 to n each time we go through the loop.
df.loc[n, 'Title'] = item_title df.loc[n, 'description'] = item_desc df.loc[n, 'Brand'] = item_brand df.loc[n, 'Model'] = item_model df.loc[n, 'Features'] = item_features df.loc[n, 'Origin'] = item_origin df.loc[n, 'Price'] = item_price df.loc[n, 'Shipping'] = item_shipping df.loc[n, 'Top Seller'] = item_top_seller df.loc[n, 'Stars'] = item_stars df.loc[n, 'No. Of Reviews'] = item_nreviews df.loc[n, 'Qty Sold'] = item_qty_sold df.loc[n, 'Link'] = item_link n+=1
Finally we can check if this worked by doing a quick df.head() which returns the first five rows of the dataframe.
df.head()
Result:
Perfect, we got all this data in a very structured format now. One more step here is to simply save it to an excel file by using the df.to_excel
df.to_excel('ebay_phones.xlsx')
This will save an excel sheet with the data in your working directory.
I hope you find this useful and in the next part I will be discussing how to get this data from multiple pages and do some exploratory analysis.
To be continued!
Thank you so much for making such a simple to understand and informative documentation on web scraping which was very much handful as a beginner.
ReplyDeleteThank you for the kind words!
DeleteFantastic! Your article is down to earth and easy to follow. I'm very thankful for such a wonderful blog.
ReplyDeleteThanks for the 2 part post its amazing! I wonder if you can share the .py file as I am struggling with the last part with the n=1 function and the tables. Thanks for the great effort!!!!
ReplyDeleteGreat article, but as similar to another comment I am struggling to get the n=1 function to work. I have tried placing n=0 at the start of the loop but to no avail. Any chance of putting the full code up or a .py file? Thanks
ReplyDelete