Introduction:
This is my first trial into web scraping. I hope you find it useful and if you have any tips or suggestions, please leave them in the comments below.
I will be showing you how to scrape data from e-commerce websites (taking ebay as an example here).
Disclaimer: Please note that it is not preferred to overload any website with too many get requests to scrape information, as this can affect their servers and they can ban your IP for this, so do this at your own responsibility.
I will be showing you how to scrape data from e-commerce websites (taking ebay as an example here).
Disclaimer: Please note that it is not preferred to overload any website with too many get requests to scrape information, as this can affect their servers and they can ban your IP for this, so do this at your own responsibility.
Exploring the website:
I will start by doing some exploration of the website itself a bit then I will be using python to get the data we need.
Before we start let's see what type of data ebay has to offer.
After some exploration into ebay I decided to take a dive into Electronics (Cell Phones & Smart Phone Parts).
By looking at the first item we have here a few things already stand out.
For example we could want to get:
- It's title
- The price of the item ($26.38)
- Where it comes from (China)
- QTY sold (283)
- Whether it offers free shipping
So apparently for some items there are tons of other useful attributes we can scrape off as well.
On top of all the previous attributes some items offer other attributes such as:
- A more detailed description of the item
- Whether the seller is a top rated seller or not
- The brand of the item
- Type
- Color
If we could get those attributes in a structured format for the 637k results we have only just for the "Cell Phone & Smartphone Parts" category imagine what kind of insights we could get.
Some ideas I can already think of are for example:
- Which brands sell the most in a particular price range?
- How much impact does free shipping have on sales of similar products in the same price range?
- Is the origin of the product a deciding factor for consumers?
Ok great, so now our goal is simple, scrape some attributes and put them in a structured format.
Scraping (brief Intro):
Generally speaking the most straightforward method to scrape data from any website is:
1. Decide what piece of information we want to get within the web page
1. Decide what piece of information we want to get within the web page
2. Look up where this piece of information exists in the HTML code of the page.
3. Load the web page in python.
4. Use BeautifulSoup library in python to parse the full HTML code of the page.
5. Finally, search for the piece of information within the parsed HTML code we have loaded in python and return it in the desired format.
Note: You don't need to fully understand HTML, as all we'll do is simply look for patterns within the code.
3. Load the web page in python.
4. Use BeautifulSoup library in python to parse the full HTML code of the page.
5. Finally, search for the piece of information within the parsed HTML code we have loaded in python and return it in the desired format.
Note: You don't need to fully understand HTML, as all we'll do is simply look for patterns within the code.
Great, now let's begin with an example:
If you want to follow along with me you can go to the below link:
Let's say I want to get the title of the first product we have here, so I would simply right click on the title of the product and click "Inspect Element" (I am doing this on Firefox. All browsers offer this functionality, just under different names).
This opens a tiny window below showing all the HTML code behind the webpage interface.
You can explore by moving along the code, and you will see that each piece you hover over will highlight the corresponding element on the webpage.
Also by clicking the tiny arrow on the left of each line you can get more details, for example by clicking on the tiny arrow here next to "h3" we find exactly what we're looking for... the title of the product.
Here we can see that the title lies between the bounds of tags "h3" as follows <h3> + some attributes + our title + </h3>. We can use this later to tell python to look within this code for the "h3" tags, but since we could have more than "h3" we can also narrow our choices a bit but asking python to look for "h3" that has a class attribute = "s-item__title".
Next we need to scroll up a little bit within the HTML code to find a piece of code that covers a wider range than just the title. We do this to just narrow our search in python (as we will see later when we get coding in python) instead of going through the whole page where similar tags could exist.
By generally scrolling through the HTML code we can see that all the details for each product fall under a general tag denoted as 'li' and a class = 's-item', and by hovering over that tag we can indeed see that it indeed highlights the area that contains all the details of the first product. For now we just need to remember this and things will be much clearer later.
You can explore by moving along the code, and you will see that each piece you hover over will highlight the corresponding element on the webpage.
Also by clicking the tiny arrow on the left of each line you can get more details, for example by clicking on the tiny arrow here next to "h3" we find exactly what we're looking for... the title of the product.
Here we can see that the title lies between the bounds of tags "h3" as follows <h3> + some attributes + our title + </h3>. We can use this later to tell python to look within this code for the "h3" tags, but since we could have more than "h3" we can also narrow our choices a bit but asking python to look for "h3" that has a class attribute = "s-item__title".
Next we need to scroll up a little bit within the HTML code to find a piece of code that covers a wider range than just the title. We do this to just narrow our search in python (as we will see later when we get coding in python) instead of going through the whole page where similar tags could exist.
By generally scrolling through the HTML code we can see that all the details for each product fall under a general tag denoted as 'li' and a class = 's-item', and by hovering over that tag we can indeed see that it indeed highlights the area that contains all the details of the first product. For now we just need to remember this and things will be much clearer later.
Scraping (Coding):
Now I will go ahead and import the libraries we'll use:from bs4 import BeautifulSoup
import requests import pandas as pd import numpy as np import re
If you don't have any of those libraries installed you can run a pip install in bash/cmd.
- BeautifulSoup for parsing HTML
- Requests for importing the webpage using it's URL
- Pandas (a great library to structure data, manipulate it & also save it to csv or excel or other formats)
- Numpy (Used to run general mathematical operations on full arrays)
- re is mainly used for text manipulation and regular expressions
Next we will:
1. Use the requests.get method on the URL and store the result into the variable "source".
2. We will use the BeautifulSoup library on the source variable and we will use the 'lxml' parser which is the preferred one to use for general purpose. (There are other parsers, but I am not really familiar with the their best use cases. Anyway you can check them in the documentation for the library if you're interested)
source = requests.get('https://www.ebay.com/b/Cell-Phone-Smartphone-Parts/43304/bn_151926?rt=nc&_pgn=1').text soup = BeautifulSoup(source, 'lxml')
We can print the output to see what it looks like (.prettify() is a great way to just have the output look a little bit more readable by indenting the HTML code as printing without it gives a very messy result)
print(soup.prettify())
Below is a snippet of the output we would get:
<!DOCTYPE html>
<!--[if IE 9]><html class="ie9" lang="en"><![endif]-->
<!--[if gt IE 9]><!-->
<html lang="en">
<!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="Shop from the world's largest selection and best deals for Cell Phone & Smartphone Parts. Shop with confidence on eBay!" property="og:description"/>
<link href="https://ir.ebaystatic.com" rel="preconnect"/>
<title>
Cell Phone & Smartphone Parts | eBay
</title>
<meta content="Shop from the world's largest selection and best deals for Cell Phone & Smartphone Parts. Shop with confidence on eBay!" name="description"/>
<meta content="eBay" property="og:site_name"/>
<meta content="unsafe-url" name="referrer"/>
We could scroll through this output to look for the data we want to get just like we did before on the website, but generally I would prefer the right click + inspect element as it is more convenient this way.
Now we can make use of the exploration we did earlier on ebay's webpage HTML code.
First to cut things down a bit, we will return only the portion of the HTML code we want for this particular product and assign it to variable "item" by using the find method of on soup.
To do this we simply write within the find method the tag we are looking for which is 'li', in this case. After that we also add the class which is 's-item'. (Here we have to write "class_" and not just "class" because "class" is reserved in python to be used for creating classes)
item = soup.find('li', class_='s-item') print(item.prettify())
After printing the output we get a much more concise portion of the full HTML code and by looking closely through it we can see that the title we are looking for. (I highlighted it below)
<li class="s-item " data-widget="/lexbrwfe$1.0.0/src/common-utils/component-parts/widget-no-update" id="w5-items[0]"> <div class="s-item__wrapper clearfix"> <div class="s-item__image-section"> <div class="s-item__image"> <a _sp="p2489527.m4335.l8656" aria-hidden="true" data-track='{"eventFamily":"LST","eventAction":"ACTN","actionKind":"NAVSRC","operationId":"2489527","flushImmediately":false,"eventProperty":{"moduledtl":"mi:4335|iid:1|li:8656|luid:1|scen:Listings","parentrq":"431c27541680a16d7b8a263effff2df1","pageci":"b69cb580-4da1-4916-8fdc-edf3db098fd2"}}' href="https://www.ebay.com/itm/Per-Samsung-Galaxy-S5-G900F-i9600-LCD-Display-Touch-Screen-Digitizer-Nero-Bianco/263922230005?hash=item3d72fda2f5:m:m5UnWcGJ7JK-5gYlX3eey8A&var=563368623895" tabindex="-1"> <div class="s-item__image-wrapper"> <div class="s-item__image-helper"> </div> <img alt="Per Samsung Galaxy S5 G900F i9600 LCD Display Touch Screen Digitizer Nero Bianco" class="s-item__image-img" src="https://i.ebayimg.com/thumbs/images/m/m5UnWcGJ7JK-5gYlX3eey8A/s-l225.jpg"/> </div> </a> </div> </div> <div class="s-item__info clearfix"> <div class="s-item__title-hotness"> </div> <a _sp="p2489527.m4335.l8656" class="s-item__link" data-track='{"eventFamily":"LST","eventAction":"ACTN","actionKind":"NAVSRC","operationId":"2489527","flushImmediately":false,"eventProperty":{"moduledtl":"mi:4335|iid:1|li:8656|luid:1|scen:Listings","parentrq":"431c27541680a16d7b8a263effff2df1","pageci":"b69cb580-4da1-4916-8fdc-edf3db098fd2"}}' href="https://www.ebay.com/itm/Per-Samsung-Galaxy-S5-G900F-i9600-LCD-Display-Touch-Screen-Digitizer-Nero-Bianco/263922230005?hash=item3d72fda2f5:m:m5UnWcGJ7JK-5gYlX3eey8A&var=563368623895"> <h3 class="s-item__title" role="text"> Per Samsung Galaxy S5 G900F i9600 LCD Display Touch Screen Digitizer Nero Bianco </h3> </a> <div class="s-item__details clearfix"> <div class="s-item__detail s-item__detail--primary"> <span class="s-item__price"> $26.38 </span> </div> <span class="s-item__detail s-item__detail--secondary"> <span class="s-item__location s-item__itemLocation"> From China </span> </span> <div class="s-item__detail s-item__detail--primary"> <span class="s-item__shipping s-item__logisticsCost"> Free shipping </span> </div> <div class="s-item__detail s-item__detail--primary"> <span class="s-item__hotness s-item__itemHotness"> <span class="NEGATIVE"> 283 sold </span> </span> </div> </div> </div> </div> </li>
If you also looked a bit further into this portion of the code you probably already realized that it contains all the other attributes for this product as well like the price, where it's from...etc.
Great, next we will follow the same methodology to narrow the code further by using the find method again. This time we will use the tag 'h3' which holds the title we are looking for and we will again add the class as 's-item__title'.
item_title = item.find('h3', class_='s-item__title') print(item_title.prettify())
This is the output we get:
<h3 class="s-item__title" role="text"> Per Samsung Galaxy S5 G900F i9600 LCD Display Touch Screen Digitizer Nero Bianco </h3>
Perfect we are very close!
At this point it is very simple to extract text out of the result we have here. To do this we will only make a minor modification to the previous piece of code by simply adding .text at the end. (note that we now print the result without .prettify())
item_title = item.find('h3', class_='s-item__title').text print(item_title)
And this is the result we get. Exactly what we need!
Per Samsung Galaxy S5 G900F i9600 LCD Display Touch Screen Digitizer Nero Bianco
Now that we know how to do this for the title of a product we can do the same for any other attribute we desire.
It is just as simple as looking up the element's/attribute's corresponding HTML bounds and extracting whatever we need.
For example here is how to get the price using the same sequence (inspect element, find the relevant tags & class and use .text to retrieve the related text)
Assign tag as 'span', class_= 's-item__price':
item_price = item.find('span', class_='s-item__price').text print(item_price)
Result:
$26.38
It's that simple!.
Next we will utilize all what we did above within a for loop to return all the attributes we need for all the products within the page and store this data into a dataframe. We can also go further by going through all pages and storing all this data for later analysis.
For Part 2: https://oaref.blogspot.com/2019/01/web-scraping-using-python-part-2.html
Good information. Thanks for the amazing information keep sharing such a wonderful information like this. Web Directories data Scraper
ReplyDeleteI am really very happy to visit your blog. Directly I am found which I truly need. please visit our website for more information
ReplyDeleteTop 5 Best Open Source Web Scraping Framework Tools In 2022
This comment has been removed by the author.
ReplyDelete
ReplyDeleteVery Informative and creative contents. This concept is a good way to enhance the knowledge. thanks for sharing.
Continue to share your knowledge through articles like these, and keep posting more blogs.
And more Information Data scraping service in Australia