The following will detail automated data collection from an HTML page using Python and the BeatifulSoup (bs4) library. It will also include how to add that information to a preexisting Google Sheet. This tutorial will assume that you have some basic knowledge of programming.
The example that will be used will show how to collect the information from the tables within articles found at this link "https://www.dtnpf.com/agriculture/web/ag/news/crops/more"
Full code available at: https://github.com/mritzing/fertDataCollector
Prerequisites:
- Python (preferably 3.6 or newer)
- A web driver available for download here: http://chromedriver.chromium.org/downloads
- NOTE: Webdrivers are browser specific so you will need to find one that suits your needs
- It is recommended that you put the driver file inside the directory where your python code will be, otherwise you will need to specify that within the declaration of the driver object
Getting page source:
from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Chrome() #declare webdriver url = "https://www.dtnpf.com/agriculture/web/ag/news/crops/more" driver.get(url) #load page html = driver.page_source #get html soup = BeautifulSoup(html,"html.parser") #convert HTML into a BeautifulSoup object print(soup) #print page source
Above will set the soup variable equal the the page source that you can then use for parsing out information.
Two Parsing Methods:
There are two ways you can parse this page source, either casting it to a string and using string manipulation functions (https://docs.python.org/2/library/string.html) or navigating the HTML tree using the bs4 library (https://www.crummy.com/software/BeautifulSoup/bs4/doc/), below will be an example of each. Both have their advantages/disadvantages it is up to you to figure out which works best in specific situations.
String Parsing:
# returns href after specified string def hrefSearch(searchStr, source): soupStr = str(source) #cast to string results = [] index = 0 while index != -1: #if found index = soupStr.find(searchStr,index) index = soupStr.find("href=",index) + 6 #because i added 6 up here, check for 5 (-1 is not found -1 + 6 = 5) if index is 5: return results endIndex = soupStr.find('"', index) results.append(soupStr[index: endIndex]) index = endIndex + 1 return results
For this project it was necessary to gather all the links from the page, https://www.dtnpf.com/agriculture/web/ag/news/crops/more, that had the title DTN Fertilizer Trends, the above function when called with: hrefSearch("DTN Retail Fertilizer Trends", soup), returns a list of the URLs for those articles. It does this by treating the HTML source as a large string looking for the searchStr and then grabbing the next href link after the searchStr.
BeautifulSoup Parsing:
#returns data from table on requested URL def tableReturn(source): data = [] # finds "DRY" <- table title , and then pulls parent to get info into list of lists table = source.find(text="DRY").find_parent("table") table_body = table.find('tbody') rows = table_body.find_all('tr') for row in rows: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) # Get rid of empty values return data
Another way to tackle data collection is to utilize the ability to navigate html trees using the functions in bs4. Above the find function was used to find the title of the first table on the page, then the parent function returned the table containing the "DRY" string. Then the information in the table was extracted in the for loop and was returned as a list of lists.
Vocab/Useful Tips:
- Webdriver:
- BeautifulSoup
- Breakpoints: