The following will detail automated data collection from an HTML page using Python and the BeatifulSoup (bs4) library.  It will also include how to add that information to a preexisting Google Sheet.  This tutorial will assume that you have some basic knowledge of programming.

The example that will be used will show how to collect the information from the tables within articles found at this link "https://www.dtnpf.com/agriculture/web/ag/news/crops/more"

Full code available at: https://github.com/mritzing/fertDataCollector

Prerequisites:

  • Python (preferably 3.6 or newer)
  • A web driver available for download here: http://chromedriver.chromium.org/downloads
    • NOTE: Webdrivers are browser specific so you will need to find one that suits your needs
    • It is recommended that you put the driver file inside the directory where your python code will be, otherwise you will need to specify that within the declaration of the driver object

Getting page source:

Chrome driver declaration and request
from bs4 import BeautifulSoup
from selenium import webdriver
	
driver = webdriver.Chrome() #declare webdriver
url  = "https://www.dtnpf.com/agriculture/web/ag/news/crops/more"
driver.get(url) #load page
html = driver.page_source #get html 
soup = BeautifulSoup(html,"html.parser") #convert HTML into a BeautifulSoup object
print(soup) #print page source

Above will set the soup variable equal the the page source that you can then use for parsing out information.

Two Parsing Methods:

There are two ways you can parse this page source, either casting it to a string and using string manipulation functions (https://docs.python.org/2/library/string.html) or navigating the HTML tree using the bs4 library (https://www.crummy.com/software/BeautifulSoup/bs4/doc/), below will be an example of each.  Both have their advantages/disadvantages it is up to you to figure out which works best in specific situations.

String Parsing:

Href Search Function
# returns href after specified string
def hrefSearch(searchStr, source):
	soupStr = str(source) #cast to string 
	results = []
	index = 0
	while index != -1: #if found 
		index = soupStr.find(searchStr,index)
		index = soupStr.find("href=",index) + 6
		#because i added 6 up here, check for 5 (-1 is not found -1 + 6 = 5)
		if index is 5:
			return results
		endIndex = soupStr.find('"', index)
		results.append(soupStr[index: endIndex])
		index = endIndex + 1
	return results

For this project it was necessary to gather all the links from the page, https://www.dtnpf.com/agriculture/web/ag/news/crops/more, that had the title DTN Fertilizer Trends, the above function when called with: hrefSearch("DTN Retail Fertilizer Trends", soup), returns a list of the URLs for those articles.  It does this by treating the HTML source as a large string looking for the searchStr and then grabbing the next href link after the searchStr. 

BeautifulSoup Parsing:


Table Return Function
#returns data from table on requested URL
def tableReturn(source):
	data = []
	# finds "DRY" <- table title , and then pulls parent to get info into list of lists
	table = source.find(text="DRY").find_parent("table")
	table_body = table.find('tbody')
	rows = table_body.find_all('tr')
	for row in rows:
	    cols = row.find_all('td')
	    cols = [ele.text.strip() for ele in cols]
	    data.append([ele for ele in cols if ele]) # Get rid of empty values
	return data

Another way to tackle data collection is to utilize the ability to navigate html trees using the functions in bs4.  Above the find function was used to find the title of the first table on the page, then the parent function returned the table containing the "DRY" string.  Then the information in the table was extracted in the for loop and was returned as a list of lists. 

Vocab/Useful Tips:

  • Webdriver: 
  • BeautifulSoup
  • Breakpoints: