The following will detail automated data collection from an HTML page using Python and the BeatifulSoup (bs4) library. It will also include how to add that information to a preexisting Google Sheet. This tutorial will assume that you have some basic knowledge of programming.
The example that will be used will show how to collect the information from the tables within articles found at this link "https://www.dtnpf.com/agriculture/web/ag/news/crops/more"
Full code available at: https://github.com/mritzing/fertDataCollector
from bs4 import BeautifulSoup from selenium import webdriver driver = webdriver.Chrome() #declare webdriver url = "https://www.dtnpf.com/agriculture/web/ag/news/crops/more" driver.get(url) #load page html = driver.page_source #get html soup = BeautifulSoup(html,"html.parser") #convert HTML into a BeautifulSoup object print(soup) #print page source |
Above will set the soup variable equal the the page source that you can then use for parsing out information.
There are two ways you can parse this page source, either casting it to a string and using string manipulation functions (https://docs.python.org/2/library/string.html) or navigating the HTML tree using the bs4 library (https://www.crummy.com/software/BeautifulSoup/bs4/doc/), below will be an example of each. Both have their advantages/disadvantages it is up to you to figure out which works best in specific situations.
String Parsing:
# returns href after specified string def hrefSearch(searchStr, source): soupStr = str(source) #cast to string results = [] index = 0 while index != -1: #if found index = soupStr.find(searchStr,index) index = soupStr.find("href=",index) + 6 #because i added 6 up here, check for 5 (-1 is not found -1 + 6 = 5) if index is 5: return results endIndex = soupStr.find('"', index) results.append(soupStr[index: endIndex]) index = endIndex + 1 return results |
For this project it was necessary to gather all the links from the page, https://www.dtnpf.com/agriculture/web/ag/news/crops/more, that had the title DTN Fertilizer Trends, the above function when called with: hrefSearch("DTN Retail Fertilizer Trends", soup), returns a list of the URLs for those articles. It does this by treating the HTML source as a large string looking for the searchStr and then grabbing the next href link after the searchStr.
BeautifulSoup Parsing:
#returns data from table on requested URL def tableReturn(source): data = [] # finds "DRY" <- table title , and then pulls parent to get info into list of lists table = source.find(text="DRY").find_parent("table") table_body = table.find('tbody') rows = table_body.find_all('tr') for row in rows: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) # Get rid of empty values return data |
Another way to tackle data collection is to utilize the ability to navigate html trees using the functions in bs4. Above the find function was used to find the title of the first table on the page, then the parent function returned the table containing the "DRY" string. Then the information in the table was extracted in the for loop and was returned as a list of lists.