I take the publicly available data on www.247sports.com and pull the last 10 years of signee data for all football players in the state of West Virginia. The method of pulling the data is to build a data pipeline that simply extracts the important information and creates a dataset. In r, there is a package called rvest
that is designed for exactly this. I show a few interesting plots and then dump the data to a csv file so that it can be easily shared.
I first have to check out https://247sports.com/Season/2020-Football/Commits/ to see what format the website is in and what code (HTML or XML) is used to display the data on the screen from their database. I do this using Google Chrome by right-click on the screen and select View Page Source
. I can tell by looking at the format of each individual player listed that the format is in HTML.
I need to get the CSS code for the data that I actually want to pull. I use SelectorGadget for this installed on a Chrome browser. From there, I build out a dataframe for each variable in on the website so that I can add these dataframes together when all of the data has been extracted. Once I had a working web scraper that pulled the data and I was able to drop the code into a for loop to iterate through all years of interest and then output to a dataframe that I use for plotting the data.
I plan to convert this into an application where a person can select their state of interest and the years of interest and the web scraper will automatically pull all of the data. To do this, I need to create variables for the year and state and then insert those variables into the website address so that I pull the correct data (both state and year). Then, I need to repeat all of my code for all 10 years. I’ll create a for loop that will run through this code for every year of interest. Code hidden from output - available on my github page.
I’ll extract all unique town names and then utilize an API from www.developer.tomtom.com that will allow me to pull the lat/long for each city/state combination. I’ll then insert the lat/long into a new dataframe and plot the locations of each commit.Code hidden from output - available on my github page.
My for loop worked and successful compiled all of the data for each year into a single dataframe. I want to create a few plots to explore the data visually. The first plot is the most complicated showing the state map with a point for each commit, shaded by year and sized by number of commits from that town in the time period. Code hidden from output - available on my github page.
This plot is a little easier to generate and shows where the commits went over the time period. Code hidden from output - available on my github page.
This plot is a little easier to generate and shows the position of the commits by town. Code hidden from output - available on my github page.
This plot is a little easier to generate and shows the colleges by number of commits. Code hidden from output - available on my github page.
You can download the csv file from my GitHub account here: Get CSV File
If you have a stat that you want to see or something you’d like to know…please email me. I’ll do my best to build it in the model.
Thanks!