Webscraping Athlete Results with Beautiful Soup from BJJHeroes.com

Web Scraping
Beautiful Soup
Python

Introduction

Unlike most mainstream sports that enjoy centralized national leagues, millions of dollars that can be spent to collect and organize data efficiently, and fans that would love to spend extra time tracking player statistics, Jiu-Jitsu continues to be very much a grassroots effort with a scattered data landscape.

As such, finding reliable and easy to access datasets are few and far between. This project demonstrates how to utilize Python and the Beautifulsoup library to collect over 50,000 match results from the BJJHeroes website.

Beautifulsoup Explained

Beautifulsoup is an html parsing library.

What that means is that from the requests library, we can use the get function to obtain access to the website within a python notebook. Then, using the Beautifulsoup library, we can parse through the html code of a website to obtain the information we are looking for.

Importing Packages and Accessing Website

To begin, I am:

  • Importing requests to access the website
  • Importing pandas
  • Declaring the website as a variable r so that we can access it
  • Showing the first 500 characters of the html website to confirm we have access to it
Code and Output Example

Finding the 'Tags' we are looking for

Now that we know we are accessing the website html, here is an example of using the parser to return parts of the html website. We will assign the variable tag the title tag from the website.

Code and Output Example

Looking at the page website we see that the First Name, Last Name, Nickname, and Team are stored in a dataframe style format.


BJJ heroes table



This table is stored as a Table Row (tr) class

Within a tr, the row and column entries are stored as a Table Data (td) class.

We can use the .find_all() method to return all entries that have the td class.

What can we learn from the above output?

Scraping Table information into a Dataframe

Next, now that we have identified the structure of the data within the html file we want, we can use a for loop to pull the data into a dataframe

For each row in table, we will:

  • Find all rows in the table and confirm there are 4 columns
  • For each column, find the <a> tag, and extract the name or text entry, and the URL if it is present
  • Store these under the variables "first_name","first_name_url", etc.
  • Append these stored values into the empty dictionary called "data", then convert it into a dataframe
< >

Using Stored Athlete URL to Extract each Athlete's Match Results

Now that we have a hyperlink stored in the dataframe, we can use it to access each athletes individual page with their match results.

On each page, they have another table, similar to the one we just used to obtain their basic information, that we can use a for loop on once on their page to obtain their match results.

BJJ heroes table

In the following code, we will:

  • Create a function extract_athlete_data() to send a request for the website using the URL + the Athlete URL
  • Confirm there are 8 columns, and store the information from each row entry under the subsequent column name
  • Loop through each athlete_url in the original dataframe, extract the data and store it in a final list
  • Convert the list into a new dataframe that includes all athlete match results on BJJHeroes.com

Replacing Athlete URL with Athlete Name

Now that we have all of the athlete results stored, the last step, which we could have done earlier, is to replace the "Athlete URL" column with the Athlete's name.

This code:

  • take the ID from 'Athlete URL', in the original df and the final dataframe df_final
  • Then it merges the two dataframes on the matching ID found in the hyperlink
  • Now, since we have added the 'First Name' and 'Last Name' columns to the dataframe, we can drop the ID columns, and shift the columns so that the name of the athlete is at the beginning

Conclusion

We now have a full dataframe of all of the match results included on the BJJHeroes website that we can use for analysis.

Some cleaning is required, as in the process of merging on the ID some duplciates were formed. There are also some athletes who appear twice on the A-Z list with different name spellings. Additionally, the rest of the data entered into the athlete stats tables are not uniformly formatted, but this will all be taken care of in another project.

If you are interested in the data as of 9/09/2024, you can access it below

Click here to download the CSV file