Introduction

Unlike most mainstream sports that enjoy centralized national leagues, millions of dollars that can be spent to collect and organize data efficiently, and fans that would love to spend extra time tracking player statistics, Jiu-Jitsu continues to be very much a grassroots effort with a scattered data landscape.

As such, finding reliable and easy to access datasets are few and far between. This project demonstrates how to utilize Python and the Beautifulsoup library to collect over 50,000 match results from the BJJHeroes website.

Beautifulsoup Explained

Beautifulsoup is an html parsing library.

What that means is that from the requests library, we can use the get function to obtain access to the website within a python notebook. Then, using the Beautifulsoup library, we can parse through the html code of a website to obtain the information we are looking for.

Importing Packages and Accessing Website

To begin, I am:

Importing requests to access the website
Importing pandas
Declaring the website as a variable r so that we can access it
Showing the first 500 characters of the html website to confirm we have access to it

Code and Output Example

In [1]:

import requests
import pandas as pd
from bs4 import BeautifulSoup

r = requests.get('https://www.bjjheroes.com/a-z-bjj-fighters-list')
# Print the first 500 characters of the HTML
print(r.text[0:500])

Out [1]:

<!doctype html>
<head dir="ltr" lang="en-US">
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<link href="//www.google-analytics.com" rel="dns-prefetch">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<link rel="pingback" href="https://www.bjjheroes.com/xmlrpc.php">
<link rel="icon" id="favicon" type="image/png" href="https://www.bjjheroes.com/wp-content/uploads/2020/03/favicon-16x">

Finding the 'Tags' we are looking for

Now that we know we are accessing the website html, here is an example of using the parser to return parts of the html website. We will assign the variable tag the title tag from the website.

Code and Output Example

In [1]:

soup = BeautifulSoup(r.text, 'html.parser')
tag = soup.title
tag

Out [1]:

Looking at the page website we see that the First Name, Last Name, Nickname, and Team are stored in a dataframe style format.

This table is stored as a Table Row (tr) class

Within a tr, the row and column entries are stored as a Table Data (td) class.

We can use the .find_all() method to return all entries that have the td class.

In [1]:

tag = soup.find_all("td")

print('First 20 entries')
tag[:10]

Out [1]:

First 20 entries

<td class="column-1"><a href="/?p=8141">Aarae</a></td>,
<td class="column-2"><a href="/?p=8141">Alexander</a></td>,
<td class="column-3"></td>,
<td class="column-4">Team Lloyd Irvin</td>,
<td class="column-1"><a href="/?p=9246">Aaron</a></td>,
<td class="column-2"><a href="/?p=9246">Johnson</a></td>,
<td class="column-3"><a href="/?p=9246">Tex</a></td>,
<td class="column-4">Unity JJ</td>,
<td class="column-1"><a href="/?p=8494">Abdurakhman</a></td>,
<td class="column-2"><a href="/?p=8494">Bilarov</a></td>

What can we learn from the above output?

Each name contains a link (a href="...") , which we can use to get to each individual athletes stats page
Each entry follows the same format, so we can reliably iterate through and store this information in a dataframe

Scraping Table information into a Dataframe

Next, now that we have identified the structure of the data within the html file we want, we can use a for loop to pull the data into a dataframe

For each row in table, we will:

Find all rows in the table and confirm there are 4 columns
For each column, find the <a> tag, and extract the name or text entry, and the URL if it is present
Store these under the variables "first_name","first_name_url", etc.
Append these stored values into the empty dictionary called "data", then convert it into a dataframe

In [1]:

# Find all table rows
rows = soup.find_all("tr")

# Initialize empty list
data = []

# Iterate over the rows and extract
for row in rows:
    columns = row.find_all("td")
    if len(columns) == 4:  # confirm the column count in the row
        first_name_tag = columns[0].find("a")
        first_name = first_name_tag.text.strip() if first_name_tag else columns[0].text.strip()
        first_name_url = first_name_tag['href'] if first_name_tag else None

        last_name_tag = columns[1].find("a")
        last_name = last_name_tag.text.strip() if last_name_tag else columns[1].text.strip()
        last_name_url = last_name_tag['href'] if last_name_tag else None

        nick_name_tag = columns[2].find("a")
        nick_name = nick_name_tag.text.strip() if nick_name_tag else columns[2].text.strip()
        nick_name_url = nick_name_tag['href'] if nick_name_tag else None

        team_tag = columns[3].find("a")
        team = team_tag.text.strip() if team_tag else columns[3].text.strip()
        team_url = team_tag['href'] if team_tag else None
        
        # Append data to list
        data.append({
            "First Name": first_name,
            "Last Name": last_name,
            "Nick Name": nick_name,
            "Team": team,
            "Athlete URL": first_name_url,
        })

# Convert the list to DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

< >

Out [1]:

	First Name	Last Name	Nick Name	Team	Athlete URL
0	Aarae	Alexander		Team Lloyd Irvin	/?p=8141
1	Aaron	Johnson	Tex	Unity JJ	/?p=9246
2	Abdurakhman	Bilarov		Team Nogueira	/?p=8494
3	Abmar	Barbosa			/?p=390
4	Abraham	Marte Messina		Yamasaki / Basico	/?p=3083
1379	Valdir	Canuto	Tio Chico	Zenith JJ	/?p=7505
1380	Nakapan	Phungephorn		BETA Academy	/?p=7512
1381	Eliot	Kelly		Yemaso JJ	/?p=7519
1382	Mauricio	Pereira	Mauricao	Behring JJ	/?p=7556
1383	Vinicius	Garcia			/?p=7636

[1384 rows x 5 columns]

Using Stored Athlete URL to Extract each Athlete's Match Results

Now that we have a hyperlink stored in the dataframe, we can use it to access each athletes individual page with their match results.

On each page, they have another table, similar to the one we just used to obtain their basic information, that we can use a for loop on once on their page to obtain their match results.

In the following code, we will:

Create a function extract_athlete_data() to send a request for the website using the URL + the Athlete URL
Confirm there are 8 columns, and store the information from each row entry under the subsequent column name
Loop through each athlete_url in the original dataframe, extract the data and store it in a final list
Convert the list into a new dataframe that includes all athlete match results on BJJHeroes.com

In [1]:

# Function to append athlete url and extract data
def extract_athlete_data(athlete_url):
    full_url = f'https://www.bjjheroes.com{athlete_url}'
    response = requests.get(full_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all table rows
    rows = soup.find_all("tr")

    # Empty list to store data
    data = []

    # Iterate extract
    for row in rows:
        columns = row.find_all("td")
        if len(columns) == 8:  # confirm table column count
            opponent_tag = columns[1].find("a")
            opponent_name = opponent_tag.text.strip() if opponent_tag else columns[1].text.strip()
            result = columns[2].text.strip()
            method_tag = columns[3].find("a")
            method = method_tag.text.strip() if method_tag else columns[3].text.strip()
            event = columns[4].text.strip()
            weight = columns[5].text.strip()
            stage = columns[6].text.strip()
            year = columns[7].text.strip()

            # Append to list
            data.append({
                "Athlete URL": full_url,
                "Opponent Name": opponent_name,
                "Result": result,
                "Method": method,
                "Event": event,
                "Weight": weight,
                "Stage": stage,
                "Year": year
            })
    return data

# Store the final data
all_data = []

# Loop through each athlete in the initial DataFrame
for index, row in df.iterrows():
    athlete_url = row['Athlete URL']
    athlete_data = extract_athlete_data(athlete_url)
    all_data.extend(athlete_data)

# Convert the list to pandas DataFrame
df_final = pd.DataFrame(all_data)

# Display  DataFrame
print(df_final)

Out [1]:

	Athlete URL	Opponent Name	Result	Method	Event	Weight	Stage	Year
0	https://www.bjjheroes.com/?p=9246	Quentin Rosensweig	L	Inside heel hook	Kakuto 5	ABS	SPF	2015
1	https://www.bjjheroes.com/?p=9246	Neiman Gracie	L	RNC	NoGi Pan Ams	94KG	SF	2015
2	https://www.bjjheroes.com/?p=9246	Richie Martinez	L	Heel hook	Kakuto Challenge	ABS	SF	2015
3	https://www.bjjheroes.com/?p=9246	Leo Nogueira	L	Points	Atlanta W. Open	94KG	SF	2016
4	https://www.bjjheroes.com/?p=9246	Romulo Azevedo	L	N/A	UAEJJF NYC Pro	94KG	SF	2016
...	...	...	...	...	...	...	...
50828	https://www.bjjheroes.com/?p=7636	Cody Heller	W	N/A	Atlanta SM Open	ABS	4F	2019
50829	https://www.bjjheroes.com/?p=7636	Daniel Olivier	W	Canto choke	New Orleans Open	88KG	SF	2020
50830	https://www.bjjheroes.com/?p=7636	Joshua Murdock	W	Points	New Orleans Open	ABS	SF	2020
50831	https://www.bjjheroes.com/?p=7636	Kyle Raemisch	W	Mounted X choke	F2W 153	85KG	SPF	2020
50832	https://www.bjjheroes.com/?p=7636	Kevin Vieira	W	Hashimoto choke	Pan American	82KG	8F	2020

[50833 rows x 8 columns]

Replacing Athlete URL with Athlete Name

Now that we have all of the athlete results stored, the last step, which we could have done earlier, is to replace the "Athlete URL" column with the Athlete's name.

This code:

take the ID from 'Athlete URL', in the original df and the final dataframe df_final
Then it merges the two dataframes on the matching ID found in the hyperlink
Now, since we have added the 'First Name' and 'Last Name' columns to the dataframe, we can drop the ID columns, and shift the columns so that the name of the athlete is at the beginning

In [1]:

df['ID'] = df['Athlete URL'].str.extract(r'\?p=(\d+)')
df_final['ID'] = df_final['Athlete URL'].str.extract(r'\?p=(\d+)')
df_merged = df_final.merge(df[['ID', 'First Name', 'Last Name']], on='ID', how='left')

# Drop the 'Athlete URL' and 'ID' columns 
df_merged.drop(columns=["Athlete URL", "ID"], inplace=True)

# Reorder columns to place 'First Name' and 'Last Name' at the beginning
df_merged = df_merged[["First Name", "Last Name", "Opponent Name", "Result", "Method", "Event", "Weight", "Stage", "Year"]]

# Final DataFrame
print(df_merged)

Out [1]:

	First Name	Last Name	Opponent Name	Result		Method	Event	Weight	Stage	Year
0	Aaron	Johnson	Quentin Rosensweig	L		Inside heel hook	Kakuto 5	ABS	SPF	2015
1	Aaron	Johnson	Neiman Gracie	L		RNC	NoGi Pan Ams	94KG	SF	2015
2	Aaron	Johnson	Richie Martinez	L		Heel hook	Kakuto Challenge	ABS	SF	2015
3	Aaron	Johnson	Leo Nogueira	L		Points	Atlanta W. Open	94KG	SF	2016
4	Aaron	Johnson	Romulo Azevedo	L		N/A	UAEJJF NYC Pro	94KG	SF	2016
...	...	...		...	...	...		...	...
51294	Vinicius	Garcia	Cody Heller	W		N/A	Atlanta SM Open	ABS	4F	2019
51295	Vinicius	Garcia	Daniel Olivier	W		Canto choke	New Orleans Open	88KG	SF	2020
51296	Vinicius	Garcia	Joshua Murdock	W		Points	New Orleans Open	ABS	SF	2020
51297	Vinicius	Garcia	Kyle Raemisch	W		Mounted X choke	F2W 153	85KG	SPF	2020
51298	Vinicius	Garcia	Kevin Vieira	W		Hashimoto choke	Pan American	82KG	8F	2020

[51299 rows x 9 columns]

Conclusion

We now have a full dataframe of all of the match results included on the BJJHeroes website that we can use for analysis.

Some cleaning is required, as in the process of merging on the ID some duplciates were formed. There are also some athletes who appear twice on the A-Z list with different name spellings. Additionally, the rest of the data entered into the athlete stats tables are not uniformly formatted, but this will all be taken care of in another project.

If you are interested in the data as of 9/09/2024, you can access it below

Click here to download the CSV file