Using Python-Chess with Pandas for High-Volume PGN Parsing

The Lichess games database (linked here) is a collection of chess games played on the Lichess website, which is a popular online chess platform. The database is regularly updated and contains millions of games, which can be downloaded in a variety of formats for analysis and study purposes. It is a valuable resource for chess players of all levels who want to improve their game and learn from the strategies and tactics of other players.

Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating structured data, including time series, tabular, and matrix data. Pandas also offers powerful tools for data cleaning, aggregation, filtering, and visualization, making it a popular choice for data scientists and analysts.

Python-chess is a Python library for working with chess in Python, providing functionality to create and modify chess games, analyze chess positions, generate and parse chess notation, and interact with chess engines.

Using these three resources in combination seems to be a promising way to explore chess in an analytically rigorous manner. I explore some of that here.

Open a PGN File with Multiple Games and Parse the Headers

You can use the pgn module of the python-chess library to read a large file with many PGNs in it. The pgn module provides a read_game function that can be used to iterate over the games in the PGN file one at a time. This can be useful for processing large files without having to load the entire file into memory at once.

This first example uses the January 2013 Standard Rated games database from Lichess. The PGN file is 92.8 MB and includes 121,333 games and 8,155,187 individual moves. It takes a over an hour for my computer to process.

import chess.pgn

filepath = '/home/ryan/Desktop/chess/lichess_pgns/lichess_db_standard_rated_2013-01.pgn'

with open(filepath) as f:
    game = chess.pgn.read_game(f)

header_list = []

for key in game.headers:
    header_list.append(key)

header_list
['Event',
 'Site',
 'Date',
 'Round',
 'White',
 'Black',
 'Result',
 'BlackElo',
 'BlackRatingDiff',
 'ECO',
 'Opening',
 'Termination',
 'TimeControl',
 'UTCDate',
 'UTCTime',
 'WhiteElo',
 'WhiteRatingDiff']
import chess.pgn
import pandas as pd

def open_and_scrape_headers(filepath):
    
    game_id = 0
    g = pd.DataFrame(columns=['game_id'] + header_list)
    
    with open(filepath) as f:
        
        while True:
            game = chess.pgn.read_game(f)
            game_id += 1

            # If there are no more games, exit the loop
            if game is None:
                break
            
            value_list = [game_id]
            for header in header_list:
                try:
                    value_list.append(game.headers[header])
                except:
                    value_list.append('')
            
            g.loc[len(g)] = value_list
            
            if (game_id % 20000 == 0):
                print('Now adding game_id: ' + str(game_id))
    
    return g
filepath = '/home/ryan/Desktop/chess/lichess_pgns/lichess_db_standard_rated_2013-01.pgn'
g = open_and_scrape_headers(filepath)
Now adding game_id: 20000
Now adding game_id: 40000
Now adding game_id: 60000
Now adding game_id: 80000
Now adding game_id: 100000
Now adding game_id: 120000
g.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 121332 entries, 0 to 121331
Data columns (total 18 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   game_id          121332 non-null  object
 1   Event            121332 non-null  object
 2   Site             121332 non-null  object
 3   Date             121332 non-null  object
 4   Round            121332 non-null  object
 5   White            121332 non-null  object
 6   Black            121332 non-null  object
 7   Result           121332 non-null  object
 8   BlackElo         121332 non-null  object
 9   BlackRatingDiff  121332 non-null  object
 10  ECO              121332 non-null  object
 11  Opening          121332 non-null  object
 12  Termination      121332 non-null  object
 13  TimeControl      121332 non-null  object
 14  UTCDate          121332 non-null  object
 15  UTCTime          121332 non-null  object
 16  WhiteElo         121332 non-null  object
 17  WhiteRatingDiff  121332 non-null  object
dtypes: object(18)
memory usage: 17.6+ MB
g['Opening'].value_counts()
Van't Kruijs Opening                                    3995
Owen Defense                                            3192
Scandinavian Defense: Mieses-Kotroc Variation           2300
Modern Defense                                          1982
Horwitz Defense                                         1952
                                                        ... 
Benoni Defense: Czech Benoni Defense                       1
Vienna Game: Giraffe Attack                                1
English Defense: Perrin Variation                          1
Owen Defense: Wind Gambit                                  1
King's Pawn Game: Busch-Gass Gambit, Chiodini Gambit       1
Name: Opening, Length: 1846, dtype: int64
g[g['Opening'].str.contains('ondon')]['Opening'].value_counts()
Queen's Pawn Game: London System              124
Indian Game: London System                     30
London System                                  15
Scotch Game: Scotch Gambit, London Defense      5
Name: Opening, dtype: int64
g.to_csv('./2013-01_game_headers.csv',
         index=False)