Using Python-Chess with Pandas for High-Volume PGN Parsing
The Lichess games database (linked here) is a collection of chess games played on the Lichess website, which is a popular online chess platform. The database is regularly updated and contains millions of games, which can be downloaded in a variety of formats for analysis and study purposes. It is a valuable resource for chess players of all levels who want to improve their game and learn from the strategies and tactics of other players.
Pandas is a popular open-source data manipulation and analysis library for Python. It provides easy-to-use data structures and data analysis tools for handling and manipulating structured data, including time series, tabular, and matrix data. Pandas also offers powerful tools for data cleaning, aggregation, filtering, and visualization, making it a popular choice for data scientists and analysts.
Python-chess is a Python library for working with chess in Python, providing functionality to create and modify chess games, analyze chess positions, generate and parse chess notation, and interact with chess engines.
Using these three resources in combination seems to be a promising way to explore chess in an analytically rigorous manner. I explore some of that here.
Open a PGN File with Multiple Games and Parse the Headers
You can use the pgn
module of the python-chess
library to read a large file with many PGNs in it. The pgn
module provides a read_game
function that can be used to iterate over the games in the PGN file one at a time. This can be useful for processing large files without having to load the entire file into memory at once.
This first example uses the January 2013 Standard Rated games database from Lichess. The PGN file is 92.8 MB and includes 121,333 games and 8,155,187 individual moves. It takes a over an hour for my computer to process.
import chess.pgn
filepath = '/home/ryan/Desktop/chess/lichess_pgns/lichess_db_standard_rated_2013-01.pgn'
with open(filepath) as f:
game = chess.pgn.read_game(f)
header_list = []
for key in game.headers:
header_list.append(key)
header_list
['Event',
'Site',
'Date',
'Round',
'White',
'Black',
'Result',
'BlackElo',
'BlackRatingDiff',
'ECO',
'Opening',
'Termination',
'TimeControl',
'UTCDate',
'UTCTime',
'WhiteElo',
'WhiteRatingDiff']
import chess.pgn
import pandas as pd
def open_and_scrape_headers(filepath):
game_id = 0
g = pd.DataFrame(columns=['game_id'] + header_list)
with open(filepath) as f:
while True:
game = chess.pgn.read_game(f)
game_id += 1
# If there are no more games, exit the loop
if game is None:
break
value_list = [game_id]
for header in header_list:
try:
value_list.append(game.headers[header])
except:
value_list.append('')
g.loc[len(g)] = value_list
if (game_id % 20000 == 0):
print('Now adding game_id: ' + str(game_id))
return g
filepath = '/home/ryan/Desktop/chess/lichess_pgns/lichess_db_standard_rated_2013-01.pgn'
g = open_and_scrape_headers(filepath)
Now adding game_id: 20000
Now adding game_id: 40000
Now adding game_id: 60000
Now adding game_id: 80000
Now adding game_id: 100000
Now adding game_id: 120000
g.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 121332 entries, 0 to 121331
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 game_id 121332 non-null object
1 Event 121332 non-null object
2 Site 121332 non-null object
3 Date 121332 non-null object
4 Round 121332 non-null object
5 White 121332 non-null object
6 Black 121332 non-null object
7 Result 121332 non-null object
8 BlackElo 121332 non-null object
9 BlackRatingDiff 121332 non-null object
10 ECO 121332 non-null object
11 Opening 121332 non-null object
12 Termination 121332 non-null object
13 TimeControl 121332 non-null object
14 UTCDate 121332 non-null object
15 UTCTime 121332 non-null object
16 WhiteElo 121332 non-null object
17 WhiteRatingDiff 121332 non-null object
dtypes: object(18)
memory usage: 17.6+ MB
g['Opening'].value_counts()
Van't Kruijs Opening 3995
Owen Defense 3192
Scandinavian Defense: Mieses-Kotroc Variation 2300
Modern Defense 1982
Horwitz Defense 1952
...
Benoni Defense: Czech Benoni Defense 1
Vienna Game: Giraffe Attack 1
English Defense: Perrin Variation 1
Owen Defense: Wind Gambit 1
King's Pawn Game: Busch-Gass Gambit, Chiodini Gambit 1
Name: Opening, Length: 1846, dtype: int64
g[g['Opening'].str.contains('ondon')]['Opening'].value_counts()
Queen's Pawn Game: London System 124
Indian Game: London System 30
London System 15
Scotch Game: Scotch Gambit, London Defense 5
Name: Opening, dtype: int64
g.to_csv('./2013-01_game_headers.csv',
index=False)