footix.data_io package

Submodules

footix.data_io.base_scrapper module

class footix.data_io.base_scrapper.Scraper(path, mapping_teams)[source]

Bases: object

Parameters:
base_url: str = ''
scraper_name: str | None = None
classmethod competitions()[source]
Return type:

list[str]

replace_name_team(df, columns)[source]
Parameters:
Return type:

DataFrame

get(url)[source]
Parameters:

url (str)

Return type:

str

static manage_path(path)[source]
Parameters:

path (str)

Return type:

Path

footix.data_io.data_reader module

class footix.data_io.data_reader.DataProtocol(*args, **kwargs)[source]

Bases: Protocol

Protocol for data readers.

class footix.data_io.data_reader.MatchupResult(home_team, away_team, result, away_goals, home_goals)[source]

Bases: object

A dataclass representing the result of a football match.

Parameters:
home_team

The name of the home team.

Type:

str

away_team

The name of the away team.

Type:

str

result

The final result of the match: (‘H’ for Home Win, ‘A’ for Away Win, ‘D’ for Draw).

Type:

str

away_goals

The number of goals scored by the away team.

Type:

float

home_goals

The number of goals scored by the home team.

Type:

float

home_team: str
away_team: str
result: str
away_goals: float
home_goals: float
static from_dict(dict_row)[source]

Factory method to create a MatchupResult object from a dictionary row.

Parameters

dict_row (dict): A dictionary containing the match results with keys: - ‘HomeTeam’: The name of the home team. - ‘AwayTeam’: The name of the away team. - ‘FTR’: The final result

(‘H’ for Home Win, ‘A’ for Away Win, ‘D’ for Draw).

  • ‘FTAG’: The number of goals scored by the away team.

  • ‘FTHG’: The number of goals scored by the home team.

Returns

MatchupResult: An instance of the MatchupResult class populated with data from the dictionary row.

Parameters:

dict_row (dict)

Return type:

MatchupResult

class footix.data_io.data_reader.EloDataReader(df_data)[source]

Bases: DataProtocol

Parameters:

df_data (DataFrame)

unique_teams()[source]
Return type:

list[str]

footix.data_io.footballdata module

Module for scraping and processing footballdata.co.uk data.

This module contains the ScrapFootballData class, which is responsible for downloading, storing, and preprocessing football match data from football-data.co.uk. It includes methods for data sanitization, team name mapping, and fixture retrieval.

Classes:

ScrapFootballData: Handles the scraping and processing of football match data.

Functions:

_process_season(season: str) -> str: Processes a season string into a standardized format.

class footix.data_io.footballdata.ScrapFootballData(competition, season, path, force_reload=False, mapping_teams=None)[source]

Bases: Scraper

Scraper for downloading and processing football match data from football-data.co.uk.

This class handles the retrieval, local storage, and preprocessing of football match data for a given competition and season. It supports automatic downloading, file management, column sanitization, and team name mapping.

Parameters:
  • competition (str) – The competition code (e.g., ‘E0’ for Premier League).

  • season (str) – The season string (e.g., ‘2020/2021’, ‘2020-2021’, or ‘2021’).

  • path (str) – Directory path to store the downloaded CSV files.

  • force_reload (bool, optional) – If True, forces re-download of data even if file exists.

  • mapping_teams (dict[str, str] | None, optional) – Optional mapping for team name

  • normalization.

base_url

Base URL for football-data.co.uk.

Type:

str

scraper_name

Name identifier for the scraper.

Type:

str

competition

Competition code.

Type:

str

season

Processed season string.

Type:

str

path

Path object for data storage.

Type:

Path

force_reload

Whether to force data reload.

Type:

bool

infered_url

Constructed URL for the CSV file.

Type:

str

df

Loaded and processed match data.

Type:

pd.DataFrame

download()[source]

Downloads and saves the competition data as a CSV file.

Return type:

None

load() pd.DataFrame[source]

Loads the data from file or downloads if not present.

Return type:

DataFrame

sanitize_columns()[source]

Converts DataFrame columns to snake_case.

get_fixtures() pd.DataFrame[source]

Returns the processed match data.

Return type:

DataFrame

base_url: str = 'https://www.football-data.co.uk/mmz4281/'
scraper_name: str | None = 'footballdata'
download()[source]

Download the competition data and save it as a CSV file.

Return type:

None

load()[source]

Load the CSV for the configured competition and season into a pandas DataFrame.

If a file named “{competition}_{season}.csv” exists under self.path and self.force_reload is False, it is loaded with pandas.read_csv. Otherwise self.download() is invoked to (re)create the CSV, which is then read.

Returns:

The loaded dataset.

Return type:

pd.DataFrame

Raises:

Notes

Relies on the instance attributes self.path (Path or str), self.competition (str), self.season (str), and self.force_reload (bool). This method may have the side effect of calling self.download().

sanitize_columns()[source]

Convert DataFrame columns to snake_case.

get_fixtures()[source]

Return the processed match data DataFrame.

Returns:

The DataFrame containing match data.

Return type:

pd.DataFrame

footix.data_io.prediction_export module

Prediction export utilities for model predictions.

This module transforms model outputs into a normalized JSON-compatible structure for prediction record consumers.

footix.data_io.prediction_export.build_prediction_records_from_predictions(fixtures, goal_matrices, samples, payload_metadata=None, team_normalizer=None, confidence_gamma=0.7)[source]

Build prediction records from existing prediction artifacts.

Parameters:
  • fixtures (Sequence[Mapping[str, Any]]) – Raw fixtures payload from odds JSON.

  • goal_matrices (Mapping[str, GoalMatrix]) – Mapping from match key to score matrix predictions.

  • samples (Mapping[str, SampleProbaResult]) – Mapping from match key to posterior probability samples.

  • payload_metadata (Mapping[str, Any] | None) – Optional metadata extracted from odds payload.

  • team_normalizer (Callable[[str], str] | None) – Optional callable for team-name normalization.

  • confidence_gamma (float | None)

Returns:

Tuple of valid records and technical error reports.

Return type:

tuple[list[dict[str, Any]], list[dict[str, str]]]

footix.data_io.prediction_export.export_prediction_records_from_model(model, fixtures, payload_metadata=None, team_normalizer=None, predict_kwargs=None, sample_kwargs=None, confidence_gamma=0.7)[source]

Compute predictions from a model and export prediction records.

Parameters:
  • model (PredictionExportModel) – Predictive model supporting predict/get_samples.

  • fixtures (Sequence[Mapping[str, Any]]) – Raw fixtures payload from odds JSON.

  • payload_metadata (Mapping[str, Any] | None) – Optional metadata extracted from odds payload.

  • team_normalizer (Callable[[str], str] | None) – Optional callable for team-name normalization.

  • predict_kwargs (Mapping[str, Any] | None) – Optional extra kwargs forwarded to predict.

  • sample_kwargs (Mapping[str, Any] | None) – Optional extra kwargs forwarded to get_samples.

  • confidence_gamma (float | None)

Returns:

Tuple of valid records and technical error reports.

Return type:

tuple[list[dict[str, Any]], list[dict[str, str]]]

footix.data_io.understat module

exception footix.data_io.understat.ShotDataNotFound[source]

Bases: RuntimeError

Raised when the expected shotsData <script> block is not present.

exception footix.data_io.understat.FixtureDataNotFound[source]

Bases: RuntimeError

Raised when the fixture data are not present.

class footix.data_io.understat.ScrapUnderstat(competition, season, path, force_reload=False, mapping_teams=None)[source]

Bases: Scraper

Scraper for downloading and processing football match data from understat.com. This class function is heavily inspired/copied from its counterpart from penalty blog: https://github.com/martineastwood/penaltyblog

This class retrieves, parses, and processes football match data for a given competition and season from Understat. It extracts fixture details, expected goals (xG), forecasts, and normalizes team names. The data is returned as a processed pandas DataFrame.

Parameters:
  • competition (str) – The competition code (e.g., ‘EPL’ for Premier League).

  • season (str) – The season string (e.g., ‘2020/2021’, ‘2020-2021’, or ‘2021’).

  • path (str) – Directory path for any required file operations.

  • force_reload (bool, optional) – If True, forces re-download or reprocessing of data.

  • mapping_teams (dict[str, str] | None, optional) – Optional mapping for team name

  • normalization.

base_url

Base URL for understat.com.

Type:

str

scraper_name

Name identifier for the scraper.

Type:

str

season

Processed season string.

Type:

str

force_reload

Whether to force data reload.

Type:

bool

slug

Slug for the competition used in URL construction.

Type:

str

sanitize_columns(df)[source]

Converts DataFrame columns to snake_case.

Parameters:

df (DataFrame)

get_fixtures() pd.DataFrame[source]

Downloads, parses, and returns processed match data.

Return type:

DataFrame

_process_season(season

str) -> str: Processes the season string for URL usage.

base_url: str = 'https://understat.com/'
scraper_name: str | None = 'understat'
static sanitize_columns(df)[source]
Parameters:

df (DataFrame)

get_fixtures()[source]

Downloads and processes match fixtures using Understat’s API.

Uses the /getLeagueData/ API endpoint which requires specific headers.

Returns:

Processed fixtures with match details, xG, and forecasts.

Return type:

pd.DataFrame

Raises:

FixtureDataNotFound – If no fixture data is found in the API response.

get_shots(understat_id)[source]
Parameters:

understat_id (str)

Return type:

DataFrame

footix.data_io.utils_scrapper module

footix.data_io.utils_scrapper.check_competition_exists(competition)[source]

Check if the competition exists in the MAPPING_COMPETITIONS dictionary.

Parameters:

competition (str) – The name of the competition to check.

Returns:

True if the competition exists, False otherwise.

Return type:

bool

footix.data_io.utils_scrapper.process_string(input_string)[source]
footix.data_io.utils_scrapper.to_snake_case(name)[source]

Convert the string name into a snake case string. Shamelessly copied from: https://stackoverflow.com/questions/1175208/ elegant-python-function-to-convert-camelcase-to-snake-case

Parameters:

name (str) – the name to convert

Returns:

the name in snake case

Return type:

str

footix.data_io.utils_scrapper.add_match_id(df)[source]

Add a stable match_id column in the form “Home - Away - YYYY-MM-DD”.

This normalizes the date formatting so match ids are consistent across scrapers that use different date string formats.

Parameters:

df (DataFrame)

Return type:

DataFrame

footix.data_io.utils_scrapper.canonicalize_matches_df(df, *, require_columns=None)[source]

Canonicalize a match dataframe.

Ensures date parsing, required columns present, sorts by date and adds a stable match_id.

Parameters:
  • df (DataFrame) – Input dataframe with match rows.

  • require_columns (list[str] | None) – List of columns that must be present (defaults to minimal match columns).

Returns:

The canonicalized dataframe.

Return type:

DataFrame

Module contents

Data input/output utilities for football data sources.

This module provides interfaces and implementations for scraping and reading football data from multiple sources (Football-Data.org, Understat, etc.).

Submodules:
  • footballdata: Football-Data.org scraper

  • understat: Understat.com data reader

  • data_reader: Generic data reading utilities

  • base_scrapper: Base classes for data scrapers

  • utils_scrapper: Scraper utility functions

class footix.data_io.ScrapFootballData(competition, season, path, force_reload=False, mapping_teams=None)[source]

Bases: Scraper

Scraper for downloading and processing football match data from football-data.co.uk.

This class handles the retrieval, local storage, and preprocessing of football match data for a given competition and season. It supports automatic downloading, file management, column sanitization, and team name mapping.

Parameters:
  • competition (str) – The competition code (e.g., ‘E0’ for Premier League).

  • season (str) – The season string (e.g., ‘2020/2021’, ‘2020-2021’, or ‘2021’).

  • path (str) – Directory path to store the downloaded CSV files.

  • force_reload (bool, optional) – If True, forces re-download of data even if file exists.

  • mapping_teams (dict[str, str] | None, optional) – Optional mapping for team name

  • normalization.

base_url

Base URL for football-data.co.uk.

Type:

str

scraper_name

Name identifier for the scraper.

Type:

str

competition

Competition code.

Type:

str

season

Processed season string.

Type:

str

path

Path object for data storage.

Type:

Path

force_reload

Whether to force data reload.

Type:

bool

infered_url

Constructed URL for the CSV file.

Type:

str

df

Loaded and processed match data.

Type:

pd.DataFrame

download()[source]

Downloads and saves the competition data as a CSV file.

Return type:

None

load() pd.DataFrame[source]

Loads the data from file or downloads if not present.

Return type:

DataFrame

sanitize_columns()[source]

Converts DataFrame columns to snake_case.

get_fixtures() pd.DataFrame[source]

Returns the processed match data.

Return type:

DataFrame

base_url: str = 'https://www.football-data.co.uk/mmz4281/'
scraper_name: str | None = 'footballdata'
download()[source]

Download the competition data and save it as a CSV file.

Return type:

None

load()[source]

Load the CSV for the configured competition and season into a pandas DataFrame.

If a file named “{competition}_{season}.csv” exists under self.path and self.force_reload is False, it is loaded with pandas.read_csv. Otherwise self.download() is invoked to (re)create the CSV, which is then read.

Returns:

The loaded dataset.

Return type:

pd.DataFrame

Raises:

Notes

Relies on the instance attributes self.path (Path or str), self.competition (str), self.season (str), and self.force_reload (bool). This method may have the side effect of calling self.download().

sanitize_columns()[source]

Convert DataFrame columns to snake_case.

get_fixtures()[source]

Return the processed match data DataFrame.

Returns:

The DataFrame containing match data.

Return type:

pd.DataFrame

class footix.data_io.ScrapUnderstat(competition, season, path, force_reload=False, mapping_teams=None)[source]

Bases: Scraper

Scraper for downloading and processing football match data from understat.com. This class function is heavily inspired/copied from its counterpart from penalty blog: https://github.com/martineastwood/penaltyblog

This class retrieves, parses, and processes football match data for a given competition and season from Understat. It extracts fixture details, expected goals (xG), forecasts, and normalizes team names. The data is returned as a processed pandas DataFrame.

Parameters:
  • competition (str) – The competition code (e.g., ‘EPL’ for Premier League).

  • season (str) – The season string (e.g., ‘2020/2021’, ‘2020-2021’, or ‘2021’).

  • path (str) – Directory path for any required file operations.

  • force_reload (bool, optional) – If True, forces re-download or reprocessing of data.

  • mapping_teams (dict[str, str] | None, optional) – Optional mapping for team name

  • normalization.

base_url

Base URL for understat.com.

Type:

str

scraper_name

Name identifier for the scraper.

Type:

str

season

Processed season string.

Type:

str

force_reload

Whether to force data reload.

Type:

bool

slug

Slug for the competition used in URL construction.

Type:

str

sanitize_columns(df)[source]

Converts DataFrame columns to snake_case.

Parameters:

df (DataFrame)

get_fixtures() pd.DataFrame[source]

Downloads, parses, and returns processed match data.

Return type:

DataFrame

_process_season(season

str) -> str: Processes the season string for URL usage.

base_url: str = 'https://understat.com/'
scraper_name: str | None = 'understat'
static sanitize_columns(df)[source]
Parameters:

df (DataFrame)

get_fixtures()[source]

Downloads and processes match fixtures using Understat’s API.

Uses the /getLeagueData/ API endpoint which requires specific headers.

Returns:

Processed fixtures with match details, xG, and forecasts.

Return type:

pd.DataFrame

Raises:

FixtureDataNotFound – If no fixture data is found in the API response.

get_shots(understat_id)[source]
Parameters:

understat_id (str)

Return type:

DataFrame

footix.data_io.build_prediction_records_from_predictions(fixtures, goal_matrices, samples, payload_metadata=None, team_normalizer=None, confidence_gamma=0.7)[source]

Build prediction records from existing prediction artifacts.

Parameters:
  • fixtures (Sequence[Mapping[str, Any]]) – Raw fixtures payload from odds JSON.

  • goal_matrices (Mapping[str, GoalMatrix]) – Mapping from match key to score matrix predictions.

  • samples (Mapping[str, SampleProbaResult]) – Mapping from match key to posterior probability samples.

  • payload_metadata (Mapping[str, Any] | None) – Optional metadata extracted from odds payload.

  • team_normalizer (Callable[[str], str] | None) – Optional callable for team-name normalization.

  • confidence_gamma (float | None)

Returns:

Tuple of valid records and technical error reports.

Return type:

tuple[list[dict[str, Any]], list[dict[str, str]]]

footix.data_io.export_prediction_records_from_model(model, fixtures, payload_metadata=None, team_normalizer=None, predict_kwargs=None, sample_kwargs=None, confidence_gamma=0.7)[source]

Compute predictions from a model and export prediction records.

Parameters:
  • model (PredictionExportModel) – Predictive model supporting predict/get_samples.

  • fixtures (Sequence[Mapping[str, Any]]) – Raw fixtures payload from odds JSON.

  • payload_metadata (Mapping[str, Any] | None) – Optional metadata extracted from odds payload.

  • team_normalizer (Callable[[str], str] | None) – Optional callable for team-name normalization.

  • predict_kwargs (Mapping[str, Any] | None) – Optional extra kwargs forwarded to predict.

  • sample_kwargs (Mapping[str, Any] | None) – Optional extra kwargs forwarded to get_samples.

  • confidence_gamma (float | None)

Returns:

Tuple of valid records and technical error reports.

Return type:

tuple[list[dict[str, Any]], list[dict[str, str]]]