Treeffuser

Module contents

class Treeffuser(n_repeats: int = 30, n_estimators: int = 3000, early_stopping_rounds: int | None = 50, eval_percent: float = 0.1, num_leaves: int = 31, max_depth: int = -1, learning_rate: float = 0.1, max_bin: int = 255, subsample_for_bin: int = 200000, min_child_samples: int = 20, subsample: float = 1.0, subsample_freq: int = 0, n_jobs: int = -1, sde_name: str = 'vesde', sde_initialize_from_data: bool = False, sde_hyperparam_min: float | Literal['default'] | None = None, sde_hyperparam_max: float | Literal['default'] | None = None, seed: int | None = None, verbose: int = 0, extra_lightgbm_params: dict | None = None)[source]

Bases: BaseTabularDiffusion

n_repeatsint

How many times to repeat the training dataset when fitting the score. That is, how many noisy versions of a point to generate for training.

n_estimatorsint

LightGBM: Number of boosting iterations.

early_stopping_roundsint

LightGBM: If None, no early stopping is performed. Otherwise, the model will stop training if no improvement is observed in the validation set for early_stopping_rounds consecutive iterations.

eval_percentfloat

LightGBM: Percentage of the training data to use for validation if early_stopping_rounds is not None.

num_leavesint

LightGBM: Maximum tree leaves for base learners.

max_depthint

LightGBM: Maximum tree depth for base learners, <=0 means no limit.

learning_ratefloat

LightGBM: Boosting learning rate.

max_binint

LightGBM: Max number of bins that feature values will be bucketed in. This is used for lightgbm’s histogram binning algorithm.

subsample_for_binint

LightGBM: Number of samples for constructing bins.

min_child_samplesint

LightGBM: Minimum number of data needed in a child (leaf). If less than this number, will not create the child.

subsamplefloat

LightGBM: Subsample ratio of the training instance.

subsample_freqint

LightGBM: Frequency of subsample, <=0 means no enable. How often to subsample the training data.

n_jobsint

LightGBM: Number of parallel threads. If set to -1, the number is set to the number of available cores.

sde_namestr

SDE: Name of the SDE to use. See treeffuser.sde.get_diffusion_sde for available SDEs.

sde_initialize_from_databool

SDE: Whether to initialize the SDE from the data. If True, the SDE hyperparameters are initialized with a heuristic based on the data (see treeffuser.sde.initialize.py). Otherwise, sde_hyperparam_min and sde_hyperparam_max are used. (default: False)

sde_hyperparam_minfloat or “default”

SDE: The scale of the SDE at t=0 (see VESDE, VPSDE, SubVPSDE).

sde_hyperparam_maxfloat or “default”

SDE: The scale of the SDE at t=T (see VESDE, VPSDE, SubVPSDE).

seedint

Random seed for generating the training data and fitting the model.

verboseint

Verbosity of the score model.

compute_nll(X: Float[ndarray, 'batch x_dim'], y: Float[ndarray, 'batch y_dim'], n_samples: int = 10, bandwidth: float | Literal['scott', 'silverman'] = 1.0, verbose: bool = False) float

Compute the negative log likelihood, sum_{(y, x) in [y, X]} log p(y|x), where p is the conditional distribution learned by the model.

Parameters:
  • X (np.ndarray) – Input data with shape (batch, x_dim).

  • y (np.ndarray) – Target data with shape (batch, y_dim).

  • n_samples (int, optional) – Number of samples to draw if computing the negative log likelihood from samples. Default is 10.

  • bandwidth (Union[float, Literal["scott", "silverman"]], optional) – The bandwidth of the kernel. If bandwidth is a float, it defines the bandwidth of the kernel. If bandwidth is a string, one of the “scott” and “silverman” estimation methods. Default is 1.0.

  • verbose (bool, optional) – If True, displays a progress bar for the sampling. Default is False.

Returns:

The computed negative log likelihood value.

Return type:

float

Note

The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.

fit(X: Float[ndarray, 'batch x_dim'] | DataFrame, y: Float[ndarray, 'batch y_dim'] | Series | DataFrame, cat_idx: list[int] | None = None)

Fit the conditional diffusion model to the tabular data (X, y).

Parameters:
  • X (np.ndarray or pd.DataFrame) – Input data with shape (batch, x_dim).

  • y (np.ndarray or pd.Series or pd.DataFrame) – Target data with shape (batch, y_dim).

  • cat_idx (List[int], optional) – If X is a np.ndarray, list of indices of categorical features in X. If X is a DataFrame, setting cat_idx will raise an error. Instead, ensure that the categorical columns have dtype category, and they will be automatically detected as categorical features. E.g., X[‘column_name’] = X[‘column_name’].astype(‘category’). Default is None.

Returns:

self – The fitted model.

Return type:

TabularDiffusion

Note

The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_new_score_model() ScoreModel[source]

Return the score model.

get_new_sde() DiffusionSDE[source]

Return the SDE model.

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params – Parameter names mapped to their values.

Return type:

dict

property n_estimators_true: list[int]

The number of estimators that are actually used in the models (after early stopping), one for each dimension of the score (i.e. the dimension of y).

predict(X: Float[ndarray, 'batch x_dim'] | DataFrame, tol: float = 0.001, max_samples: int = 100, verbose: bool = False)

Predict the conditional mean of the response given the input data X using Monte Carlo estimates.

The method iteratively samples from the model until the change in the norm of the mean estimate is within a specified tolerance, or until a maximum number of samples is reached.

Parameters:
  • X (Float[ndarray, "batch x_dim"] or pd.DataFrame) – Input data with shape (batch, x_dim).

  • tol (float, optional) – Tolerance for the stopping criterion based on the relative change in the mean estimate. Default is 1e-3.

  • max_samples (int, optional) – Maximum number of samples to draw in the Monte Carlo simulation to ensure convergence. Default is 100.

  • verbose (bool, optional) – If True, displays a progress bar indicating the sampling progress. Default is False.

Returns:

The predicted conditional mean of the response for each input in X, shaped according to the original dimensionality of the target data provided during training.

Return type:

Float[ndarray, “batch y_dim”]

Raises:

ValueError – If the model has not been fitted yet.

Note

The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.

predict_distribution(X: Float[ndarray, 'batch x_dim'], n_samples: int = 100, bandwidth: float | Literal['scott', 'silverman'] = 1.0, verbose: bool = False) list[KernelDensity]

Estimate the distribution of the predicted responses for the given input data X using Gaussian KDEs from sklearn.neighbors.KernelDensity.

Parameters:
  • X (Float[ndarray, "batch x_dim"]) – Input data with shape (batch, x_dim).

  • n_samples (int, optional) – Number of samples to draw for each input. Default is 100.

  • bandwidth (Union[float, Literal["scott", "silverman"]], optional) – The bandwidth of the kernel for the Kernel Density Estimation. If a float, it defines the bandwidth of the kernel. If a string, one of the “scott” or “silverman” estimation methods. Default is 1.0.

  • verbose (bool, optional) – If True, displays a progress bar indicating the number of samples drawn. Default is False.

Returns:

A list of KernelDensity objects representing the estimated distributions for each input in X.

Return type:

List[KernelDensity]

Raises:

ValueError – If the model has not been fitted yet.

Note

The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.

sample(X: Float[ndarray, 'batch x_dim'] | DataFrame, n_samples: int, n_parallel: int = 10, n_steps: int = 50, seed=None, verbose: bool = False) Float[ndarray, 'n_samples batch y_dim']

Sample responses from the diffusion model conditional on the given input data X.

Parameters:
  • X (np.ndarray or pd.DataFrame) – Input data with shape (batch, x_dim).

  • n_samples (int) – Number of samples to draw for each input.

  • n_parallel (int, optional) – Number of parallel samples to draw. Default is 10.

  • n_steps (int, optional) – Number of steps to take by the SDE solver. Default is 100.

  • seed (int, optional) – Seed for the random number generator of the sampling. Default is None.

  • verbose (bool, optional) – Show a progress bar indicating the number of samples drawn. Default is False.

Returns:

Samples drawn from the diffusion model.

Return type:

Float[ndarray, “n_samples batch y_dim”]

Raises:

ValueError – If the model has not been fitted yet.

Note

The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.

set_fit_request(*, cat_idx: bool | None | str = '$UNCHANGED$') Treeffuser

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

cat_idx (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for cat_idx parameter in fit.

Returns:

self – The updated object.

Return type:

object

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**params (dict) – Estimator parameters.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_predict_request(*, max_samples: bool | None | str = '$UNCHANGED$', tol: bool | None | str = '$UNCHANGED$', verbose: bool | None | str = '$UNCHANGED$') Treeffuser

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to predict.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • max_samples (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for max_samples parameter in predict.

  • tol (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for tol parameter in predict.

  • verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for verbose parameter in predict.

Returns:

self – The updated object.

Return type:

object