Treeffuser¶
Module contents¶
- class Treeffuser(n_repeats: int = 30, n_estimators: int = 3000, early_stopping_rounds: int | None = 50, eval_percent: float = 0.1, num_leaves: int = 31, max_depth: int = -1, learning_rate: float = 0.1, max_bin: int = 255, subsample_for_bin: int = 200000, min_child_samples: int = 20, subsample: float = 1.0, subsample_freq: int = 0, n_jobs: int = -1, sde_name: str = 'vesde', sde_initialize_from_data: bool = False, sde_hyperparam_min: float | Literal['default'] | None = None, sde_hyperparam_max: float | Literal['default'] | None = None, seed: int | None = None, verbose: int = 0, extra_lightgbm_params: dict | None = None)[source]¶
Bases:
BaseTabularDiffusion
- n_repeatsint
How many times to repeat the training dataset when fitting the score. That is, how many noisy versions of a point to generate for training.
- n_estimatorsint
LightGBM: Number of boosting iterations.
- early_stopping_roundsint
LightGBM: If None, no early stopping is performed. Otherwise, the model will stop training if no improvement is observed in the validation set for early_stopping_rounds consecutive iterations.
- eval_percentfloat
LightGBM: Percentage of the training data to use for validation if early_stopping_rounds is not None.
- num_leavesint
LightGBM: Maximum tree leaves for base learners.
- max_depthint
LightGBM: Maximum tree depth for base learners, <=0 means no limit.
- learning_ratefloat
LightGBM: Boosting learning rate.
- max_binint
LightGBM: Max number of bins that feature values will be bucketed in. This is used for lightgbm’s histogram binning algorithm.
- subsample_for_binint
LightGBM: Number of samples for constructing bins.
- min_child_samplesint
LightGBM: Minimum number of data needed in a child (leaf). If less than this number, will not create the child.
- subsamplefloat
LightGBM: Subsample ratio of the training instance.
- subsample_freqint
LightGBM: Frequency of subsample, <=0 means no enable. How often to subsample the training data.
- n_jobsint
LightGBM: Number of parallel threads. If set to -1, the number is set to the number of available cores.
- sde_namestr
SDE: Name of the SDE to use. See treeffuser.sde.get_diffusion_sde for available SDEs.
- sde_initialize_from_databool
SDE: Whether to initialize the SDE from the data. If True, the SDE hyperparameters are initialized with a heuristic based on the data (see treeffuser.sde.initialize.py). Otherwise, sde_hyperparam_min and sde_hyperparam_max are used. (default: False)
- sde_hyperparam_minfloat or “default”
SDE: The scale of the SDE at t=0 (see VESDE, VPSDE, SubVPSDE).
- sde_hyperparam_maxfloat or “default”
SDE: The scale of the SDE at t=T (see VESDE, VPSDE, SubVPSDE).
- seedint
Random seed for generating the training data and fitting the model.
- verboseint
Verbosity of the score model.
- compute_nll(X: Float[ndarray, 'batch x_dim'], y: Float[ndarray, 'batch y_dim'], n_samples: int = 10, bandwidth: float | Literal['scott', 'silverman'] = 1.0, verbose: bool = False) float ¶
Compute the negative log likelihood, sum_{(y, x) in [y, X]} log p(y|x), where p is the conditional distribution learned by the model.
- Parameters:
X (np.ndarray) – Input data with shape (batch, x_dim).
y (np.ndarray) – Target data with shape (batch, y_dim).
n_samples (int, optional) – Number of samples to draw if computing the negative log likelihood from samples. Default is 10.
bandwidth (Union[float, Literal["scott", "silverman"]], optional) – The bandwidth of the kernel. If bandwidth is a float, it defines the bandwidth of the kernel. If bandwidth is a string, one of the “scott” and “silverman” estimation methods. Default is 1.0.
verbose (bool, optional) – If True, displays a progress bar for the sampling. Default is False.
- Returns:
The computed negative log likelihood value.
- Return type:
float
Note
The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.
- fit(X: Float[ndarray, 'batch x_dim'] | DataFrame, y: Float[ndarray, 'batch y_dim'] | Series | DataFrame, cat_idx: list[int] | None = None)¶
Fit the conditional diffusion model to the tabular data (X, y).
- Parameters:
X (np.ndarray or pd.DataFrame) – Input data with shape (batch, x_dim).
y (np.ndarray or pd.Series or pd.DataFrame) – Target data with shape (batch, y_dim).
cat_idx (List[int], optional) – If X is a np.ndarray, list of indices of categorical features in X. If X is a DataFrame, setting cat_idx will raise an error. Instead, ensure that the categorical columns have dtype category, and they will be automatically detected as categorical features. E.g., X[‘column_name’] = X[‘column_name’].astype(‘category’). Default is None.
- Returns:
self – The fitted model.
- Return type:
TabularDiffusion
Note
The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.
- get_metadata_routing()¶
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequest
encapsulating routing information.- Return type:
MetadataRequest
- get_params(deep=True)¶
Get parameters for this estimator.
- Parameters:
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
params – Parameter names mapped to their values.
- Return type:
dict
- property n_estimators_true: list[int]¶
The number of estimators that are actually used in the models (after early stopping), one for each dimension of the score (i.e. the dimension of y).
- predict(X: Float[ndarray, 'batch x_dim'] | DataFrame, tol: float = 0.001, max_samples: int = 100, verbose: bool = False)¶
Predict the conditional mean of the response given the input data X using Monte Carlo estimates.
The method iteratively samples from the model until the change in the norm of the mean estimate is within a specified tolerance, or until a maximum number of samples is reached.
- Parameters:
X (Float[ndarray, "batch x_dim"] or pd.DataFrame) – Input data with shape (batch, x_dim).
tol (float, optional) – Tolerance for the stopping criterion based on the relative change in the mean estimate. Default is 1e-3.
max_samples (int, optional) – Maximum number of samples to draw in the Monte Carlo simulation to ensure convergence. Default is 100.
verbose (bool, optional) – If True, displays a progress bar indicating the sampling progress. Default is False.
- Returns:
The predicted conditional mean of the response for each input in X, shaped according to the original dimensionality of the target data provided during training.
- Return type:
Float[ndarray, “batch y_dim”]
- Raises:
ValueError – If the model has not been fitted yet.
Note
The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.
- predict_distribution(X: Float[ndarray, 'batch x_dim'], n_samples: int = 100, bandwidth: float | Literal['scott', 'silverman'] = 1.0, verbose: bool = False) list[KernelDensity] ¶
Estimate the distribution of the predicted responses for the given input data X using Gaussian KDEs from sklearn.neighbors.KernelDensity.
- Parameters:
X (Float[ndarray, "batch x_dim"]) – Input data with shape (batch, x_dim).
n_samples (int, optional) – Number of samples to draw for each input. Default is 100.
bandwidth (Union[float, Literal["scott", "silverman"]], optional) – The bandwidth of the kernel for the Kernel Density Estimation. If a float, it defines the bandwidth of the kernel. If a string, one of the “scott” or “silverman” estimation methods. Default is 1.0.
verbose (bool, optional) – If True, displays a progress bar indicating the number of samples drawn. Default is False.
- Returns:
A list of KernelDensity objects representing the estimated distributions for each input in X.
- Return type:
List[KernelDensity]
- Raises:
ValueError – If the model has not been fitted yet.
Note
The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.
- sample(X: Float[ndarray, 'batch x_dim'] | DataFrame, n_samples: int, n_parallel: int = 10, n_steps: int = 50, seed=None, verbose: bool = False) Float[ndarray, 'n_samples batch y_dim'] ¶
Sample responses from the diffusion model conditional on the given input data X.
- Parameters:
X (np.ndarray or pd.DataFrame) – Input data with shape (batch, x_dim).
n_samples (int) – Number of samples to draw for each input.
n_parallel (int, optional) – Number of parallel samples to draw. Default is 10.
n_steps (int, optional) – Number of steps to take by the SDE solver. Default is 100.
seed (int, optional) – Seed for the random number generator of the sampling. Default is None.
verbose (bool, optional) – Show a progress bar indicating the number of samples drawn. Default is False.
- Returns:
Samples drawn from the diffusion model.
- Return type:
Float[ndarray, “n_samples batch y_dim”]
- Raises:
ValueError – If the model has not been fitted yet.
Note
The method handles 2D inputs ([“batch x_dim”], [“batch y_dim”]) by convention, but also works with 1D inputs ([“batch”]) for single-dimensional data.
- set_fit_request(*, cat_idx: bool | None | str = '$UNCHANGED$') Treeffuser ¶
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
cat_idx (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
cat_idx
parameter infit
.- Returns:
self – The updated object.
- Return type:
object
- set_params(**params)¶
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
**params (dict) – Estimator parameters.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_predict_request(*, max_samples: bool | None | str = '$UNCHANGED$', tol: bool | None | str = '$UNCHANGED$', verbose: bool | None | str = '$UNCHANGED$') Treeffuser ¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
max_samples (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
max_samples
parameter inpredict
.tol (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
tol
parameter inpredict
.verbose (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
verbose
parameter inpredict
.
- Returns:
self – The updated object.
- Return type:
object