The Sparsifier Object

class sparseklearn.sparsifier.Sparsifier(num_feat_full, num_feat_comp, num_samp, mask=None, transform='dct', D_indices=None, num_feat_shared=0, random_state=None)[source]

Sparsifier.

Compresses data through sparsification. Permits several operations on sparsified data.

Parameters
num_feat_fullint

Dimension of a full sample.

num_feat_compint

The number of dimensions to keep in the compressed data.

num_sampint

The number of samples in the dataset.

transform{‘dct’, None}, defaults to ‘dct’.

The preconditioning transform. Determines what form of H to use in the preconditioning transform HD. Any method other than None will also use the diagonal D matrix (which can be set using the D_indices parameter). The direct cosine transform is currently the only method supported (‘dct’).

masknp.ndarray, shape (n_datapoints, dim_mask), optional

defaults to None. The user-provided mask. If None, mask is generated using the generate_mask method.

num_feat_sharedint, defaults to 0.

The minimum number of dimensions to be shared across all samples in the compressed data.

D_indicesnp.ndarray, shape (n_datapoints,), optional

defaults to None. The user-provided diagonal of the preconditioning matrix D. If None, generated using the generate_D_indices method.

Attributes
masknp.ndarray, shape (num_samp, num_feat_comp)

The mask used to sparsify the data. Array of integers, each row is the indices specifying which entries that sample were kept.

D_indicesnp.ndarray, shape (n_signflips,)

Defines the preconditioning matrix D. Array of integers, the indices of the preconditioning matrix D with sign -1.

Methods

apply_HD(self, X)

Apply the preconditioning transform to X.

apply_mask(self, X, mask)

Apply the mask to X.

fit_sparsifier(self[, X, HDX, RHDX])

Fit the sparsifier to specified data.

invert_HD(self, HDX)

Apply the inverse of HD to HDX.

invert_mask_bool(self)

Compute the mask inverse.

pairwise_distances(self[, Y])

Computes the pairwise distance between each sparsified sample, or between each sparsified sample and each full sample in Y if Y is given.

pairwise_mahalanobis_distances(self, means, …)

Computes the mahalanobis distance between each compressed sample and each full mean (each row of means).

weighted_means(self, W)

Computes weighted full means of sparsified samples.

weighted_means_and_variances(self, W)

Computes weighted full means and variances of sparsified samples.

apply_mask(self, X, mask)[source]

Apply the mask to X.

Parameters
Xnp.ndarray, shape(n, P)
masknp.ndarray, shape(n, Q)
Returns
RXnp.ndarray, shape(n, Q)

Masked X. The nth row of RX is X[n][mask[n]].

apply_HD(self, X)[source]

Apply the preconditioning transform to X.

Parameters
Xnp.ndarray, shape (n, P)

The data to precondition. Each row is a datapoint.

Returns
HDXnp.ndarray, shape (n, P)

The transformed data.

invert_HD(self, HDX)[source]

Apply the inverse of HD to HDX.

Parameters
HDXnp.ndarray, shape (n, P)

The preconditioned data. Each row is a datapoint.

Returns
Xnp.ndarray, shape (n, P)

The raw data.

invert_mask_bool(self)[source]

Compute the mask inverse.

The mask is an array indicating which dimensions are kept for each data point. The inverse mask is an array indicating which datapoints keep this dimension, for each dimension. For computational efficiency, the inverse mask is given as a sparse boolean array whereas the mask is a (smaller) dense integer array.

Returns
mask_inversesparse.csr_matrix, bool, shape (P,N)

The mask inverse. The ij entry is 1 if the jth datapoint keeps the ith dimension under the mask, and 0 otherwise; in other words, 1 if i is in the list mask[j].

fit_sparsifier(self, X=None, HDX=None, RHDX=None)[source]

Fit the sparsifier to specified data.

Sets self.RHDX, the sumsampled, preconditioned data. At least one of the parameters must be set. If RHDX is passed, then X and HDX are ignored. If HDX is passed, then X is ignored.

Parameters
Xnp.ndarray, shape (num_samp, num_feat_full), defaults to None.

Dense, raw data.

HDXnp.ndarray, shape (num_samp, num_feat_full), defaults to None.

Dense, preconditioned data.

RHDXnp.ndarray, shape (num_samp, num_feat_comp), defaults to None.

Subsampled, preconditioned data.

pairwise_distances(self, Y=None)[source]

Computes the pairwise distance between each sparsified sample, or between each sparsified sample and each full sample in Y if Y is given.

Parameters
Ynp.ndarray, shape (K, P), optional

defaults to None. Full, transformed samples.

Returns
distancesnp.ndarray, shape(K or N, N)

distances between each pair of samples (if Y is None) or distances between each sample and each row in Y.

weighted_means(self, W)[source]

Computes weighted full means of sparsified samples. Currently this is also used to compute hard assignments but should be updated for speed later - zeros in W are multiplied through.

Parameters
Wnp.ndarray, shape (N, K)

Weights. Each row corresponds to a sample, each column to a set of weights. The columns of W should sum to 1. There is no necessary correspondence between the columns of W.

Returns
meansnp.ndarray, shape (K,P)

Weighted full means. Each row corresponds to a possible independent set of weights (for example, a binary W with K columns would give the means of K clusters).

weighted_means_and_variances(self, W)[source]

Computes weighted full means and variances of sparsified samples. Currently also used to compute hard assignments but should be updated for speed later - zeros in W are multiplied through.

Parameters
Wnp.ndarray, shape (N, K)

Weights. Each row corresponds to a sample, each column to a set of weights. The columns of W should sum to 1. There is no necessary correspondence between the columns of W.

Returns
meansnp.ndarray, shape (K,P)

Weighted full means. Each row corresponds to a possible independent set of weights (for example, a binary W with K columns would give the means of K clusters).

variancesnp.ndarray, shape (K,P)

Weighted full variances. Each row corresponds to a possible independent set of weights (for example, a binary W with K columns would give the variances of K clusters).

pairwise_mahalanobis_distances(self, means, covariances, covariance_type)[source]

Computes the mahalanobis distance between each compressed sample and each full mean (each row of means).

Parameters
meansnp.ndarray, shape (K,P)

The means with which to take the mahalanobis distances. Each row of means is a single mean in P-dimensional space.

covariancesnp.ndarray, shape (K,P) or shape (P,).

The non-zero entries of the covariance matrix. If covariance_type is ‘spherical’, must be shape (P,). If covariance_type is ‘diag’, must be shape (K,P)

covariance_type{‘spherical’, ‘diag’}, string.

The form of the covariance matrix.

Returns
distancesnp.ndarray, shape (N,K)

The pairwise mahalanobis distances.