Let
P be a set of
n points in
,
be an integer and
be a constant. An
ε-coreset is a subset
with appropriate non-negative weights (scalars), that approximates
any given set
of
k centers. That is, the sum of squared distances over every point in
P to its closest point in
Q is the same, up to a factor of
to the weighted sum of
C to the same
k centers. If the coreset is small, we can solve problems such as
k-means clustering or its variants (e.g., discrete
k-means, where the centers are restricted to be in
P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension
d. We suggest the first such coreset of size
independent of
d. This is also the first deterministic coreset construction whose resulting size is not exponential in
d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.
Full article