Example with code
Distance vs. Dimensionality
- The x-axis represents the distance between two points
- The y-axis represents the count (this is a histogram)
- The play button increases the number of dimensions, causing the histogram to shift to the right. This means distances are increasing as dimensions go up!
- The curse of dimensionality is that as the number of dimensions increases, the distances between any two points also increases. That makes it hard to group things together!
import numpy as np
import pandas as pd
import plotly.express as px
def generate_data(n, dim):
x = np.random.normal(0, 1, (n, dim))
y = np.random.normal(3, 1, (n, dim))
data = np.concatenate([x, y], axis=0)
normalized = (data - data.mean(axis=0)) / data.std(axis=0)
distances = np.linalg.norm(normalized[:, None, :] - normalized[None, :, :], axis=2).flatten()
return distances
n = 100
dims = list(range(1, 10)) + list(range(10, 200, 10))
distances = {dim: generate_data(n, dim) for dim in dims}
df = pd.DataFrame(distances).melt(var_name='dims', value_name='samples')
fig = px.histogram(df, animation_frame='dims', )
fig.update_xaxes(range=[0, 20]);