I started off simple, can SDS be used to find the single most predictive feature in a dataset.
I.e can it find the best parameters for dimension
, and threshold
in the function microtest
shown below, such that it best predicts the class of an input row.
def microtest(row, dimension, threshold):
in_class = row[dimension] > threshold
return in_class
To do this I used a Standard SDS:
SDS(I=I_sync(D=D_passive(DH), T=T_boolean(TM)), H=H_fixed(2000))
Only TM
and DH
needed to be specified for this task.
The hypothesis selection method returns a random DimensionThreshold
.
DimensionThreshold = collections.namedtuple(
"DimensionThreshold",
("dimension", "operator", "threshold")
)
Where dimension
is the index of the chosen feature, operator
is in {operator.lt, operator.gt}
(less than or greater than) and threshold is one of the values in the dataset from the chosen dimension.
The variant is called DH_plane
as it returns a region of a hyperspace defined as everything to one side of a plane.
def DH_plane(dataset, rng):
"""
Takes a "choice"able dataset where each row is expected to be of the same
length.
Each row must also be indexable and each value in a row is ordered with
respect to all other values in the same column.
Returns a single (feature_index, threshold, operator) tuple
"""
def DH():
random_row = rng.choice(dataset)
feature_num = rng.randrange(len(random_row)-1)
threshold = random_row[feature_num]
operator = random.choice(sds_ml.operators)
return DimensionThreshold(feature_num, operator, threshold)
return DH
The microtest selection method returns a partial function, formed by a generic microtest function applied to a randomly selected row from the dataset.
The generic function microtest
takes a hypothesis in the form of a DimensionThreshold
and a row from the dataset, and returns whether the class of the row matches the predicted class due to its position relative to plane defined by the hypothesis.
This whole operation, in short, evaluates the boolean expression
row class == hyp.operator(row[hyp.dimension], hyp.threshold)
Here is the full definition of TM_plane
.
def TM_plane(dataset, rng):
"""
Takes a (feature_index, threshold, operator) hyp and a row.
Applies operator to row[feature_index] and compares result to row[-1]
"""
def microtest(hyp, row):
feature = row[hyp.dimension]
comparison = hyp.operator(feature, hyp.threshold)
return comparison == row[-1]
def TM():
row = rng.choice(dataset)
return functools.partial(microtest, row=row)
return TM, microtest
I found some toy datasets here: 10 Standard Datasets for Practicing Applied Machine Learning
Two datasets looked likely:
My current task was regarding boolean classification, so I went with the Pima Indians Diabetes Dataset.
First, an aside. The "Indians" of the "Pima Indians Diabetes Dataset" are North Americans, so we're in a tricky space when it comes to using the name "Indian" and so I'll say Pima where necessary. Having said that, here's a relevant quote from the book "1941: The Americas before Columbus" by Charles C. Mann.
I abhor the term Native American," [Russel] Means declared in 1998. Matching his actions to his words, Means had joined and become prominent in an indigenous-rights group called the American Indian Movement. "We were colonised as American Indians," he wrote, "we were colonized as American Indians, and we will gain our freedom as American Indians, and then we will call ourselves any damn thing we choose" (At the same time, the common British usage of "Red Indian" to distinguish American natives from "East Indians" is unwelcome.)
The dataset has these features:
Input features
Output features:
Running this variant against the Pima Indians Diabetes Dataset with 1000 agents converges in somewhere around 3000-5000 iterations, which on my computer is around 25 seconds.
The largest cluster is at the hypothesis ("Plasma glucose concentration a 2 hours in an oral glucose tolerance test" > 143.00).
Here's some stats, for a dataset of size 768 with 268 "positive" instances.
Accuracy | 0.75 |
Cluster size | 0.3 |
True positive | 126 |
True Negative | 450 |
False positive | 50 |
False Negative | 142 |
Precision | 0.716 |
Recall | 0.470 |
Where accuracy with a proportion of correctly guessed rows, precision is the proportion of "in class" guesses being correct, and recall is the proportion of all "in class" rows in the database that were correctly identified.
I allowed agents to choose up to one threshold per dimension, (e.g. A > x, B < y, C > z), but not (A > x, A < y, A > z) and take the OR of all comparisons, basically the union of a number of regions defined as being on one side of a plane.
I needed to redefine the hypothesis selection method DH and the microtest selection method TM.
A hypothesis is now a set of DimensionThreshold
s, each one for a different dimension. The number is chosen uniformly at random between 1 and the number of dimensions in the dataset.
def DH_plane_union(dataset, rng):
"""
Takes a "choice"able dataset where each row is expected to be of the same
length.
Each row must also be indexable and each value in a row is ordered with
respect to all other values in the same column.
Returns between one dimension threshold, or one dimension threshold per dimension, with no repeated dimensions.
Returns an unordered set of (feature_index, threshold, operator) tuples
"""
dimension_count = max(len(row)-1 for row in dataset)
def DH():
hyp_dim_count = rng.randint(2, dimension_count)
hyp_dims = rng.sample(range(dimension_count), hyp_dim_count)
hyp = frozenset(
DimensionThreshold(
dimension=dimension,
operator=rng.choice(sds_ml.sds_ml.operators),
threshold=rng.choice(dataset)[dimension],
)
for dimension
in sorted(hyp_dims)
)
return hyp
return DH
The microtest selects a random row from the dataset and returns true if any of the DimensionThreshold
comparisons return true.
def TM_plane_union(dataset, rng):
def microtest(hyp, row):
comparison = any(
plane.operator(row[plane.dimension], plane.threshold)
for plane
in hyp
)
return comparison == row[-1]
def TM():
row = rng.choice(dataset)
return functools.partial(microtest, row=row)
return TM, microtest
Interestingly for the first few thousand iterations, you tend to see exactly the same result ("Plasma glucose concentration a 2 hours in an oral glucose tolerance test." > 143.00), even if you force the hypotheses to have at least two dimensions you see a combination of the same dimension plus a junk dimension, e.g. ("Plasma glucose concentration a 2 hours in an oral glucose tolerance test." > 143.00 OR "Diastolic blood pressure (mm Hg)." < 0).
After between 1000 and 3000 iterations a new cluster forms at ("Plasma glucose concentration a 2 hours in an oral glucose tolerance test" > 139.00 OR "Body mass index (weight in kg/(height in m)^2)." > 45.30)
Here are some stats for the dataset of size 768 with 268 "positive" instances.
Accuracy | 0.76 |
Cluster size | 0.58 |
True positive | 147 |
True Negative | 433 |
False positive | 67 |
False Negative | 121 |
Precision | 0.687 |
Recall | 0.549 |
I then tried "plane intersection", where agents would choose up to one threshold per dimension, and take the AND of all comparisons, basically the intersection of a number of regions defined as being one side of a plane.
I only needed to redefine TM
The microtest selects a random row from the dataset and returns true if any of the DimensionThreshold
comparisons return true.
def TM_plane_intersection(dataset, rng):
def microtest(hyp, row):
comparison = all(
plane.operator(row[plane.dimension], plane.threshold)
for plane
in hyp
)
return comparison == row[-1]
def TM():
row = rng.choice(dataset)
return functools.partial(microtest, row=row)
return TM, microtest
After 5000 iterations with 10000 agents the largest cluster is at ((insulin < 545.00) AND (bmi > 30.10) AND (glucose_concentration > 126.00)
Here are the stats for this variant.
Accuracy | 0.77 |
Cluster size | 0.32 |
True positive | 145 |
True Negative | 443 |
False positive | 57 |
False Negative | 123 |
Precision | 0.718 |
Recall | 0.541 |
My next interest is in defining hypotheses as the union of a set of intersections. This is probably when we'll reach the limit of uniformly random sampling, and population sampling will come in, but I'll try an example with uniformly random selection and see where I get.
Population sampling is where random numbers are sampled from a distribution representing the current distribution of active agents rather than sampling a uniform distribution. For example, rather than selecting a dimension at random, a dimension will be selected by randomly selecting an agent, and if the selected agent is active, using a dimension from their hypothesis, if the selected agent is inactive, a dimension is selected uniformly at random.
Please send questions or comments to andrew@aomartin.co.uk, but do so on the understanding that I may post the entire content here.