Quick Start¶
control and treatment is 2d-array like data. Each row is a N-features sample. Let’s say control is Mc x N matrix and treatment is Mt x N matrix. They don’t have to be all numeric data, but we can only use subset of numeric only features for PSM.
Minimal usage¶
First, import:
We just select most similar sample from control for each treatment sample:
from __future__ import print_function
import numpy as np
from ctmatching import psm, grouper
Create some test data:
control = np.array([[10., 0., 7.], [1., 4., 8.],])
treatment = np.array([[8., 3., 8.], [2., -3., 4.],])
Perform psm:
# by default, use_col = None, stratify_order = None, independent = True, k = 1
selected_control_index, selected_control_index_for_each_treatment = psm(control, treatment)
Display the matching:
for treatment_sample, control_samples in grouper(
control, treatment, selected_control_index_for_each_treatment):
print("\n--- %s ---" % treatment_sample)
print("--- match:")
for control_sample in control_samples:
print(" %s" % control_sample)
The output looks like:
--- [ 8. 3. 8.] ---
--- match:
[ 10. 0. 7.]
--- [ 2. -3. 4.] ---
--- match:
[ 1. 4. 8.]
Want to make feature3 more important? Let’s do stratified matching:
# use stratified order
selected_control_index, selected_control_index_for_each_treatment = psm(
control, treatment, stratify_order=[[2], [0, 1]])
for treatment_sample, control_samples in grouper(
control, treatment, selected_control_index_for_each_treatment):
print("\n--- %s ---" % treatment_sample)
print("--- match:")
for control_sample in control_samples:
print(" %s" % control_sample)
The output looks like:
--- [ 8. 3. 8.] ---
--- match:
[ 1. 4. 8.]
--- [ 2. -3. 4.] ---
--- match:
[ 10. 0. 7.]
Advance Usage¶
In this section, we will use ctmatching on US re78 data. The full description of this data is here
In control, treat group matching, we may have these considers
- Sometime, we only want selected columns for matching.
- Sometime, we want search Minimal similar sample by feature1, with same feature1 value, then start considering feature2.
- We may need take multiple attributes into account together.
- We may want every treatment sample to multiple different control samples.
OK, let’s take a look at the hard code:
from ctmatching import load_re78
control, treatment = load_re78() # load data
# we only use second, third, ... , 7th column and use third column (second of use_col)
# as the dominate feature, then 5th column as second dominate
selected_control_index, selected_control_index_for_each_treatment = psm(
control, treatment,
use_col = [2,3,4,5,6,7],
stratify_order = [[1],[3],[0,2,4],[5]],
independent = False,
k = 2,
)
for treatment_sample, control_samples in grouper(
control, treatment, selected_control_index_for_each_treatment):
print("\n--- %s ---" % treatment_sample)
print("--- match:")
for control_sample in control_samples:
print(" %s" % control_sample)
The output looks like:
--- ['NSW1' '1' '37' '11' '1' '0' '1' '1' '0.0' '0.0' '9930.046'] ---
--- match:
['PSID368' '0' '40' '11' '1' '0' '1' '1' '0.0' '0.0' '0.0']
['PSID375' '0' '46' '11' '1' '0' '1' '1' '0.0' '0.0' '2820.98']
--- ['NSW2' '1' '22' '9' '0' '1' '0' '1' '0.0' '0.0' '3595.894'] ---
--- match:
['PSID341' '0' '20' '9' '0' '1' '0' '1' '1500.798' '0.0' '12618.31']
['PSID334' '0' '19' '9' '0' '1' '0' '1' '1822.118' '0.0' '3372.172']
--- ['NSW3' '1' '30' '12' '1' '0' '0' '0' '0.0' '0.0' '24909.45'] ---
--- match:
['PSID99' '0' '28' '12' '1' '0' '0' '0' '16722.34' '4253.806' '7314.747']
['PSID159' '0' '28' '12' '1' '0' '0' '0' '6285.328' '2255.806' '7310.313']
...
Not too hard, right?
If you want to take one more step further, you should check this API reference ctmatching.core.psm()