Quick Start

control and treatment is 2d-array like data. Each row is a N-features sample. Let’s say control is Mc x N matrix and treatment is Mt x N matrix. They don’t have to be all numeric data, but we can only use subset of numeric only features for PSM.

Minimal usage

First, import:

We just select most similar sample from control for each treatment sample:

from __future__ import print_function
import numpy as np
from ctmatching import psm, grouper

Create some test data:

control = np.array([[10., 0., 7.], [1., 4., 8.],])
treatment = np.array([[8., 3., 8.], [2., -3., 4.],])

Perform psm:

# by default, use_col = None, stratify_order = None, independent = True, k = 1
selected_control_index, selected_control_index_for_each_treatment = psm(control, treatment)

Display the matching:

for treatment_sample, control_samples in grouper(
    control, treatment, selected_control_index_for_each_treatment):

    print("\n--- %s ---" % treatment_sample)
    print("--- match:")
    for control_sample in control_samples:
        print("    %s" % control_sample)

The output looks like:

--- [ 8.  3.  8.] ---
--- match:
    [ 10.   0.   7.]

--- [ 2. -3.  4.] ---
--- match:
    [ 1.  4.  8.]

Want to make feature3 more important? Let’s do stratified matching:

# use stratified order
selected_control_index, selected_control_index_for_each_treatment = psm(
    control, treatment, stratify_order=[[2], [0, 1]])

for treatment_sample, control_samples in grouper(
    control, treatment, selected_control_index_for_each_treatment):

    print("\n--- %s ---" % treatment_sample)
    print("--- match:")
    for control_sample in control_samples:
        print("    %s" % control_sample)

The output looks like:

--- [ 8.  3.  8.] ---
--- match:
    [ 1.  4.  8.]

--- [ 2. -3.  4.] ---
--- match:
    [ 10.   0.   7.]

Advance Usage

In this section, we will use ctmatching on US re78 data. The full description of this data is here

In control, treat group matching, we may have these considers

  • Sometime, we only want selected columns for matching.
  • Sometime, we want search Minimal similar sample by feature1, with same feature1 value, then start considering feature2.
  • We may need take multiple attributes into account together.
  • We may want every treatment sample to multiple different control samples.

OK, let’s take a look at the hard code:

from ctmatching import load_re78

control, treatment = load_re78() # load data

# we only use second, third, ... , 7th column and use third column (second of use_col)
# as the dominate feature, then 5th column as second dominate
selected_control_index, selected_control_index_for_each_treatment = psm(
    control, treatment,
    use_col = [2,3,4,5,6,7],
    stratify_order = [[1],[3],[0,2,4],[5]],
    independent = False,
    k = 2,
)

for treatment_sample, control_samples in grouper(
    control, treatment, selected_control_index_for_each_treatment):

    print("\n--- %s ---" % treatment_sample)
    print("--- match:")
    for control_sample in control_samples:
        print("    %s" % control_sample)

The output looks like:

--- ['NSW1' '1' '37' '11' '1' '0' '1' '1' '0.0' '0.0' '9930.046'] ---
--- match:
    ['PSID368' '0' '40' '11' '1' '0' '1' '1' '0.0' '0.0' '0.0']
    ['PSID375' '0' '46' '11' '1' '0' '1' '1' '0.0' '0.0' '2820.98']

--- ['NSW2' '1' '22' '9' '0' '1' '0' '1' '0.0' '0.0' '3595.894'] ---
--- match:
    ['PSID341' '0' '20' '9' '0' '1' '0' '1' '1500.798' '0.0' '12618.31']
    ['PSID334' '0' '19' '9' '0' '1' '0' '1' '1822.118' '0.0' '3372.172']

--- ['NSW3' '1' '30' '12' '1' '0' '0' '0' '0.0' '0.0' '24909.45'] ---
--- match:
    ['PSID99' '0' '28' '12' '1' '0' '0' '0' '16722.34' '4253.806' '7314.747']
    ['PSID159' '0' '28' '12' '1' '0' '0' '0' '6285.328' '2255.806' '7310.313']
...

Not too hard, right?

If you want to take one more step further, you should check this API reference ctmatching.core.psm()