core

Control, treatment matching algorithm main module. (Propensity score matching)

ctmatching.core.normalize(train, test)[source]

Pre-processing, normalize data by eliminating mean and variance.

Parameters:
  • train – 2-d ndarray like data.
  • test – 2-d ndarray like data.
ctmatching.core.dist(X, Y)[source]

Calculate X, Y distance matrix.

X, Y are M x N matrix. N must be same.

Parameters:
  • X – 2-d ndarray like data.
  • Y – 2-d ndarray like data.
ctmatching.core.exam_input(control, treatment, stratify_order=None)[source]

Exam input argument.

ctmatching.core.stratified_matching(control, treatment, stratify_order)[source]

Calculate the order of matched control samples. Conponent function of psm().

Here’s how it’s done.

control = 1000 * 5 (1000 samples, 5-dimension vector)
treatment = 100 * 5 (100 samples, 5-dimension vector)
stratify_order = [[0], [1,2,3], [4]]

1. construct 3 distance matrix for 3 stratify rules, each matrix size is 100 * 1000

  • select first column of treatment
  • select first column of control
  • compute distance matrix
  • repeat this over three order

2. construct a 1000 * 3 matrix, each column is the distance array against control then sort them by first column, then second column, finally third column. now first row index should be the nearest sample in control group by mean of stratification. append the row index to “indices” matrix.

Parameters:stratify_order
Returns nn_index:
 knn index, M1 X M2 matrix. M1 is number of treatment, M2 is number of control.
ctmatching.core.non_stratified_matching(control, treatment)[source]

Find index of KNN-neighbor of control sample for treatment group.

Returns nn_index:
 knn index, M1 X M2 matrix. M1 is number of treatment, M2 is number of control.

Conponent function of psm().

ctmatching.core.non_repeat_index_matching(nn_indices, k=1)[source]

All treatment samples match against different samples from control group.

For example:

- treatment_sample1 matches: control_1, control_25, control_30
- treatment_sample2 matches: control_2, control_25, control_34

Because treatment_sample1 already took control_25, so treatment_sample2 has to take control2 and control_34 (second nearest, third nearest).

Returns selected_control_index:
 all control sample been selected for entire treatment group.
Returns selected_control_index_for_each_treatment:
 selected control sample for each treatment sample.

Conponent function of psm().

ctmatching.core.independent_index_matching(nn_indices, k=1)[source]

Each treatment_sample match against to first k nearest neighbor in control group. Multiple treatment sample may match the same control sample.

Returns selected_control_index:
 all control sample been selected for entire treatment group.
Returns selected_control_index_for_each_treatment:
 selected control sample for each treatment sample.

Conponent function of psm().

ctmatching.core.psm(control, treatment, use_col=None, stratify_order=None, independent=True, k=1)[source]

Propensity score matching main function.

If you want to know the inside of the psm algorithm, check stratified_matching(), non_stratified_matching(), non_repeat_index_matching(), independent_index_matching(). otherwise, just read the parameters’ definition.

Suppose we have m1 control samples, m2 treatment samples. Sample is n-dimension vector.

Parameters:
  • control (numpy.ndarray) –

    control group sample data, m1 x n matrix. Example:

    [[c1_1, c1_2, ..., c1_n], # c means control
    
    [c2_1, c2_2, ..., c2_n], ..., [cm1_1, cm1_2, ..., cm1_n],]
  • treatment (numpy.ndarray) –

    control group sample data, m2 x n matrix. Example:

    [[t1_1, t1_2, ..., t1_n], # t means treatment
    
    [t2_1, t2_2, ..., t2_n], ..., [tm1_1, tm1_2, ..., tm1_n],]
  • use_col (list or numpy.ndarray) –

    (default None, use all) list of column index. Example:

    [0, 1, 4, 6, 7, 9] # use first, second, fifth, ... columns
    
  • stratify_order (list of list) –

    (default None, use normal nearest neighbor) list of list. Example:

    # for input data has 6 columns
    # first feature has highest priority
    # [second, third, forth] features' has second highest priority by mean of euclidean distance
    # fifth feature has third priority, ...
    [[0], [1, 2, 3], [4], [5]]
    
  • independent (boolean) – (default True), if True, same treatment sample could be matched to different control sample.
  • k (int) – (default 1) Number of samples selected from control group.
Returns selected_control_index:
 

all control sample been selected for entire treatment group.

Returns selected_control_index_for_each_treatment:
 

selected control sample for each treatment sample.

selected_control_index: selected control sample index. Example (k = 3):

(m2 * k)-length array: [7, 120, 43, 54, 12, 98, ..., 71, 37, 14]

selected_control_index_for_each_treatment: selected control sample index for each treatment sample. Example (k = 3):

# for treatment[0], we have control[7], control[120], control[43]
# matched by mean of stratification.
[[7, 120, 43],
 [54, 12, 98],
 ...,
 [71, 37, 14],]
Raises:
  • InputError – if the input parameters are not legal.
  • NotEnoughControlSampleError – if don’t have sufficient data for independent index matching.
ctmatching.core.grouper(control, treatment, selected_control_index_for_each_treatment)[source]

Generate treatment sample and matched control samples pair.