Analyze Data

This tool helps to analyze data by features.

General usage

$ hwrt analyze_data --help
usage: hwrt analyze_data [-h] [-d FILE] [-f]

optional arguments:
  -h, --help            show this help message and exit
  -d FILE, --handwriting_datasets FILE
                        where are the pickled handwriting_datasets?
  -f, --features        analyze features

Plug-in System

It can be extended by a plugin system. To do so, the configuration file ~/.hwrtrc has to be edited. The following two entries are important:

data_analyzation_plugins: /home/moose/Desktop/da.py
data_analyzation_queue:
  - TrainingCount:
    - filename: trainingcount.csv
  - Creator: null

The value of data_analyzation_plugins indicates where the file with self-written data analyzation classes is located. Could could looke like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import time
from collections import defaultdict

# hwrt modules
from hwrt import HandwrittenData
from hwrt import utils
from hwrt import data_analyzation_metrics
from hwrt import geometry


class TrainingCount(object):
    """Analyze how many training examples exist for each recording."""

    def __init__(self, filename="creator.csv"):
        self.filename = data_analyzation_metrics.prepare_file(filename)

    def __repr__(self):
        return "TrainingCount(%s)" % self.filename

    def __str__(self):
        return "TrainingCount(%s)" % self.filename

    def __call__(self, raw_datasets):
        write_file = open(self.filename, "a")
        write_file.write("symbol,trainingcount\n")  # heading

        print_data = defaultdict(int)
        start_time = time.time()
        for i, raw_dataset in enumerate(raw_datasets):
            if i % 100 == 0 and i > 0:
                utils.print_status(len(raw_datasets), i, start_time)
            print_data[raw_dataset['handwriting'].formula_in_latex] += 1
        print("\r100%"+"\033[K\n")
        # Sort the data by highest value, descending
        print_data = sorted(print_data.items(),
                            key=lambda n: n[1],
                            reverse=True)
        # Write data to file
        write_file.write("total,%i\n" %
                         sum([value for _, value in print_data]))
        for userid, value in print_data:
            write_file.write("%s,%i\n" % (userid, value))
        write_file.close()

Default metrics

There are also many ready-to-use metrics:

Data analyzation metrics

Each algorithm works on a set of handwritings. They have to be applied like this:

>>> import data_analyzation_metrics
>>> a = [{'is_in_testset': 0,
...    'formula_id': 31L,
...    'handwriting': HandwrittenData(raw_data_id=2953),
...    'formula_in_latex': 'A',
...    'id': 2953L},
...   {'is_in_testset': 0,
...    'formula_id': 31L,
...    'handwriting': HandwrittenData(raw_data_id=4037),
...    'formula_in_latex': 'A',
...    'id': 4037L},
...   {'is_in_testset': 0,
...    'formula_id': 31L,
...    'handwriting': HandwrittenData(raw_data_id=4056),
...    'formula_in_latex': 'A',
...    'id': 4056L}]
>>> creator_metric = Creator('creator.csv')
>>> creator_metric(a)
class hwrt.data_analyzation_metrics.AnalyzeErrors(filename='errors.txt', time_max_threshold=30000)

Analyze the number of errors in the dataset.

class hwrt.data_analyzation_metrics.Creator(filename='creator.csv')

Analyze who created most of the data.

class hwrt.data_analyzation_metrics.InstrokeSpeed(filename='instroke_speed.csv')

Analyze how fast the points were in pixel/ms.

class hwrt.data_analyzation_metrics.InterStrokeDistance(filename='dist_between_strokes.csv')

Analyze how much distance in px is between strokes.

class hwrt.data_analyzation_metrics.TimeBetweenPointsAndStrokes(filename='average_time_between_points.txt', filename_strokes='average_time_between_strokes.txt')

For each recording: Store the average time between controll points of one stroke / controll points of two different strokes.

hwrt.data_analyzation_metrics.get_metrics(metrics_description)

Get metrics from a list of dictionaries.

hwrt.data_analyzation_metrics.prepare_file(filename)

Truncate the file and return the filename.

hwrt.data_analyzation_metrics.sort_by_formula_id(raw_datasets)

Sort a list of formulas by id, where id represents the accepted formula id.

Parameters:

raw_datasets : list of dictionaries

A list of raw datasets.

Examples

The parameter raw_datasets has to be of the format

>>> rd = [{'is_in_testset': 0,
...        'formula_id': 31,
...        'handwriting': HandwrittenData(raw_data_id=2953),
...        'formula_in_latex': 'A',
...        'id': 2953},
...       {'is_in_testset': 0,
...        'formula_id': 31,
...        'handwriting': HandwrittenData(raw_data_id=4037),
...        'formula_in_latex': 'A',
...        'id': 4037},
...       {'is_in_testset': 0,
...        'formula_id': 31,
...        'handwriting': HandwrittenData(raw_data_id=4056),
...        'formula_in_latex': 'A',
...        'id': 4056}]
>>> sort_by_formula_id(rd)

Table Of Contents

Previous topic

Download Raw Data

Next topic

View Data

This Page