This tool helps to analyze data by features.
$ hwrt analyze_data --help
usage: hwrt analyze_data [-h] [-d FILE] [-f]
optional arguments:
-h, --help show this help message and exit
-d FILE, --handwriting_datasets FILE
where are the pickled handwriting_datasets?
-f, --features analyze features
It can be extended by a plugin system. To do so, the configuration file ~/.hwrtrc has to be edited. The following two entries are important:
data_analyzation_plugins: /home/moose/Desktop/da.py
data_analyzation_queue:
- TrainingCount:
- filename: trainingcount.csv
- Creator: null
The value of data_analyzation_plugins indicates where the file with self-written data analyzation classes is located. Could could looke like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import time
from collections import defaultdict
# hwrt modules
from hwrt import HandwrittenData
from hwrt import utils
from hwrt import data_analyzation_metrics
from hwrt import geometry
class TrainingCount(object):
"""Analyze how many training examples exist for each recording."""
def __init__(self, filename="creator.csv"):
self.filename = data_analyzation_metrics.prepare_file(filename)
def __repr__(self):
return "TrainingCount(%s)" % self.filename
def __str__(self):
return "TrainingCount(%s)" % self.filename
def __call__(self, raw_datasets):
write_file = open(self.filename, "a")
write_file.write("symbol,trainingcount\n") # heading
print_data = defaultdict(int)
start_time = time.time()
for i, raw_dataset in enumerate(raw_datasets):
if i % 100 == 0 and i > 0:
utils.print_status(len(raw_datasets), i, start_time)
print_data[raw_dataset['handwriting'].formula_in_latex] += 1
print("\r100%"+"\033[K\n")
# Sort the data by highest value, descending
print_data = sorted(print_data.items(),
key=lambda n: n[1],
reverse=True)
# Write data to file
write_file.write("total,%i\n" %
sum([value for _, value in print_data]))
for userid, value in print_data:
write_file.write("%s,%i\n" % (userid, value))
write_file.close()
There are also many ready-to-use metrics:
Data analyzation metrics
Each algorithm works on a set of handwritings. They have to be applied like this:
>>> import data_analyzation_metrics
>>> a = [{'is_in_testset': 0,
... 'formula_id': 31L,
... 'handwriting': HandwrittenData(raw_data_id=2953),
... 'formula_in_latex': 'A',
... 'id': 2953L},
... {'is_in_testset': 0,
... 'formula_id': 31L,
... 'handwriting': HandwrittenData(raw_data_id=4037),
... 'formula_in_latex': 'A',
... 'id': 4037L},
... {'is_in_testset': 0,
... 'formula_id': 31L,
... 'handwriting': HandwrittenData(raw_data_id=4056),
... 'formula_in_latex': 'A',
... 'id': 4056L}]
>>> creator_metric = Creator('creator.csv')
>>> creator_metric(a)
Analyze the number of errors in the dataset.
Analyze who created most of the data.
Analyze how fast the points were in pixel/ms.
Analyze how much distance in px is between strokes.
For each recording: Store the average time between controll points of one stroke / controll points of two different strokes.
Get metrics from a list of dictionaries.
Truncate the file and return the filename.
Sort a list of formulas by id, where id represents the accepted formula id.
Parameters: | raw_datasets : list of dictionaries
|
---|
Examples
The parameter raw_datasets has to be of the format
>>> rd = [{'is_in_testset': 0,
... 'formula_id': 31,
... 'handwriting': HandwrittenData(raw_data_id=2953),
... 'formula_in_latex': 'A',
... 'id': 2953},
... {'is_in_testset': 0,
... 'formula_id': 31,
... 'handwriting': HandwrittenData(raw_data_id=4037),
... 'formula_in_latex': 'A',
... 'id': 4037},
... {'is_in_testset': 0,
... 'formula_id': 31,
... 'handwriting': HandwrittenData(raw_data_id=4056),
... 'formula_in_latex': 'A',
... 'id': 4056}]
>>> sort_by_formula_id(rd)