Simple xmgrace XVG file format¶
Gromacs produces graphs in the xmgrace (“xvg”) format. These are
simple multi-column data files. The class XVG
encapsulates
access to such files and adds a number of methods to access the data
(as NumPy arrays), compute aggregates, or quickly plot it.
The XVG
class is useful beyond reading xvg files. With the
array keyword or the XVG.set()
method one can load data from
an array instead of a file. The array should be simple “NXY” data
(typically: first column time or position, further columns scalar
observables). The data should be a NumPy numpy.ndarray
array
a
with shape
(M, N)
where M-1 is the
number of observables and N the number of observations, e.g.the
number of time points in a time series. a[0]
is the time or
position and a[1:]
the M-1 data columns.
Errors¶
The XVG.error
attribute contains the statistical error for
each timeseries. It is computed from the standard deviation of the
fluctuations from the mean and their correlation time. The parameters
for the calculations of the correlation time are set with
XVG.set_correlparameters()
.
See also
Plotting¶
The XVG.plot()
and XVG.errorbar()
methods are set up to
produce graphs of multiple columns simultaneously. It is
typically assumed that the first column in the selected (sub)array
contains the abscissa (“x-axis”) of the graph and all further columns
are plotted against the first one.
Data selection¶
Plotting from XVG
is fairly flexible as one can always pass
the columns keyword to select which columns are to be
plotted. Assuming that the data contains [t, X1, X2, X3]
, then one
can
plot all observable columns (X1 to X3) against t:
xvg.plot()
plot only X2 against t:
xvg.plot(columns=[0,2])
plot X2 and X3 against t:
xvg.plot(columns=[0,2,3])
plot X1 against X3:
xvg.plot(columns=[2,3])
Coarse grainining of data¶
It is also possible to coarse grain the data for plotting (which typically results in visually smoothing the graph because noise is averaged out).
Currently, two alternative algorithms to produce “coarse grained” (decimated) graphs are implemented and can be selected with the method keyword for the plotting functions in conjuction with maxpoints (the number of points to be plotted):
- mean histogram (default) — bin the data (using
numkit.timeseries.regularized_function()
and compute the mean for each bin. Gives the exact number of desired points but the time data are whatever the middle of the bin is. - smooth subsampled — smooth the data with a running average (other windows like Hamming are also possible) and then pick data points at a stepsize compatible with the number of data points required. Will give exact times but not the exact number of data points.
For simple test data, both approaches give very similar output.
For the special case of periodic data such as angles, one can use the
circular mean (“circmean”) to coarse grain. In this case, jumps across
the -180º/+180º boundary are added as masked datapoints and no line is
drawn across the jump in the plot. (Only works with the simple
XVG.plot()
method at the moment, errorbars or range plots are
not implemented yet.)
See also
Examples¶
In this example we generate a noisy time series of a sine wave. We store the time, the value, and an error. (In a real example, the value might be the mean over multiple observations and the error might be the estimated error of the mean.)
>>> import numpy as np
>>> import gromacs.formats
>>> X = np.linspace(-10,10,50000)
>>> yerr = np.random.randn(len(X))*0.05
>>> data = np.vstack((X, np.sin(X) + yerr, np.random.randn(len(X))*0.05))
>>> xvg = gromacs.formats.XVG(array=data)
Plot value for all time points:
>>> xvg.plot(columns=[0,1], maxpoints=None, color="black", alpha=0.2)
Plot bin-averaged (decimated) data with the errors, over 1000 points:
>>> xvg.errorbar(maxpoints=1000, color="red")
(see output in Figure Plot of Raw vs Decimated data)
In principle it is possible to use other functions to decimate the
data. For XVG.plot()
, the method keyword can be changed (see
XVG.decimate()
for allowed method values). For
XVG.errorbar()
, the method to reduce the data values (typically
column 1) is fixed to “mean” but the errors (typically columns 2 and 3)
can also be reduced with error_method = “rms”.
If one wants to show the variation of the raw data together with the decimated and smoothed data then one can plot the percentiles of the deviation from the mean in each bin:
>>> xvg.errorbar(columns=[0,1,1], maxpoints=1000, color="blue", demean=True)
The demean keyword indicates if fluctuations from the mean are
regularised [1]. The method XVG.plot_coarsened()
automates this approach and can plot coarsened data selected by the
columns keyword.
Footnotes
[1] | When error_method = “percentile” is selected for
XVG.errorbar() then demean does not actually
force a regularisation of the fluctuations from the
mean. Instead, the (symmetric) percentiles are computed
on the full data and the error ranges for plotting are
directly set to the percentiles. In this way one can
easily plot the e.g. 10th-percentile to 90th-percentile
band (using keyword percentile = 10). |
Classes and functions¶
-
class
gromacs.fileformats.xvg.
XVG
(filename=None, names=None, array=None, permissive=False, **kwargs)¶ Class that represents the numerical data in a grace xvg file.
All data must be numerical.
NAN
andINF
values are supported via python’sfloat()
builtin function.The
array
attribute can be used to access the the array once it has been read and parsed. Thema
attribute is a numpy masked array (good for plotting).Conceptually, the file on disk and the XVG instance are considered the same data. Whenever the filename for I/O (
XVG.read()
andXVG.write()
) is changed then the filename associated with the instance is also changed to reflect the association between file and instance.With the permissive =
True
flag one can instruct the file reader to skip unparseable lines. In this case the line numbers of the skipped lines are stored inXVG.corrupted_lineno
.A number of attributes are defined to give quick access to simple statistics such as
mean
: mean of all data columnsstd
: standard deviationmin
: minimum of datamax
: maximum of dataerror
: error on the mean, taking correlation times into account (see alsoXVG.set_correlparameters()
)tc
: correlation time of the data (assuming a simple exponential decay of the fluctuations around the mean)
These attributes are numpy arrays that correspond to the data columns, i.e. :attr:`XVG.array`[1:].
Note
- Only simple XY or NXY files are currently supported, not Grace files that contain multiple data sets separated by ‘&’.
- Any kind of formatting (i.e. xmgrace commands) is discarded.
Initialize the class from a xvg file.
Arguments: - filename
is the xvg file; it can only be of type XY or NXY. If it is supplied then it is read and parsed when
XVG.array
is accessed.- names
optional labels for the columns (currently only written as comments to file); string with columns separated by commas or a list of strings
- array
read data from array (see
XVG.set()
)- permissive
False
raises aValueError
and logs and errior when encountering data lines that it cannot parse.True
ignores those lines and logs a warning—this is a risk because it might read a corrupted input file [False
]- stride
Only read every stride line of data [1].
- savedata
True
includes the data (XVG.array`
and associated caches) when the instance is pickled (seepickle
); this is oftens not desirable because the data are already on disk (the xvg file filename) and the resulting pickle file can become very big.False
omits those data from a pickle. [False
]- metadata
dictionary of metadata, which is not touched by the class
-
array
¶ Represent xvg data as a (cached) numpy array.
The array is returned with column-first indexing, i.e. for a data file with columns X Y1 Y2 Y3 ... the array a will be a[0] = X, a[1] = Y1, ... .
-
decimate
(method, a, **kwargs)¶ Decimate data a to maxpoints using method.
If a is a 1D array then it is promoted to a (2, N) array where the first column simply contains the index.
If the array contains fewer than maxpoints points or if maxpoints is
None
then it is returned as it is. The default for maxpoints is 10000.Valid values for the reduction method:
- “mean”, uses
XVG.decimate_mean()
to coarse grain by averaging the data in bins along the time axis - “circmean”, uses
XVG.decimate_circmean()
to coarse grain by calculating the circular mean of the data in bins along the time axis. Use additional keywords low and high to set the limits. Assumes that the data are in degrees. - “min” and “max* select the extremum in each bin
- “rms”, uses
XVG.decimate_rms()
to coarse grain by computing the root mean square sum of the data in bins along the time axis (for averaging standard deviations and errors) - “percentile” with keyword per:
XVG.decimate_percentile()
reduces data in each bin to the percentile per - “smooth”, uses
XVG.decimate_smooth()
to subsample from a smoothed function (generated with a running average of the coarse graining step size derived from the original number of data points and maxpoints)
Returns: numpy array (M', N')
from a(M', N)
array withM' == M
(except whenM == 1
, see above) andN' <= N
(N'
is maxpoints).- “mean”, uses
-
decimate_circmean
(a, maxpoints, **kwargs)¶ Return data a circmean-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates the weighted circular mean in each bin as the decimated data, using
numkit.timeseries.circmean_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
Keywords low and high can be used to set the boundaries. By default they are [-pi, +pi].
This method returns a masked array where jumps are flagged by an insertion of a masked point.
Note
Assumes that the first column is time and that the data are in degrees.
Warning
Breaking of arrays only works properly with a two-column array because breaks are only inserted in the x-column (a[0]) where y1 = a[1] has a break.
-
decimate_error
(a, maxpoints, **kwargs)¶ Return data a error-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates an error estimate in each bin as the decimated data, using
numkit.timeseries.error_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
See also
Note
Assumes that the first column is time.
Does not work very well because often there are too few datapoints to compute a good autocorrelation function.
-
decimate_max
(a, maxpoints, **kwargs)¶ Return data a max-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates the maximum in each bin as the decimated data, using
numkit.timeseries.max_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
Note
Assumes that the first column is time.
-
decimate_mean
(a, maxpoints, **kwargs)¶ Return data a mean-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates the weighted average in each bin as the decimated data, using
numkit.timeseries.mean_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
Note
Assumes that the first column is time.
-
decimate_min
(a, maxpoints, **kwargs)¶ Return data a min-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates the minimum in each bin as the decimated data, using
numkit.timeseries.min_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
Note
Assumes that the first column is time.
-
decimate_percentile
(a, maxpoints, **kwargs)¶ Return data a percentile-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates the percentile per in each bin as the decimated data, using
numkit.timeseries.percentile_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
Note
Assumes that the first column is time.
Keywords: - per
- percentile as a percentage, e.g. 75 is the value that splits the data into the lower 75% and upper 25%; 50 is the median [50.0]
-
decimate_rms
(a, maxpoints, **kwargs)¶ Return data a rms-decimated on maxpoints.
Histograms each column into maxpoints bins and calculates the root mean square sum in each bin as the decimated data, using
numkit.timeseries.rms_histogrammed_function()
. The coarse grained time in the first column contains the centers of the histogram time.If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of maxpoints points is returned.
Note
Assumes that the first column is time.
-
decimate_smooth
(a, maxpoints, window='flat')¶ Return smoothed data a decimated on approximately maxpoints points.
- Produces a smoothed graph using the smoothing window window; “flat” is a running average.
- select points at a step size approximatelt producing maxpoints
If a contains <= maxpoints then a is simply returned; otherwise a new array of the same dimensions but with a reduced number of points (close to maxpoints) is returned.
Note
Assumes that the first column is time (which will never be smoothed/averaged), except when the input array a is 1D and therefore to be assumed to be data at equidistance timepoints.
TODO: - Allow treating the 1st column as data
-
default_color_cycle
= ['black', 'red', 'blue', 'orange', 'magenta', 'cyan', 'yellow', 'brown', 'green']¶ Default color cycle for
XVG.plot_coarsened()
:['black', 'red', 'blue', 'orange', 'magenta', 'cyan', 'yellow', 'brown', 'green']
-
default_extension
= 'xvg'¶ Default extension of XVG files.
-
error
¶ Error on the mean of the data, taking the correlation time into account.
See [FrenkelSmit2002] p526:
error = sqrt(2*tc*acf[0]/T)where acf() is the autocorrelation function of the fluctuations around the mean, y-<y>, tc is the correlation time, and T the total length of the simulation.
[FrenkelSmit2002] D. Frenkel and B. Smit, Understanding Molecular Simulation. Academic Press, San Diego 2002
-
errorbar
(**kwargs)¶ errorbar plot for a single time series with errors.
Set columns keyword to select [x, y, dy] or [x, y, dx, dy], e.g.
columns=[0,1,2]
. SeeXVG.plot()
for details. Only a single timeseries can be plotted and the user needs to select the appropriate columns with the columns keyword.By default, the data are decimated (see
XVG.plot()
) for the default of maxpoints = 10000 by averaging data in maxpoints bins.x,y,dx,dy data can plotted with error bars in the x- and y-dimension (use filled =
False
).For x,y,dy use filled =
True
to fill the region between y±dy. fill_alpha determines the transparency of the fill color. filled =False
will draw lines for the error bars. Additional keywords are passed topylab.errorbar()
.By default, the errors are decimated by plotting the 5% and 95% percentile of the data in each bin. The percentile can be changed with the percentile keyword; e.g. percentile = 1 will plot the 1% and 99% perentile (as will percentile = 99).
The error_method keyword can be used to compute errors as the root mean square sum (error_method = “rms”) across each bin instead of percentiles (“percentile”). The value of the keyword demean is applied to the decimation of error data alone.
See also
XVG.plot()
lists keywords common to both methods.
-
ma
¶ Represent data as a masked array.
The array is returned with column-first indexing, i.e. for a data file with columns X Y1 Y2 Y3 ... the array a will be a[0] = X, a[1] = Y1, ... .
inf and nan are filtered via
numpy.isfinite()
.
-
max
¶ Maximum of the data columns.
-
maxpoints_default
= 10000¶ Aim for plotting around that many points
-
mean
¶ Mean value of all data columns.
-
min
¶ Minimum of the data columns.
-
parse
(stride=None)¶ Read and cache the file as a numpy array.
Store every stride line of data; if
None
then the class default is used.The array is returned with column-first indexing, i.e. for a data file with columns X Y1 Y2 Y3 ... the array a will be a[0] = X, a[1] = Y1, ... .
-
plot
(**kwargs)¶ Plot xvg file data.
The first column of the data is always taken as the abscissa X. Additional columns are plotted as ordinates Y1, Y2, ...
In the special case that there is only a single column then this column is plotted against the index, i.e. (N, Y).
Keywords: - columns : list
Select the columns of the data to be plotted; the list is used as a numpy.array extended slice. The default is to use all columns. Columns are selected after a transform.
- transform : function
function
transform(array) -> array
which transforms the original array; must return a 2D numpy array of shape [X, Y1, Y2, ...] where X, Y1, ... are column vectors. By default the transformation is the identity [lambda x: x
].- maxpoints : int
limit the total number of data points; matplotlib has issues processing png files with >100,000 points and pdfs take forever to display. Set to
None
if really all data should be displayed. At the moment we simply decimate the data at regular intervals. [10000]- method
method to decimate the data to maxpoints, see
XVG.decimate()
for details- color
single color (used for all plots); sequence of colors (will be repeated as necessary); or a matplotlib colormap (e.g. “jet”, see
matplotlib.cm
). The default is to use theXVG.default_color_cycle
.- kwargs
All other keyword arguments are passed on to
pylab.plot()
.
-
plot_coarsened
(**kwargs)¶ Plot data like
XVG.plot()
with the range of all data shown.Data are reduced to maxpoints (good results are obtained with low values such as 100) and the actual range of observed data is plotted as a translucent error band around the mean.
Each column in columns (except the abscissa, i.e. the first column) is decimated (with
XVG.decimate()
) and the range of data is plotted alongside the mean usingXVG.errorbar()
(see for arguments). Additional arguments:Kewords: - maxpoints
number of points (bins) to coarsen over
- color
single color (used for all plots); sequence of colors (will be repeated as necessary); or a matplotlib colormap (e.g. “jet”, see
matplotlib.cm
). The default is to use theXVG.default_color_cycle
.- method
Method to coarsen the data. See
XVG.decimate()
The demean keyword has no effect as it is required to be
True
.See also
-
read
(filename=None)¶ Read and parse xvg file filename.
-
set
(a)¶ Set the array data from a (i.e. completely replace).
No sanity checks at the moment...
-
set_correlparameters
(**kwargs)¶ Set and change the parameters for calculations with correlation functions.
The parameters persist until explicitly changed.
Keywords: - nstep
only process every nstep data point to speed up the FFT; if left empty a default is chosen that produces roughly 25,000 data points (or whatever is set in ncorrel)
- ncorrel
If no nstep is supplied, aim at using ncorrel data points for the FFT; sets
XVG.ncorrel
[25000]- force
force recalculating correlation data even if cached values are available
- kwargs
see
numkit.timeseries.tcorrel()
for other options
-
std
¶ Standard deviation from the mean of all data columns.
-
tc
¶ Correlation time of the data.
See
XVG.error()
for details.
-
write
(filename=None)¶ Write array to xvg file filename in NXY format.
Note
Only plain files working at the moment, not compressed.
-
gromacs.fileformats.xvg.
break_array
(a, threshold=3.141592653589793, other=None)¶ Create a array which masks jumps >= threshold.
Extra points are inserted between two subsequent values whose absolute difference differs by more than threshold (default is pi).
Other can be a secondary array which is also masked according to a.
Returns (a_masked, other_masked) (where other_masked can be
None
)