Tabulated Data Inference

Overview

The library that I wrote utilizes a parser combinator (specifically, mpc) to automatically identify the data types from tabulated data. Its primary advantages are its ability to automatically identify:

The delimiter (data delimited with ,, ;, \t, and (single-space) delimiters are supported) and column names even with arbitrary metadata preceding the actual data (see 2nd file parsing example below).
Date/time/datetime strings of (nearly) any format with no prior knowledge about the format.

Motivation and Problem Definition

For my senior capstone engineering project at Olin, I was working with air quality instruments such as the Aerodyne ACSM and the Brechtel SEMS. In writing software for real-time monitoring of these instruments and analysis of the data they generated, I noticed that every instrument had a different way of reporting data. Every datetime format was different; some data files were .dat files with tab-delimited data preceded by several lines of metadata, whereas others were standard .csv files; and the data types covered everything from logical values to bit strings to floating point numbers using scientific notation. As a result, a lot of the code I wrote either had to be instrument-dependent or reference a complex configuration file that specified the ways in which each instrument encoded its data.

As a project in my entrepreneurship class, I decided to investigate the feasibility of and make prototypes for a client + cloud software product that would, among other things, be able to automatically ingest any scientific instrument’s data and make sense of it. To that end, I wrote this library to parse tabulated data files and identify and extract information about each column of data.

Solution

Being able to identify the delimiter and column names even with metadata preceding the tabulated data turned out to be relatively straightforward problem to solve, but the real challenge came in distinguishing a date/time/datetime string of any standard format from an arbitrary string (and in some cases, integers). To solve this problem, I used a parser combinator and defined a grammar that accepts strings representing:

Boolean values
Bit strings
Datetime strings
Time strings
Date strings
Integers
Floating point numbers (in decimal notation)
Floating point numbers (in scientific notation)

Using the Boost library’s date and time string formatting rules as a reference, this table enumerates a (non-exhaustive) selection of the various date/time/datetime formats that can be classified by the library:

Boost Format	Example	Classification
`%m-%d-%y`	`04-02-2022`	Date string
`%d%m%y`	`181021`	Date string
`%m-%d-%y`	`04-02-22`	Date string
`%Y-%m-%d` (ISO extended)	`2022-04-02`	Date string
`%Y%m%d` (ISO format, date only)	`20220402`	Date string
N/A (extra character over `%m-%d-%y`)	`12-34-56789`	Arbitrary string
N/A (expect 12 hr time when PM is used)	`13-34-2022 PM`	Arbitrary string
`%H:%M:%S%F %p`	`10:03:22.0023 PM`	Time string
`%H:%M:%S%F %p %ZP`	`13:03:22.0023 PM MST-07`	Time string
`%H%M%S%F%q` (ISO format, time only)	`131211-0700`	Time string
`%Y-%b-%d %H:%M:%S%F %ZP` (boost default input)	`2005-Oct-15 13:12:11 MST-07`	Datetime string
`%Y%m%dT%H%M%S%F%q` (ISO format)	`20051015T131211-0700`	Datetime string
N/A (`%m-%d-%y %H:%M` sans leading zeros)	`2/9/2022 0:16`	Datetime string

Example

The most challenging data files I used as test cases were the ones produced by the Brechtel SEMS. They include tab-delimited data with several lines of metadata preceding the actual data and, for some reason, a blank line in-between the row of column names and the first row. They also include a variety of data types to be parsed, including date strings, time strings, floating points, and logical values. For example, this file is used for the unit tests:

... (preceding lines removed for brevity)
#
#Transformations/Corrections
#---------------------------
#Desmearing: N = N0 * exp( -t / tau )
#Tau: 0.250
#CPC Counting Efficiency: nu = 1 - (C0 / (1 + exp((Dp - C1)/C2)))
#C0 = 1.025000
#C1 = 7.929000
#C2 = 0.845000
#
#=================================================
#StartDate	StartTime	EndDate	EndTime	ScanDirection	Sheath_Press	Sheath_Temp	Sheath_Avg	Sheath_Sdev	Sheath_RH	ColSamp_Avg	ColSamp_Sdev	ColSamp_RH	ColSamp_Temp	CPC_A_FlwAvg	CPC_A_Sdev	Sat_Temp	Cond_Temp	SEMS_Errors	MCPC_Errors	Impactr_Press	Bin_Dia1	Bin_Dia2	Bin_Dia3	Bin_Dia4	Bin_Dia5	Bin_Dia6	Bin_Dia7	Bin_Dia8	Bin_Dia9	Bin_Dia10	Bin_Dia11	Bin_Dia12	Bin_Dia13	Bin_Dia14	Bin_Dia15	Bin_Dia16	Bin_Dia17	Bin_Dia18	Bin_Dia19	Bin_Dia20	Bin_Dia21	Bin_Dia22	Bin_Dia23	Bin_Dia24	Bin_Dia25	Bin_Dia26	Bin_Dia27	Bin_Dia28	Bin_Dia29	Bin_Dia30	Bin_Conc1	Bin_Conc2	Bin_Conc3	Bin_Conc4	Bin_Conc5	Bin_Conc6	Bin_Conc7	Bin_Conc8	Bin_Conc9	Bin_Conc10	Bin_Conc11	Bin_Conc12	Bin_Conc13	Bin_Conc14	Bin_Conc15	Bin_Conc16	Bin_Conc17	Bin_Conc18	Bin_Conc19	Bin_Conc20	Bin_Conc21	Bin_Conc22	Bin_Conc23	Bin_Conc24	Bin_Conc25	Bin_Conc26	Bin_Conc27	Bin_Conc28	Bin_Conc29	Bin_Conc30	

181021	16:02:53	181021	16:03:25	1	1008	293.0	5.00	0.006	41	0.332	0.004	44	293.8	0.361	0.000	318.1	293.0	0	0	1.3	5.36	6.17	7.09	8.16	9.39	10.80	12.43	14.32	16.50	19.02	21.93	25.31	29.24	33.80	39.12	45.34	52.63	61.22	71.36	83.41	97.82	115.16	136.21	161.97	193.82	233.60	283.78	347.64	429.57	535.30	0.00	0.00	63.25	142.38	265.95	406.84	861.64	1936.16	2997.35	5231.52	6265.10	8117.19	10347.52	11147.31	13780.78	14020.80	13887.50	11887.87	8326.53	5639.37	2365.64	1078.11	564.21	367.12	359.97	126.26	39.08	19.28	4.18	36.05	
181021	16:03:40	181021	16:04:10	0	1008	293.0	5.00	0.006	41	0.330	0.003	44	293.8	0.361	0.000	318.1	293.0	0	0	1.3	5.36	6.17	7.09	8.16	9.39	10.80	12.43	14.32	16.50	19.02	21.93	25.31	29.24	33.80	39.12	45.34	52.63	61.22	71.36	83.41	97.82	115.16	136.21	161.97	193.82	233.60	283.78	347.64	429.57	535.30	0.00	0.00	19.99	47.64	98.55	212.24	277.82	290.42	164.93	345.82	503.09	623.19	844.23	893.09	710.15	958.30	949.33	472.34	715.62	323.65	341.05	341.94	414.09	211.14	312.16	184.17	117.04	135.98	11.55	1.41	
181021	16:04:25	181021	16:04:56	1	1008	293.0	5.00	0.006	41	0.330	0.000	43	293.9	0.361	0.002	318.1	293.0	0	0	1.3	5.36	6.17	7.09	8.16	9.39	10.80	12.44	14.32	16.50	19.02	21.93	25.31	29.24	33.80	39.12	45.34	52.64	61.22	71.36	83.41	97.82	115.16	136.20	161.97	193.81	233.59	283.75	347.61	429.52	535.23	0.00	0.00	18.67	51.15	117.25	264.05	380.46	294.87	334.65	449.28	716.25	585.34	774.78	1146.46	1148.51	1019.48	754.39	642.54	540.29	510.22	760.03	529.62	391.75	303.28	348.90	238.12	143.86	10.56	1.66	0.11	
181021	16:05:10	181021	16:05:41	0	1008	293.1	5.00	0.009	42	0.330	0.000	42	294.0	0.360	0.001	318.1	293.0	0	0	1.3	5.36	6.17	7.09	8.16	9.39	10.80	12.44	14.32	16.50	19.02	21.93	25.31	29.24	33.80	39.12	45.34	52.64	61.22	71.36	83.41	97.82	115.16	136.20	161.96	193.80	233.57	283.73	347.57	429.46	535.15	0.00	0.00	1.12	3.77	10.65	29.93	90.24	274.72	328.70	376.08	442.66	954.42	706.54	732.08	704.70	1087.59	836.92	784.19	690.07	629.20	451.05	257.94	378.84	415.27	267.83	169.37	47.19	37.79	3.07	0.34

Using just three functions from the library, each field from the data file and its classification can be identified. An excerpt from the example script shows how this works:

cout << "Finding delimiter..." << endl;
auto delimRet = getDelim(thisFileLines);

cout << "Finding fields..." << endl;
vector<vector<string>> fieldRet;
int consistentFields = getFields(thisFileLines, get<0>(delimRet), fieldRet, get<1>(delimRet));

cout << "Classifying fields..." << endl;
vector<tuple<string, FieldCls>> classificationRet;
classifyColumns(fieldRet, classificationRet, parser);

cout << "Data insights for " << fileTargets.at(fileIdx) << ":" << endl;
for (auto classification: classificationRet) {
    cout << " - " << get<0>(classification) << " (" << FieldClsCorrespondingNames[get<1>(classification)] << ")"
         << endl;
}

This prints the following for the above example file:

Finding delimiter...
Finding fields...
Classifying fields...
Data insights for tests/test_targets/long_SEMS.dat:
 - #StartDate (date)
 - StartTime (time)
 - EndDate (date)
 - EndTime (time)
 - ScanDirection (logical)
 - Sheath_Press (int)
 - Sheath_Temp (float_dec)  // Decimal floating point
 - ... (remaining lines removed for brevity)

Future Work

Because of the way the language grammar is structured, a prime opportunity for a new feature would be the derivation of the date/time string format from the abstract syntax tree generated for a given input. For example, using a function from the parser combinator library’s that prints a representation of the abstract syntax tree, we can see the tree generated for the input 2005-Oct-15 13:12:11 MST-07:

> 
  regex 
  datetime|> 
    date|> 
      year_4_digit|regex:1:1 '2005'
      char:1:5 '-'
      month_b|regex:1:6 'Oct'
      char:1:9 '-'
      day|regex:1:10 '15'
    char:1:12 ' '
    time|> 
      hour_24|regex:1:13 '13'
      char:1:15 ':'
      minute|regex:1:16 '12'
      char:1:18 ':'
      second|regex:1:19 '11'
      char:1:21 ' '
      tz_ZP|regex:1:22 'MST-07'
  regex

The tz_ZP tag, for instance, refers to the %ZP flag that is used for printing time zones using the Boost library in the MST-07 format. More opportunities for improving the library are listed in the repository README.

Additionally, because, as mentioned in the motivation section, I worked on this as part of a project to prototype and make a business case for a product, I also made a functional prototype of a desktop client that would make use of this library’s capabilities. I made this GUI in C# using Microsoft’s WinUI 3 platform for native Windows application development. This application can serve as a springboard for future work in integrating the library into useful user-facing products.

A screenshot of the GUI that I made that would make use of the tabulated data inference library.