non uniform text file parsing



hi,
i am trying to parse a collection of text files that look like following. there a number of them but not all are structured exactly the same way.if i figure to parse this file i can handle the rest.i am novice in bash, and i want to parse the size in bytes column only, but the document is non uniform, even the delimiters are not same.for example between collection name and no of documents the delimiter is <tab,tab,space> but the delimiter option accepts only a single char. please i really want to know how to handle this weird parsing.the text shown below is only a part of file.




1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
<COMMENT> TestbedName:          trec123-100-sample300-callan99
<COMMENT>
<COMMENT> RevisionHistory:
<COMMENT> v1a, October 28, 1999:
<COMMENT>   - Initial release.
<COMMENT>
<COMMENT> NumberOfCollections:            100
<COMMENT> NumberOfDocuments:           30,000
<COMMENT> SizeInBytes:            270,145,023
<COMMENT>
<COMMENT> CollectionName   NumberOfDocuments  SizeInBytes
<COMMENT>    ap88_1		300		   805,137
<COMMENT>    ap88_2		300		   762,665
<COMMENT>    ap88_3		300		   787,242
<COMMENT>    ap88_4		300		   823,265
<COMMENT>    ap88_5		300		   728,748
<COMMENT>    ap88_6		300		   835,178
<COMMENT>    ap88_7		300		   751,507
<COMMENT>    ap88_8		300		   788,955
<COMMENT>    ap89_1		300		   853,772
<COMMENT>    ap89_2		300		   824,850
<COMMENT>    ap89_3		300		   804,463
<COMMENT>    ap89_4		300		   807,761
<COMMENT>    ap89_5		300		   885,290
<COMMENT>    ap89_6		300		   838,144
<COMMENT>    ap89_7		300		   793,021
<COMMENT>    ap89_8		300		   834,519
<COMMENT>    ap90_1		300		   774,047
<COMMENT>    ap90_2		300		   875,806
<COMMENT>    ap90_3		300		   874,445
<COMMENT>    ap90_4		300		   845,463
<COMMENT>    ap90_5		300		   776,059
<COMMENT>    ap90_6		300		   820,116
<COMMENT>    ap90_7		300		   810,200
<COMMENT>    ap90_8		300		   794,558
<COMMENT>    doe_1		300		   229,975
<COMMENT>    doe_2		300		   245,362
<COMMENT>    doe_3		300		   255,834
<COMMENT>    doe_4		300		   236,867
<COMMENT>    doe_5		300		   255,995
<COMMENT>    doe_6		300		   255,932
<COMMENT>    fr88_1		300		 5,227,664
<COMMENT>    fr88_2		300		12,109,095
<COMMENT>
<COMMENT>  ap88_1:  300 docs, 805137 bytes
AP880224-0321 ap88_1
AP880218-0282 ap88_1
AP880225-0251 ap88_1
AP880217-0097 ap88_1
AP880324-0012 ap88_1
AP880322-0004 ap88_1
AP880217-0216 ap88_1
AP880220-0003 ap88_1
AP880309-0328 ap88_1
AP880319-0122 ap88_1
AP880321-0192 ap88_1
AP880225-0287 ap88_1
AP880319-0135 ap88_1
AP880322-0152 ap88_1
AP880222-0259 ap88_1
AP880222-0246 ap88_1
AP880223-0121 ap88_1
AP880225-0047 ap88_1
AP880312-0124 ap88_1
AP880311-0326 ap88_1
AP880219-0203 ap88_1
AP880319-0036 ap88_1
AP880316-0152 ap88_1
AP880219-0037 ap88_1
Looks like that would be a pretty straight forward job for boost spirit
Topic archived. No new replies allowed.