I wanted to test if a key-combination of 3 field values appeared multiple times in an XML file. A colleague of mine gave me head start on text-processing using Awk. OS X ships with the BSD version of awk so this came out quite handy.
The input XML:
<ehbo>
<order>
<siebelorderid>10</siebelorderid>
<siebelordernumber>20</siebelordernumber>
<cordysordernumber>30</cordysordernumber>
</order>
<order>
<siebelorderid>10</siebelorderid>
<siebelordernumber>20</siebelordernumber>
<cordysordernumber>30</cordysordernumber>
</order>
</ehbo>
The Awk script:
BEGIN {
print "START";
print "";
}
/SiebelOrderID/ {
line = $0;
split(line, a, ">");
split(a[2], b, "<");
siebelOrderID = b[1];
#print "a" siebelOrderID;
}
/SiebelOrderNumber/ {
line = $0;
split(line, c, ">");
split(c[2], d, "<");
SiebelOrderNumber = d[1];
#print "b" SiebelOrderNumber;
}
/CordysOrderNumber/ {
line = $0;
split(line, e, ">");
split(e[2], f, "<");
CordysOrderNumber = f[1];
#print "c" CordysOrderNumber;
}
/\/Order/ {
plep = siebelOrderID "_" SiebelOrderNumber "_" CordysOrderNumber;
lijst[plep]++;
#print "d" plep;
siebelOrderID="";
SiebelOrderNumber="";
CordysOrderNumber="";
}
END {
for (i in lijst) {
if (lijst[i]-1) {
print i, lijst[i]
}
}
print "";
print "FINISHED!";
}
The input XML:
<ehbo>
<order>
<siebelorderid>10</siebelorderid>
<siebelordernumber>20</siebelordernumber>
<cordysordernumber>30</cordysordernumber>
</order>
<order>
<siebelorderid>10</siebelorderid>
<siebelordernumber>20</siebelordernumber>
<cordysordernumber>30</cordysordernumber>
</order>
</ehbo>
The Awk script:
BEGIN {
print "START";
print "";
}
/SiebelOrderID/ {
line = $0;
split(line, a, ">");
split(a[2], b, "<");
siebelOrderID = b[1];
#print "a" siebelOrderID;
}
/SiebelOrderNumber/ {
line = $0;
split(line, c, ">");
split(c[2], d, "<");
SiebelOrderNumber = d[1];
#print "b" SiebelOrderNumber;
}
/CordysOrderNumber/ {
line = $0;
split(line, e, ">");
split(e[2], f, "<");
CordysOrderNumber = f[1];
#print "c" CordysOrderNumber;
}
/\/Order/ {
plep = siebelOrderID "_" SiebelOrderNumber "_" CordysOrderNumber;
lijst[plep]++;
#print "d" plep;
siebelOrderID="";
SiebelOrderNumber="";
CordysOrderNumber="";
}
END {
for (i in lijst) {
if (lijst[i]-1) {
print i, lijst[i]
}
}
print "";
print "FINISHED!";
}
To make life easier, I put the script in a file called "parser.awk" and then used the following command in Terminal:
awk -f parser.awk input.xml
This will print the field combinations that occur multiple times.
References:
awk -f parser.awk input.xml
This will print the field combinations that occur multiple times.
References:
- http://www.unix.com/unix-dummies-questions-answers/19844-length-string.html
- http://www.vectorsite.net/tsawk.html
- http://commandlinemac.blogspot.nl/2008/12/learn-to-talk-awk.html