c# - RegEx - Match on XML declarations when the version # is not "1.0" -
i want use regex this.
i need find errant xml declarations , not version 1.0
the following valid matches:
bad declaration
<? xml ver="1.0" encoding="utf-8"?>
bad declaration
<?xml version="1.0' encoding=utf-8>
bad declaration
<?xml ?>
bad declaration (doesn't start on first line)
..... <? xml ver="1.0" encoding="utf-8"?>
version 1.1 (single quotes)
<?xml version='1.1' encoding='utf-8'?>
version 1.1 (double quotes)
<?xml version="1.1" encoding="utf-8"?>
erroneous version #
<?xml version='999999' encoding='utf-8'?>
version 1.1 (multi-line) - not sure if multi-line formatting allowed i've seen done , need check it.
<?xml version="1.1" encoding="utf-8" standalone="no" ?>
want matches on invalid xml declarations or xml declarations version other 1.0
the following valid xml 1.0 declarations. these should never return match:
<?xml version="1.0" encoding="utf-8" standalone="no" ?> <?xml version= "1.0" encoding= 'utf-8' standalone= "no" ?> <?xml version="1.0" encoding="utf-8" standalone="no" ?>
xml 1.0's xml declaration grammar is:
xmldecl ::= '<?xml' versioninfo encodingdecl? sddecl? s? '?>' versioninfo ::= s 'version' eq ("'" versionnum "'" | '"' versionnum '"') eq ::= s? '=' s? versionnum ::= '1.0' encodingdecl ::= s 'encoding' eq ('"' encname '"' | "'" encname "'" ) encname ::= [a-za-z] ([a-za-z0-9._] | '-')* sddecl ::= s 'standalone' eq (("'" ('yes' | 'no') "'") | ('"' ('yes' | 'no') '"')) s ::= (#x20 | #x9 | #xd | #xa)+
this can trivally converted c#'s regex notation, write regex matches valid declaration:
new regex(@" \a<\?xml [ \t\n\r]+version[ \t\n\r]*=[ \t\n\r]*([""'])1\.0\1 (?:[ \t\n\r]+encoding[ \t\n\r]*=[ \t\n\r]*([""'])[a-za-z][a-za-z0-9._-]*\2)? (?:[ \t\n\r]+standalone[ \t\n\r]*=[ \t\n\r]*([""'])(?:yes|no)\3)? [ \t\n\r]* \?> ", regexoptions.compiled | regexoptions.ignorepatternwhitespace)
and can inverted using negative look-ahead make match if valid declaration missing.
new regex(@" \a(?!<\?xml [ \t\n\r]+version[ \t\n\r]*=[ \t\n\r]*([""'])1\.0\1 (?:[ \t\n\r]+encoding[ \t\n\r]*=[ \t\n\r]*([""'])[a-za-z][a-za-z0-9._-]*\2)? (?:[ \t\n\r]+standalone[ \t\n\r]*=[ \t\n\r]*([""'])(?:yes|no)\3)? [ \t\n\r]* \?>) ", regexoptions.compiled | regexoptions.ignorepatternwhitespace)
(i've used back-reference simplify regex not necessary)
note when match, match beginning of string, won't match invalid declaration you. add (<[^>]*>)
after lookahead if need non-empty match.
Comments
Post a Comment