So I needed to quickly validate a CSV, and identify all the bad lines in it. I don’t like re-inventing the wheel, so I did a Google search for ‘csv validation regex’. I couldn’t find any, so I made one myself!

An individual field is a sequence of non-commas: [^,]*. All fields except the last are followed by a comma. So, if we have 10 fields, each line should look like ^([^,]*,){9}[^,]*$.

But some fields might have commas in them- a name column might be populated with “Smith, John”. If so, we want to ignore commas within these ‘enclosures’. In this case, the enclosure is the double quotation, so a field might also look like "[^"]*", read as “match any non-quotations between quotations”.

What if we have escaped quotation characters within those enclosures? Neither standard nor extended regexes can help us, but Perl regexes can use positive lookbehinds: (?<=\\)". So, a field should match the expression "([^"]|(?<=\\)")*". Read this as “two quotations, in between which are only non-quotations OR escaped quotations”.

Our regex is getting pretty complicated, so let me simplify using variables. A:=[^,]*, our first type of field. B:="([^"]|(?<=\\)")*", our second type of field.

With these abstractions, we can easily construct our regex. We want ^((A|B),){9}(A|B)$. This expands to: ^((([^,]*)|("([^"]|(?<=\\)")*")),){9}(([^,]*)|("([^"]|(?<=\\)")*"))$. There are probably more parentheses in there than you really need!

Let’s put this all together into grep to count all the bad lines:

grep -P -c -v '^((([^,]*)|("([^"]|(?<=\\)")*")),){9}(([^,]*)|("([^"]|(?<=\\)")*"))$' myfile.csv

This can be improved a bit: If you know that fields must be all numeric, for instance, you can make the regex test for this, but this should cover the most general case. If you are using other delimiting and enclosing characters, replace the commas and quotes with your own choices.