I've written a script that cleans up .csv files, removing some bad commas and bad quotes (bad, means they break an in house program we use to transform these files) using sed:
# remove all commas, and re-insert the good commas using clean.sed sed -f clean.sed $1 > $1.1st # remove all quotes sed 's/\"//g' $1.1st > $1.tmp # add the good quotes around good commas sed 's/\,/\"\,\"/g' $1.tmp > $1.tmp1 # add leading quotes sed 's/^/\"/' $1.tmp1 > $1.tmp2 # add trailing quotes sed 's/$/\"/' $1.tmp2 > $1.tmp3 # remove utf characters sed 's/<feff>//' $1.tmp3 > $1.tmp4 # replace original file with new stripped version and delete .tmp files cp -rf $1.tmp4 quotes_$1
Here is clean.sed:
s/\",\"/XXX/g; :a s/,//g ta s/XXX/\",\"/g;
Then it removes the temp files and viola we have a new file that starts with the word "quotes" that we can use for our other processes.
My question is:
Why do I have to make a sed statement to remove the feff tag in that temp file? The original file doesn't have it, but it always appears in the replacement. At first I thought cp was causing this but if I put in the sed statement to remove before the cp, it isn't there.
Maybe I'm just missing something...
U+FEFF is the code point for a byte order mark. Your files most likely contain data saved in UTF-16 and the BOM has been corrupted by your 'cleaning process' which is most likely expecting ASCII. It's probably not a good idea to remove the BOM, but instead to fix your scripts to not corrupt it in the first place.
To get rid of these in GNU emacs:
There is also a way to convert files with DOS line termination convention to Unix line termination convention.