Author Topic: two problems with CSV-files (Read 11522 times)

Anson · « **on:** October 12, 2009, 02:27:41 am »

I wanted to import and export info with CSV-files, and while doing so, stumbled upon two problems. Further testing finally gave me the reasons what probably happened, but now i am looking for methods what to do ...

problem 1:
i had written a program to read a CSV file and work on the values in it, but it behaved strangely. Looking at the contents of the CSV showed nothing special, but finally i looked at every byte with my own program, and thus i found that in front of the first character were three more chars with the codes 239, 187 and 191. Notepad simply seems to ignore them, but other software might not.

when i had a closer look at the original CSV export template in PVD v14, I saw this:

Code: [Select]

%OPTIONS%
encoding="UTF8BOM"
replace=";<->,"
replace=""<->"
replace="#13#10<-> "
replace="#13<-> "
replace="#10<-> "
filter="CSV Files|*.csv"
%OPTIONS%
%HEAD%Title;Original Title;Year;Genre;Country;Actors;Director;Description;
%HEAD%{%value=title};{%value=origtitle};{%value=year};{%value=genre};{%value=country};{%value=actors};{%value=directors};"{%value=description}";

and that already was the solution to removing the three extra chars: i changed encoding="UTF8BOM" to encoding="UTF8" (without BOM) and everything works fine now.

now the questions:

Was the change to UTF8BOM intended ? in other/older versions and forum posts, i only had seen UTF8 and UTF-8 (which of the two is correct?), but never saw UTF8BOM ?
which encodings should be used to make CSV-files as compatible to as many other software as possible ?
and which encodings are available at all (does this depend on PVD or my system) ?

problem 2:
after importing my collection via CSV without obvious problems, i had to see that there was one problem: many comments were truncated in the middle. a closer look revealed that that happened always at some quotation marks which occured in the middle of a field.

Now the reason for the problem is obvious: PVD only imports fields "as is", respecting the userdefined delimiter and the userdefined quotation character, but only by ignoring the first and second occurence of the quotation character and the remaining string after the second quote char (assuming they enclose the field at the start and end of the field), but PVD does not interpret doubled quotation marks which are used by many other programs and even listed in a RFC.

This is no bug since the RFC and other sources give a warning: "While there are various specifications and implementations for the CSV format, there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.". Nonetheless, it would be nice if a future version of PVD would also understand those additional conventions.

Similar also applies to exporting to a CSV file: in the options section all characters which might cause problems are replaced, eg quotes are simply removed, linebreaks are replaced by space, and semicolons are replaced by commas. I found that problematic since it messes up lots of my comments, and nice features of PVD like the imbedded hyperlinks in memo fields don't work any longer when exporting and reimporting them since they will be missing the quote chars. Most serious will be the rare cases where a title is changed because it contains the delimiter or the quote char, since title and year are used to identify a movie.

now the questions:

would other people also be interested in more fully using import and export of CSV-files, or do most people use excel import because of this (but as i understood, exporting to excel is done with the same "normal" CSV-files and thus also the related problems) ?
did someone already come up with suggestions or even a solution how to change a few parameters to be able to import and export everything with as little loss as possible ?

here are some links for reference: Wikipedia and RFC 4180

and a short summary of these sources what most implementations understand:

fields of one record are on one line and are separated by one specific delimiter (originally "," but now often ";" since there are lots of "," and much fewer ";" in strings and most of all in numbers)
the delimiter separates values and doesn't terminate them (thus no delimiter is required after the last field; a final delimiter would add an empty field to the line/record)
any field may be enclosed with specific quotation characters
if a field contains the delimiter, the quotation char or linebreaks, it must be enclosed in those quote chars, and quote chars in the field are doubled to escape them
depending on implementation, whitespace at the start and end of fields should not be trimmed (to be sure, the whole field can be enclosed in quote chars)
a single header line is optional, but if present should hold the same number of fields as all the other lines, and the values should be the names of the fields

my first idea for a temporary solution for export only: changing the replace command in the options section to double quote chars instead of removing them and to no longer change ";" into "," and no longer remove linebreaks, and putting quote chars around all fields in the body of the template, or at least around all fields which at some time might include the delimiter or the quote chars, including titles, original titles, and all other strings and memo fields. this might work, but only as long as the data is not intended for reimport.

my first idea for a temporary solution for import and export: using a delimiter which is guaranteed to appear in no fields like maybe a TAB or other nonprintable char (if such chars are possible at all), but maybe someone uses TAB in comment fields. thus which char to use ? with a carefully selected delimiter, maybe quote chars wouldn't be needed at all and thus require no measures (except removing the replace commands in the header). And which second special char could be used as quote char, eg to avoid misinterpretations by PVD and other software when such software should attempt to match a userdefined quote char, which (when empty) might match any place in a string.

would anyone be interested in exploring this further, or even does have some working solution ?

nostra · « **Reply #1 on:** October 12, 2009, 02:49:32 am »

Quote

Was the change to UTF8BOM intended ?

Yes

Quote

which encodings should be used to make CSV-files as compatible to as many other software as possible ?

encoding="ANSI"

BUT in this case some international characters could be lost

Quote

and which encodings are available at all (does this depend on PVD or my system) ?

The export plugin has only 3 options:

UTF8
UTF8BOM
ANSI

Anson · « **Reply #2 on:** October 12, 2009, 10:31:20 pm »

thanks for the fast reply

Quote from: nostra on October 12, 2009, 02:49:32 am

Quote
Was the change to UTF8BOM intended ?
Yes

in case somebody else is wondering what all this means, here is some info i found in the meantime:

after you said "yes", i read a FAQ and also twice the RFC to get a better understanding of UTF8 and what BOM does. As I understand this, the only difference caused by UTF8BOM instead of UTF8 is that a textfile (including csv files) starts with three special chars, to announce to software that the file is no simple ascii file. These three chars are hex EF BB BF or decimal 239 187 191, and they are the UTF8 representation of the Unicode character FEFF.

Since many programs (including Windows Notepad) look at these chars, it's really nice to have UTF8BOM instead of UTF8 as default (from the FAQ: "A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files."). Only in few cases a file (mostly pure ascii files) needs to NOT start with the special chars of the BOM, eg Unix shell scripts which expect specific ASCII characters such as "#!" at the beginning. only then users should be aware of the possible problem and export files without BOM.

instead of switching back to UTF8, i now have changed my program to check for this start of the file too and act accordingly, also writing these codes to a file which i generate.

Quote

Quote
which encodings should be used to make CSV-files as compatible to as many other software as possible, and which encodings are available at all ?

The export plugin has only 3 options:

ANSI is most compatible, BUT in this case some international characters could be lost. And since eg lots of movie titles are chinese or whatever, i won't use ANSI in most cases.
UTF8 and UTF8BOM both take care of almost everything. they include US-ASCII completely as the first 128 chars, and the rest is also easy to handle. for the difference with and without BOM see above.

thanks again ... question/problems part 1 solved and finished :-)

anybody else interested in what i wrote as part 2, improving importing and exporting CSV by changing delimiter, quote char and replace options ?

to find other chars which might be usable as delimiters and quote chars, i counted all chars in a CSV where i exported ID, origtitle, year and my comment, and the only chars which didn't appear on this test were %@\_`|~ and the char 127 as well as all nonprintable chars 0-31 (except CR and LF). Any ideas or comments ?

nostra · « **Reply #3 on:** October 12, 2009, 11:12:11 pm »

Quote

anybody else interested in what i wrote as part 2, improving importing and exporting CSV by changing delimiter, quote char and replace options ?

I will improve the plugin to support escaping quote characters in the next version.

Anson · « **Reply #4 on:** October 13, 2009, 08:19:46 am »

Quote from: nostra on October 12, 2009, 11:12:11 pm

I will improve the plugin to support escaping quote characters in the next version.

big thanks !
with those changes, the existing features (like hyperlinks in memos) will be much more useful too !

just to be sure, since english is not our first language:
"the plugin" = import plugin? for handling escaped quote chars?

already now, using replace=""<->""" and ...;"{%value=description}";... in the export template should be simple and good enough for escaping quote chars, after another general small problem of replacements is solved:

i just tested it and currently you seem to apply each replacement twice ...

replace="o<->aeiou" generates Kaeiaeiouunga from Konga
replace="oo<->o" generates Four from Foooour
replace=""<->""" doubles a single " twice for a final """" instead of ""

Personal Video Database

News:

Author Topic: two problems with CSV-files (Read 11522 times)

Anson

two problems with CSV-files

nostra

Re: two problems with CSV-files

Anson

Re: two problems with CSV-files

nostra

Re: two problems with CSV-files

Anson

big thanks, and a small problem with all replacements