I wanted to import and export info with CSV-files, and while doing so, stumbled upon two problems. Further testing finally gave me the reasons what probably happened, but now i am looking for methods what to do ...
problem 1:i had written a program to read a CSV file and work on the values in it, but it behaved strangely. Looking at the contents of the CSV showed nothing special, but finally i looked at every byte with my own program, and thus i found that in front of the first character were three more chars with the codes 239, 187 and 191. Notepad simply seems to ignore them, but other software might not.
when i had a closer look at the original CSV export template in PVD v14, I saw this:
%OPTIONS%
encoding="UTF8BOM"
replace=";<->,"
replace=""<->"
replace="#13#10<-> "
replace="#13<-> "
replace="#10<-> "
filter="CSV Files|*.csv"
%OPTIONS%
%HEAD%Title;Original Title;Year;Genre;Country;Actors;Director;Description;
%HEAD%{%value=title};{%value=origtitle};{%value=year};{%value=genre};{%value=country};{%value=actors};{%value=directors};"{%value=description}";
and that already was the solution to removing the three extra chars: i changed encoding="UTF8BOM" to encoding="UTF8" (without BOM) and everything works fine now.
now the questions:- Was the change to UTF8BOM intended ? in other/older versions and forum posts, i only had seen UTF8 and UTF-8 (which of the two is correct?), but never saw UTF8BOM ?
- which encodings should be used to make CSV-files as compatible to as many other software as possible ?
- and which encodings are available at all (does this depend on PVD or my system) ?
problem 2:after importing my collection via CSV without obvious problems, i had to see that there was one problem: many comments were truncated in the middle. a closer look revealed that that happened always at some quotation marks which occured in the middle of a field.
Now the reason for the problem is obvious: PVD only imports fields "as is", respecting the userdefined delimiter and the userdefined quotation character, but only by ignoring the first and second occurence of the quotation character and the remaining string after the second quote char (assuming they enclose the field at the start and end of the field), but PVD does not interpret doubled quotation marks which are used by many other programs and even listed in a RFC.
This is no bug since the RFC and other sources give a warning: "
While there are various specifications and implementations for the CSV format, there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files.". Nonetheless, it would be nice if a future version of PVD would also understand those additional conventions.
Similar also applies to exporting to a CSV file: in the options section all characters which might cause problems are replaced, eg quotes are simply removed, linebreaks are replaced by space, and semicolons are replaced by commas. I found that problematic since it messes up lots of my comments, and nice features of PVD like the imbedded hyperlinks in memo fields don't work any longer when exporting and reimporting them since they will be missing the quote chars. Most serious will be the rare cases where a title is changed because it contains the delimiter or the quote char, since title and year are used to identify a movie.
now the questions:- would other people also be interested in more fully using import and export of CSV-files, or do most people use excel import because of this (but as i understood, exporting to excel is done with the same "normal" CSV-files and thus also the related problems) ?
- did someone already come up with suggestions or even a solution how to change a few parameters to be able to import and export everything with as little loss as possible ?
here are some links for reference:
Wikipedia and
RFC 4180and a short summary of these sources what most implementations understand:
- fields of one record are on one line and are separated by one specific delimiter (originally "," but now often ";" since there are lots of "," and much fewer ";" in strings and most of all in numbers)
- the delimiter separates values and doesn't terminate them (thus no delimiter is required after the last field; a final delimiter would add an empty field to the line/record)
- any field may be enclosed with specific quotation characters
- if a field contains the delimiter, the quotation char or linebreaks, it must be enclosed in those quote chars, and quote chars in the field are doubled to escape them
- depending on implementation, whitespace at the start and end of fields should not be trimmed (to be sure, the whole field can be enclosed in quote chars)
- a single header line is optional, but if present should hold the same number of fields as all the other lines, and the values should be the names of the fields
my first idea for a temporary solution for export only: changing the replace command in the options section to double quote chars instead of removing them and to no longer change ";" into "," and no longer remove linebreaks, and putting quote chars around all fields in the body of the template, or at least around all fields which at some time might include the delimiter or the quote chars, including titles, original titles, and all other strings and memo fields. this might work, but only as long as the data is not intended for reimport.
my first idea for a temporary solution for import and export: using a delimiter which is guaranteed to appear in no fields like maybe a TAB or other nonprintable char (if such chars are possible at all), but maybe someone uses TAB in comment fields. thus which char to use ? with a carefully selected delimiter, maybe quote chars wouldn't be needed at all and thus require no measures (except removing the replace commands in the header). And which second special char could be used as quote char, eg to avoid misinterpretations by PVD and other software when such software should attempt to match a userdefined quote char, which (when empty) might match any place in a string.
would anyone be interested in exploring this further, or even does have some working solution ?