1.. _vector.csv:
2
3Comma Separated Value (.csv)
4============================
5
6.. shortname:: CSV
7
8.. built_in_by_default::
9
10OGR supports reading and writing primarily non-spatial tabular data
11stored in text CSV files. CSV files are a common interchange format
12between software packages supporting tabular data and are also easily
13produced manually with a text editor or with end-user written scripts or
14programs.
15
16While in theory .csv files could have any extension, in order to
17auto-recognise the format OGR only supports CSV files ending with the
18extension ".csv". The datasource name may be either a single CSV file or
19point to a directory. For a directory to be recognised as a .csv
20datasource at least half the files in the directory need to have the
21extension .csv. One layer (table) is produced from each .csv file
22accessed.
23
24For files structured as CSV, but not ending
25with .CSV extension, the 'CSV:' prefix can be added before the filename
26to force loading by the CSV driver.
27
28The OGR CSV driver supports reading and writing. Because the CSV format
29has variable length text lines, reading is done sequentially. Reading
30features in random order will generally be very slow. OGR CSV layer
31might have a coordinate system stored in a .prj file (see GeoCSV
32specification). When reading a field named "WKT" is assumed to contain
33WKT geometry, but also is treated as a regular field. The OGR CSV driver
34returns all attribute columns as string data types if no field type
35information file (with .csvt extension) is available.
36
37Limited type recognition can be done for Integer, Real, String, Date
38(YYYY-MM-DD), Time (HH:MM:SS+nn), DateTime (YYYY-MM-DD HH:MM:SS+nn)
39columns through a descriptive file with the same name as the CSV file,
40but a .csvt extension. In a single line the types for each column have
41to be listed with double quotes and be comma separated (e.g.,
42"Integer","String"). It is also possible to specify explicitly the width
43and precision of each column, e.g.
44"Integer(5)","Real(10.7)","String(15)". The driver will then use these
45types as specified for the csv columns. Subtypes
46can be passed between parenthesis, such as "Integer(Boolean)",
47"Integer(Int16)" and "Real(Float32)". Starting with GDAL 2.1,
48accordingly with the `GeoCSV
49specification <http://giswiki.hsr.ch/GeoCSV>`__, the "CoordX" or
50"Point(X)" type can be used to specify a column with longitude/easting
51values, "CoordY" or "Point(Y)" for latitude/northing values and "WKT"
52for geometries encoded in WKT
53
54Starting with GDAL 2.2, the "JSonStringList", "JSonIntegerList",
55"JSonInteger64List" and "JSonRealList" types can be used in .csvt to map
56to the corresponding OGR StringList, IntegerList, Integer64List and
57RealList types. The field values are then encoded as JSon arrays, with
58proper CSV escaping.
59
60Automatic field type guessing can also be done
61if specifying the open options described in the below "Open options"
62section.
63
64Driver capabilities
65-------------------
66
67.. supports_create::
68
69.. supports_georeferencing::
70
71.. supports_virtualio::
72
73Format
74------
75
76CSV files have one line for each feature (record) in the layer (table).
77The attribute field values are separated by commas. At least two fields
78per line must be present. Lines may be terminated by a DOS (CR/LF) or
79Unix (LF) style line terminators. Each record should have the same
80number of fields. The driver will also accept a semicolon, a tabulation
81or a space character as field separator .
82This autodetection will work only if there's no other potential
83separator on the first line of the CSV file. Otherwise it will default
84to comma as separator.
85
86Complex attribute values (such as those containing commas, quotes or
87newlines) may be placed in double quotes. Any occurrences of double
88quotes within the quoted string should be doubled up to "escape" them.
89
90By default, the driver attempts to treat the first line of the file as a
91list of field names for all the fields. However, if one or more of the
92names is all numeric it is assumed that the first line is actually data
93values and dummy field names are generated internally (field_1 through
94field_n) and the first record is treated as a feature.
95Numeric values are treated as field names if they are
96enclosed in double quotes. Starting with GDAL 2.1, this behavior can be
97modified via the HEADERS open option.
98
99All CSV files are treated as UTF-8 encoded. A
100Byte Order Mark (BOM) at the beginning of the file will be parsed
101correctly. The option WRITE_BOM can be used to create a file
102with a Byte Order Mark, which can improve compatibility with some
103software (particularly Excel).
104
105Example (employee.csv):
106
107::
108
109   ID,Salary,Name,Comments
110   132,55000.0,John Walker,"The ""big"" cheese."
111   133,11000.0,Jane Lake,Cleaning Staff
112
113Note that the Comments value for the first data record is placed in
114double quotes because the value contains quotes, and those quotes have
115to be doubled up so we know we haven't reached the end of the quoted
116string yet.
117
118Many variations of textual input are sometimes called Comma Separated
119Value files, including files without commas, but fixed column widths,
120those using tabs as separators or those with other auxiliary data
121defining field types or structure. This driver does not attempt to
122support all such files, but instead to support simple .csv files that
123can be auto-recognised. Scripts or other mechanisms can generally be
124used to convert other variations into a form that is compatible with the
125OGR CSV driver.
126
127Reading CSV containing spatial information
128------------------------------------------
129
130Building point geometries
131~~~~~~~~~~~~~~~~~~~~~~~~~
132
133Consider the following CSV file (test.csv):
134
135::
136
137   Latitude,Longitude,Name
138   48.1,0.25,"First point"
139   49.2,1.1,"Second point"
140   47.5,0.75,"Third point"
141
142Starting with GDAL 2.1, it is possible to directly specify the potential
143names of the columns that can contain X/longitude and Y/latitude with
144the X_POSSIBLE_NAMES and Y_POSSIBLE_NAMES open option.
145
146*ogrinfo -ro -al test.csv -oo X_POSSIBLE_NAMES=Lon\* -oo
147Y_POSSIBLE_NAMES=Lat\* -oo KEEP_GEOM_COLUMNS=NO* will return :
148
149::
150
151   OGRFeature(test):1
152     Name (String) = First point
153     POINT (0.25 48.1)
154
155   OGRFeature(test):2
156     Name (String) = Second point
157     POINT (1.1 49.2)
158
159   OGRFeature(test):3
160     Name (String) = Third point
161     POINT (0.75 47.5)
162
163If CSV file does not have a header line, the dummy "field_n" names can be
164used as possible names for coordinate fields. For example plain XYZ point
165data can be opened as
166
167*ogrinfo -ro -al elevation.xyz -oo X_POSSIBLE_NAMES=field_1 -oo
168Y_POSSIBLE_NAMES=field_2 -oo Z_POSSIBLE_NAMES=field_3*
169
170Otherwise, if one or several columns contain a geometry definition
171encoded as WKT, WKB (encoded in hexadecimal) or GeoJSON (in which case
172the GeoJSON content must be formatted to follow CSV rules, that is to
173say it must be surrounded by double-quotes, and double-quotes inside the
174string must be repeated for proper escaping), the name of such column(s)
175the GEOM_POSSIBLE_NAMES open option.
176
177For older versions, it is possible to extract spatial information
178(points) from a CSV file which has columns for the X and Y coordinates,
179through the use of the :ref:`VRT <vector.vrt>` driver.
180
181You can write the associated VRT file (test.vrt):
182
183::
184
185   <OGRVRTDataSource>
186       <OGRVRTLayer name="test">
187           <SrcDataSource>test.csv</SrcDataSource>
188           <GeometryType>wkbPoint</GeometryType>
189           <LayerSRS>WGS84</LayerSRS>
190           <GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude"/>
191       </OGRVRTLayer>
192   </OGRVRTDataSource>
193
194and *ogrinfo -ro -al test.vrt* will return :
195
196::
197
198   OGRFeature(test):1
199     Latitude (String) = 48.1
200     Longitude (String) = 0.25
201     Name (String) = First point
202     POINT (0.25 48.1 0)
203
204   OGRFeature(test):2
205     Latitude (String) = 49.2
206     Longitude (String) = 1.1
207     Name (String) = Second point
208     POINT (1.1 49.200000000000003 0)
209
210   OGRFeature(test):3
211     Latitude (String) = 47.5
212     Longitude (String) = 0.75
213     Name (String) = Third point
214     POINT (0.75 47.5 0)
215
216Building line geometries
217~~~~~~~~~~~~~~~~~~~~~~~~
218
219Consider the following CSV file (test.csv):
220
221::
222
223   way_id,pt_id,x,y
224   1,1,2,49
225   1,2,3,50
226   2,1,-2,49
227   2,2,-3,50
228
229With a GDAL build with Spatialite enabled, *ogrinfo test.csv -dialect
230SQLite -sql "SELECT way_id, MakeLine(MakePoint(CAST(x AS float),CAST(y
231AS float))) FROM test GROUP BY way_id"* will return :
232
233::
234
235   OGRFeature(SELECT):0
236     way_id (String) = 1
237     LINESTRING (2 49,3 50)
238
239   OGRFeature(SELECT):1
240     way_id (String) = 2
241     LINESTRING (-2 49,-3 50)
242
243Open options
244------------
245
246The following open options can be specified
247(typically with the -oo name=value parameters of ogrinfo or ogr2ogr):
248
249-  **MERGE_SEPARATOR**\ =YES/NO (defaults to NO). Setting it to YES will
250   enable merging consecutive separators. Mostly useful when it is the
251   space character.
252-  **AUTODETECT_TYPE**\ =YES/NO (defaults to NO). Setting it to YES will
253   enable auto-detection of field data types. If while reading the
254   records (beyond the records used for autodetection), a value is found
255   to not correspond to the autodetected data type, a warning will be
256   emitted and the field will be emptied.
257-  **KEEP_SOURCE_COLUMNS**\ =YES/NO (default NO) keep a copy of the
258   original columns where the guessing is active, and the guessed type
259   is different from string. The name of the original columns will be
260   suffixed with "_original". This flag should be used only when
261   AUTODETECT_TYPE=YES.
262-  **AUTODETECT_WIDTH**\ =YES/NO/STRING_ONLY (defaults to NO). Setting
263   it to YES to detect the width of string and integer fields, and the
264   width and precision of real fields. Setting it to STRING_ONLY
265   restricts to string fields. Setting it to NO select default size and
266   width. If while reading the records (beyond the records used for
267   autodetection), a value is found to not correspond to the
268   autodetected width/precision, a warning will be emitted and the field
269   will be emptied.
270-  **AUTODETECT_SIZE_LIMIT**\ =size to specify the number of bytes to
271   inspect to determine the data type and width/precision. The default
272   will be 100000. Setting 0 means inspecting the whole file. Note :
273   specifying a value over 1 MB (or 0 if the file is larger than 1MB)
274   will prevent reading from standard input.
275-  **QUOTED_FIELDS_AS_STRING**\ =YES/NO (default NO). Only used if
276   AUTODETECT_TYPE=YES. Whether to enforce quoted fields as string
277   fields when set to YES. Otherwise, by default, the content of quoted
278   fields will be tested for real, integer, etc... data types.
279-  **X_POSSIBLE_NAMES**\ =list_of_names. (GDAL >= 2.1) Comma separated
280   list of possible names for X/longitude coordinate of a point. Each
281   name might be a pattern using the star character in starting and/or
282   ending position. E.g.: prefix*, \*suffix or \*middle*. The values in
283   the column must be floating point values. X_POSSIBLE_NAMES and
284   Y_POSSIBLE_NAMES must be both specified and a matching for each must
285   be found in the columns of the CSV file. Only one geometry column per
286   layer might be built when using X_POSSIBLE_NAMES/Y_POSSIBLE_NAMES.
287-  **Y_POSSIBLE_NAMES**\ =list_of_names. (GDAL >= 2.1) Comma separated
288   list of possible names for Y/latitude coordinate of a point. Each
289   name might be a pattern using the star character in starting and/or
290   ending position. E.g.: prefix*, \*suffix or \*middle*. The values in
291   the column must be floating point values. X_POSSIBLE_NAMES and
292   Y_POSSIBLE_NAMES must be both specified and a matching for each must
293   be found in the columns of the CSV file.
294-  **Z_POSSIBLE_NAMES**\ =list_of_names. (GDAL >= 2.1) Comma separated
295   list of possible names for Z/elevation coordinate of a point. Each
296   name might be a pattern using the star character in starting and/or
297   ending position. E.g.: prefix*, \*suffix or \*middle*. The values in
298   the column must be floating point values. Only taken into account in
299   combination with X_POSSIBLE_NAMES and Y_POSSIBLE_NAMES.
300-  **GEOM_POSSIBLE_NAMES**\ =list_of_names. (GDAL >= 2.1) Comma
301   separated list of possible names for geometry columns that contain
302   geometry definitions encoded as WKT, WKB (in hexadecimal form,
303   potentially in PostGIS 2.0 extended WKB) or GeoJSON. Each name might
304   be a pattern using the star character in starting and/or ending
305   position. E.g.: prefix*, \*suffix or \*middle\*
306-  **KEEP_GEOM_COLUMNS**\ =YES/NO (default YES) Expose the detected
307   X,Y,Z or geometry columns as regular attribute fields.
308-  **HEADERS**\ =YES/NO/AUTO (default AUTO) (GDAL >= 2.1) Whether the
309   first line of the file contains column names or not. When set to
310   AUTO, GDAL will assume the first line is column names if none of the
311   values are strictly numeric.
312-  **EMPTY_STRING_AS_NULL**\ =YES/NO (default NO) (GDAL >= 2.1) Whether
313   to consider empty strings as null fields on reading'.
314
315Creation Issues
316---------------
317
318The driver supports creating new databases (as a directory of .csv
319files), adding new .csv files to an existing directory or .csv files or
320appending features to an existing .csv table. Starting with GDAL 2.1,
321deleting or replacing existing features, or adding/modifying/deleting
322fields is supported, provided the modifications done are small enough to
323be stored in RAM temporarily before flushing to disk.
324
325Layer Creation options:
326
327-  **LINEFORMAT**: By default when creating new .csv files they are
328   created with the line termination conventions of the local platform
329   (CR/LF on win32 or LF on all other systems). This may be overridden
330   through use of the LINEFORMAT layer creation option which may have a
331   value of **CRLF** (DOS format) or **LF** (Unix format).
332-  **GEOMETRY**: By default, the geometry of
333   a feature written to a .csv file is discarded. It is possible to
334   export the geometry in its WKT representation by specifying
335   GEOMETRY=\ **AS_WKT**. It is also possible to export point geometries
336   into their X,Y,Z components (different columns in the csv file) by
337   specifying GEOMETRY=\ **AS_XYZ**, GEOMETRY=\ **AS_XY** or
338   GEOMETRY=\ **AS_YX**. The geometry column(s) will be prepended to the
339   columns with the attributes values. It is also possible to export
340   geometries in GeoJSON representation using SQLite SQL dialect query,
341   see example below.
342-  **CREATE_CSVT**\ =YES/NO: Create the
343   associated .csvt file (see above paragraph) to describe the type of
344   each column of the layer and its optional width and precision.
345   Default value : NO
346-  **SEPARATOR**\ =COMMA/SEMICOLON/TAB/SPACE:
347   Field separator character. Default value : COMMA
348-  **WRITE_BOM**\ =YES/NO: Write a UTF-8 Byte
349   Order Mark (BOM) at the start of the file. Default value : NO
350-  **GEOMETRY_NAME**\ =name (Starting with GDAL 2.1): Name of geometry
351   column. Only used if GEOMETRY=AS_WKT and CREATE_CSVT=YES. Defaults to
352   WKT
353-  **STRING_QUOTING**\ =IF_NEEDED/IF_AMBIGUOUS/ALWAYS (Starting with
354   GDAL 2.3): whether to double-quote strings. IF_AMBIGUOUS means that
355   string values that look like numbers will be quoted (it also implies
356   IF_NEEDED). Defaults to IF_AMBIGUOUS (behavior in older versions was
357   IF_NEEDED)
358
359Configuration options (set with ``--config key value`` on command line
360utilities):
361
362-  **OGR_WKT_PRECISION**\ =int: Number of decimals for coordinate
363   values. Default to 15. A heuristics is used to remove insignificant
364   trailing 00000x or 99999x that can appear when formatting decimal
365   numbers.
366-  **OGR_WKT_ROUND**\ =YES/NO: (GDAL >= 2.3) Whether to enable the above
367   mentioned heuristics to remove insignificant trailing 00000x or
368   99999x. Default to YES.
369
370VSI Virtual File System API support
371-----------------------------------
372
373The driver supports reading and writing to files managed by VSI Virtual
374File System API, which include "regular" files, as well as files in the
375/vsizip/ (read-write) , /vsigzip/ (read-only) , /vsicurl/ (read-only)
376domains.
377
378Writing to /dev/stdout or /vsistdout/ is also supported.
379
380Examples
381~~~~~~~~
382
383-  This example shows using ogr2ogr to transform a shapefile with point
384   geometry into a .csv file with the X,Y,Z coordinates of the points as
385   first columns in the .csv file
386
387   ::
388
389      ogr2ogr -f CSV output.csv input.shp -lco GEOMETRY=AS_XYZ
390
391-  This example shows using ogr2ogr to transform a shapefile into a .csv
392   file with geography field formatted using GeoJSON format.
393
394   ::
395
396      ogr2ogr -f CSV -dialect sqlite -sql "select AsGeoJSON(geometry) AS geom, * from input" output.csv input.shp
397
398- Convert a CSV into a GeoPackage. Specify the names of the coordinate columns and assign a coordinate reference system.
399
400   ::
401
402     ogr2ogr \
403       -f GPKG output.gpkg \
404       input.csv \
405       -oo X_POSSIBLE_NAMES=longitude \
406       -oo Y_POSSIBLE_NAMES=latitude \
407       -a_srs 'EPSG:4326'
408
409
410Particular datasources
411----------------------
412
413The CSV driver can also read files whose structure is close to CSV files
414:
415
416-  Airport data files NfdcFacilities.xls, NfdcRunways.xls,
417   NfdcRemarks.xls and NfdcSchedules.xls found on that `FAA
418   website <http://www.faa.gov/airports/airport_safety/airportdata_5010/menu/index.cfm>`__
419
420-  Files from the `USGS
421   GNIS <http://geonames.usgs.gov/domestic/download_data.htm>`__
422   (Geographic Names Information System)
423
424-  The allCountries file from `GeoNames <http://www.geonames.org>`__
425
426-  `Eurostat .TSV
427   files <http://epp.eurostat.ec.europa.eu/NavTree_prod/everybody/BulkDownloadListing?file=read_me.pdf>`__
428
429Other Notes
430-----------
431
432-  `GeoCSV specification <http://giswiki.hsr.ch/GeoCSV>`__ (supported by
433   GDAL >= 2.1)
434-  Initial development of the OGR CSV driver was supported by `DM
435   Solutions Group <http://www.dmsolutions.ca/>`__ and
436   `GoMOOS <http://www.gomoos.org/>`__.
437-  `Carto <https://carto.com/>`__ funded field type auto-detection and
438   open options related to geometry columns.
439