README.md
1# datapackage-py
2
3[![Travis](https://travis-ci.org/frictionlessdata/datapackage-py.svg?branch=master)](https://travis-ci.org/frictionlessdata/datapackage-py)
4[![Coveralls](https://coveralls.io/repos/github/frictionlessdata/datapackage-py/badge.svg?branch=master)](https://coveralls.io/github/frictionlessdata/datapackage-py?branch=master)
5[![PyPi](https://img.shields.io/pypi/v/datapackage.svg)](https://pypi.python.org/pypi/datapackage)
6[![Github](https://img.shields.io/badge/github-master-brightgreen)](https://github.com/frictionlessdata/datapackage-py)
7[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)
8
9A library for working with [Data Packages](http://specs.frictionlessdata.io/data-package/).
10
11> **[Important Notice]** We have released [Frictionless Framework](https://github.com/frictionlessdata/frictionless-py). This framework provides improved `datapackage` functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration) from `datapackage` to Frictionless Framework.
12> - we continue to bug-fix `datapackage@1.x` in this [repository](https://github.com/frictionlessdata/datapackage-py) as well as it's available on [PyPi](https://pypi.org/project/datapackage/) as it was before
13> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable
14> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version
15
16## Features
17
18 - `Package` class for working with data packages
19 - `Resource` class for working with data resources
20 - `Profile` class for working with profiles
21 - `validate` function for validating data package descriptors
22 - `infer` function for inferring data package descriptors
23
24## Contents
25
26<!--TOC-->
27
28 - [Getting Started](#getting-started)
29 - [Installation](#installation)
30 - [Documentation](#documentation)
31 - [Introduction](#introduction)
32 - [Working with Package](#working-with-package)
33 - [Working with Resource](#working-with-resource)
34 - [Working with Group](#working-with-group)
35 - [Working with Profile](#working-with-profile)
36 - [Working with Foreign Keys](#working-with-foreign-keys)
37 - [Working with validate/infer](#working-with-validateinfer)
38 - [Frequently Asked Questions](#frequently-asked-questions)
39 - [API Reference](#api-reference)
40 - [`cli`](#cli)
41 - [`Package`](#package)
42 - [`Resource`](#resource)
43 - [`Group`](#group)
44 - [`Profile`](#profile)
45 - [`validate`](#validate)
46 - [`infer`](#infer)
47 - [`DataPackageException`](#datapackageexception)
48 - [`TableSchemaException`](#tableschemaexception)
49 - [`LoadError`](#loaderror)
50 - [`CastError`](#casterror)
51 - [`IntegrityError`](#integrityerror)
52 - [`RelationError`](#relationerror)
53 - [`StorageError`](#storageerror)
54 - [Contributing](#contributing)
55 - [Changelog](#changelog)
56
57<!--TOC-->
58
59## Getting Started
60
61### Installation
62
63The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify `datapackage` version range in your `setup/requirements` file e.g. `datapackage>=1.0,<2.0`.
64
65```bash
66$ pip install datapackage
67```
68
69#### OSX 10.14+
70If you receive an error about the `cchardet` package when installing datapackage on Mac OSX 10.14 (Mojave) or higher, follow these steps:
711. Make sure you have the latest x-code by running the following in terminal: `xcode-select --install`
722. Then go to [https://developer.apple.com/download/more/](https://developer.apple.com/download/more/) and download the `command line tools`. Note, this requires an Apple ID.
733. Then, in terminal, run `open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg`
74You can read more about these steps in this [post.](https://stackoverflow.com/questions/52509602/cant-compile-c-program-on-a-mac-after-upgrade-to-mojave)
75
76## Documentation
77
78### Introduction
79
80Let's start with a simple example:
81
82```python
83from datapackage import Package
84
85package = Package('datapackage.json')
86package.get_resource('resource').read()
87```
88
89### Working with Package
90
91A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more.
92
93Consider we have some local csv files in a `data` directory. Let's create a data package based on this data using a `Package` class:
94
95> data/cities.csv
96
97```csv
98city,location
99london,"51.50,-0.11"
100paris,"48.85,2.30"
101rome,"41.89,12.51"
102```
103
104> data/population.csv
105
106```csv
107city,year,population
108london,2017,8780000
109paris,2017,2240000
110rome,2017,2860000
111```
112
113First we create a blank data package:
114
115```python
116package = Package()
117```
118
119Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern `**/*.csv`:
120
121```python
122package.infer('**/*.csv')
123package.descriptor
124#{ profile: 'tabular-data-package',
125# resources:
126# [ { path: 'data/cities.csv',
127# profile: 'tabular-data-resource',
128# encoding: 'utf-8',
129# name: 'cities',
130# format: 'csv',
131# mediatype: 'text/csv',
132# schema: [Object] },
133# { path: 'data/population.csv',
134# profile: 'tabular-data-resource',
135# encoding: 'utf-8',
136# name: 'population',
137# format: 'csv',
138# mediatype: 'text/csv',
139# schema: [Object] } ] }
140```
141
142An `infer` method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit:
143
144```python
145package.descriptor['resources'][1]['schema']['fields'][1]['type'] = 'year'
146package.commit()
147package.valid # true
148```
149
150Because our resources are tabular we could read it as a tabular data:
151
152```python
153package.get_resource('population').read(keyed=True)
154#[ { city: 'london', year: 2017, population: 8780000 },
155# { city: 'paris', year: 2017, population: 2240000 },
156# { city: 'rome', year: 2017, population: 2860000 } ]
157```
158
159Let's save our descriptor on the disk as a zip-file:
160
161```python
162package.save('datapackage.zip')
163```
164
165To continue the work with the data package we just load it again but this time using local `datapackage.zip`:
166
167```python
168package = Package('datapackage.zip')
169# Continue the work
170```
171
172It was onle basic introduction to the `Package` class. To learn more let's take a look on `Package` class API reference.
173
174### Working with Resource
175
176A class for working with data resources. You can read or iterate tabular resources using the `iter/read` methods and all resource as bytes using `row_iter/row_read` methods.
177
178Consider we have some local csv file. It could be inline data or remote link - all supported by `Resource` class (except local files for in-brower usage of course). But say it's `data.csv` for now:
179
180```csv
181city,location
182london,"51.50,-0.11"
183paris,"48.85,2.30"
184rome,N/A
185```
186
187Let's create and read a resource. Because resource is tabular we could use `resource.read` method with a `keyed` option to get an array of keyed rows:
188
189```python
190resource = Resource({path: 'data.csv'})
191resource.tabular # true
192resource.read(keyed=True)
193# [
194# {city: 'london', location: '51.50,-0.11'},
195# {city: 'paris', location: '48.85,2.30'},
196# {city: 'rome', location: 'N/A'},
197# ]
198resource.headers
199# ['city', 'location']
200# (reading has to be started first)
201```
202
203As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a `N/A` string instead of Python `None`. First we have to infer resource metadata:
204
205```python
206resource.infer()
207resource.descriptor
208#{ path: 'data.csv',
209# profile: 'tabular-data-resource',
210# encoding: 'utf-8',
211# name: 'data',
212# format: 'csv',
213# mediatype: 'text/csv',
214# schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } }
215resource.read(keyed=True)
216# Fails with a data validation error
217```
218
219Let's fix not available location. There is a `missingValues` property in Table Schema specification. As a first try we set `missingValues` to `N/A` in `resource.descriptor.schema`. Resource descriptor could be changed in-place but all changes should be commited by `resource.commit()`:
220
221```python
222resource.descriptor['schema']['missingValues'] = 'N/A'
223resource.commit()
224resource.valid # False
225resource.errors
226# [<ValidationError: "'N/A' is not of type 'array'">]
227```
228
229As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for `missingValues` property. Also don't forget to have an empty string as a missing value:
230
231```python
232resource.descriptor['schema']['missingValues'] = ['', 'N/A']
233resource.commit()
234resource.valid # true
235```
236
237All good. It looks like we're ready to read our data again:
238
239```python
240resource.read(keyed=True)
241# [
242# {city: 'london', location: [51.50,-0.11]},
243# {city: 'paris', location: [48.85,2.30]},
244# {city: 'rome', location: null},
245# ]
246```
247
248Now we see that:
249- locations are arrays with numeric lattide and longitude
250- Rome's location is a native JavaScript `null`
251
252And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor:
253
254```python
255resource.save('dataresource.json')
256```
257
258Let's check newly-crated `dataresource.json`. It contains path to our data file, inferred metadata and our `missingValues` tweak:
259
260```json
261{
262 "path": "data.csv",
263 "profile": "tabular-data-resource",
264 "encoding": "utf-8",
265 "name": "data",
266 "format": "csv",
267 "mediatype": "text/csv",
268 "schema": {
269 "fields": [
270 {
271 "name": "city",
272 "type": "string",
273 "format": "default"
274 },
275 {
276 "name": "location",
277 "type": "geopoint",
278 "format": "default"
279 }
280 ],
281 "missingValues": [
282 "",
283 "N/A"
284 ]
285 }
286}
287```
288
289If we decide to improve it even more we could update the `dataresource.json` file and then open it again using local file name:
290
291```python
292resource = Resource('dataresource.json')
293# Continue the work
294```
295
296It was onle basic introduction to the `Resource` class. To learn more let's take a look on `Resource` class API reference.
297
298### Working with Group
299
300A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the `group: <name>` field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name).
301
302Consider we have a data package with two tables partitioned by a year and a shared schema stored separately:
303
304> cars-2017.csv
305
306```csv
307name,value
308bmw,2017
309tesla,2017
310nissan,2017
311```
312
313> cars-2018.csv
314
315```csv
316name,value
317bmw,2018
318tesla,2018
319nissan,2018
320```
321
322> cars.schema.json
323
324```json
325{
326 "fields": [
327 {
328 "name": "name",
329 "type": "string"
330 },
331 {
332 "name": "value",
333 "type": "integer"
334 }
335 ]
336}
337```
338
339> datapackage.json
340
341```json
342{
343 "name": "datapackage",
344 "resources": [
345 {
346 "group": "cars",
347 "name": "cars-2017",
348 "path": "cars-2017.csv",
349 "profile": "tabular-data-resource",
350 "schema": "cars.schema.json"
351 },
352 {
353 "group": "cars",
354 "name": "cars-2018",
355 "path": "cars-2018.csv",
356 "profile": "tabular-data-resource",
357 "schema": "cars.schema.json"
358 }
359 ]
360}
361```
362
363Let's read the resources separately:
364
365```python
366package = Package('datapackage.json')
367package.get_resource('cars-2017').read(keyed=True) == [
368 {'name': 'bmw', 'value': 2017},
369 {'name': 'tesla', 'value': 2017},
370 {'name': 'nissan', 'value': 2017},
371]
372package.get_resource('cars-2018').read(keyed=True) == [
373 {'name': 'bmw', 'value': 2018},
374 {'name': 'tesla', 'value': 2018},
375 {'name': 'nissan', 'value': 2018},
376]
377```
378
379On the other hand, these resources defined with a `group: cars` field. It means we can treat them as a group:
380
381```python
382package = Package('datapackage.json')
383package.get_group('cars').read(keyed=True) == [
384 {'name': 'bmw', 'value': 2017},
385 {'name': 'tesla', 'value': 2017},
386 {'name': 'nissan', 'value': 2017},
387 {'name': 'bmw', 'value': 2018},
388 {'name': 'tesla', 'value': 2018},
389 {'name': 'nissan', 'value': 2018},
390]
391```
392
393We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the `merge_groups` flag to enable groupping behaviour:
394
395```python
396package = Package('datapackage.json')
397package.save(storage='sql', engine=engine)
398# SQL tables:
399# - cars-2017
400# - cars-2018
401package.save(storage='sql', engine=engine, merge_groups=True)
402# SQL tables:
403# - cars
404```
405
406### Working with Profile
407
408A component to represent JSON Schema profile from [Profiles Registry]( https://specs.frictionlessdata.io/schemas/registry.json):
409
410```python
411profile = Profile('data-package')
412
413profile.name # data-package
414profile.jsonschema # JSON Schema contents
415
416try:
417 valid = profile.validate(descriptor)
418except exceptions.ValidationError as exception:
419 for error in exception.errors:
420 # handle individual error
421```
422
423### Working with Foreign Keys
424
425The library supports foreign keys described in the [Table Schema](http://specs.frictionlessdata.io/table-schema/#foreign-keys) specification. It means if your data package descriptor use `resources[].schema.foreignKeys` property for some resources a data integrity will be checked on reading operations.
426
427Consider we have a data package:
428
429```python
430DESCRIPTOR = {
431 'resources': [
432 {
433 'name': 'teams',
434 'data': [
435 ['id', 'name', 'city'],
436 ['1', 'Arsenal', 'London'],
437 ['2', 'Real', 'Madrid'],
438 ['3', 'Bayern', 'Munich'],
439 ],
440 'schema': {
441 'fields': [
442 {'name': 'id', 'type': 'integer'},
443 {'name': 'name', 'type': 'string'},
444 {'name': 'city', 'type': 'string'},
445 ],
446 'foreignKeys': [
447 {
448 'fields': 'city',
449 'reference': {'resource': 'cities', 'fields': 'name'},
450 },
451 ],
452 },
453 }, {
454 'name': 'cities',
455 'data': [
456 ['name', 'country'],
457 ['London', 'England'],
458 ['Madrid', 'Spain'],
459 ],
460 },
461 ],
462}
463```
464
465Let's check relations for a `teams` resource:
466
467```python
468from datapackage import Package
469
470package = Package(DESCRIPTOR)
471teams = package.get_resource('teams')
472teams.check_relations()
473# tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"
474```
475
476As we could see there is a foreign key violation. That's because our lookup table `cities` doesn't have a city of `Munich` but we have a team from there. We need to fix it in `cities` resource:
477
478```python
479package.descriptor['resources'][1]['data'].append(['Munich', 'Germany'])
480package.commit()
481teams = package.get_resource('teams')
482teams.check_relations()
483# True
484```
485
486Fixed! But not only a check operation is available. We could use `relations` argument for `resource.iter/read` methods to dereference a resource relations:
487
488```python
489teams.read(keyed=True, relations=True)
490#[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}},
491# {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}},
492# {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}]
493```
494
495Instead of plain city name we've got a dictionary containing a city data. These `resource.iter/read` methods will fail with the same as `resource.check_relations` error if there is an integrity issue. But only if `relations=True` flag is passed.
496
497### Working with validate/infer
498
499A standalone function to validate a data package descriptor:
500
501```python
502from datapackage import validate, exceptions
503
504try:
505 valid = validate(descriptor)
506except exceptions.ValidationError as exception:
507 for error in exception.errors:
508 # handle individual error
509```
510
511A standalone function to infer a data package descriptor.
512
513```python
514descriptor = infer('**/*.csv')
515#{ profile: 'tabular-data-resource',
516# resources:
517# [ { path: 'data/cities.csv',
518# profile: 'tabular-data-resource',
519# encoding: 'utf-8',
520# name: 'cities',
521# format: 'csv',
522# mediatype: 'text/csv',
523# schema: [Object] },
524# { path: 'data/population.csv',
525# profile: 'tabular-data-resource',
526# encoding: 'utf-8',
527# name: 'population',
528# format: 'csv',
529# mediatype: 'text/csv',
530# schema: [Object] } ] }
531```
532
533### Frequently Asked Questions
534
535#### Accessing data behind a proxy server?
536
537Before the `package = Package("https://xxx.json")` call set these environment variables:
538
539```python
540import os
541
542os.environ["HTTP_PROXY"] = 'xxx'
543os.environ["HTTPS_PROXY"] = 'xxx'
544```
545
546## API Reference
547
548### `cli`
549```python
550cli()
551```
552Command-line interface
553
554```
555Usage: datapackage [OPTIONS] COMMAND [ARGS]...
556
557Options:
558 --version Show the version and exit.
559 --help Show this message and exit.
560
561Commands:
562 infer
563 validate
564```
565
566
567### `Package`
568```python
569Package(self,
570 descriptor=None,
571 base_path=None,
572 strict=False,
573 unsafe=False,
574 storage=None,
575 schema=None,
576 default_base_path=None,
577 **options)
578```
579Package representation
580
581__Arguments__
582- __descriptor (str/dict)__: data package descriptor as local path, url or object
583- __base_path (str)__: base path for all relative paths
584- __strict (bool)__: strict flag to alter validation behavior.
585 Setting it to `True` leads to throwing errors
586 on any operation with invalid descriptor
587- __unsafe (bool)__:
588 if `True` unsafe paths will be allowed. For more inforamtion
589 https://specs.frictionlessdata.io/data-resource/#data-location.
590 Default to `False`
591- __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance
592- __options (dict)__: storage options to use for storage creation
593
594__Raises__
595- `DataPackageException`: raises error if something goes wrong
596
597
598
599#### `package.base_path`
600Package's base path
601
602__Returns__
603
604`str/None`: returns the data package base path
605
606
607
608#### `package.descriptor`
609Package's descriptor
610
611__Returns__
612
613`dict`: descriptor
614
615
616
617#### `package.errors`
618Validation errors
619
620Always empty in strict mode.
621
622__Returns__
623
624`Exception[]`: validation errors
625
626
627
628#### `package.profile`
629Package's profile
630
631__Returns__
632
633`Profile`: an instance of `Profile` class
634
635
636
637#### `package.resource_names`
638Package's resource names
639
640__Returns__
641
642`str[]`: returns an array of resource names
643
644
645
646#### `package.resources`
647Package's resources
648
649__Returns__
650
651`Resource[]`: returns an array of `Resource` instances
652
653
654
655#### `package.valid`
656Validation status
657
658Always true in strict mode.
659
660__Returns__
661
662`bool`: validation status
663
664
665
666#### `package.get_resource`
667```python
668package.get_resource(name)
669```
670Get data package resource by name.
671
672__Arguments__
673- __name (str)__: data resource name
674
675__Returns__
676
677`Resource/None`: returns `Resource` instances or null if not found
678
679
680
681#### `package.add_resource`
682```python
683package.add_resource(descriptor)
684```
685Add new resource to data package.
686
687The data package descriptor will be validated with newly added resource descriptor.
688
689__Arguments__
690- __descriptor (dict)__: data resource descriptor
691
692__Raises__
693- `DataPackageException`: raises error if something goes wrong
694
695__Returns__
696
697`Resource/None`: returns added `Resource` instance or null if not added
698
699
700
701#### `package.remove_resource`
702```python
703package.remove_resource(name)
704```
705Remove data package resource by name.
706
707The data package descriptor will be validated after resource descriptor removal.
708
709__Arguments__
710- __name (str)__: data resource name
711
712__Raises__
713- `DataPackageException`: raises error if something goes wrong
714
715__Returns__
716
717`Resource/None`: returns removed `Resource` instances or null if not found
718
719
720
721#### `package.get_group`
722```python
723package.get_group(name)
724```
725Returns a group of tabular resources by name.
726
727For more information about groups see [Group](#group).
728
729__Arguments__
730- __name (str)__: name of a group of resources
731
732__Raises__
733- `DataPackageException`: raises error if something goes wrong
734
735__Returns__
736
737`Group/None`: returns a `Group` instance or null if not found
738
739
740
741#### `package.infer`
742```python
743package.infer(pattern=False)
744```
745Infer a data package metadata.
746
747> Argument `pattern` works only for local files
748
749If `pattern` is not provided only existent resources will be inferred
750(added metadata like encoding, profile etc). If `pattern` is provided
751new resoures with file names mathing the pattern will be added and inferred.
752It commits changes to data package instance.
753
754__Arguments__
755- __pattern (str)__: glob pattern for new resources
756
757__Returns__
758
759`dict`: returns data package descriptor
760
761
762
763#### `package.commit`
764```python
765package.commit(strict=None)
766```
767Update data package instance if there are in-place changes in the descriptor.
768
769__Example__
770
771
772```python
773package = Package({
774 'name': 'package',
775 'resources': [{'name': 'resource', 'data': ['data']}]
776})
777
778package.name # package
779package.descriptor['name'] = 'renamed-package'
780package.name # package
781package.commit()
782package.name # renamed-package
783```
784
785__Arguments__
786- __strict (bool)__: alter `strict` mode for further work
787
788__Raises__
789- `DataPackageException`: raises error if something goes wrong
790
791__Returns__
792
793`bool`: returns true on success and false if not modified
794
795
796
797#### `package.save`
798```python
799package.save(target=None,
800 storage=None,
801 merge_groups=False,
802 to_base_path=False,
803 **options)
804```
805Saves this data package
806
807It saves it to storage if `storage` argument is passed or
808saves this data package's descriptor to json file if `target` arguments
809ends with `.json` or saves this data package to zip file otherwise.
810
811__Example__
812
813
814It creates a zip file into ``file_or_path`` with the contents
815of this Data Package and its resources. Every resource which content
816lives in the local filesystem will be copied to the zip file.
817Consider the following Data Package descriptor:
818
819```json
820{
821 "name": "gdp",
822 "resources": [
823 {"name": "local", "format": "CSV", "path": "data.csv"},
824 {"name": "inline", "data": [4, 8, 15, 16, 23, 42]},
825 {"name": "remote", "url": "http://someplace.com/data.csv"}
826 ]
827}
828```
829
830The final structure of the zip file will be:
831
832```
833./datapackage.json
834./data/local.csv
835```
836
837With the contents of `datapackage.json` being the same as
838returned `datapackage.descriptor`. The resources' file names are generated
839based on their `name` and `format` fields if they exist.
840If the resource has no `name`, it'll be used `resource-X`,
841where `X` is the index of the resource in the `resources` list (starting at zero).
842If the resource has `format`, it'll be lowercased and appended to the `name`,
843becoming "`name.format`".
844
845__Arguments__
846- __target (string/filelike)__:
847 the file path or a file-like object where
848 the contents of this Data Package will be saved into.
849- __storage (str/tableschema.Storage)__:
850 storage name like `sql` or storage instance
851- __merge_groups (bool)__:
852 save all the group's tabular resoruces into one bucket
853 if a storage is provided (for example into one SQL table).
854 Read more about [Group](#group).
855- __to_base_path (bool)__:
856 save the package to the package's base path
857 using the "<base_path>/<target>" route
858- __options (dict)__:
859 storage options to use for storage creation
860
861__Raises__
862- `DataPackageException`: raises if there was some error writing the package
863
864__Returns__
865
866`bool/Storage`: on success return true or a `Storage` instance
867
868### `Resource`
869```python
870Resource(self,
871 descriptor={},
872 base_path=None,
873 strict=False,
874 unsafe=False,
875 storage=None,
876 package=None,
877 **options)
878```
879Resource represenation
880
881__Arguments__
882- __descriptor (str/dict)__: data resource descriptor as local path, url or object
883- __base_path (str)__: base path for all relative paths
884- __strict (bool)__:
885 strict flag to alter validation behavior. Setting it to `true`
886 leads to throwing errors on any operation with invalid descriptor
887- __unsafe (bool)__:
888 if `True` unsafe paths will be allowed. For more inforamtion
889 https://specs.frictionlessdata.io/data-resource/#data-location.
890 Default to `False`
891- __storage (str/tableschema.Storage)__: storage name like `sql` or storage instance
892- __options (dict)__: storage options to use for storage creation
893
894__Raises__
895- `DataPackageException`: raises error if something goes wrong
896
897
898
899#### `resource.data`
900Return resource data
901
902
903#### `resource.descriptor`
904Package's descriptor
905
906__Returns__
907
908`dict`: descriptor
909
910
911
912#### `resource.errors`
913Validation errors
914
915Always empty in strict mode.
916
917__Returns__
918
919`Exception[]`: validation errors
920
921
922
923#### `resource.group`
924Group name
925
926__Returns__
927
928`str`: group name
929
930
931
932#### `resource.headers`
933Resource's headers
934
935> Only for tabular resources (reading has to be started first or it's `None`)
936
937__Returns__
938
939`str[]/None`: returns data source headers
940
941
942
943#### `resource.inline`
944Whether resource inline
945
946__Returns__
947
948`bool`: returns true if resource is inline
949
950
951
952#### `resource.local`
953Whether resource local
954
955__Returns__
956
957`bool`: returns true if resource is local
958
959
960
961#### `resource.multipart`
962Whether resource multipart
963
964__Returns__
965
966`bool`: returns true if resource is multipart
967
968
969
970#### `resource.name`
971Resource name
972
973__Returns__
974
975`str`: name
976
977
978
979#### `resource.package`
980Package instance if the resource belongs to some package
981
982__Returns__
983
984`Package/None`: a package instance if available
985
986
987
988#### `resource.profile`
989Resource's profile
990
991__Returns__
992
993`Profile`: an instance of `Profile` class
994
995
996
997#### `resource.remote`
998Whether resource remote
999
1000__Returns__
1001
1002`bool`: returns true if resource is remote
1003
1004
1005
1006#### `resource.schema`
1007Resource's schema
1008
1009> Only for tabular resources
1010
1011For tabular resources it returns `Schema` instance to interact with data schema.
1012Read API documentation - [tableschema.Schema](https://github.com/frictionlessdata/tableschema-py#schema).
1013
1014__Returns__
1015
1016`tableschema.Schema`: schema
1017
1018
1019
1020#### `resource.source`
1021Resource's source
1022
1023Combination of `resource.source` and `resource.inline/local/remote/multipart`
1024provides predictable interface to work with resource data.
1025
1026__Returns__
1027
1028`list/str`: returns `data` or `path` property
1029
1030
1031
1032#### `resource.table`
1033Return resource table
1034
1035
1036#### `resource.tabular`
1037Whether resource tabular
1038
1039__Returns__
1040
1041`bool`: returns true if resource is tabular
1042
1043
1044
1045#### `resource.valid`
1046Validation status
1047
1048Always true in strict mode.
1049
1050__Returns__
1051
1052`bool`: validation status
1053
1054
1055
1056#### `resource.iter`
1057```python
1058resource.iter(integrity=False, relations=False, **options)
1059```
1060Iterates through the resource data and emits rows cast based on table schema.
1061
1062> Only for tabular resources
1063
1064__Arguments__
1065
1066
1067 keyed (bool):
1068 yield keyed rows in a form of `{header1: value1, header2: value2}`
1069 (default is false; the form of rows is `[value1, value2]`)
1070
1071 extended (bool):
1072 yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]`
1073 (default is false; the form of rows is `[value1, value2]`)
1074
1075 cast (bool):
1076 disable data casting if false
1077 (default is true)
1078
1079 integrity (bool):
1080 if true actual size in BYTES and SHA256 hash of the file
1081 will be checked against `descriptor.bytes` and `descriptor.hash`
1082 (other hashing algorithms are not supported and will be skipped silently)
1083
1084 relations (bool):
1085 if true foreign key fields will be checked and resolved to its references
1086
1087 foreign_keys_values (dict):
1088 three-level dictionary of foreign key references optimized
1089 to speed up validation process in a form of
1090 `{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`.
1091 If not provided but relations is true, it will be created
1092 before the validation process by *index_foreign_keys_values* method
1093
1094 exc_handler (func):
1095 optional custom exception handler callable.
1096 Can be used to defer raising errors (i.e. "fail late"), e.g.
1097 for data validation purposes. Must support the signature below
1098
1099__Custom exception handler__
1100
1101
1102```python
1103def exc_handler(exc, row_number=None, row_data=None, error_data=None):
1104 '''Custom exception handler (example)
1105
1106 # Arguments:
1107 exc(Exception):
1108 Deferred exception instance
1109 row_number(int):
1110 Data row number that triggers exception exc
1111 row_data(OrderedDict):
1112 Invalid data row source data
1113 error_data(OrderedDict):
1114 Data row source data field subset responsible for the error, if
1115 applicable (e.g. invalid primary or foreign key fields). May be
1116 identical to row_data.
1117 '''
1118 # ...
1119```
1120
1121__Raises__
1122- `DataPackageException`: base class of any error
1123- `CastError`: data cast error
1124- `IntegrityError`: integrity checking error
1125- `UniqueKeyError`: unique key constraint violation
1126- `UnresolvedFKError`: unresolved foreign key reference error
1127
1128__Returns__
1129
1130`Iterator[list]`: yields rows
1131
1132
1133
1134#### `resource.read`
1135```python
1136resource.read(integrity=False,
1137 relations=False,
1138 foreign_keys_values=False,
1139 **options)
1140```
1141Read the whole resource and return as array of rows
1142
1143> Only for tabular resources
1144> It has the same API as `resource.iter` except for
1145
1146__Arguments__
1147- __limit (int)__: limit count of rows to read and return
1148
1149__Returns__
1150
1151`list[]`: returns rows
1152
1153
1154
1155#### `resource.check_integrity`
1156```python
1157resource.check_integrity()
1158```
1159Checks resource integrity
1160
1161> Only for tabular resources
1162
1163It checks size in BYTES and SHA256 hash of the file
1164against `descriptor.bytes` and `descriptor.hash`
1165(other hashing algorithms are not supported and will be skipped silently).
1166
1167__Raises__
1168- `exceptions.IntegrityError`: raises if there are integrity issues
1169
1170__Returns__
1171
1172`bool`: returns True if no issues
1173
1174
1175
1176#### `resource.check_relations`
1177```python
1178resource.check_relations(foreign_keys_values=False)
1179```
1180Check relations
1181
1182> Only for tabular resources
1183
1184It checks foreign keys and raises an exception if there are integrity issues.
1185
1186__Raises__
1187- `exceptions.RelationError`: raises if there are relation issues
1188
1189__Returns__
1190
1191`bool`: returns True if no issues
1192
1193
1194
1195#### `resource.drop_relations`
1196```python
1197resource.drop_relations()
1198```
1199Drop relations
1200
1201> Only for tabular resources
1202
1203Remove relations data from memory
1204
1205__Returns__
1206
1207`bool`: returns True
1208
1209
1210
1211#### `resource.raw_iter`
1212```python
1213resource.raw_iter(stream=False)
1214```
1215Iterate over data chunks as bytes.
1216
1217If `stream` is true File-like object will be returned.
1218
1219__Arguments__
1220- __stream (bool)__: File-like object will be returned
1221
1222__Returns__
1223
1224`bytes[]/filelike`: returns bytes[]/filelike
1225
1226
1227
1228#### `resource.raw_read`
1229```python
1230resource.raw_read()
1231```
1232Returns resource data as bytes.
1233
1234__Returns__
1235
1236`bytes`: returns resource data in bytes
1237
1238
1239
1240#### `resource.infer`
1241```python
1242resource.infer(**options)
1243```
1244Infer resource metadata
1245
1246Like name, format, mediatype, encoding, schema and profile.
1247It commits this changes into resource instance.
1248
1249__Arguments__
1250- __options__:
1251 options will be passed to `tableschema.infer` call,
1252 for more control on results (e.g. for setting `limit`, `confidence` etc.).
1253
1254__Returns__
1255
1256`dict`: returns resource descriptor
1257
1258
1259
1260#### `resource.commit`
1261```python
1262resource.commit(strict=None)
1263```
1264Update resource instance if there are in-place changes in the descriptor.
1265
1266__Arguments__
1267- __strict (bool)__: alter `strict` mode for further work
1268
1269__Raises__
1270- `DataPackageException`: raises error if something goes wrong
1271
1272__Returns__
1273
1274`bool`: returns true on success and false if not modified
1275
1276
1277
1278#### `resource.save`
1279```python
1280resource.save(target, storage=None, to_base_path=False, **options)
1281```
1282Saves this resource
1283
1284Into storage if `storage` argument is passed or
1285saves this resource's descriptor to json file otherwise.
1286
1287__Arguments__
1288- __target (str)__:
1289 path where to save a resource
1290- __storage (str/tableschema.Storage)__:
1291 storage name like `sql` or storage instance
1292- __to_base_path (bool)__:
1293 save the resource to the resource's base path
1294 using the "<base_path>/<target>" route
1295- __options (dict)__:
1296 storage options to use for storage creation
1297
1298__Raises__
1299- `DataPackageException`: raises error if something goes wrong
1300
1301__Returns__
1302
1303`bool`: returns true on success
1304Building index...
1305Started generating documentation...
1306
1307### `Group`
1308```python
1309Group(self, resources)
1310```
1311Group representation
1312
1313__Arguments__
1314- __Resource[]__: list of TABULAR resources
1315
1316
1317
1318#### `group.headers`
1319Group's headers
1320
1321__Returns__
1322
1323`str[]/None`: returns headers
1324
1325
1326
1327#### `group.name`
1328Group name
1329
1330__Returns__
1331
1332`str`: name
1333
1334
1335
1336#### `group.schema`
1337Resource's schema
1338
1339__Returns__
1340
1341`tableschema.Schema`: schema
1342
1343
1344
1345#### `group.iter`
1346```python
1347group.iter(**options)
1348```
1349Iterates through the group data and emits rows cast based on table schema.
1350
1351> It concatenates all the resources and has the same API as `resource.iter`
1352
1353
1354
1355#### `group.read`
1356```python
1357group.read(limit=None, **options)
1358```
1359Read the whole group and return as array of rows
1360
1361> It concatenates all the resources and has the same API as `resource.read`
1362
1363
1364
1365#### `group.check_relations`
1366```python
1367group.check_relations()
1368```
1369Check group's relations
1370
1371The same as `resource.check_relations` but without the optional
1372argument *foreign_keys_values*. This method will test foreignKeys of the
1373whole group at once otpimizing the process by creating the foreign_key_values
1374hashmap only once before testing the set of resources.
1375
1376
1377### `Profile`
1378```python
1379Profile(self, profile)
1380```
1381Profile representation
1382
1383__Arguments__
1384- __profile (str)__: profile name in registry or URL to JSON Schema
1385
1386__Raises__
1387- `DataPackageException`: raises error if something goes wrong
1388
1389
1390
1391#### `profile.jsonschema`
1392JSONSchema content
1393
1394__Returns__
1395
1396`dict`: returns profile's JSON Schema contents
1397
1398
1399
1400#### `profile.name`
1401Profile name
1402
1403__Returns__
1404
1405`str/None`: name if available
1406
1407
1408
1409#### `profile.validate`
1410```python
1411profile.validate(descriptor)
1412```
1413Validate a data package `descriptor` against the profile.
1414
1415__Arguments__
1416- __descriptor (dict)__: retrieved and dereferenced data package descriptor
1417
1418__Raises__
1419- `ValidationError`: raises if not valid
1420__Returns__
1421
1422`bool`: returns True if valid
1423
1424
1425### `validate`
1426```python
1427validate(descriptor)
1428```
1429Validate a data package descriptor.
1430
1431__Arguments__
1432- __descriptor (str/dict)__: package descriptor (one of):
1433 - local path
1434 - remote url
1435 - object
1436
1437__Raises__
1438- `ValidationError`: raises on invalid
1439
1440__Returns__
1441
1442`bool`: returns true on valid
1443
1444
1445### `infer`
1446```python
1447infer(pattern, base_path=None)
1448```
1449Infer a data package descriptor.
1450
1451> Argument `pattern` works only for local files
1452
1453__Arguments__
1454- __pattern (str)__: glob file pattern
1455
1456__Returns__
1457
1458`dict`: returns data package descriptor
1459
1460
1461### `DataPackageException`
1462```python
1463DataPackageException(self, message, errors=[])
1464```
1465Base class for all DataPackage/TableSchema exceptions.
1466
1467If there are multiple errors, they can be read from the exception object:
1468
1469```python
1470try:
1471 # lib action
1472except DataPackageException as exception:
1473 if exception.multiple:
1474 for error in exception.errors:
1475 # handle error
1476```
1477
1478
1479
1480#### `datapackageexception.errors`
1481List of nested errors
1482
1483__Returns__
1484
1485`DataPackageException[]`: list of nested errors
1486
1487
1488
1489#### `datapackageexception.multiple`
1490Whether it's a nested exception
1491
1492__Returns__
1493
1494`bool`: whether it's a nested exception
1495
1496
1497
1498### `TableSchemaException`
1499```python
1500TableSchemaException(self, message, errors=[])
1501```
1502Base class for all TableSchema exceptions.
1503
1504
1505### `LoadError`
1506```python
1507LoadError(self, message, errors=[])
1508```
1509All loading errors.
1510
1511
1512### `CastError`
1513```python
1514CastError(self, message, errors=[])
1515```
1516All value cast errors.
1517
1518
1519### `IntegrityError`
1520```python
1521IntegrityError(self, message, errors=[])
1522```
1523All integrity errors.
1524
1525
1526### `RelationError`
1527```python
1528RelationError(self, message, errors=[])
1529```
1530All relations errors.
1531
1532
1533### `StorageError`
1534```python
1535StorageError(self, message, errors=[])
1536```
1537All storage errors.
1538
1539
1540## Contributing
1541
1542> The project follows the [Open Knowledge International coding standards](https://github.com/okfn/coding-standards).
1543
1544Recommended way to get started is to create and activate a project virtual environment.
1545To install package and development dependencies into active environment:
1546
1547```bash
1548$ make install
1549```
1550
1551To run tests with linting and coverage:
1552
1553```bash
1554$ make test
1555```
1556
1557## Changelog
1558
1559Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted [commit history](https://github.com/frictionlessdata/datapackage-py/commits/master).
1560
1561#### v1.15
1562
1563> WARNING: it can be breaking for some setups, please read the discussions below
1564
1565- Fixed header management according to the specs:
1566 - https://github.com/frictionlessdata/datapackage-py/pull/257
1567 - https://github.com/frictionlessdata/datapackage-py/issues/256
1568 - https://github.com/frictionlessdata/forum/issues/1
1569
1570#### v1.14
1571
1572- Add experimental options for pick/skiping fileds/rows
1573
1574#### v1.13
1575
1576- Add `unsafe` option to Package and Resource (#262)
1577
1578#### v1.12
1579
1580- Use `chardet` for encoding deteciton by default. For `cchardet`: `pip install datapackage[cchardet]`
1581
1582#### v1.11
1583
1584- `resource/package.save` now accept a `to_base_path` argument (#254)
1585- `package.save` now returns a `Storage` instance if available
1586
1587#### v1.10
1588
1589- Added an ability to check tabular resource's integrity
1590
1591#### v1.9
1592
1593- Added `resource.package` property
1594
1595#### v1.8
1596
1597- Added support for [groups of resources](#group)
1598
1599#### v1.7
1600
1601- Added support for [compression of resources](https://frictionlessdata.io/specs/patterns/#compression-of-resources)
1602
1603#### v1.6
1604
1605- Added support for custom request session
1606
1607#### v1.5
1608
1609Updated behaviour:
1610- Added support for Python 3.7
1611
1612#### v1.4
1613
1614New API added:
1615- added `skip_rows` support to the resource descriptor
1616
1617#### v1.3
1618
1619New API added:
1620- property `package.base_path` is now publicly available
1621
1622#### v1.2
1623
1624Updated behaviour:
1625- CLI command `$ datapackage infer` now outputs only a JSON-formatted data package descriptor.
1626
1627#### v1.1
1628
1629New API added:
1630- Added an integration between `Package/Resource` and the `tableschema.Storage` - https://github.com/frictionlessdata/tableschema-py#storage. It allows to load and save data package from/to different storages like SQL/BigQuery/etc.
1631