Welcome to GeoBases documentation!

Indices and tables

This module defines a class GeoBase to manipulate geographical data (or not). It loads static files containing data, then provides tools to play with it.

It relies on four other modules:

  • GeoUtils: to compute haversine distances between points
  • LevenshteinUtils: to calculate distances between strings. Indeed, we need a good tool to do it, in order to recognize things like station names in schedule files where we do not have the station id
  • GeoGridModule: to handle geographical indexation
  • SourcesManagerModule: to handle data sources

Examples for airports:

>>> geo_a = GeoBase(data='airports', verbose=False)
>>> sorted(geo_a.findNearKey('ORY', 50)) # Orly, airports <= 50km
[(0.0, 'ORY'), (18.8..., 'TNF'), (27.8..., 'LBG'), (34.8..., 'CDG')]
>>> geo_a.get('CDG', 'city_code')
'PAR'
>>> geo_a.distance('CDG', 'NCE')
694.516...

Examples for stations:

>>> geo_t = GeoBase(data='stations', verbose=False)
>>>
>>> # Nice, stations <= 5km
>>> point = (43.70, 7.26)
>>> [geo_t.get(k, 'name') for d, k in sorted(geo_t.findNearPoint(point, 3))]
['Nice-Ville', 'Nice-Riquier', 'Nice-St-Roch']
>>>
>>> geo_t.get('frpaz', 'name')
'Paris-Austerlitz'
>>> geo_t.distance('frnic', 'frpaz')
683.526...

From any point of reference, we have a few duplicates even with ('iata_code', 'location_type') key:

>>> geo = GeoBase(data='ori_por', key_fields=['iata_code', 'location_type'])
In skipped zone, dropping line 1: "iata_code...".
/!\ [lno ...] CRK+C is duplicated #1, first found lno ...: creation of ...
/!\ [lno ...] EAP+C is duplicated #1, first found lno ...: creation of ...
/!\ [lno ...] OSF+C is duplicated #1, first found lno ...: creation of ...
/!\ [lno ...] RDU+C is duplicated #1, first found lno ...: creation of ...
Import successful from ...
Available fields for things: ...
class GeoBases.GeoBaseModule.GeoBase(data, **kwargs)[source]

Bases: GeoBases.VisualMixinModule.VisualMixin

This is the main and only class. After __init__, a file is loaded in memory, and the user may use the instance to get information.

__init__(data, **kwargs)[source]

Initialization

The kwargs parameters given when creating the object may be:

  • source : None by default, file-like to the source
  • paths : None by default, path or list of paths to the source. This will only be used if source is None.
  • headers : [] by default, list of fields in the data
  • key_fields : None by default, list of fields defining the key for a line, None means line numbers will be used to generate keys
  • indices : [] by default, an iterable of additional indexed fields
  • delimiter : '^' by default, delimiter for each field,
  • subdelimiters : {} by default, a { 'field' : 'delimiter' } dict to define subdelimiters
  • join : [] by default, list of dict defining join clauses. A join clause is a dict { 'fields' : fields, 'with' : [base, fields]}, for example { 'fields' : 'country_code', 'with' : ['countries', 'code']}
  • quotechar : '"' by default, this is the string defined for quoting
  • limit : None by default, put an int if you want to load only the first lines
  • skip : None by default, put an int if you want to skip the first lines during loading
  • discard_dups : False by default, boolean to discard key duplicates or handle them
  • verbose : True by default, toggle verbosity
Parameters:
  • data – the type of data, 'airports', 'stations', and many more available. 'feed' will create an empty instance.
  • kwargs – additional parameters
Raises:

ValueError, if data parameters is not recognized

Returns:

None

>>> geo_a = GeoBase(data='airports')
Import successful from ...
Available fields for things: ...
>>> geo_t = GeoBase(data='stations')
Import successful from ...
Available fields for things: ...
>>> geo_f = GeoBase(data='feed')
No source specified, skipping loading...
Available fields for things: ...
No geocode support, skipping grid...
>>> geo_c = GeoBase(data='odd')
Traceback (most recent call last):
ValueError: Wrong data type "odd". Not in ['aircraft', ...]

Import some custom data.

>>> p = 'DataSources/Airports/GeoNames/airports_geonames_only_clean.csv'
>>> fl = open(op.join(op.realpath(op.dirname(__file__)), p))
>>> GeoBase(data='feed',
...         source=fl,
...         headers=['iata_code', 'name', 'city'],
...         key_fields='iata_code',
...         delimiter='^',
...         verbose=False).get('ORY', 'name')
'Paris-Orly'
>>> fl.close()
>>> GeoBase(data='airports',
...         headers=['iata_code', 'cname', 'city'],
...         join=[],
...         verbose=False).get('ORY', 'cname')
'Paris-Orly'
addGrid(radius=50, precision=5, force=False, verbose=True)[source]

Create the grid for geographical indexation.

This operation is automatically performed an initialization if there is geocode support in headers.

Parameters:
  • radius – the grid accuracy, in kilometers the precision parameter is used to define grid size
  • precision – the hash length. This is only used if radius is None, otherwise this parameter (a hash length) is computed from the radius
  • forceFalse by default, force grid update if it already exists
  • verbose – toggle verbosity
Returns:

None

>>> geo_o.addGrid(radius=50, force=True, verbose=True)
/!\ Grid already built, overriding...
addIndex(fields, force=False, verbose=True)[source]

Add an index on an iterable of fields.

Parameters:
  • fields – the iterable of fields
  • forceFalse by default, force index update if it already exists
  • verbose – toggle verbosity
>>> geo_o.addIndex('iata_code', force=True, verbose=True)
/!\ Index on ('iata_code',) already built, overriding...
Built index for fields ('iata_code',)

Index on multiple fields.

>>> geo_o.addIndex(('icao_code', 'location_type'), verbose=True)
Built index for fields ('icao_code', 'location_type')

Do not force.

>>> geo_o.addIndex('iata_code', force=False, verbose=True)
/!\ Index on ('iata_code',) already built, exiting...
biasFuzzyCache(fuzzy_value, field, max_results=None, min_match=0.75, from_keys=None, biased_result=())[source]

If algorithms for fuzzy searches are failing on a single example, it is possible to use a first cache which will block the research and force the result.

Parameters:
  • fuzzy_value – the value, like 'Marseille'
  • field – the field we look into, like 'name'
  • max_results – if None, returns all, if an int, only returns the first ones
  • min_match – filter out matches under this threshold
  • from_keys – if None, it takes all keys into consideration, else takes from_keys iterable of keys as search domain
  • biased_result – the expected result
Returns:

None

>>> geo_t.fuzzyFindCached('Marseille Saint Ch.', 'name')[0]
(0.8..., 'frmsc')
>>> geo_t.biasFuzzyCache('Marseille Saint Ch.',
...                      field='name',
...                      biased_result=[(1.0, 'Me!')])
>>> geo_t.fuzzyFindCached('Marseille Saint Ch.', 'name')[0]
(1.0, 'Me!')
clearFuzzyBiasCache()[source]

Clear biasing cache for fuzzy searches.

>>> geo_t.clearFuzzyBiasCache()
clearFuzzyCache()[source]

Clear cache for fuzzy searches.

>>> geo_t.clearFuzzyCache()
delete(key, field=None)[source]

Method to manually remove a value in the base.

Parameters:key – the key we want to delete
Returns:None
>>> data = geo_t.get('frxrn') # Output all data in one dict
>>> geo_t.delete('frxrn')
>>> geo_t.get('frxrn', 'name')
Traceback (most recent call last):
KeyError: 'Thing not found: frxrn'

How to reverse the delete if data has been stored:

>>> geo_t.set('frxrn', **data)
>>> geo_t.get('frxrn', 'name')
'Redon'

We can delete just a field.

>>> geo_t.delete('frxrn', 'lat')
>>> geo_t.get('frxrn', 'lat')
Traceback (most recent call last):
KeyError: "Field 'lat' [for key 'frxrn'] not in ...
>>> geo_t.get('frxrn', 'name')
'Redon'

And put it back again.

>>> geo_t.set('frxrn', lat='47.65179')
>>> geo_t.get('frxrn', 'lat')
'47.65179'
distance(key0, key1)[source]

Compute distance between two elements.

This is just a wrapper between the original haversine function, but it is probably one of the most used feature :)

Parameters:
  • key0 – the first key
  • key1 – the second key
Returns:

the distance (km)

>>> geo_t.distance('frnic', 'frpaz')
683.526...
dropGrid(verbose=True)[source]

Delete grid.

Returns:None
>>> geo_t.dropGrid()
>>> geo_t.hasGrid()
False

Attempt to use the grid, failure.

>>> sorted(geo_t.findNearKey('frbve', grid=False))[0:3]
[(0.0, 'frbve'), (7.63..., 'fr2698'), (9.07..., 'fr3065')]
>>> sorted(geo_t.findNearKey('frbve'))[0:3]
Traceback (most recent call last):
ValueError: Attempting to use grid, but grid is None

Adding the grid again.

>>> geo_t.addGrid(radius=50, verbose=True)
>>> sorted(geo_t.findNearKey('frbve'))[0:3]
[(0.0, 'frbve'), (7.63..., 'fr2698'), (9.07..., 'fr3065')]
dropIndex(fields=None, verbose=True)[source]

Drop an index on an iterable of fields.

If fields is not given all indexes are dropped.

Parameters:fields – the iterable of fields, if None, all indexes will be dropped
>>> geo_o.hasIndex(('icao_code', 'location_type'))
True
>>> geo_o.dropIndex(('icao_code', 'location_type'))
>>> geo_o.hasIndex(('icao_code', 'location_type'))
False
findClosestFromKey(key, N=1, from_keys=None, grid=True, double_check=True)[source]

Same as findClosestFromPoint, except the point is given not by a (lat, lng), but with its key, like 'ORY' or 'SFO'. We just look up in the base to retrieve latitude and longitude, then call findClosestFromPoint.

Parameters:
  • key – the key of the element (like 'SFO')
  • N – the N closest results wanted
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform findClosestFromKey. This is useful when we have names and have to perform a matching based on name and location (see fuzzyFindNearPoint).
  • grid – boolean, use grid or not
  • double_check – when using grid, perform an additional check on results distance, this is useful because the grid is approximate, so the results are only as accurate as the grid size
Returns:

an iterable of (distance, key) like [(3.2, 'SFO'), (4.5, 'LAX')]

>>> list(geo_a.findClosestFromKey('ORY')) # Orly
[(0.0, 'ORY')]
>>> list(geo_a.findClosestFromKey('ORY', N=3))
[(0.0, 'ORY'), (18.80..., 'TNF'), (27.80..., 'LBG')]
>>> # Corner case, from_keys empty is not used
>>> list(geo_t.findClosestFromKey('ORY', N=2, from_keys=()))
[]
>>> list(geo_t.findClosestFromKey(None, N=2))
[]

No grid.

>>> list(geo_o.findClosestFromKey('ORY', grid=False))
[(0.0, 'ORY')]
>>> list(geo_a.findClosestFromKey('ORY', N=3, grid=False))
[(0.0, 'ORY'), (18.80..., 'TNF'), (27.80..., 'LBG')]
>>> list(geo_t.findClosestFromKey('frnic', N=1, grid=False))
[(0.0, 'frnic')]

Custom keys as search domain.

>>> keys = ('frpaz', 'frply', 'frbve')
>>> list(geo_t.findClosestFromKey('frnic',
...                               N=2,
...                               grid=False,
...                               from_keys=keys))
[(482.79..., 'frbve'), (683.52..., 'frpaz')]
findClosestFromPoint(lat_lng, N=1, from_keys=None, grid=True, double_check=True)[source]

Concept close to findNearPoint, but here we do not look for the things radius-close to a point, we look for the closest thing from this point, given by latitude/longitude.

Parameters:
  • lat_lng – the lat_lng of the point (a tuple (lat, lng))
  • N – the N closest results wanted
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform findClosestFromPoint. This is useful when we have names and have to perform a matching based on name and location (see fuzzyFindNearPoint).
  • grid – boolean, use grid or not
  • double_check – when using grid, perform an additional check on results distance, this is useful because the grid is approximate, so the results are only as accurate as the grid size
Returns:

an iterable of (distance, key) like [(3.2, 'SFO'), (4.5, 'LAX')]

>>> point = (43.70, 7.26) # Nice
>>> list(geo_a.findClosestFromPoint(point))
[(5.82..., 'NCE')]
>>> list(geo_a.findClosestFromPoint(point, N=3))
[(5.82..., 'NCE'), (30.28..., 'CEQ'), (79.71..., 'ALL')]
>>> list(geo_t.findClosestFromPoint(point, N=1))
[(0.56..., 'frnic')]
>>> # Corner case, from_keys empty is not used
>>> list(geo_t.findClosestFromPoint(point, N=2, from_keys=()))
[]
>>> list(geo_t.findClosestFromPoint(None, N=2))
[]

No grid.

>>> list(geo_o.findClosestFromPoint(point, grid=False))
[(0.60..., 'NCE@1')]
>>> list(geo_a.findClosestFromPoint(point, grid=False))
[(5.82..., 'NCE')]
>>> list(geo_a.findClosestFromPoint(point, N=3, grid=False))
[(5.82..., 'NCE'), (30.28..., 'CEQ'), (79.71..., 'ALL')]
>>> list(geo_t.findClosestFromPoint(point, N=1, grid=False))
[(0.56..., 'frnic')]

Custom keys as search domain.

>>> keys = ('frpaz', 'frply', 'frbve')
>>> list(geo_t.findClosestFromPoint(point,
...                                 N=2,
...                                 grid=False,
...                                 from_keys=keys))
[(482.84..., 'frbve'), (683.89..., 'frpaz')]
findNearKey(key, radius=50, from_keys=None, grid=True, double_check=True)[source]

Same as findNearPoint, except the point is given not by a (lat, lng), but with its key, like 'ORY' or 'SFO'. We just look up in the base to retrieve latitude and longitude, then call findNearPoint.

Parameters:
  • key – the key of the element (like 'SFO')
  • radius – the radius of the search (kilometers)
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform search.
  • grid – boolean, use grid or not
  • double_check – when using grid, perform an additional check on results distance, this is useful because the grid is approximate, so the results are only as accurate as the grid size
Returns:

an iterable of (distance, key) like [(3.2, 'SFO'), (4.5, 'LAX')]

>>> sorted(geo_o.findNearKey('ORY', 10)) # Orly, por <= 10km
[(0.0, 'ORY'), (6.94..., 'XJY'), (9.96..., 'QFC')]
>>> sorted(geo_a.findNearKey('ORY', 50)) # Orly, airports <= 50km
[(0.0, 'ORY'), (18.8..., 'TNF'), (27.8..., 'LBG'), (34.8..., 'CDG')]
>>> sorted(geo_t.findNearKey('frnic', 3)) # Nice station, stations <= 3km
[(0.0, 'frnic'), (2.2..., 'fr4342'), (2.3..., 'fr5737')]

No grid.

>>> # Orly, airports <= 50km
>>> sorted(geo_a.findNearKey('ORY', 50, grid=False))
[(0.0, 'ORY'), (18.8..., 'TNF'), (27.8..., 'LBG'), (34.8..., 'CDG')]
>>> 
>>> # Nice station, stations <= 3km
>>> sorted(geo_t.findNearKey('frnic', 3, grid=False))
[(0.0, 'frnic'), (2.2..., 'fr4342'), (2.3..., 'fr5737')]
>>> 
>>> keys = ['ORY', 'CDG', 'SFO']
>>> sorted(geo_a.findNearKey('ORY', 50, grid=False, from_keys=keys))
[(0.0, 'ORY'), (34.8..., 'CDG')]
findNearPoint(lat_lng, radius=50, from_keys=None, grid=True, double_check=True)[source]

Returns a list of nearby things from a point (given latidude and longitude), and a radius for the search. Note that the haversine function, which compute distance at the surface of a sphere, here returns kilometers, so the radius should be in kms.

Parameters:
  • lat_lng – the lat_lng of the point (a tuple (lat, lng))
  • radius – the radius of the search (kilometers)
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform search.
  • grid – boolean, use grid or not
  • double_check – when using grid, perform an additional check on results distance, this is useful because the grid is approximate, so the results are only as accurate as the grid size
Returns:

an iterable of (distance, key) like [(3.2, 'SFO'), (4.5, 'LAX')]

>>> # Paris, airports <= 20km
>>> [geo_a.get(k, 'name') for d, k in
...  sorted(geo_a.findNearPoint((48.84, 2.367), 20))]
['Paris-Orly', 'Paris-Le Bourget']
>>>
>>> # Nice, stations <= 3km
>>> [geo_t.get(k, 'name') for d, k in
...  sorted(geo_t.findNearPoint((43.70, 7.26), 3))]
['Nice-Ville', 'Nice-Riquier', 'Nice-St-Roch']
>>>
>>> # Wrong geocode
>>> sorted(geo_t.findNearPoint(None, 5))
[]

No grid mode.

>>> # Paris, airports <= 20km
>>> [geo_a.get(k, 'name') for d, k in
...  sorted(geo_a.findNearPoint((48.84, 2.367), 20, grid=False))]
['Paris-Orly', 'Paris-Le Bourget']
>>> 
>>> # Nice, stations <= 3km
>>> [geo_t.get(k, 'name') for d, k in
...  sorted(geo_t.findNearPoint((43.70, 7.26), 3, grid=False))]
['Nice-Ville', 'Nice-Riquier', 'Nice-St-Roch']
>>> 
>>> # Paris, airports <= 50km with from_keys input list
>>> sorted(geo_a.findNearPoint((48.84, 2.367), 50,
...                            from_keys=['ORY', 'CDG', 'BVE'],
...                            grid=False))
[(12.76..., 'ORY'), (23.40..., 'CDG')]
findWith(conditions, from_keys=None, reverse=False, mode='and', index=True, verbose=False)[source]

Get iterator of all keys with particular field.

For example, if you want to know all airports in Paris.

Parameters:
  • conditions – a list of ('field', 'value') conditions
  • reverse – we look keys where the field is not the particular value. Note that this negation is done at the lower level, before combining conditions. So if you have two conditions with mode='and', expect results matching not condition 1 and not condition 2.
  • mode – either 'or' or 'and', how to handle several conditions
  • from_keys – if given, we will look for results from this iterable of keys
  • index – boolean to disable index when searching
  • verbose – toggle verbosity during search
Returns:

an iterable of (v, key) where v is the number of matched conditions

>>> list(geo_a.findWith([('city_code', 'PAR')]))
[(1, 'ORY'), (1, 'TNF'), (1, 'CDG'), (1, 'BVA')]
>>> len(list(geo_o.findWith([('comment', '')], reverse=True))) 
212
>>> len(list(geo_o.findWith([('__dup__', [])]))) 
6264
>>> # Counting duplicated keys
>>> len(list(geo_o.findWith([('__par__', [])], reverse=True))) 
5377

Testing indexes.

>>> list(geo_o.findWith([('iata_code', 'MRS')], mode='and', verbose=True))
["and" mode] Using index for ('iata_code',): value(s) ('MRS',)
[(1, 'MRS'), (1, 'MRS@1')]
>>> geo_o.addIndex('iata_code', force=True)
/!\ Index on ('iata_code',) already built, overriding...
Built index for fields ('iata_code',)
>>> geo_o.addIndex('location_type')
Built index for fields ('location_type',)

Now querying with simple indexes (dropping multiple index if it exists).

>>> geo_o.dropIndex(('iata_code', 'location_type'), verbose=False)
>>> list(geo_o.findWith([('iata_code', 'NCE'), ('location_type', ('A',))],
...                     mode='and',
...                     verbose=True))
["and" mode] Using index for ('iata_code',) and ('location_type',): value(s) ('NCE',); (('A',),)
[(2, 'NCE')]

Multiple index.

>>> geo_o.addIndex(('iata_code', 'location_type'), verbose=False)
>>> list(geo_o.findWith([('iata_code', 'NCE'), ('location_type', ('A',))],
...                     mode='and',
...                     verbose=True))
["and" mode] Using index for ('iata_code', 'location_type'): value(s) ('NCE', ('A',))
[(2, 'NCE')]

Mode “or” with index.

>>> geo_o.addIndex('city_code_list')
Built index for fields ('city_code_list',)
>>> list(geo_o.findWith([('iata_code', 'NCE'), ('city_code_list', ('NCE',))],
...                     mode='or',
...                     verbose=True))
["or" mode] Using index for ('iata_code',) and ('city_code_list',): value(s) ('NCE',); (('NCE',),)
[(2, 'NCE@1'), (2, 'NCE')]
>>> list(geo_o.findWith([('iata_code', 'NCE'), ('city_code_list', ('NCE',))],
...                     mode='or',
...                     index=False,
...                     verbose=True))
[(2, 'NCE'), (2, 'NCE@1')]

Testing several conditions.

>>> c_1 = [('city_code_list', ('PAR',))]
>>> c_2 = [('location_type', ('H',))]
>>> len(list(geo_o.findWith(c_1)))
17
>>> len(list(geo_o.findWith(c_2))) 
100
>>> len(list(geo_o.findWith(c_1 + c_2, mode='and'))) 
2
>>> len(list(geo_o.findWith(c_1 + c_2, mode='or'))) 
111
static fuzzyClean(value)[source]

Cleaning from LevenshteinUtils.

>>> GeoBase.fuzzyClean('antibes ville 2')
'antibes'
fuzzyFind(fuzzy_value, field, max_results=None, min_match=0.75, from_keys=None)[source]

Fuzzy searches are retrieving an information on a thing when we do not know the code. We compare the value fuzzy_value which is supposed to be a field (e.g. a city or a name), to all things we have in the base, and we output the best match. Matching is performed using Levenshtein module, with a modified version of the Lenvenshtein ratio, adapted to the type of data.

Example: we look up ‘Marseille Saint Ch.’ in our base and we find the corresponding code by comparing all station names with ‘’Marseille Saint Ch.’‘.

Parameters:
  • fuzzy_value – the value, like 'Marseille'
  • field – the field we look into, like 'name'
  • max_results – max number of results, None means all results
  • min_match – filter out matches under this threshold
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform fuzzyFind. This is useful when we have geocodes and have to perform a matching based on name and location (see fuzzyFindNearPoint).
Returns:

an iterable of (distance, key) like [(0.97, 'SFO'), (0.55, 'LAX')]

>>> geo_t.fuzzyFind('Marseille Charles', 'name')[0]
(0.8..., 'frmsc')
>>> geo_a.fuzzyFind('paris de gaulle', 'name')[0]
(0.78..., 'CDG')
>>> geo_a.fuzzyFind('paris de gaulle',
...                 field='name',
...                 max_results=3,
...                 min_match=0.55)
[(0.78..., 'CDG'), (0.60..., 'HUX'), (0.57..., 'LBG')]

Some corner cases.

>>> geo_a.fuzzyFind('paris de gaulle', 'name', max_results=None)[0]
(0.78..., 'CDG')
>>> geo_a.fuzzyFind('paris de gaulle', 'name',
...                 max_results=1, from_keys=[])
[]
fuzzyFindCached(fuzzy_value, field, max_results=None, min_match=0.75, from_keys=None, verbose=False, d_range=None)[source]

Same as fuzzyFind but with a caching and bias system.

Parameters:
  • fuzzy_value – the value, like 'Marseille'
  • field – the field we look into, like 'name'
  • max_results – max number of results, None means all results
  • min_match – filter out matches under this threshold
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform fuzzyFind. This is useful when we have geocodes and have to perform a matching based on name and location (see fuzzyFindNearPoint).
  • verbose – display information on caching for a certain range of similarity
  • d_range – the range of similarity
Returns:

an iterable of (distance, key) like [(0.97, 'SFO'), (0.55, 'LAX')]

>>> geo_t.fuzzyFindCached('Marseille Saint Ch.', 'name')[0]
(0.8..., 'frmsc')
>>> geo_a.fuzzyFindCached('paris de gaulle',
...                       field='name',
...                       verbose=True,
...                       d_range=(0, 1))[0]
[0.79]           paris+de+gaulle ->   paris+charles+de+gaulle (  CDG)
(0.78..., 'CDG')
>>> geo_a.fuzzyFindCached('paris de gaulle',
...                       field='name',
...                       min_match=0.60,
...                       max_results=2,
...                       verbose=True,
...                       d_range=(0, 1))
[0.79]           paris+de+gaulle ->   paris+charles+de+gaulle (  CDG)
[0.61]           paris+de+gaulle ->        bahias+de+huatulco (  HUX)
[(0.78..., 'CDG'), (0.60..., 'HUX')]

Some biasing:

>>> geo_a.biasFuzzyCache('paris de gaulle',
...                      field='name',
...                      biased_result=[(0.5, 'Biased result')])
>>> geo_a.fuzzyFindCached('paris de gaulle',
...                       field='name',
...                       max_results=None,
...                       verbose=True,
...                       d_range=(0, 1))
Using bias: ('paris+de+gaulle', 'name', None, 0.75, None)
[(0.5, 'Biased result')]
>>> geo_a.clearFuzzyBiasCache()
>>> geo_a.fuzzyFindCached('paris de gaulle',
...                       field='name',
...                       max_results=None,
...                       verbose=True,
...                       min_match=0.75)
[(0.78..., 'CDG')]
fuzzyFindNearPoint(lat_lng, radius, fuzzy_value, field, max_results=None, min_match=0.75, from_keys=None, grid=True, double_check=True)[source]

Same as fuzzyFind but with we search only within a radius from a geocode.

Parameters:
  • lat_lng – the lat_lng of the point (a tuple (lat, lng))
  • radius – the radius of the search (kilometers)
  • fuzzy_value – the value, like 'Marseille'
  • field – the field we look into, like 'name'
  • max_results – if None, returns all, if an int, only returns the first ones
  • min_match – filter out matches under this threshold
  • from_keys – if None, it takes all keys in consideration, else takes a from_keys iterable of keys to perform search.
  • grid – boolean, use grid or not
  • double_check – when using grid, perform an additional check on results distance, this is useful because the grid is approximate, so the results are only as accurate as the grid size
Returns:

an iterable of (distance, key) like [(0.97, 'SFO'), (0.55, 'LAX')]

>>> geo_a.fuzzyFind('Brussels', 'name', min_match=0.60)[0]
(0.61..., 'BQT')
>>> geo_a.get('BQT', 'name')  # Brussels just matched on Brest!!
'Brest'
>>> geo_a.get('BRU', 'name') # We wanted BRU for 'Bruxelles'
'Bruxelles National'
>>> 
>>> # Now a request limited to a circle of 20km around BRU gives BRU
>>> point = (50.9013, 4.4844)
>>> geo_a.fuzzyFindNearPoint(point,
...                          radius=20,
...                          fuzzy_value='Brussels',
...                          field='name',
...                          min_match=0.40)[0]
(0.46..., 'BRU')
>>> 
>>> # Now a request limited to some input keys
>>> geo_a.fuzzyFindNearPoint(point,
...                          radius=2000,
...                          fuzzy_value='Brussels',
...                          field='name',
...                          max_results=1,
...                          min_match=0.30,
...                          from_keys=['ORY', 'CDG'])
[(0.33..., 'ORY')]
get(key, field=None, **kwargs)[source]

Simple get on the base.

Get data on key for field information. For example you can get data on CDG for its city_code_list. You can use the None as field value to get all information in a dictionary. You can give an additional keyword argument default, to avoid KeyError on the key parameter.

Parameters:
  • key – the key of the element (like 'SFO')
  • field – the field (like 'name' or 'iata_code')
  • kwargs – other named arguments, use ‘default’ to avoid KeyError on key (not KeyError on field). Use ‘ext_field’ to field data from join base.
Raises:

KeyError if the key is not in the base

Returns:

the needed information

>>> geo_a.get('CDG', 'city_code')
'PAR'
>>> geo_t.get('frnic', 'name')
'Nice-Ville'
>>> geo_t.get('frnic')
{'info': 'Desserte Voyageur-Infrastructure', 'code': 'frnic', ...}

Cases of unknown key.

>>> geo_t.get('frmoron', 'name', default='There')
'There'
>>> geo_t.get('frmoron', 'name')
Traceback (most recent call last):
KeyError: 'Thing not found: frmoron'
>>> geo_t.get('frmoron', 'name', default=None)
>>> geo_t.get('frmoron', default='There')
'There'

Cases of unknown field, this is a bug and always fail.

>>> geo_t.get('frnic', 'not_a_field', default='There')
Traceback (most recent call last):
KeyError: "Field 'not_a_field' [for key 'frnic'] not in ['__dup__', ...
getFromAllDuplicates(key, field=None, **kwargs)[source]

Get all duplicates data, parent key included.

Parameters:
  • key – the key of the element (like 'SFO')
  • field – the field (like 'name' or 'iata_code')
  • kwargs – other named arguments, use ‘default’ to avoid key failure
Returns:

the list of values for the given field iterated on all duplicates for the key, including the key itself

>>> for n in geo_o.getFromAllDuplicates('ORY', 'name'):
...     print(n)
Paris Orly Airport
>>> geo_o.getFromAllDuplicates('THA', 'name')
['Tullahoma Regional Airport/William Northern Field', 'Tullahoma']

One parent, one duplicate example.

>>> geo_o.get('THA@1', '__par__')
['THA']
>>> geo_o.get('THA', '__dup__')
['THA@1']

Use getFromAllDuplicates on master or duplicates gives the same results.

>>> geo_o.getFromAllDuplicates('THA', '__key__')
['THA', 'THA@1']
>>> geo_o.getFromAllDuplicates('THA@1', '__key__')
['THA@1', 'THA']

Corner cases are handled in the same way as get method.

>>> geo_o.getFromAllDuplicates('nnnnnnoooo', default='that')
'that'
>>> it = geo_o.getFromAllDuplicates('THA', field=None)
>>> [e['__key__'] for e in it]
['THA', 'THA@1']
getJoinBase(fields, verbose=True)[source]

Get joined base from the fields who have join.

Parameters:
  • fields – the iterable of fields
  • verbose – boolean, toggle verbosity
Returns:

a GeoBase object or None if fields are not joined

>>> geo_o.getJoinBase('iata_code')
Fields "('iata_code',)" do not have join, cannot retrieve external base.
>>> geo_o.getJoinBase('country_code') 
<GeoBases.GeoBaseModule.GeoBase object at 0x...>
getLocation(key, **kwargs)[source]

Returns geocode as (float, float) or None.

Parameters:
  • key – the key of the element (like 'SFO')
  • kwargs – other named arguments, use ‘default’ to avoid KeyError on key (not None on wrong value).
Returns:

the location, a tuple of floats like (lat, lng), or None if any problem happened during execution

>>> geo_o.getLocation('AGN')
(57.5..., -134...)

Behavior on unkwown key.

>>> geo_o.getLocation('UNKNOWN')
Traceback (most recent call last):
KeyError: 'Thing not found: UNKNOWN'
>>> geo_o.getLocation('UNKNOWN', default=(0, 0))
(0, 0)
hasDuplicates(key)[source]

Tell if a key has duplicates.

Parameters:key – the key of the element (like 'SFO')
Returns:the number of duplicates
>>> geo_o.hasDuplicates('MRS')
1
>>> geo_o.hasDuplicates('MRS@1')
1
>>> geo_o.hasDuplicates('PAR')
0
hasGeoSupport(key=None)[source]

Check if data type has geocoding support.

If a key parameter is given, check the geocode support of this specific key.

Parameters:key – if key parameter is not None, we check the geocode support for this specific key, not for the general data with fields attribute
Returns:boolean for geocoding support
>>> geo_t.hasGeoSupport()
True
>>> geo_f.hasGeoSupport()
False

For a specific key.

>>> geo_o.hasGeoSupport('ORY')
True
>>> geo_o.set('EMPTY')
>>> geo_o.hasGeoSupport('EMPTY')
False
>>> geo_o.delete('EMPTY') # avoid messing other tests
hasGrid()[source]

Tells if an iterable of fields is indexed.

Parameters:fields – the iterable of fields
Returns:a boolean
>>> geo_t.hasGrid()
True
>>> geo_t.dropGrid()
>>> geo_t.hasGrid()
False
>>> geo_t.addGrid()
hasIndex(fields=None)[source]

Tells if an iterable of fields is indexed.

Default value is None for fields, this will test the presence of any index.

Parameters:fields – the iterable of fields
Returns:a boolean
>>> geo_o.hasIndex('iata_code')
True
>>> geo_o.hasIndex(('iata_code', 'asciiname'))
False
>>> geo_o.hasIndex()
True
hasJoin(fields=None)[source]

Tells if an iterable of fields has join information.

Default value is None for fields, this will test the presence of any join information.

Parameters:fields – the iterable of fields
Returns:a boolean
>>> geo_o.hasJoin('iata_code')
False
>>> geo_o.hasJoin('tvl_por_list')
True
>>> geo_o.hasJoin()
True
hasParents(key)[source]

Tell if a key has parents.

Parameters:key – the key of the element (like 'SFO')
Returns:the number of parents
>>> geo_o.hasParents('MRS')
0
>>> geo_o.hasParents('MRS@1')
1
>>> geo_o.hasParents('PAR')
0
static hasTrepSupport()[source]

Check if module has OpenTrep support.

keys()[source]

Returns a list of all keys in the base.

Returns:the list of all keys
>>> geo_a.keys()
['AGN', 'AGM', 'AGJ', 'AGH', ...
static phonemes(value, method='dmetaphone')[source]

Compute phonemes for any value.

Parameters:
  • value – the input value
  • method – change the phonetic method used
Returns:

the phonemes

>>> GeoBase.phonemes('sheekago')
['XKK', None]
>>> GeoBase.phonemes('sheekago', 'nysiis')
'SACAG'
phoneticFind(value, field, method='dmetaphone', from_keys=None, verbose=False)[source]

Phonetic search.

Parameters:
  • value – the value for which we look for a match
  • field – the field, like 'name'
  • method – change the phonetic method used
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform search.
  • verbose – toggle verbosity
Returns:

an iterable of (phonemes, key) matching

>>> list(geo_o.get(k, 'name') for _, k in
...      geo_o.phoneticFind(value='chicago',
...                         field='name',
...                         method='dmetaphone',
...                         verbose=True))
Looking for phonemes like ['XKK', None] (for "chicago")
['Chicago']
>>> list(geo_o.get(k, 'name') for _, k in
...      geo_o.phoneticFind('chicago', 'name', 'nysiis'))
['Chicago']

Alternate methods.

>>> list(geo_o.phoneticFind('chicago', 'name', 'dmetaphone'))
[(['XKK', None], 'CHI')]
>>> list(geo_o.phoneticFind('chicago', 'name', 'metaphone'))
[('XKK', 'CHI')]
>>> list(geo_o.phoneticFind('chicago', 'name', 'nysiis'))
[('CACAG', 'CHI')]
save(path=None, safe=False, headers=None, verbose=True)[source]

Save the data structure in the initial loaded file.

Parameters:
  • pathNone as default. If no argument is given for this parameter, we will try to save to the default path defined in the configuration file. Otherwise we will try to save in the path given.
  • safe – default is False. If safe is False, the data is dumped in the initial loaded file. If True, a filename.new will be created to dump the data.
  • headers – the headers of data which will be dumped. Leave default to use headers defined in configuration. Otherwise, this must be a list of fields.
  • verbose – toggle verbosity
Returns:

None

set(key, **kwargs)[source]

Method to manually change a value in the base.

Parameters:
  • key – the key we want to change a value of
  • kwargs – the keyword arguments containing new data
Returns:

None

Here are a few examples.

>>> geo_t.get('frnic', 'name')
'Nice-Ville'
>>> geo_t.set('frnic', name='Nice Gare SNCF')
>>> geo_t.get('frnic', 'name')
'Nice Gare SNCF'
>>> geo_t.set('frnic', name='Nice-Ville') # tearDown

We may even add new fields.

>>> geo_t.set('frnic', new_field='some_value')
>>> geo_t.get('frnic', 'new_field')
'some_value'

We can create just the key.

>>> geo_t.set('NEW_KEY_1')
>>> geo_t.get('NEW_KEY_1')
{'__gar__': [], ..., '__lno__': 0, '__key__': 'NEW_KEY_1'}
>>> geo_t.delete('NEW_KEY_1') # tearDown

Examples with an empty base.

>>> geo_f.keys()
[]

Set a new key with a dict, then get the data back.

>>> d = {
...     'code' : 'frnic',
...     'name' : 'Nice',
... }
>>> geo_f.set('frnic', **d)
>>> geo_f.keys()
['frnic']
>>> geo_f.get('frnic', 'name')
'Nice'

The base fields are not automatically updated when setting data.

>>> geo_f.fields
[]

You can manually update the fields.

>>> geo_f.syncFields()
>>> geo_f.fields
['__dup__', '__gar__', '__key__', '__lno__', '__par__', 'code', 'name']
syncFields(mode='all', sort=True)[source]

Iterate through the collection to look for all available fields. Then affect the result to self.fields.

If you execute this method, be aware that fields order may change depending on how dictionaries return their keys. To have better consistency, we automatically sort the found fields. You can change this behavior with the sort parameter.

Parameters:
  • mode'all' or 'any', 'all' will look for fields shared by all keys, 'any' will look for all fields from all keys
  • sort – sort the fields found
Returns:

None

>>> from pprint import pprint
>>> pprint(geo_t.fields)
['__key__',
 '__dup__',
 '__par__',
 '__lno__',
 'code',
 'lines@raw',
 'lines',
 'name',
 'info',
 'lat',
 'lng',
 '__gar__']

Fields synchronisation, common fields for all keys.

>>> geo_t.set('frnic', new_field='Nice Gare SNCF')
>>> geo_t.syncFields(mode='all')
>>> pprint(geo_t.fields) # did not change, except order
['__dup__',
 '__gar__',
 '__key__',
 '__lno__',
 '__par__',
 'code',
 'info',
 'lat',
 'lines',
 'lines@raw',
 'lng',
 'name']

Fields synchronisation, all fields for all keys.

>>> geo_t.syncFields(mode='any')
>>> pprint(geo_t.fields) # notice the new field 'new_field'
['__dup__',
 '__gar__',
 '__key__',
 '__lno__',
 '__par__',
 'code',
 'info',
 'lat',
 'lines',
 'lines@raw',
 'lng',
 'name',
 'new_field']

Restore previous state, drop new field and synchronize fields again.

>>> geo_t.delete('frnic', 'new_field')
>>> geo_t.syncFields()
>>> pprint(geo_t.fields)
['__dup__',
 '__gar__',
 '__key__',
 '__lno__',
 '__par__',
 'code',
 'info',
 'lat',
 'lines',
 'lines@raw',
 'lng',
 'name']
static trepSearch(fuzzy_value, trep_format='S', from_keys=None, verbose=False)[source]

OpenTrep integration.

If not hasTrepSupport(), main_trep is not defined and trepSearch will raise an exception if called.

Parameters:
  • fuzzy_value – the fuzzy value
  • trep_format – the format given to OpenTrep
  • from_keys – if None, it takes all keys in consideration, else takes from_keys iterable of keys to perform search.
  • verbose – toggle verbosity
Returns:

an iterable of (distance, key) like [(0.97, 'SFO'), (0.55, 'LAX')]

>>> if GeoBase.hasTrepSupport():
...     print geo_t.trepSearch('sna francisco los agneles') 
[(31.5192, 'SFO'), (46.284, 'LAX')]
>>> if GeoBase.hasTrepSupport():
...     print geo_t.trepSearch('sna francisco', verbose=True) 
 -> Raw result: SFO/31.5192
 -> Fmt result: ([(31.5192, 'SFO')], '')
[(31.5192, 'SFO')]
updateGrid(verbose=True)[source]

Update the grid for geographical indexation.

Parameters:
  • radius – the grid accuracy, in kilometers the precision parameter is used to define grid size
  • precision – the hash length. This is only used if radius is None, otherwise this parameter (a hash length) is computed from the radius
  • verbose – toggle verbosity
Returns:

None

We use the grid for a query.

>>> sorted(geo_t.findNearKey('frbve'))[0:3]
[(0.0, 'frbve'), (7.63..., 'fr2698'), (9.07..., 'fr3065')]

Now we add a new key to the data.

>>> geo_t.set('NEW_KEY_3', **{
...     'lat' : '45.152',
...     'lng' : '1.528',
... })

If we run the query again, the result is wrong when using the grid, because it is not up-to-date.

>>> sorted(geo_t.findNearKey('frbve'))[0:3]
[(0.0, 'frbve'), (7.63..., 'fr2698'), (9.07..., 'fr3065')]
>>> sorted(geo_t.findNearKey('frbve', grid=False))[0:3]
[(0.0, 'frbve'), (0.07..., 'NEW_KEY_3'), (7.63..., 'fr2698')]

Now we update the grid, then the query works.

>>> geo_t.updateGrid()
>>> sorted(geo_t.findNearKey('frbve'))[0:3]
[(0.0, 'frbve'), (0.07..., 'NEW_KEY_3'), (7.63..., 'fr2698')]
>>> geo_t.delete('NEW_KEY_3') # avoid messing other tests

Note that updateGrid will not create the grid if it does not exist.

>>> geo_f.updateGrid()
No grid to update.
updateIndex(fields=None, verbose=True)[source]

Update index on fields.

If fields is not given all indexes are updated.

Parameters:
  • fields – the iterable of fields, if None, all indexes will be updated
  • verbose – toggle verbosity

Here is an example, we drop the index then make a query.

>>> geo_o.dropIndex('iata_code')
>>> list(geo_o.findWith([('iata_code', 'NCE')])) # not indexed
[(1, 'NCE'), (1, 'NCE@1')]

Now we index and make the same query.

>>> geo_o.addIndex('iata_code')
Built index for fields ('iata_code',)
>>> list(geo_o.findWith([('iata_code', 'NCE')])) # indexed
[(1, 'NCE'), (1, 'NCE@1')]

Now we add a new key to the data.

>>> geo_o.set('NEW_KEY_2', **{
...     'iata_code' : 'NCE',
... })

If we run the query again, the result is wrong when using the index, because it is not up-to-date.

>>> list(geo_o.findWith([('iata_code', 'NCE')])) # indexed
[(1, 'NCE'), (1, 'NCE@1')]
>>> list(geo_o.findWith([('iata_code', 'NCE')], index=False))
[(1, 'NCE'), (1, 'NEW_KEY_2'), (1, 'NCE@1')]

Now we update the index, then the query works.

>>> geo_o.updateIndex('iata_code')
Built index for fields ('iata_code',)
>>> list(geo_o.findWith([('iata_code', 'NCE')])) # indexed, up to date
[(1, 'NCE'), (1, 'NEW_KEY_2'), (1, 'NCE@1')]
>>> geo_o.delete('NEW_KEY_2') # avoid messing other tests

Note that updateIndex will not create indexes if it does not exist.

>>> geo_f.updateIndex('iata_code')
No index to update on "iata_code".

This module defines a class VisualMixin which will be used by the GeoBase as a mixin.

class GeoBases.VisualMixinModule.VisualMixin[source]

Bases: object

Main class used as mixin for the GeoBase class.

buildDashboardData(keep=10, dashboard_weight=None, from_keys=None)[source]

Build dashboard data.

Parameters:
  • keep – the number of values kept after counting for each field
  • dashboard_weight – the field used as weight for the graph. Leave None if you just want to count the number of keys
  • from_keys – only use this iterable of keys if not None
Returns:

a dictionary of fields counters information

buildGraphData(graph_fields, graph_weight=None, with_types=False, directed=False, from_keys=None)[source]

Build graph data.

Parameters:
  • graph_fields – iterable of fields used to define the nodes. Nodes are the values of these fields. Edges represent the data.
  • graph_weight – field used to define the weight of nodes and edges. If None, the weight is 1 for each key.
  • with_types – boolean to consider values from different fields of the same “type” or not, meaning we will create only one node if the same value is found accross different fields, if there are no types. Otherwise we create different nodes. Default is False, meaning untyped graphs.
  • directed – boolean, if the graph is directed or not, default is False.
  • from_keys – only use this iterable of keys if not None
Returns:

the nodes data

>>> nodes = g.buildGraphData(
...     graph_fields=['continent_name', 'country_code'],
...     graph_weight='page_rank'
... )
>>> edges = nodes['Antarctica']['edges'].values()
>>> sorted(edges[0].items())
[('from', 'Antarctica'), ('to', 'AQ'), ('weight', 0)]
dashboardVisualize(output='example', output_dir=None, keep=10, dashboard_weight=None, from_keys=None, verbose=True)[source]

Dashboard display (aggregated view).

Parameters:
  • output – set the name of the rendered files
  • output_dir – set the directory of the rendered files, will be created if it does not exist
  • keep – the number of values kept after counting for each field
  • dashboard_weight – the field used as weight for the graph. Leave None if you just want to count the number of keys
  • from_keys – only display this iterable of keys if not None
  • verbose – toggle verbosity
Returns:

this is the tuple of (names of templates rendered, (list of html templates, list of static files))

graphVisualize(graph_fields, graph_weight=None, with_types=False, from_keys=None, output='example', output_dir=None, verbose=True)[source]

Graph display (like force directed graph).

Parameters:
  • graph_fields – iterable of fields used to define the nodes. Nodes are the values of these fields. Edges represent the data.
  • graph_weight – field used to define the weight of nodes and edges. If None, the weight is 1 for each key.
  • with_types – boolean to consider values from different fields of the same “type” or not, meaning we will create only one node if the same value is found accross different fields, if there are no types. Otherwise we create different nodes. Default is False, meaning untyped graphs.
  • from_keys – only display this iterable of keys if not None
  • output – set the name of the rendered files
  • output_dir – set the directory of the rendered files, will be created if it does not exist
  • verbose – toggle verbosity
Returns:

this is the tuple of (names of templates rendered, (list of html templates, list of static files))

visualize(output='example', output_dir=None, icon_label=None, icon_weight=None, icon_color=None, icon_type='auto', from_keys=None, add_lines=None, add_anonymous_icons=None, add_anonymous_lines=None, link_duplicates=True, draw_join_fields=True, catalog=None, line_colors=None, use_3D=False, verbose=True, warnings=False)[source]

Map and table display.

Parameters:
  • output – set the name of the rendered files
  • output_dir – set the directory of the rendered files, will be created if it does not exist
  • icon_label – set the field which will appear as map icons title
  • icon_weight – set the field defining the map icons circle surface
  • icon_color – set the field defining the map icons colors
  • icon_type – set the icon size, either 'B', 'S', 'auto' or None for no-icons mode
  • from_keys – only display this iterable of keys if not None
  • add_lines – list of (key1, key2, ..., keyN) to draw additional lines
  • add_anonymous_icons – list of geocodes, like [(lat1, lng1), (lat2, lng2), ..., (latN, lngN)], to draw additional icons from geocodes not in the data
  • add_anonymous_icons – list of list of geocodes, like [[(lat1, lng1), (lat2, lng2), ..., (latN, lngN)], ...], to draw additional lines from geocodes not in the data
  • link_duplicates – boolean toggling lines between duplicated keys, default True
  • draw_join_fields – boolean toggling drawing of join fields containing geocode information, default True
  • catalog – dictionary of {'value': 'color'} to have specific colors for some categories, which is computed with the icon_color field
  • line_colors – tuple of 4 colors to change the default lines color, the three values are for the three line types: those computed with link_duplicates, those given with add_lines, those given with add_anonymous_lines, those computed with draw_join_fields
  • use_3D – toggle 3D visualizations
  • verbose – toggle verbosity
  • warnings – toggle warnings, even more verbose
Returns:

this is the tuple of (names of templates rendered, (list of html templates, list of static files))