Monday, 12 August 2013

Starbase - a Python wrapper for Stargate (HBase REST API)

Starbase - a Python wrapper for Stargate (HBase REST API)

Introduction

Recently, when working on a Big Data project at "Goldmund, Wyldebeast & Wunderliebe" (http://www.goldmund-wyldebeast-wunderliebe.com) I have got a change to get acquainted with Hadoop.

Hadoop was chosen after studying tons of articles on the web, reading (and writing) of white papers, basic performance tests (sometimes hard, if you're on a tight schedule). HBase had become the Hadoop database of choice and the Cloudera Manager was chosen as a bundle - one to rule them all.

Python is my personal (and primary) programming language of choice and is also the primary language in the company I work for.

When starting with work with a new technology, I would ideally want to have a clean and easy (pythonic!) API to work with.

I was surprised I couldn't find any working Python wrapper around the Stargate (which is the official REST API for HBase) and finally decided to write one in my free time.

Ladies and gentlemen, let me introduce - the starbase.

I assume that reader of this article already has some understanding of databases (is familiar with such terms as table, table column, table row). Below, I will provide some code samples and briefly explain what has been done.

Finally, you are welcome to report any issue related to starbase at the issue tracker on the starbase github page:
https://github.com/barseghyanartur/starbase/issues

Installation

Install the latest version from cheese shop (PyPi).
$ pip install starbase

Usage and examples

Operating with API starts with making a connection instance.

Required imports

>>> from starbase import Connection

Create a connection instance

Defaults to 127.0.0.1:8000. Specify when creating a connection instance, if your settings are different.
>>> c = Connection()

Show tables

Assuming that there are two existing tables named table1 and table2, the following would be printed out.
>>> c.tables()
['table1', 'table2']

Operating with table schema

Whenever you need to operate with a table, you need to have a table instance created.
Create a table instance (note, that at this step no table is created).
>>> t = c.table('table3')

Create a new table

Create a table named table3 with columns column1, column2, column3 (this is the point where the table is actually created). In the example below, column1, column2 and column3 are column families (in short - columns). Columns are declared in the table schema.

>>> t.create('column1', 'column2', 'column3')
201

Check if table exists

>>> t.exists()
True

Show table columns

>>> t.columns()
['column1', 'column2', 'column3']

Add columns to the table

Add columns given (column4, column5, column6, column7).
>>> t.add_columns('column4', 'column5', 'column6', 'column7')
200

Drop columns from table

Drop columns given (column6, column7).
>>> t.drop_columns('column6', 'column7')
201

Drop entire table schema

>>> t.drop()
200

Operating with table data

Insert data into a single row

HBase is a key/value store. In HBase columns (also named column families) are part of declared table schema and have to be defined when a table is created. Columns have qualifiers, which are not declared in the table schema. Number of column qualifiers is not limited.

Within a single row, a value is mapped by a column family and a qualifier (in terms of key/value store
concept). Value might be anything castable to string (JSON objects, data structures, XML, etc).

In the example below, key1, key12, key21, etc. - are the qualifiers. Obviously, column1column2 and column3 are column families.

Column families must be composed of printable characters. Qualifiers can be made of any arbitrary bytes.

Table rows are identified by row keys - unique identifiers (UID or so called primary key). In the example below, my-key-1 is the row key (UID).

То recap all what's said above, HBase maps (row key, column family, column qualifier and timestamp) to a value.

>>> t.insert(
>>>     'my-key-1',
>>>     {
>>>         'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
>>>         'column2': {'key21': 'value 21', 'key22': 'value 22'},
>>>         'column3': {'key32': 'value 31', 'key32': 'value 32'}
>>>     }
>>> )
200
Note, that you may also use the native way of naming the columns and cells (qualifiers). Result of the following would be equal to the result of the previous example.
>>> t.insert(
>>>     'my-key-1a',
>>>     {
>>>         'column1:key11': 'value 11', 'column1:key12': 'value 12', 'column1:key13': 'value 13',
>>>         'column2:key21': 'value 21', 'column2:key22': 'value 22',
>>>         'column3:key32': 'value 31', 'column3:key32': 'value 32'
>>>     }
>>> )
200

Update row data

>>> t.update(
>>>     'my-key-1',
>>>     {'column4': {'key41': 'value 41', 'key42': 'value 42'}}
>>> )
200

Remove row, row column or row cell

Remove a row cell (qualifier).
>>> t.remove('my-key-1', 'column4', 'key41')
200
Remove a row column (column family).
>>> t.remove('my-key-1', 'column4')
200
Remove an entire row.
>>> t.remove('my-key-1')
200

Fetch table data

Fetch a single row with all columns.
>>> t.fetch('my-key-1')
{
    'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
    'column2': {'key21': 'value 21', 'key22': 'value 22'},
    'column3': {'key32': 'value 31', 'key32': 'value 32'}
}
Fetch a single row with selected columns (limit to column1 and column2 columns).
>>> t.fetch('my-key-1', ['column1', 'column2'])
{
    'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
    'column2': {'key21': 'value 21', 'key22': 'value 22'},
}
Narrow the result set even more (limit to cells key1 and key2 of column column1 and cell key32 of column column3).
>>> t.fetch('my-key-1', {'column1': ['key11', 'key13'], 'column3': ['key32']})
{
    'column1': {'key11': 'value 11', 'key13': 'value 13'},
    'column3': {'key32': 'value 32'}
}
Note, that you may also use the native way of naming the columns and cells (qualifiers). Example below does exactly the same as example above.
>>>  t.fetch('my-key-1', ['column1:key11', 'column1:key13', 'column3:key32'])
{
    'column1': {'key11': 'value 11', 'key13': 'value 13'},
    'column3': {'key32': 'value 32'}
}
If you set the perfect_dict argument to False, you'll get the native data structure.
>>>  t.fetch('my-key-1', ['column1:key11', 'column1:key13', 'column3:key32'], perfect_dict=False)
{
    'column1:key11': 'value 11', 'column1:key13': 'value 13',
    'column3:key32': 'value 32'
}

Batch operations with table data

Batch operations (insert and update) work similar to normal insert and update, but are done in a batch. You are advised to operate in batch as much as possible.

Batch insert

In the example below, we will insert 5000 records in a batch.
>>> data = {
>>>     'column1': {'key11': 'value 11', 'key12': 'value 12', 'key13': 'value 13'},
>>>     'column2': {'key21': 'value 21', 'key22': 'value 22'},
>>> }
>>> b = t.batch()
>>> for i in range(0, 5000):
>>>     b.insert('my-key-%s' % i, data)
>>> b.commit(finalize=True)
{'method': 'PUT', 'response': [200], 'url': 'table3/bXkta2V5LTA='}

Batch update

In the example below, we will update 5000 records in a batch.
>>> data = {
>>>     'column3': {'key31': 'value 31', 'key32': 'value 32'},
>>> }
>>> b = t.batch()
>>> for i in range(0, 5000):
>>>     b.update('my-key-%s' % i, data)
>>> b.commit(finalize=True)
{'method': 'POST', 'response': [200], 'url': 'table3/bXkta2V5LTA='}
Note: The table batch method accepts an optional size argument (int). If set, an auto-commit is fired each the time the stack is full.

Table data search (row scanning)

Table scanning is in development. At the moment it's only possible to fetch all rows from a table given. Result set returned is a generator.
>>> t.fetch_all_rows()
<generator object results at 0x28e9190>

License

The starbase package is GPL 2.0/LGPL 2.1 licensed.

4 comments:

  1. Artur,
    I have used this wrapper with great success with my development HBase. Is there a flag that I can set within your package to echo/output/log the http requests?

    ReplyDelete
    Replies
    1. Hello. I have made an example of how to log HTTP requests in `starbase`.

      See this:

      https://github.com/barseghyanartur/starbase/blob/master/examples/logging_http_requests.py

      Best regards,

      Delete
  2. Is there any batch size defined like in case of happybase wraper?

    ReplyDelete
  3. like for happybase:
    batch = table.batch(batch_size=batch_size)

    ReplyDelete