Pages

Sunday 27 May 2018

Getting Apache Drill working on Windows 10

Apache Drill is a program that allows you to query many data sources from a single query using SQL. It supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files.

I basically want to be able to query CSV files, sqlite files, parquet files and other RDBMS using SQL with a unified interface from python. I am looking into Apache Drill, hopefully it will solve all my problems, I'll just be using it in local mode.

Download apache drill from here: https://drill.apache.org/download/. I did the direct file download, I got version 1.13.0.

To get drill working, you'll also need Oracle JDK v8 from here: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html. I got version jdk-8u171-windows-x64.exe

Use 7zip to unzip the tar.gz file to some directory. From the command prompt, cd to the directory, it should have the following contents:

bin
conf
git.properties
jars
KEYS
LICENSE
NOTICE
README.md
sample-data
winutils

cd to the bin folder and try to run

sqlline -u jdbc:drill:zk=local
The first error I got was: "WARN: JAVA_HOME not found in your environment. Please set the JAVA_HOME variable in your environment to match the location of your Java installation". This is pretty easy to fix, make sure java JDK is installed and you have an environment variable called JAVA_HOME (see here for how to create one) that has the value "C:\Program Files\Java\jdk1.8.0_171" (assuming that is where you installed it).

After doing that, the next error I got when running sqlline -u jdbc:drill:zk=local was: Error: Failure in connecting to Drill: org.apache.drill.exec.rpc.RpcException: Failure setting up ZK for client. (state=,code=0) plus a whole lot more java error stuff. Then no matter what I typed it simply said No current connection. To fix this one, you have to put quotes around the argument passed to sqlline like this:

sqlline -u "jdbc:drill:zk=local"

Note that to exit sqlline I found Ctrl d to be the easiest, this simulates an end of file character and the program stops gracefully. Now I can run commands like:

USE cp; 
SELECT employee_id, first_name FROM `employee.json`;

Connecting to sqlite databases from apache drill

Of course you can connect to sqlite databases directly using the sqlite3 console program, but I would like to connect to sqlite dbs, csvs, parquet files and other RDBMS all from the same query. To do that I need the JDBC drivers for sqlite, which are available here: https://bitbucket.org/xerial/sqlite-jdbc/downloads/. I got version sqlite-jdbc-3.21.0.jar. This has to be copied to the 3rdparty folder in the apache drill install directory: C:\Users\james\Downloads\apache-drill-1.13.0\jars\3rdparty on my system. Make sure you close and reopen sqlline/drill before trying to use the driver.

Unfortunately, we can't access sqlite dbs in the same way as files, we have to create a new storage plugin for each sqlite database we have. I downloaded the sample database chinook.db from here: http://www.sqlitetutorial.net/sqlite-sample-database/. Extract the database chinook.db somewhere and remember where it is. I got the list of tables by opening the db in sqlitestudio.

Now we need to open the drill web console by navigating to http://localhost:8047/storage/ (make sure sqlline or drill is running in the background). We need to create a new storage plugin, we'll call it chinook. Down the bottom of the page there is a text entry, put chinook in there and and press the create new storage plugin button. You'll be met with a text entry field that is expecting JSON. Enter the following, but replace the path that I have entered with the path to your database file:

{
"type": "jdbc",
"driver": "org.sqlite.JDBC",
"url": "jdbc:sqlite:C:\\Users\\james\\Downloads\\chinook.db",
"username": null,
"password": null,
"enabled": true
}

Now we can query the sqlite database using drill (you may need to close and restart sqlline/drill):

SELECT * FROM chinook.customers WHERE lastname like 'g%';
+-------------+------------+------------+---------------------------------------------------+--------------------------+
| CustomerId | FirstName | LastName | Company | Address |
+-------------+------------+------------+---------------------------------------------------+--------------------------+
| 1 | Luφs | Gonτalves | Embraer - Empresa Brasileira de Aeronßutica S.A. | Av. Brigadeiro Faria Lim |
| 7 | Astrid | Gruber | null | Rotenturmstra▀e 4, 1010 |
| 19 | Tim | Goyer | Apple Inc. | 1 Infinite Loop |
| 23 | John | Gordon | null | 69 Salem Street |
| 27 | Patrick | Gray | null | 1033 N Park Ave |
| 42 | Wyatt | Girard | null | 9, Place Louis Barthou |
| 56 | Diego | GutiΘrrez | null | 307 Macacha Gⁿemes |
+-------------+------------+------------+---------------------------------------------------+--------------------------+
7 rows selected (0.185 seconds)

Unfortunately, not all tables could be read. If I try to open the employees table, i get:

 SELECT * FROM chinook.employees;
Error: DATA_READ ERROR: Failure while attempting to read from database.

sql SELECT *
FROM employees
plugin sqlite
Fragment 0:0

[Error Id: ac42e2eb-a8ac-478e-b5bd-de663f7bc93e on DESKTOP:31010] (state=,code=0)

The sqlite JDBC driver doesn't seem to like the datetime fields, use something like sqlitestudio to change the datatype of the fields to strings, then everything will work. You'll just have to convert the string dates to actual date objects in code later on.

Using Apache drill from python

I think the easiest way of doing this is to use pydrill (run pip install pydrill to get it). This however uses the rest api, which is likely not the most efficient. Using the ODBC drivers and pyodbc will likely be better, but I have not tested it yet.

from pydrill.client import PyDrill

drill = PyDrill(host='localhost', port=8047)

if not drill.is_active():
raise ImproperlyConfigured('Please run Drill first')

res= drill.query('SELECT * FROM chinook.customers LIMIT 5')

for result in res:
print(result)

# pandas dataframe
df = res.to_dataframe()
print(df)