Data is the new oil…
This is one phrase we read in almost every data analysis article on the web. However, not very often we read about the toolset needed to build a database and to store these data. It is as if we take for granted the storing step and center our attention on modeling data and building algorithms.
Although there are tons of complex and expensive software and hardware to perform this kind of task massively, for most small to medium projects people usually rely on Excel or CSV files. But as our collection of data and the analysis requirements increases could challenge our computer’s capabilities and software packages as MS Excel to do the job.
Here is when SQL and Python could help us handle the task.
I will be using for this article os, sqlite3 and pandas python libraries. They are very common libraries, so, you shouldn’t have any problem importing them or downloading them with pip.
We are going to explore the basic concepts, but still, by the end of this article we are going to be able to:
- Build a database from scratch
- Build a database importing an Excel/CSV file
- Convert a database into a Data Frame to manage data
Although I start this article talking about oil, for the rest I will reference a database built on one of my previous articles “Automate with python my… wine management?”.
I leave here the link if you haven’t read it already:
1. Build a database from scratch
Once imported the libraries into Python, the first step is to create the database. Given the exercise incorporates data from wine preferences, I will call it WineDB.db. The extension “.db” indicates the file is stored as structured database data. These types of files could be manipulated with SQL language.
The second step is to create a cursor, I think about this as activating the WineDB database.
Once “activated” the database, the next step is to create the structure. If you are a frequent excel user, think of it as the columns of the table. For this example I will be using 12 columns:
No: Consecutive number, Nombre: Name, Tipo: Wine type (red, white or rose), Composicion: Describes if the wine is made from a single grape (mono varietal) or a combination of more than one (blend), Varietal: Indicates the grape varietal of the wine (Zinfandel, Merlot, Malbec, Shiraz, etc.), Cosecha: Wine vintage, Pais: Country, Region: Region, Compra: Year when the wine was bought, Costo: Cost of the wine, Consumo: Year consumed, and Calificacion: Wine score from 1 to 5.
Something to keep in mind is that after coding in SQL, even when doing it in Python, we need to commit our code, like if our program requires us to validate or confirm the instructions, as shown below:
In the code above, I define each “column” within its unique characteristics. To start, the ‘No’ column will be the primary key of the table, just as in MS Access or other related software, this will be a constant when connecting with other tables of the database. Another criterion is the TEXT NOT NULL, meaning that it always needs to have text data otherwise will give an error.
As in the previous, the rest of the columns also had their characteristics. Remember that it is important to define them well or we could have some problems later populating or manipulating our tables.
Now that we have our structure defined, the next task is to fill with data our database. There are several forms to do this, but we are going to cover the first two.
For the first we can insert one record, this is the simplest to execute, by activating the cursor and inserting a value with the SQL notation. Although is a very linear task the caveat is that it only allows us to write one record at a time.
The second example is very similar to the first, but before it is needed to build a list of tuples with the records we want to import in the table. Although the code changes a little, you will see in the image below that the instruction and the logic of the code remain the same.
Be aware that in both cases the data should correspond with the structure defined previously, as well as to end by committing your changes.
With these simple steps of code, we end building our first database.
If we open the DB file, it will show something like the image below, where at the right of the window is detailed the structure of “columns” we previously define and on the left, we can see the four data entries we appended.
2. Build a database importing an Excel/CSV file
Another way to import data on a database is for example when we had previously stored data on an Excel/CSV file and want to transfer it into a database. For this, I will build another table and exemplify how easy it is.
It takes a few lines of code to create a new database and import an Excel file. I’m almost sure that even if we omit the creating columns instructions will work as well, but I print the code just as I write it.
Reviewing the code below, the first line reference the name of the excel file where the data is stored. As in the first database, the next three lines of code build the table structure, and the last two lines of code at the bottom, mentions the only changes between the first and the second database, it first read the Excel file with pandas and then export the data to the SQL database.
And… that’s it! We could check in the database and see that the code successfully imported 273 wine records from the Excel file.
SQL also allows us to perform queries to analyze data, to mention some examples, we could answer questions such as how looks the complete data stored, what are the wines ranked with the highest score?, listing Red or White wines, or even mention which wines cost more than 500 Mexican pesos or 25 dollars.
3. Convert a database into a Data Frame to manage data
SQL allows you to interact and analyze data in several ways, but if you are like me, most likely you might prefer to return to Python language to analyze data or to incorporate this into a larger program.
When finishing editing the SQL table it is time to close the connection and then, like when importing data, we can read the SQL database through pandas. As a data frame, the data can be manipulated or manipulated, for example, we can append a new wine record through a dictionary and update it on the SQL database.
When we consult the SQL file, the database gets updated from the initial 272 records to 273.
Although we only review some of the first and basic concepts, we can easily escalate this and incorporate them in more complex programs. For example, apply the same logic in a program that gathers information through web scraping and store the data on a SQL database, that latter could be used as input data on a machine learning model.
Hope you like this article but mostly you find it useful…