Efficiently Reading and Writing Large Datasets with Pandas and SQL

Devesh Poojari
5 min readFeb 11, 2023

--

Read, Process, Write Large Data Efficiently with Pandas & SQL

Photo by Mika Baumeister on Unsplash

Working with large datasets can often be a challenge, especially when it comes to reading and writing data to and from databases. The traditional approach of loading the entire dataset into memory can lead to system crashes and slow processing times.

In this article, we’ll explore a better solution: streamline reading and writing data in chunks. This technique allows us to efficiently handle large amounts of data, making it a valuable tool for anyone working with databases and dataframes. We’ll be focusing on using pandas, a popular data analysis library, to demonstrate how to read extensive data from a database table and write it to a pandas dataframe in manageable chunks, as well as how to write large data from a dataframe back into a database.

Streamline Reading Large Datasets from a Database Table in Chunks

When working with massive amounts of data stored in a database, it’s important to handle it in an efficient and manageable way. The pd.read_sql() function in pandas offers a convenient solution to read data from a database table into a pandas DataFrame. With the addition of the chunksize parameter, you can control the number of rows loaded into memory at a time, allowing you to process the data in manageable chunks and manipulate it as needed. In this article, we’ll focus on reading large datasets from a Postgres database using pandas.

engine = create_engine(
"postgresql+psycopg2://db_username:db_password@db_host:db_port/db_name")

conn = engine.connect().execution_options(stream_results=True)

for chunk_dataframe in pd.read_sql("SELECT * FROM schema.table_name", conn,
chunksize=50000):
print(f"Dataframe with {len(chunk_dataframe)} rows")
# ... do something (calculation/manipulation) with dataframe ...

In the above code,

  • We create a sqlalchemy engine using the create_engine() method from SQLAlchemy library.
  • We create a connection to a PostgreSQL database with stream_results=True.More on this later.
  • We then pass this connection to the pd.read_sql() function along with a SQL query to select all rows from the table.
  • We also specify a chunksize of 50000 rows, which means that the pd.read_sql() function will return a new DataFrame containing 50,000 rows at a time.
  • We can then use a for loop to iterate over the chunks of data returned by the pd.read_sql() function.
  • In this example, we simply print the number of rows in each chunk, but in a real-world scenario, you would likely do some additional processing on each chunk before moving on to the next one.

stream_results: In SQLAlchemy, when you execute a query, the results are typically loaded into memory all at once. When working with large result sets, this can be inefficient as it requires a lot of memory. When stream_results is enabled (when set to True), the query returns a cursor, and each row of the result set is fetched as needed, reducing memory usage. This can be particularly useful when working with large result sets that would otherwise consume a lot of memory.

Writing Large Datasets to a Database Table

After processing your data, it may be necessary to write it back to a database table. While the to_sql() method provided by pandas is a convenient way to do this, it may not be the most efficient method for writing large data. We will use the parameter ‘method’ of to_sql(). This is where the COPY method comes into play.

The COPY method is widely regarded as one of the fastest ways to insert data into a SQL database. The COPY statement in SQL is used to quickly load large amounts of data into a table, or to export data from a file to a table. The basic syntax of the COPY statement is straightforward and makes it easy to quickly insert large amounts of data into a database table.

COPY [table_name] ([column1, column2, ...]) FROM [file_path] [WITH (options)]

In this article, we’ll explore the COPY method and how it can be to write large data to a database table efficiently able. Whether you are working with a small or massive amount of data, the COPY method is a useful tool to have in your arsenal for quickly and efficiently writing data to a database.

In python one way is to store the dataframe in a file and then use the above query to bulk insert data quickly. But most use cases don’t prefer creating a file, so we will use a buffer object

Note: This method is only applicable for DBs that support COPY FROM method

import csv
from io import StringIO

def copy_insert(table, conn, keys, data_iter):
# Gets a DBAPI connection that provides a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
string_buffer = StringIO()
writer = csv.writer(string_buffer)
writer.writerows(data_iter)
string_buffer.seek(0)

columns = ', '.join(['"{}"'.format(k) for k in keys])
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name

sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=string_buffer)

Now let’s understand what's happening above.

  1. The copy_insert function is a utility function that inserts data into a database table using the COPY FROM method, which is a faster method of inserting data than standard INSERT statements.

2. The function takes four parameters:

  • table is a pandas.io.sql.SQLTable object that represents the table in the database.
  • conn is an SQLAlchemy connection object to the database
  • keys is a list of column names.
  • data_iter is iterable that provides the values to be inserted.

3. The function starts by getting a DBAPI connection from the SQLAlchemy connection object and creating a cursor.

4. The values to be inserted are then written to a StringIO buffer as a CSV file, which is then passed to the copy_expert method of the cursor.

The copy_expert method is used to execute the COPY statement, which inserts the data from the CSV file into the database table. The table_name variable is constructed either by using the schema name and table name or just the table name, depending on whether the table has a schema defined (for eg. Mysql does not have a schema but Postgres does).

5. The COPY statement is executed with the SQL argument and the file buffer as the file argument, which inserts the data into the database.

Now to insert data we will use sqlalchemy’s base method to insert the data

df.to_sql(name="table_name", schema="schema_name", con=engine, if_exists="append", index=False, method=copy_insert)

name: the name of the table in the database.

schema: the name of the schema in the database to which the table belongs.

con: the SQLAlchemy engine object that represents the connection to the database.

if_exists: a string that specifies the behaviour if the table already exists, “append” in this case. With “append”, new rows are added to the existing table.

index: a boolean value that specifies whether the DataFrame index should be written as a separate column in the table, False in this case.

method: a string that specifies the method to be used to write the data to the table. We will be using copy_insert defined earlier.

And there you have it, the data is inserted quickly and efficiently!

That’s all for today.

Thank you for the read and for staying with me for so long. I am going to be writing more beginner-friendly posts in the future too. For more posts like this, you can follow me on Medium. I welcome feedback and constructive criticism and can be reached on Linkedin.

Thanks for your time. Keep Learning and Keep growing.

Thank You!

Happy Learning :)

--

--

Devesh Poojari
Devesh Poojari

No responses yet