qsv: Blazing-fast CSV data-wrangling toolkit
Table of Contents | |
---|---|
Hi-ho "Quicksilver" away! logo details | qsv is a command line program for querying, slicing, indexing, analyzing, filtering, enriching, transforming, sorting, validating, joining & converting CSV files. Commands are simple, composable & "blazing fast". * Commands * Installation Options * Whirlwind Tour / Notebooks / Lessons & Exercises * Cookbook * FAQ * Performance Tuning * 👉 Benchmarks 🚀 * Environment Variables * Feature Flags * Goals/Non-goals * Testing * NYC School of Data 2022/csv,conf,v8 slides * Sponsor |
Try it out at qsv.dathere.com!
Command | Description |
---|---|
apply ✨🚀🧠🤖🔣👆 | Apply series of string, date, math & currency transformations to given CSV column/s. It also has some basic NLP functions (similarity, sentiment analysis, profanity, eudex, language & name gender) detection. |
applydp 🚀🔣👆 | applydp is a slimmed-down version of apply with only Datapusher+ relevant subcommands/operations (qsvdp binary variant only). |
behead | Drop headers from a CSV. |
cat 🗄️ | Concatenate CSV files by row or by column. |
clipboard | Provide input from the clipboard or save output to the clipboard. |
count 📇🏎️🐻❄️ | Count the rows in a CSV file. (11.87 seconds for a 15gb, 27m row NYC 311 dataset without an index. Instantaneous with an index.) If the polars feature is enabled, uses Polars' multithreaded, mem-mapped CSV reader for fast counts even without an index |
datefmt 🚀👆 | Formats recognized date fields (19 formats recognized) to a specified date format using strftime date format specifiers. |
dedup 🤯🚀👆 | Remove duplicate rows (See also extdedup , extsort , sort & sortcheck commands). |
describegpt 🌐🤖🪄 | Infer extended metadata about a CSV using a GPT model from OpenAI's API or an LLM from another API compatible with the OpenAI API specification such as Ollama or Jan. |
diff 🚀 | Find the difference between two CSVs with ludicrous speed! e.g. compare two CSVs with 1M rows x 9 columns in under 600ms! |
enum 👆 | Add a new column enumerating rows by adding a column of incremental or uuid identifiers. Can also be used to copy a column or fill a new column with a constant value. |
excel 🚀 | Exports a specified Excel/ODS sheet to a CSV file. |
exclude 📇👆 | Removes a set of CSV data from another set based on the specified columns. |
explode 🔣👆 | Explode rows into multiple ones by splitting a column value based on the given separator. |
extdedup | Remove duplicate rows from an arbitrarily large CSV/text file using a memory-mapped, on-disk hash table. Unlike the dedup command, this command does not load the entire file into memory nor does it sort the deduped file. |
extsort 🚀 | Sort an arbitrarily large CSV/text file using a multithreaded external merge sort algorithm. |
fetch ✨🧠🌐 | Fetches data from web services for every row using HTTP Get. Comes with HTTP/2 adaptive flow control, jql JSON query language support, dynamic throttling (RateLimit) & caching with available persistent caching using Redis or a disk-cache. |
fetchpost ✨🧠🌐 | Similar to fetch , but uses HTTP Post. (HTTP GET vs POST methods) |
fill 👆 | Fill empty values. |
fixlengths | Force a CSV to have same-length records by either padding or truncating them. |
flatten | A flattened view of CSV records. Useful for viewing one record at a time. e.g. qsv slice -i 5 data.csv | qsv flatten . |
fmt | Reformat a CSV with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.) |
frequency 📇😣🏎️👆🪄 | Build frequency tables of each column. Uses multithreading to go faster if an index is present. |
geocode ✨🧠🌐🚀🔣👆 | Geocodes a location against an updatable local copy of the Geonames cities database. With caching and multi-threading, it geocodes up to 360,000 records/sec! |
headers 🗄️ | Show the headers of a CSV. Or show the intersection of all headers between many CSV files. |
index | Create an index (📇) for a CSV. This is very quick (even the 15gb, 28m row NYC 311 dataset takes all of 14 seconds to index) & provides constant time indexing/random access into the CSV. With an index, count , sample & slice work instantaneously; random access mode is enabled in luau ; and multithreading (🏎️) is enabled for the frequency , split , stats , schema & tojsonl commands. |
input | Read CSV data with special commenting, quoting, trimming, line-skipping & non-UTF8 encoding handling rules. Typically used to "normalize" a CSV for further processing with other qsv commands. |
join 👆 | Inner, outer, right, cross, anti & semi joins. Automatically creates a simple, in-memory hash index to make it fast. |
joinp ✨🚀🐻❄️ | Inner, outer, right, cross, anti, semi & asof joins using the Pola.rs engine. Unlike the join command, joinp can process files larger than RAM, is multithreaded, has join key validation, pre-join filtering, supports asof joins (which is particularly useful for time series data) & its output columns can be coalesced. However, joinp doesn't have an --ignore-case option. |
json 👆 | Convert JSON to CSV. |
jsonl 🚀🔣 | Convert newline-delimited JSON (JSONL/NDJSON) to CSV. See tojsonl command to convert CSV to JSONL. |
luau 👑 ✨📇🌐🔣 | Create multiple new computed columns, filter rows, compute aggregations and build complex data pipelines by executing a Luau 0.635 expression/script for every row of a CSV file (sequential mode), or using random access with an index (random access mode). Can process a single Luau expression or full-fledged data-wrangling scripts using lookup tables with discrete BEGIN, MAIN and END sections. It is not just another qsv command, it is qsv's Domain-specific Language (DSL) with numerous qsv-specific helper functions to build production data pipelines. |
partition 👆 | Partition a CSV based on a column value. |
prompt | Open a file dialog to either pick a file as input or save output to a file. |
pseudo 🔣👆 | Pseudonymise the value of the given column by replacing them with an incremental identifier. |
py ✨🔣 | Create a new computed column or filter rows by evaluating a python expression on every row of a CSV file. Python's f-strings is particularly useful for extended formatting, [with the ability to evaluate Python expressions as |