Quantcast
Channel: Planet MySQL
Viewing all articles
Browse latest Browse all 18766

Indexed CSV

$
0
0

Why is there no index for CSV files?  Indexes are very simple.

If the first column of your CSV file is in sorted order you can do a binary search to find your data.  But what if you need to find data in the second or third column?

If you have a separate index file pointing to the first byte of each line you could seek to that position in the CSV file and get your data.  Given a file with only your needed column and the byte offset and length of the line you can search the index to find the pointer to the position in the CSV file.

SAMPLE DATA
“record”,”title”,”data”
1,”test1”,”data1”
2,”test3”,”data3”
3,”test2”,”data2”

SAMPLE INDEX
“test1”,24,17
“test2”,58,17
“test3”,41,17
“title”,0,23

Here is a simple perl program to create just such an index.

#!/usr/bin/perl
# csvindex.pl

use strict;
use warnings;

use Text::CSV_XS;

my $file = $ARGV[0];

my $csv = Text::CSV_XS->new ({
binary    => 1,
auto_diag => 1,
sep_char  => ','    # not really needed as this is the default
});

open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file'\n";

my $head = 0;
my $tail = 0 ;

while (my $fields = $csv->getline( $data )) {
$tail = tell $data ;
printf("\"%s\",%d,%d\n",$fields->[$ARGV[1]], $head, $tail-$head);
$head = $tail;
}

This command create an index of the second column (number start with 0).

$ perl csvindex.pl 2009.csv 1 | sort > 2009.cidx

A binary search can be done with the linux ‘look’ command.

$ look \"OK\" 2009.cidx
"OK",2134,59

Here is a simple program to seek into a file and print ‘n’ number of characters.

#!/usr/bin/perl
# csvseek.pl
use strict;
use warnings;

use Text::CSV_XS;

my $file = $ARGV[0];

open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file'\n";

seek $data, $ARGV[1], 0 ;

my $line = "" ;

read $data, $line, $ARGV[2];
chomp($line);

print $line . "\n" ;

Here is how it is used.

$ perl csvseek.pl  2009.csv 2134 59
OKLAHOMA,OK,40,"4,765","1,251","2,160","1,024",6,"9,206",

I’d like to see these tools added to CSVKit.

What I’d really like to see is this index/s added to MySQL’s CSV engine.  Would anyone like to help me get this done?

Tweet


PlanetMySQL Voting: Vote UP / Vote DOWN

Viewing all articles
Browse latest Browse all 18766

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>