• Winston Ng


Do you find yourself extracting specific data from multiple excel spreadsheets, comma-separated value (CSV) files, and text files to aggregate and analyze them? Doing this once in a while with just a few files may not be too cumbersome but it can quickly become an insurmountable task when the number of files grows and the frequency of performing such aggregation increases.

One way to simplify this could be to write a parser for each of the file types/ formats you’ll be processing and ETL (extract-transform-load) that data into another database or central excel file for analysis. This would be a viable option if the type of files and their format is known and doesn’t change frequently. Otherwise, a change in the type or format of the files would require developers to update the parser and redeploy the application.

A cloud-based service, Trifacta, we recently discovered when working on a project can be a good solution to the problem. It is a cloud-based data “wrangling” (extraction) service built upon Apache Spark where you can load in a new file type/format and create a “recipe” for extracting only the data you need with their intuitive visual interface. Once the recipe is created, you can then upload all the files needing data wrangling to the system and have it spit out only the data you need.

Trifacta also has an API (Application Programming Interface) for hooking up the uploading and wrangling of files so that it can be integrated with another system to automate the whole process. For the system we developed, we had external files requiring data extraction fed directly into it via HTTP/SFTP upload, send to Trifacta for extraction and validate/transform the output data before storing into our database. These data are then used to generate management reports presented via a mobile App.

The key value proposition here is that any change in the file type/format; or addition of new file types can readily be handled via the Trifacta user interface and not have to wait for a development/deployment cycle. Definitely worth a try if data extraction is something you’re struggling with.

18 views0 comments