Compare lists from text files using Matlab – an application for resource exploration

INTRODUCTION

With today’s post I would like to share a Matlab script I used often to compare lists of ID numbers stored in separate text files. The ID numbers can be of anything: oil and gas wells, mining diamond drill hole locations, gravity or resistivity measurement stations, outcrop locations, you name it. And the script will handle any combination of ASCII characters (numbers, letters, etcetera).

I included below here some test files for you to try with the code, which is in the next section. Please refer to the comment section in the code for usage and file description. Have fun, and if you try it on your lists, let me know how it works for you.

TEST FILES

These files are in doc format; they need to be downloaded and saved as plain txt files. Test1 and Test2 contain 2 short lists of (fake) Canadian Oil and Gas wells; the former is the result of a search using a criterion (wells inside a polygonal area), the latter is a list of wells that have digital wireline logs. Test3 and Test4 are short lists of (fake) diamond drill holes.

*** Notice that script is setup to compare list in Test2.txt against list in Test1.txt and not the other way around. If using this on your own lists, you will have to decide in advance which subset of wells you are interested in.

Test1

Test2

Test3

Test4

THE SCRIPT

Here is the code:

% Script to compare lists of wells from 2 text files
% (wells are identified by a well ID string)
%
%%%% INPUT
% must be text files with 1-line header
%
% Sample test files provided:
% File 1 -  test1.txt - list of wells satisfying a criteria
%           (for example wells with digital wireline logs)
%
% File 2 -  test2.txt - list of all wells within an area
%           (for example returned using polygon search)
%
% These FILES contain (dummy) oil and gas wells from Canada
% i.e. wells with a 16-digit identifier of format 1XX021611428W600
% Header line is simply  'UWI' (Unique Well Identifier)
% For example:
% UWI
% 1XX021611428W600
% 1XX023511428W600
% ...
% ...
% To use with other files either
% 1 - actual name of the files must be changed to test1.txt and test2.txt
% or
% 2 - names in line 47 and 51 must be changed to 'yourfile1.txt' and
% 'yourfile2.txt', respectively
%
%%%% OUTPUT
% script extracts wells from list 2 that satisfy criteria and saves to
% File 3 -  out.xls
%
%%%%
% To use different header line and well identifier formats n and h must be
% changed accordingly
% For example suppose we had a file of diamond drill holes from a mining
% exploration permit, containing Diamond Drill Hole IDs in this format:
% DDH-ID (header line)
% DDH-23
% DDH-57
% ...
% ...
% we would need to change both h and n to 6
% try it with test3.txt and test4.txt
%
%% import data
curves=textread('test1.txt','%c');
% open file - this is list of wells with digital wireline logs (curves)
% notice script reads each character in the file as a row
%
inside=textread('test2.txt','%c');
% open file - this is list of wells inside rectangle or polygon
% notice script reads each character in the file as a row
%
%% cleanup, reshape variables, convert to cell array
n=16;       % number of characters in well identifier (for example, 16)
h=3;        % number of characters in header line (for example, 3)
[r,c]=size(inside); l=(r-h)/n;          % remove characters in the header
inside1=reshape(inside((h+1):r),n,l)';  % line and reshape
cellinside=cellstr(inside1);            % create cell array from char array
%
[t,s]=size(curves); f=(t-h)/n;          % remove characters in the header
curves1=reshape(curves((h+1):t),n,f)';  % line and reshape
cellcurves=cellstr(curves1);            % create cell array from char array
%
%% compare lists
for k=1:l
    for j=1:f
TF(k,j) = strcmpi(cellinside(k),cellcurves(j));
    end
end
% this is the kernel of the script:
% compare list of wells with logs (curves) vs. list of wells inside area
%
% From Matlab help: the command TF = strcmpi(s1,s2)
% compares the strings s1 and s2 and returns logical 1 (true) if they are
% the same (ignoring case), and returns logical 0 (false) otherwise
%
%% extract matching UWI and save output to Excel file
TTFF=max(int8(TF)')';
vi=logical(TTFF);
exp=cellinside(vi);
xlswrite('out.xls',exp);
%
%% optional house cleaning
clear inside curves n h r c l t s f TF TTFF j k vi

You could achieve the same result in Excel using a formula like this (applied to a single spreadsheet with lists in 2 columns, A and B):

=IF(COUNTIF(A$2:A$522,B2),"OK","Missing")

There’s a couple of good examples and alternative Excel solutions here. However I like Excel a lot less because the automation (with Macros, visual basic) is more cumbersome, and because it is more prone to errors that are less easy to QC (contrary to Matlab, which will give you error messages and has a straightforward debugger). When Excel is really necessary is in those instances in which in addition to finding a subset of wells using the 2 lists you want to extract information from other columns, such as coordinates, ground elevation, date drilled, operator, common well name, etcetera; then VLOOKUP is the right tool. I will show you how to do just that in a future post.

AN AFTERTHOUGHT

Since a picture is worth a thousand words I thought I’d add a pictorial example. Below is a geological map from a mining exploration area in Tuscany, Italy, where high grade asbestos was mined in the past, and gold exploration is under way. I added to this map black square symbols to  indicate locations where stream sediments were collected during a geochemical prospecting campaign in the ’80s (samples derived from the sediments are analyzed for concentrations in gold pathfinder elements). Suppose we had 2 lists: List1 contains all locations with anomalous concentrations in 1 or more of the pathfinder elements (I marked these locations with red diamond symbols). List2 contains all locations (I marked those with green cross symbols) with outcropping fault contact between the metamorphic basement and the more recent sedimentary rock formations (faults bounding horst structures formed in extensional settings are believed to act as conduits for gold bearing fluids). If we run the Matlab script using List1 and List2 it would return a list of locations with fault contact AND anomalous pathfinder concentrations. Those are indicated by arrows on the map –  the location with just the green cross and those with just the red diamond would not be included.

Sources (through the Regione Toscana): for map, University of Siena, Department of Earth Science, Geological Map of the Regione Toscana scale 1:10,000, sheet 343070;  for Stream Sediment data: ENI/RIMIN, unpublished Industry Report, 1989

5 thoughts on “Compare lists from text files using Matlab – an application for resource exploration

  1. Pingback: Visualization tips for geoscientists: Surfer «

  2. The expert Matlab user would have noticed that I used strcmpi in its simplest form, which compares two strings for equality. In this particular case, the choice made it necessary having a double loop in the code:

    for k=1:l
    for j=1:f
    TF(k,j) = strcmpi(cellinside(k),cellcurves(j));
    end
    end

    I did that for demonstration purposes so that the causal reader with some programming experience but little knowledge of Matlab or of cell arrays may follow more easily.

    A better way to use strcmpi would be in its second form*, like this:

    for i=1:l
    TF(i) = strcmpi(cellinside(i),cellcurves);
    end

    which compares a string with each element of a cell array of strings. This code has only one for loop and it wil lbe faster, noticeably so with longer lists.

    In a future post I will put together some other examples like this, and also some more advanced one, all with application to geoscience data manipulation, and show how to write more efficient matlab code, through vectorization and other techniques.

    ————
    * there is a third form, which compares directly strings array, which could be vectorize to avoid for loops altogether, but it requires arrays (lists) of the same size (same number of items).

  3. Pingback: Visualization tips for geoscientists – Matlab «

  4. Matteo,
    I have a similar problem but the thing is that I have only one .txt file and from it, I need to extract some parameters and their values but all this information is on lines of text. I am going to paste one part of my file only to be a little bit more specific. I would like that you advice me in how to solve this problem.
    01/22/2013 13:39:05>         
    Host : 172.31.110.59:7003 User-Agent : TR069 Client 1.0 Content-Type : text/xml; charset=utf-8

    Device.HomePlug.Interface.1.AssociatedDevice.1.TxPhyRate
    146

    Device.HomePlug.Interface.1.AssociatedDevice.1.RxPhyRate
    120

    The point is that I want to extract for example the time, the value of the RxPhyRate that is in the next line and is 120 and so on for other parameters.

    I would appreciate any advice regarding this issue.

    • Hi Raymundo

      It looks like you are trying to scrape data from a website. Although I am by no means an expert on that topic, I am reading about it right now and I think there are far better ways to do that than using Matlab.
      Having siad that, if I had to use Matlab what I would do is to read the time in the first line:
      01/22/2013 13:39:05>
      then get read of the first part of what follows which appears only once:
      Host : 172.31.110.59:7003 User-Agent : TR069 Client 1.0 Content-Type : text/xml; charset=utf-8
      At that point I would be left with repeating N times a cycle of skipping a bunch of text and reading a value.
      Your best bet is the textscan, which has a lot more flexibility than textread.
      There are lots of examples in this documentation, including the kind of operations you willl need to do:
      http://www.mathworks.se/help/matlab/ref/textscan.html

      If you get stuck with your code, I suggest you post specific questions on either the Mathworks forum, e.g.
      http://www.mathworks.com/matlabcentral/answers/44344
      or, stackexchange
      http://stackoverflow.com/questions/5531082/matlab-how-to-read-every-nth-line-of-a-text-file?rq=1
      but be prepared to show some preliminary code.

      Good luck.
      Matteo

Leave a Reply