none
Help please - How to sort multiple fields from a flat file dataset RRS feed

  • Question

  • Hello all,

    I have multiple, large flat file data sets that I need to sort and I'm not having any luck finding a way to do it.  I have had advice on using a DB Engine like MySQL but unfortunately, the data sets are large and needs to be processed from scratch, from multiple vendors, many times per day.  

    The turnaround time loading and sorting the DB are prohibitive for those reasons.

    Right now the work is being done on a leased LPAR (Logical Partition) on an IBM Mainframe and we would like to try and offload some of the work to a server, if possible.

    Here is a sample of what is needed in the sort but I am struggling to code a way to achieve this in Visual Basic .NET using Visual Studio 2015.  The actual code to do this on the Mainframe is just a few lines.

    I need to sort in the following order

    field 1 ascending

    field 2 ascending within field 1

    field 3 descending within field 2, within field 1

    field 4 ascending within field 3, within field 2 within field 1

    Field 3 needs to be descending in order to float the most current record within the sort by date 

    Is there any way to do this using Visual Basic .NET 2015 or 2017?

    -Ron

    Tuesday, September 26, 2017 1:13 PM

All replies

  • I would suggest asking your question in the following forum without mentioning using Visual Studio, see if it's possible via SQL only and if so then it can be done in Visual Studio.

    https://dba.stackexchange.com/

    Whenever I have something complex in regards to SQL I always write it in (I use SQL-Server) SQL-Server Management Studio, get the SQL to work then implement in code.

    Also consider

    • Using stored procedures.
    • Evaluate if indexes would help with ordering data.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Tuesday, September 26, 2017 2:27 PM
  • I would suggest asking your question in the following forum without mentioning using Visual Studio, see if it's possible via SQL only and if so then it can be done in Visual Studio.

    https://dba.stackexchange.com/

    Whenever I have something complex in regards to SQL I always write it in (I use SQL-Server) SQL-Server Management Studio, get the SQL to work then implement in code.

    Also consider

    • Using stored procedures.
    • Evaluate if indexes would help with ordering data.

    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Hi Karen,

    Thank you for the response, but I'm unsure if I understand you correctly.

    Are you suggesting that  use a DB Engine such as SQL Server?  

    If yes, then that is not an option as I cannot accept the time constraints of using a relational DB.  The files being used are varied and from multiple vendors and they need to be pre-processed very quickly.  The amount of data and delivery window makes using a DB engine unacceptable.  Just doing the initial load of the database takes more time then to process the entire file and produce the final product on the mainframe.

    Or are you suggesting, I get the SQL commands to perform the data manipulation the way I need them and then convert that to some sort of Visual Basic logic that does not use a DB Engine? 

    -Ron

    Tuesday, September 26, 2017 2:38 PM
  • In regards to SQL-Server, was not suggesting this, only indicated that is what we use and we do mass imports from multiple companies via a Windows Service.

    What I was suggesting was to look at doing this via SQL no matter what the engine is, only that it supports at least SQL-92 or better.


    Please remember to mark the replies as answers if they help and unmark them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.
    VB Forums - moderator
    profile for Karen Payne on Stack Exchange, a network of free, community-driven Q&A sites

    Tuesday, September 26, 2017 3:16 PM
  • I am familiar with the mainframe way of doing things with JCL and job steps. Many years ago I wrote many jobs with one or more sort steps, often sorting data to/from tape. Oh, now I remember; about thirty years ago the third-party mainframe sorting utility was Syncsort and I see they are still in business.

    I think that conceptually to sort in descending order it is just a matter of flipping the bits but I am not sure of the details. Sorting by multiple keys is simply a matter of concatenating the data (the keys). I could concentrate on the details but I think we don't need to do that.

    See Sorting Data. That will get you started. See the Secondary Descending Sort, that can be extended to four fields.

    If you have too much data to be processed by a single sort then one possibility is to sort portions of data at a time and then merge the data into one file. I have written many merges in my time but nothing in the past quarter century. It is possible.

    Something else that seems obvious to me but might not be so obvious to you is that it might enhance performance if you read just the keys into main memory with a pointer to the original data and sort just that then later put the data back together after sorting the keys. I hope you understand; what I am trying to explain is quite simple so if you think you understand then you probably do.



    Sam Hobbs
    SimpleSamples.Info



    Wednesday, September 27, 2017 2:56 AM
  • the data sets are large and needs to be processed from scratch, from multiple vendors, many times per day.

    I think you have answered your own question.   You can't even think about the actual sorting until you have determined what form the source data is going to take.   If your current sorting is an IBM mainframe then I suspect the data is effectively a flat file, and the sort will be some version of the merge sort - multiple passes through the file, creating sorted sub-files and then merging them.  It is disk-file based, although the O/S would work out whether the files ever need to actually be written to disk.

    You could replicate that for the PC - there are many examples of the algorithm available - but whether it would be efficient in the Windows environment (without a lot of detail work with memory mapped files) is hard to say.

    The solution, of course, is to use the facility of a database, which is already optimised for this type of process.  But the time involved in getting that existing flat file into a database (and then, presumably, out again) is prohibitive.

    So moving this one task from the mainframe to the PC is unlikely to be practical.  What you should be looking at is moving a much larger portion of the process, including receiving the initial data from the client and at least some of the processing after the sort, into a PC environment.  If that's not practical, then it is unlikely that you will be able to improve things by just sorting on a PC instead of a mainframe.

    Wednesday, September 27, 2017 3:30 AM
  • You can't even think about the actual sorting until you have determined what form the source data is going to take.   If your current sorting is an IBM mainframe then

    Since it is currently being done but in the mainframe environment, they already know what form the source data is in so perhaps what you are trying to say is not clear.

    the sort will be some version of the merge sort - multiple passes through the file, creating sorted sub-files and then merging them. 

    There are many ways to sort data. If the data (the keys at least) can all fit into memory then the possibilities are the same as any other data that fits in memory. And now that we have SSDs there might be ways to use external disks more efficiently.

    You could replicate that for the PC - there are many examples of the algorithm available - but whether it would be efficient in the Windows environment (without a lot of detail work with memory mapped files) is hard to say.

    I can't think of any significant difference between a mainframe and a PC.

    The solution, of course, is to use the facility of a database, which is already optimised for this type of process.  But the time involved in getting that existing flat file into a database (and then, presumably, out again) is prohibitive.


    Internally, a database would do the sort somehow; the sorting of the keys that is. The requirements, as stated, are to only sort the data; nothing else that databases do needs to be done. I sure can't think of anything relevant that databases do that can't be done without them.



    Sam Hobbs
    SimpleSamples.Info

    Wednesday, September 27, 2017 4:40 AM

  • I have multiple, large flat file data sets that I need to sort and I'm not having any luck finding a way to do it.  

    Right now the work is being done on a leased LPAR (Logical Partition) on an IBM Mainframe and we would like to try and offload some of the work to a server, if possible.

    Here is a sample of what is needed in the sort but I am struggling to code a way to achieve this in Visual Basic .NET using Visual Studio 2015.   

    Is there any way to do this using Visual Basic .NET 2015 or 2017?


    Before reinventing the wheel, have you considered trying any of the available
    sorting solutions? Here's one example:

    Sort Solution - Mainframe Sorting Power for Windows
    http://www.mwlabs.de/psortsol.htm

    "Sort Solution is a high-performance sorting library that represents a new 
    milestone in bringing Mainframe sorting power to the desktop."

    "Available as standalone Sorting Utility and 32-Bit DLL / ActiveX for developers

    Maximum file size only limited by the Operating System ...

    Sort millions or even billions of records

    Sort files of any size with only 2-3 megabytes of RAM. ...

    Use up to 64 keys in one sort pass"

    "Easy to integrate into all applications written with a 32-Bit languages like 
    Visual Basic, Visual C++, Delphi, C++-Builder and others"

    ...

    "Contents of the Evaluation Package

    ...

    Libraries and support files for C/C++ and Visual Basic"

    - Wayne

    Wednesday, September 27, 2017 7:36 AM

  • Before reinventing the wheel, have you considered trying any of the available
    sorting solutions? Here's one example:

    Sort Solution - Mainframe Sorting Power for Windows
    http://www.mwlabs.de/psortsol.htm


    Bummer. I just checked the link for the download of Sort Solution and I get
    a "not found". In any event, the basic suggestion stands - look for available
    libraries or utilities that can handle large sorts. I'm sure there are a
    number available. e.g. -

    Freeware Command Line Sort Utility CMSort
    http://www.chmaas.handshake.de/delphi/freeware/cmsort/cmsort.htm

    - Wayne

    Wednesday, September 27, 2017 8:26 AM
  • Hi All,

    I ended up reformatting all the files into a (Field standard) 400 byte record .  I then sequenced the file.  I then cut the file to just the sequence number and my varied sort fields.  That reduced the records from 400 bytes to 68 bytes.  50 bytes of sort fields and 18 bytes of the sequence number.

    I then wrote a small program that did bulk inserts to a SQLite database using the transaction process logic at 500K records at a time.

    That loaded in just under an hour.  I was very impressed with that.  The DB was just under 500MB.

    I then ran the command below on the DB.

    SELECT DISTINCT * FROM table ORDER BY field1,field2,field3,field4 DESC,field5;

    I started that process around 2:00am last night and it is still running.

    Full disclosure, I had to move the DB to another server before I ran the query and I think I moved it to the older (5400 RPM) drive.

    I will move it to the SSD drive once the query completes and report back.

    -Ron


    • Edited by NJDevils28 Wednesday, September 27, 2017 10:58 AM
    Wednesday, September 27, 2017 10:56 AM
  • I ended up reformatting all the files into a (Field standard) 400 byte record .  I then sequenced the file.  I then cut the file to just the sequence number and my varied sort fields.  That reduced the records from 400 bytes to 68 bytes.  50 bytes of sort fields and 18 bytes of the sequence number.

    If you already have a flat file with identifiable fields, then there is no need to use a database.   Any of the available file sorting algorithms will perform much faster than the database load and sort.     Of course, if your subsequent processing of the data benefits from the database load, apart from the sorting feature, that may well make the database worthwhile.   But if you are simply going to export the data back into a flat file, there is no point in getting it into the database just to sort.   See, for instance:
    http://www.ashishsharma.me/2011/08/external-merge-sort.html

    There are many other similar examples available.

    Wednesday, September 27, 2017 11:33 AM
  • I ended up reformatting all the files into a (Field standard) 400 byte record .  I then sequenced the file.  I then cut the file to just the sequence number and my varied sort fields.  That reduced the records from 400 bytes to 68 bytes.  50 bytes of sort fields and 18 bytes of the sequence number.

    If you already have a flat file with identifiable fields, then there is no need to use a database.   Any of the available file sorting algorithms will perform much faster than the database load and sort.     Of course, if your subsequent processing of the data benefits from the database load, apart from the sorting feature, that may well make the database worthwhile.   But if you are simply going to export the data back into a flat file, there is no point in getting it into the database just to sort.   See, for instance:
    http://www.ashishsharma.me/2011/08/external-merge-sort.html

    There are many other similar examples available.

    Hi and Thank you for that link.

    I will look into it.  That code

    Wednesday, September 27, 2017 1:05 PM
  • I just checked the link for the download of Sort Solution and I get
    a "not found".

    Note that it says it is for Windows 95, Windows 98, Windows NT 4.x and Windows 2000. So it was written before .Net was available. It might be that a utility that specializes in sorting is more efficient than the sorting in .Net but don't assume.

    How is Array.Sort in C# so super-fast?

    Performance of built-in .NET collection sorters

    C# Sort and OrderBy comparison



    Sam Hobbs
    SimpleSamples.Info

    Wednesday, September 27, 2017 4:42 PM
  • Thank you!
    Wednesday, September 27, 2017 5:08 PM