locked
What is the fastest and efficient way to read from a large CSV file in C#? RRS feed

  • Pytanie

  • I have tried different CSV parsers and nothing seems to work. I have a CSV file that has more than 500,000 rows and 26 fields (double data type). I want to read each row, process the data and write it to two different files. Also i will be creating the output files for every 50,000 rows. 
    poniedziałek, 20 lipca 2020 20:54

Wszystkie odpowiedzi

  • If you don't need to write the code yourself, then you can check this https://download.cnet.com/CSV-Splitter/3000-2074_4-75910188.html (I know I used it and it does a great job).

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    poniedziałek, 20 lipca 2020 21:13
  • You could split the files then process them. During processing you could have a cancellation option and also use a delegate to show progress.

    Rough chunking process.

    public class FileOperations
    {
        private static string _baseFileName = "Your base file name for splitting";
        private static bool WriteChunk(IEnumerator<string> lineIterator, int splitSize, int chunk)
        {
            // in this case place split file in a folder below the app named ChunkFiles
            // Then create a file name with chunk param which is the index from the caller for statement
            var fileName = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "ChunkFiles", $"{_baseFileName}_{chunk + 1}.txt");
    
            using (var writer = File.CreateText(fileName))
            {
    
                for (int index = 0; index < splitSize; index++)
                {
                    if (!lineIterator.MoveNext())
                    {
                        return false;
                    }
                    writer.WriteLine(lineIterator.Current);
                }
            }
            return true;
        }
        /// <summary>
        /// File to split
        /// </summary>
        private static string _incomingFileName = "SomeFileName";
        /// <summary>
        /// Split file by splitSize (lines to split by)
        /// </summary>
        /// <param name="splitSize"></param>
        public static void SplitFileVanilla(int splitSize)
        {
            if (!File.Exists(_incomingFileName))
            {
                throw new FileNotFoundException(_incomingFileName);
            }
    
            using (var lineIterator = File.ReadLines(_incomingFileName).GetEnumerator())
            {
                var stillGoing = true;
    
                for (int chunkIndex = 0; stillGoing; chunkIndex++)
                {
                    stillGoing = WriteChunk(lineIterator, splitSize, chunkIndex);
                }
            }
        }
    }
    
    


    Please remember to mark the replies as answers if they help and unmarked them if they provide no help, this will help others who are looking for solutions to the same or similar problem. Contact via my Twitter (Karen Payne) or Facebook (Karen Payne) via my MSDN profile but will not answer coding question on either.

    NuGet BaseConnectionLibrary for database connections.

    StackOverFlow
    profile for Karen Payne on Stack Exchange

    poniedziałek, 20 lipca 2020 22:53
  • I have tried different CSV parsers and nothing seems to work. [...] 

    Did you try the TextFieldParser class too (from .NET Framework), which is usually suitable for C# and VB programs?

    What issues did you meet with existing parsers?


    wtorek, 21 lipca 2020 07:26
  • Please clarify what you mean by fast and efficient. These are often opposite terms. Fast would be loading something up into memory to maximize speed but this is often inefficient. In most cases it depends upon context.

    Please clarify the context. Is this a backend process that runs nightly? If so then fast is only fast enough to have it complete before morning. However if you're doing this as part of a UI then fast would be fast enough to not have the UI freeze but that doesn't mean it is temporally fast.

    Also please identify what tools you've already tried and why they weren't good enough.


    Michael Taylor http://www.michaeltaylorp3.net

    wtorek, 21 lipca 2020 13:47
  • TextFieldParser scaling was unfavorable. It is slow and becomes yet slower on larger files.
    wtorek, 21 lipca 2020 14:36
  • I am talking in terms of processing speed. I have tried different CSV Parser and Readers available on CodeProject and GitHub. 
    wtorek, 21 lipca 2020 14:38
  • I am talking in terms of processing speed. I have tried different CSV Parser and Readers available on CodeProject and GitHub. 

    Did you identify the fastest one for your typical data?


    wtorek, 21 lipca 2020 14:47
  • So how fast does it need to be? Is this a backend process where it is OK to use more memory for faster performance or does it run as part of a UI and therefore the UI needs to be responsive. Please provide more details so we can provide a more focused answer rather than going back and forth until we fully understand your requirements. 

    Please also list all existing versions you've tried and why they weren't sufficient. This will dramatically speed up the time to get you to an answer.


    Michael Taylor http://www.michaeltaylorp3.net

    wtorek, 21 lipca 2020 15:33
  • You can't open big files in a standard way, but you can create a connection to a CSV file.
    ...
    Open large CSV
    Navigate to Data >> Get & Transform Data >> From File >> From Text/CSV and import the CSV file.
    After a while, you are going to get a window with the file preview.
    Click the little triangle next to the load button.
    wtorek, 21 lipca 2020 15:38
  • I have tried different CSV parsers and nothing seems to work. I have a CSV file that has more than 500,000 rows and 26 fields (double data type). I want to read each row, process the data and write it to two different files. Also i will be creating the output files for every 50,000 rows. 
    using System;
    using System.Collections.Generic;
    using System.Text;
    
    namespace csConsoleApp
    {
        /// <summary>
        /// There's exactly 1 way to process CSVs.  It isn't fast and it isn't efficient.  But this is DotNET 
        /// Framework, so you should have kissed efficiency goodbye before you started.  All interpreted languages 
        /// are garbage for performance ie speed and efficiency at runtime.  
        /// 
        /// DotNET, Python, Perl, Javascript, Lua, JRE, on and on and on - It's fast and easy to produce functional 
        /// programs.  End of benefits.  That's fine though - it allows for ROBUST and POWERFUL applications in the 
        /// blink of an eye.
        /// 
        /// For example, you should never get an error using this code unless you send a null or empty string into 
        /// the getFields function.
        /// 
        /// Mind you that this is the fastest and cleanest you can get - it assumes that your CSV doesn't contain 
        /// any string literals or escape characters, which appear commonly in CSV files for example where a comma 
        /// (ie delimiter character) actually appears in a data field.
        /// 
        /// If you seriously have nothing but double-type numeric data, this should work fine for you.  Just read your 
        /// file line-by-line and put each line through the getFields function, and extract the double-as-string data 
        /// for parsing into real numerics later.
        /// </summary>
        public partial class csvParserSimplest
        {
            /// <summary>
            /// A CSV is really "a delimited flatfile" where the delimiter is typically 
            /// a single comma or tabstop character.
            /// </summary>
            public char delimiter = ',';
    
            /// <summary>
            /// Extracts the fields from a CSV file.
            /// </summary>
            /// <param name="csvline">
            /// A single line of Comma Separated Values (or whatever delimiter you like).
            /// </param>
            /// <returns>
            /// An array of strings, where each element is one of the text fields discovered.
            /// </returns>
            public string[] getFields(string csvline)
            {
                List<string> r = new List<string>();
    
                /*
                    Tracking the index where the last delimiter was found (and passed) in the line.
                */
                int lastfieldstart = 0;
                for (int i = 0; i < csvline.Length; i++)
                {
                    if (csvline[i] == this.delimiter)
                    {
                        /*
                            Delimiter found.  Backtrack to where the previous delimiter ended
                        */
                        r.Add(csvline.Substring(lastfieldstart, i - lastfieldstart));
                        // If your data file contains leading or trailing whitespace in the datafields, then 
                        // you'll want to add .Trim() just inside the last parenthese above.
    
                        /*
                            Set the new fieldstart to the char past the current delimiter char
                        */
                        lastfieldstart = i + 1;
                    }
                }
    
                /*
                    A line may end without a proper delimiter.  Make sure you get the last field included!
                */
                if (lastfieldstart < csvline.Length) 
                { 
                    r.Add(csvline.Substring(lastfieldstart, csvline.Length - lastfieldstart)); 
                    // If your data file contains leading or trailing whitespace in the datafields, then 
                    // you'll want to add .Trim() just inside the last parenthese above.
                }
    
                return r.ToArray();
            }
        }
    }
    

    That's File A - your CSV Parser class.

    using System;
    using System.Collections.Generic;
    using System.Text;
    
    namespace csConsoleApp
    {
    
        class Program
        {
    
            static void Main(string[] args)
            {
                csvParserSimplest csv = new csvParserSimplest();
    
                // Replace the next line with your code to read your file line-by-line, passing each 
                // line through csv.getFields.
                string[] csvflds = csv.getFields(@"Field 0,Field 1,Field 2,Field 3,Field 4,Field 5,");
                foreach (string f in csvflds)
                {
                    Console.WriteLine(f);
                }
            }
        }
    }
    

    That's File B - your console app.


    Before you can learn anything new you have to learn that there's stuff you don't know.

    wtorek, 21 lipca 2020 16:00
  • It depends on the format of the CSV file. Not all CSV files use the same format. The main inconsistency is how strings are stored. If there are no strings with commas then the simplest and most efficient way to parse the data is with the String.Split Method. Just read each line in and split it into an array. You say double data type and that implies that none of those fields have commas in them. It would really help me help you if I had details as described above. If each line begins with a string then you might can process that data separately from the rest of the data.


    Sam Hobbs
    SimpleSamples.Info

    niedziela, 2 sierpnia 2020 18:47
  • Hi,

    Has your issue been resolved?

    If so, please click on the "Mark as answer" option of the reply that solved your question, so that it will help other members to find the solution quickly if they face a similar issue.

    Best Regards,

    Timon


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    czwartek, 6 sierpnia 2020 08:29
  • The fastest is to use SSIS by far. Even when compared to highly threaded C# apps trawling through large it handles it in orders of magnitude better.

    Normally don't like suggesting a different technology to solve a problem, but having proved the difference I thought I would mention it. Beside right tools for the right job...

    wtorek, 18 sierpnia 2020 11:05
  • The fastest approach requires memory mapping your file, reading it in meaningful chunks, offloading those chunks to separate tasks/threads for parsing and storing it.

    imho 500k rows is not that much, so the question is how much time does your current approach take to complete the job?


    wtorek, 18 sierpnia 2020 11:21
  • The fastest is to use SSIS by far.

    Anything that SSIS can do can be done in C#, it is just a matter of knowing how.


    Sam Hobbs
    SimpleSamples.Info

    wtorek, 18 sierpnia 2020 13:03
  • Loading 10GB CSV file is slow in SSIS too. Especially if there are errors (such as double quotes inside the column).

    Looking for new opportunities

    For every expert, there is an equal and opposite expert. - Becker's Law


    My blog


    My TechNet articles

    wtorek, 18 sierpnia 2020 14:42